We will find and delete duplicate rows in SQL Server based on specific criteria is a pretty common task when we are working with databases.
In this post, you will learn how to:
- Find duplicate rows into a table using the row_number function.
- Find duplicate rows into a table using the cross apply operator.
- Delete those rows that do not meet the criteria.
Define a structure and filling it up
In this example, we create and fill up a table called post_history with these columns id, title, and create_date. This is script required.
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'post_history') AND type in (N'U')) BEGIN DROP TABLE post_history END GO CREATE TABLE post_history( post_history_id INT IDENTITY(1,1), title VARCHAR(250) NOT NULL, created_date DATETIME NOT NULL, ) GO INSERT INTO dbo.post_history ( title, created_date ) VALUES ( 'Post A', '20180101'), ( 'Post B', '20180201'), ( 'Post C', '20180301'), ( 'Post C', '20180304'), ( 'Post A', '20180110'), ( 'Post B', '20180211'), ( 'Post C', '20180302'), ( 'Post D', '20180302') SELECT * FROM dbo.post_history AS PH ORDER BY PH.title, ph.created_date ASC
After run the above script this will be the expected output.
The objective will be to delete duplicate rows and only keep the first one entered according to the create_date field and the field with which we will identify if the record is duplicate or it will not be title.
Delete duplicate rows using the row_number function
In the following instruction, we proceed to assign a position to each record within each title group, taking the created_date field as the ordering criteria. In this case, we sort in ascending order, since you want to delete all but the first chronologically entered record.
SELECT ROW_NUMBER() OVER(PARTITION BY ph.title ORDER BY PH.created_date ASC) Position, ph.title, ph.created_date FROM dbo.post_history AS PH
Once a position has been assigned, the next step is to delete all the records whose position is greater than 1.
DELETE FROM dbo.post_history WHERE post_history_id IN ( SELECT X.post_history_id FROM ( SELECT ROW_NUMBER() OVER(PARTITION BY ph.title ORDER BY PH.created_date ASC) Position, PH.post_history_id FROM dbo.post_history AS PH ) X WHERE X.Position >1 )
With the previous instruction, the data table will be as follows.
Delete duplicate rows using cross_apply operator
Likewise, we can use the CROSS APPLY statement for identifying and delete rows. In this variant, you create a subquery with the titles and for each query and you obtain through the CROSS APPLY the first matching record and therefore a deletion is used with the NOT IN statement.
DELETE FROM dbo.post_history WHERE post_history_id NOT IN ( SELECT PH.post_history_id FROM ( SELECT PH.title FROM dbo.post_history AS PH GROUP BY PH.title) Titles CROSS APPLY ( SELECT TOP (1) PH.post_history_id FROM dbo.post_history AS PH WHERE PH.title = Titles.title ORDER BY PH.created_date ASC ) PH );
Which is the correct way to delete duplicate rows? This will depend on each scenario and must be analyzed and tested. Do you have other styles for erasing this data? Please let me know.