Delete duplicate rows in SQL Server

Containers

We will find and delete duplicate rows in SQL Server based on specific criteria is a pretty common task when we are working with databases.

In this post, you will learn how to:

  • Find duplicate rows into a table using the row_number function.
  • Find duplicate rows into a table using the cross apply operator.
  • Delete those rows that do not meet the criteria.

Define a structure and filling it up

In this example, we create and fill up a table called post_history with these columns id, title, and create_date. This is script required.

IF  EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'post_history') AND type in (N'U'))
	BEGIN		
		DROP TABLE post_history
	END 
GO


CREATE TABLE post_history(
							post_history_id INT IDENTITY(1,1),
							title VARCHAR(250) NOT NULL,
							created_date DATETIME NOT NULL,
						)

GO
INSERT INTO dbo.post_history
(
    title,
    created_date
)
VALUES (   'Post A',    '20180101'),
	   (   'Post B',    '20180201'),
	   (   'Post C',    '20180301'),
	   (   'Post C',    '20180304'),
	   (   'Post A',    '20180110'),
	   (   'Post B',    '20180211'),
	   (   'Post C',    '20180302'),
	   (   'Post D',    '20180302')
	   


SELECT * 
	FROM dbo.post_history AS PH
	ORDER BY PH.title, ph.created_date ASC

After run the above script this will be the expected output.

The objective will be to delete duplicate rows and only keep the first one entered according to the create_date field and the field with which we will identify if the record is duplicate or it will not be title.

Delete duplicate rows using the row_number function

In the following instruction, we proceed to assign a position to each record within each title group, taking the created_date field as the ordering criteria. In this case, we sort in ascending order, since you want to delete all but the first chronologically entered record.

SELECT  ROW_NUMBER() OVER(PARTITION BY ph.title ORDER BY PH.created_date ASC) Position,
		ph.title,
		ph.created_date 
	FROM dbo.post_history AS PH

Once a position has been assigned, the next step is to delete all the records whose position is greater than 1.

DELETE FROM dbo.post_history
	WHERE post_history_id IN 
            (
                SELECT X.post_history_id
                    FROM (
                            SELECT  ROW_NUMBER() OVER(PARTITION BY ph.title 
                                    ORDER BY PH.created_date ASC) Position,
                                    PH.post_history_id
                                FROM dbo.post_history AS PH
                        ) X
                    WHERE X.Position >1
                )

With the previous instruction, the data table will be as follows.

Delete duplicate rows using cross_apply operator

Likewise, we can use the CROSS APPLY statement for identifying and delete rows. In this variant, you create a subquery with the titles and for each query and you obtain through the CROSS APPLY the first matching record and therefore a deletion is used with the NOT IN statement.

DELETE FROM dbo.post_history
WHERE post_history_id 
        NOT IN (
                SELECT PH.post_history_id
                FROM (
                    SELECT PH.title 
                        FROM dbo.post_history AS PH 
                        GROUP BY PH.title) Titles
                    CROSS APPLY
                    (
                        SELECT TOP (1)
                            PH.post_history_id
                        FROM dbo.post_history AS PH
                        WHERE PH.title = Titles.title
                        ORDER BY PH.created_date ASC
                    ) PH
            );

Conclusions

Which is the correct way to delete duplicate rows? This will depend on each scenario and must be analyzed and tested. Do you have other styles for erasing this data? Please let me know.


Leave a Reply

Your email address will not be published. Required fields are marked *