How to Identify and Remove Duplicate Records in SQL

Duplicate records in a database can cause inefficiencies and inaccuracies in data processing. Let’s walk through how to identify and handle duplicates in SQL with practical examples using a hypothetical artists table.

The Setup: Creating Our `artists` Table

To demonstrate, here’s the script to create and populate a sample table:

CREATE TABLE artists (
    id INT,
    first_name VARCHAR(40),
    last_name VARCHAR(40)
);

INSERT INTO artists VALUES 
(1, 'Thomas', 'Black'),
(2, 'Kate', 'Smith'),
(3, 'Natali', 'Wein'),
(4, 'Francesco', 'Benelli'),
(5, 'Kate', 'Smith'),
(6, 'Natali', 'Wein');

SELECT * FROM artists;

Initial Output:

id	first_name	last_name
1	Thomas	Black
2	Kate	Smith
3	Natali	Wein
4	Francesco	Benelli
5	Kate	Smith
6	Natali	Wein

Step 1: Identifying Duplicates

To find duplicate entries based on first_name and last_name, we can use GROUP BY with HAVING:

SELECT first_name, last_name, COUNT(*) AS record_count
FROM artists
GROUP BY first_name, last_name
HAVING COUNT(*) > 1;

Output:

first_name	last_name	record_count
Kate	Smith	2
Natali	Wein	2

Step 2: Deleting Duplicates

Scenario 1: Single Duplicate Per Record

If each duplicate has one copy, the query below ensures we retain the earliest record (MIN(id)) and delete the rest:

DELETE FROM artists
WHERE id IN (
    SELECT MAX(id) 
    FROM artists
    GROUP BY first_name, last_name
    HAVING COUNT(*) > 1
);

Output After Deletion:

id	first_name	last_name
1	Thomas	Black
2	Kate	Smith
3	Natali	Wein
4	Francesco	Benelli

Scenario 2: Multiple Duplicates Per Record

If there are multiple duplicates (e.g., triple copies), we use NOT IN with a subquery to retain only the earliest record for each duplicate group:

DELETE FROM artists
WHERE id NOT IN (
    SELECT MIN(id)
    FROM artists
    GROUP BY first_name, last_name
);

This approach ensures only the first occurrence of each duplicate group remains.

Step 3: Using Window Functions for Flexibility

Window functions like ROW_NUMBER() provide granular control for identifying and managing duplicates. Let’s see how to rank records within duplicate groups:

SELECT id, first_name, last_name,
       ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY id) AS row_num
FROM artists;

Output:

id	first_name	last_name	row_num
1	Thomas	Black	1
2	Kate	Smith	1
5	Kate	Smith	2
3	Natali	Wein	1
6	Natali	Wein	2
4	Francesco	Benelli	1

To delete duplicates while keeping the first occurrence:

WITH cte AS (
    SELECT id, 
           ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY id) AS row_num
    FROM artists
)
DELETE FROM artists
WHERE id IN (
    SELECT id 
    FROM cte 
    WHERE row_num > 1
);

Final Output:

id	first_name	last_name
1	Thomas	Black
2	Kate	Smith
3	Natali	Wein
4	Francesco	Benelli

Step 4: Special Cases and Pro Tips

1. What If There’s No Primary Key?

Use system-generated identifiers like PostgreSQL’s ctid or Oracle’s rowid.
Example with ctid in PostgreSQL:

DELETE FROM artists
WHERE ctid NOT IN (
    SELECT MIN(ctid)
    FROM artists
    GROUP BY first_name, last_name
);

2. Using a Backup and Reload Method

In situations where duplicates are pervasive and the table has no unique identifier, a safe approach is to create a backup, deduplicate, and reload:

-- Backup table
CREATE TABLE artists_backup AS SELECT * FROM artists;

-- Deduplicate
INSERT INTO artists
SELECT DISTINCT * FROM artists_backup;

-- Restore data
TRUNCATE TABLE artists;
INSERT INTO artists SELECT DISTINCT * FROM artists_backup;

⚠️ Important: Avoid truncating live production tables unless you’re absolutely sure about the implications.

Conclusion

Dealing with duplicate records requires a thoughtful approach tailored to the specific database and business requirements. Whether you’re using simple GROUP BY queries, leveraging window functions, or relying on database-specific tools like ctid, these techniques equip you to efficiently identify and remove duplicates. Always test on non-production environments first and consider creating backups before making changes to critical data.

Which approach will you try first? Let me know your thoughts or if you have other tips for managing duplicates! 🚀