How to Identify and Remove Duplicate Records in SQL

How to Identify and Remove Duplicate Records in SQL

Duplicate records in a database can cause inefficiencies and inaccuracies in data processing. Let’s walk through how to identify and handle duplicates in SQL with practical examples using a hypothetical artists table.


The Setup: Creating Our artists Table

To demonstrate, here’s the script to create and populate a sample table:

CREATE TABLE artists (
    id INT,
    first_name VARCHAR(40),
    last_name VARCHAR(40)
);

INSERT INTO artists VALUES 
(1, 'Thomas', 'Black'),
(2, 'Kate', 'Smith'),
(3, 'Natali', 'Wein'),
(4, 'Francesco', 'Benelli'),
(5, 'Kate', 'Smith'),
(6, 'Natali', 'Wein');

SELECT * FROM artists;

Initial Output:

idfirst_namelast_name
1ThomasBlack
2KateSmith
3NataliWein
4FrancescoBenelli
5KateSmith
6NataliWein

Step 1: Identifying Duplicates

To find duplicate entries based on first_name and last_name, we can use GROUP BY with HAVING:

SELECT first_name, last_name, COUNT(*) AS record_count
FROM artists
GROUP BY first_name, last_name
HAVING COUNT(*) > 1;

Output:

first_namelast_namerecord_count
KateSmith2
NataliWein2

Step 2: Deleting Duplicates

Scenario 1: Single Duplicate Per Record

If each duplicate has one copy, the query below ensures we retain the earliest record (MIN(id)) and delete the rest:

DELETE FROM artists
WHERE id IN (
    SELECT MAX(id) 
    FROM artists
    GROUP BY first_name, last_name
    HAVING COUNT(*) > 1
);

Output After Deletion:

idfirst_namelast_name
1ThomasBlack
2KateSmith
3NataliWein
4FrancescoBenelli

Scenario 2: Multiple Duplicates Per Record

If there are multiple duplicates (e.g., triple copies), we use NOT IN with a subquery to retain only the earliest record for each duplicate group:

DELETE FROM artists
WHERE id NOT IN (
    SELECT MIN(id)
    FROM artists
    GROUP BY first_name, last_name
);

This approach ensures only the first occurrence of each duplicate group remains.


Step 3: Using Window Functions for Flexibility

Window functions like ROW_NUMBER() provide granular control for identifying and managing duplicates. Let’s see how to rank records within duplicate groups:

SELECT id, first_name, last_name,
       ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY id) AS row_num
FROM artists;

Output:

idfirst_namelast_namerow_num
1ThomasBlack1
2KateSmith1
5KateSmith2
3NataliWein1
6NataliWein2
4FrancescoBenelli1

To delete duplicates while keeping the first occurrence:

WITH cte AS (
    SELECT id, 
           ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY id) AS row_num
    FROM artists
)
DELETE FROM artists
WHERE id IN (
    SELECT id 
    FROM cte 
    WHERE row_num > 1
);

Final Output:

idfirst_namelast_name
1ThomasBlack
2KateSmith
3NataliWein
4FrancescoBenelli

Step 4: Special Cases and Pro Tips

1. What If There’s No Primary Key?

  • Use system-generated identifiers like PostgreSQL’s ctid or Oracle’s rowid.

  • Example with ctid in PostgreSQL:

DELETE FROM artists
WHERE ctid NOT IN (
    SELECT MIN(ctid)
    FROM artists
    GROUP BY first_name, last_name
);

2. Using a Backup and Reload Method

In situations where duplicates are pervasive and the table has no unique identifier, a safe approach is to create a backup, deduplicate, and reload:

-- Backup table
CREATE TABLE artists_backup AS SELECT * FROM artists;

-- Deduplicate
INSERT INTO artists
SELECT DISTINCT * FROM artists_backup;

-- Restore data
TRUNCATE TABLE artists;
INSERT INTO artists SELECT DISTINCT * FROM artists_backup;

⚠️ Important: Avoid truncating live production tables unless you’re absolutely sure about the implications.


Conclusion

Dealing with duplicate records requires a thoughtful approach tailored to the specific database and business requirements. Whether you’re using simple GROUP BY queries, leveraging window functions, or relying on database-specific tools like ctid, these techniques equip you to efficiently identify and remove duplicates. Always test on non-production environments first and consider creating backups before making changes to critical data.

Which approach will you try first? Let me know your thoughts or if you have other tips for managing duplicates! 🚀