Duplicate records in a database can cause inefficiencies and inaccuracies in data processing. Let’s walk through how to identify and handle duplicates in SQL with practical examples using a hypothetical artists
table.
The Setup: Creating Our artists
Table
To demonstrate, here’s the script to create and populate a sample table:
CREATE TABLE artists (
id INT,
first_name VARCHAR(40),
last_name VARCHAR(40)
);
INSERT INTO artists VALUES
(1, 'Thomas', 'Black'),
(2, 'Kate', 'Smith'),
(3, 'Natali', 'Wein'),
(4, 'Francesco', 'Benelli'),
(5, 'Kate', 'Smith'),
(6, 'Natali', 'Wein');
SELECT * FROM artists;
Initial Output:
id | first_name | last_name |
1 | Thomas | Black |
2 | Kate | Smith |
3 | Natali | Wein |
4 | Francesco | Benelli |
5 | Kate | Smith |
6 | Natali | Wein |
Step 1: Identifying Duplicates
To find duplicate entries based on first_name
and last_name
, we can use GROUP BY
with HAVING
:
SELECT first_name, last_name, COUNT(*) AS record_count
FROM artists
GROUP BY first_name, last_name
HAVING COUNT(*) > 1;
Output:
first_name | last_name | record_count |
Kate | Smith | 2 |
Natali | Wein | 2 |
Step 2: Deleting Duplicates
Scenario 1: Single Duplicate Per Record
If each duplicate has one copy, the query below ensures we retain the earliest record (MIN(id)
) and delete the rest:
DELETE FROM artists
WHERE id IN (
SELECT MAX(id)
FROM artists
GROUP BY first_name, last_name
HAVING COUNT(*) > 1
);
Output After Deletion:
id | first_name | last_name |
1 | Thomas | Black |
2 | Kate | Smith |
3 | Natali | Wein |
4 | Francesco | Benelli |
Scenario 2: Multiple Duplicates Per Record
If there are multiple duplicates (e.g., triple copies), we use NOT IN
with a subquery to retain only the earliest record for each duplicate group:
DELETE FROM artists
WHERE id NOT IN (
SELECT MIN(id)
FROM artists
GROUP BY first_name, last_name
);
This approach ensures only the first occurrence of each duplicate group remains.
Step 3: Using Window Functions for Flexibility
Window functions like ROW_NUMBER()
provide granular control for identifying and managing duplicates. Let’s see how to rank records within duplicate groups:
SELECT id, first_name, last_name,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY id) AS row_num
FROM artists;
Output:
id | first_name | last_name | row_num |
1 | Thomas | Black | 1 |
2 | Kate | Smith | 1 |
5 | Kate | Smith | 2 |
3 | Natali | Wein | 1 |
6 | Natali | Wein | 2 |
4 | Francesco | Benelli | 1 |
To delete duplicates while keeping the first occurrence:
WITH cte AS (
SELECT id,
ROW_NUMBER() OVER (PARTITION BY first_name, last_name ORDER BY id) AS row_num
FROM artists
)
DELETE FROM artists
WHERE id IN (
SELECT id
FROM cte
WHERE row_num > 1
);
Final Output:
id | first_name | last_name |
1 | Thomas | Black |
2 | Kate | Smith |
3 | Natali | Wein |
4 | Francesco | Benelli |
Step 4: Special Cases and Pro Tips
1. What If There’s No Primary Key?
Use system-generated identifiers like PostgreSQL’s
ctid
or Oracle’srowid
.Example with
ctid
in PostgreSQL:
DELETE FROM artists
WHERE ctid NOT IN (
SELECT MIN(ctid)
FROM artists
GROUP BY first_name, last_name
);
2. Using a Backup and Reload Method
In situations where duplicates are pervasive and the table has no unique identifier, a safe approach is to create a backup, deduplicate, and reload:
-- Backup table
CREATE TABLE artists_backup AS SELECT * FROM artists;
-- Deduplicate
INSERT INTO artists
SELECT DISTINCT * FROM artists_backup;
-- Restore data
TRUNCATE TABLE artists;
INSERT INTO artists SELECT DISTINCT * FROM artists_backup;
⚠️ Important: Avoid truncating live production tables unless you’re absolutely sure about the implications.
Conclusion
Dealing with duplicate records requires a thoughtful approach tailored to the specific database and business requirements. Whether you’re using simple GROUP BY
queries, leveraging window functions, or relying on database-specific tools like ctid
, these techniques equip you to efficiently identify and remove duplicates. Always test on non-production environments first and consider creating backups before making changes to critical data.
Which approach will you try first? Let me know your thoughts or if you have other tips for managing duplicates! 🚀