There are 2 main methods to check for duplicates in SQL. The first method uses the keyword DISTINCT and the other one uses GROUP BY.
1 SELECT DISTINCT * FROM table
GROUP BY method
123 SELECT columns, COUNT(*) cntFROM tableGROUP BY columns
The advantage of the GROUP BY method is that you can see the number of duplicates. The drawback is more code.
Finally, to check for duplicates you just have to look at the number of rows before and after you used either of the 2 methods above. If the count before and after is not the same, then you have duplicates.
Sometimes you will notice that your query returned duplicates. Other times, often for big queries, you will think that there’s no duplicates even when there is.
For that reason it is good practice to check if your view contains duplicates, especially for complex queries.
Where do duplicates come from? They can come from a bug in the ETL process, when data is inserted. But most often than not, they will come from the way you designed your query.
When the relationship in between 2 tables is not 1 to 1 and you join them you can end up with duplicates.
One easy way to check if you have duplicates or not in SQL is to count the number of rows before you join your table, and after you do. If you see that the count is different before and after then it means you have duplicates.
To practice, create the following 2 tables:
123456789101112131415161718 CREATE TABLE items (item varchar(10));INSERT INTO items ( item ) VALUES('itemA'),('itemB'),('itemC')CREATE TABLE inventory (item varchar(10), warehouse varchar(10), quantity int);INSERT INTO inventory ( item, warehouse, quantity ) VALUES('itemA', 'A', 0),('itemA', 'B', 17),('itemB', 'A', 5),('itemB', 'B', 2),('itemC', 'A', 0),('itemC', 'B', 0)
Let’s say you work as an inventory analyst and you’re trying to identify items that have inventory. You can run the following query:
12345 select it.itemfrom items it left join inventory inv on it.item = inv.itemwhere quantity > 0
And this will return 3 rows: itemA, itemB, itemB.
We have duplicates because the second table can have multiple values for the same item.
Using one of the 2 methods above (DISTINCT or GROUP BY) would return the correct result: itemA, itemB.