Finding duplicates
By admin on Mar 2, 2009 | In Daily Commute - Standard hints/tips | Send feedback »
One very common requirement when working with data, is finding duplicate entries. The sql language is very powerful in doing this, yet many people prefer to use Excel to find duplicates. My opinion is this is fine for relatively small data-sets (less than 10,000 rows). You can use conditional formatting (for example using =COUNTIF(a:A,a2) > 1) or you can use the Filter/Advanced menu options to filter and copy to another location (you take out the duplicates by checking Unique records only). More modern versions of Excel now have a special command in the Data Menu: Data>Remove duplicates.
In Mysql (and indeed other versions of sql) the most common way of finding duplicate field entries is by using the GROUP BY clause. For example, to bring back job entries from the emp table that have more than 1 employee recorded against that particular job:
Code:
select job, count(*) | |
-> from emp | |
-> group by job | |
-> having count(*) > 1; |
If you wanted to bring back jobs recorded more than 3 times in the table you merely change the having clause:
Code:
select job, count(*) | |
-> from emp | |
-> group by job | |
-> having count(*) > 3; |
Incidentally, if you merely wanted to bring back distinct field(s) in a table, you can use GROUP BY without performing a count.
Now this is all very well, but supposing you have a requirement to bring back all fields in a record where the id field occurs as a duplicate. This can be achieved by using the IN clause in a subquery:
Code:
mysql> select a.* from emp a | |
-> where a.job in | |
-> (select b.job | |
-> from emp b | |
-> group by b.job having count(*) >1); |
All very well so far, but let's introduce a second table. Suppose we want to select records that occur twice within the two tables (and it's possible that the records appear within the same table)?
We can do this by using the UNION operator to bring back all records in both tables (remember to use UNION ALL, otherwise duplicates would be suppressed). We put this in an inline view, and then using GROUP BY to select only duplicates.
select empno, ename, job
from
(select empno, ename, job
from emp
UNION ALL
select empno, ename, job
from emp_temp) as empt
group by empno, ename, job having count(*) > 1;
I'm not sure how you would approach that in Excel! Easy enough in Mysql though!
Have fun!
Feedback awaiting moderation
This post has 54 feedbacks awaiting moderation...
Leave a comment
| « Making queries more efficient 1) Locating queries that run slowly | Powerful string manipulation » |