Tags: group by
Finding duplicates
By admin on Mar 2, 2009 | In Daily Commute - Standard hints/tips | Send feedback »
One very common requirement when working with data, is finding duplicate entries. The sql language is very powerful in doing this, yet many people prefer to use Excel to find duplicates. My opinion is this is fine for relatively small data-sets (less than 10,000 rows). You can use conditional formatting (for example using =COUNTIF(a:A,a2) > 1) or you can use the Filter/Advanced menu options to filter and copy to another location (you take out the duplicates by checking Unique records only). More modern versions of Excel now have a special command in the Data Menu: Data>Remove duplicates.
In Mysql (and indeed other versions of sql) the most common way of finding duplicate field entries is by using the GROUP BY clause. For example, to bring back job entries from the emp table that have more than 1 employee recorded against that particular job:
Code:
select job, count(*) | |
-> from emp | |
-> group by job | |
-> having count(*) > 1; |
If you wanted to bring back jobs recorded more than 3 times in the table you merely change the having clause:
Code:
select job, count(*) | |
-> from emp | |
-> group by job | |
-> having count(*) > 3; |
Incidentally, if you merely wanted to bring back distinct field(s) in a table, you can use GROUP BY without performing a count.
Now this is all very well, but supposing you have a requirement to bring back all fields in a record where the id field occurs as a duplicate. This can be achieved by using the IN clause in a subquery:
Code:
mysql> select a.* from emp a | |
-> where a.job in | |
-> (select b.job | |
-> from emp b | |
-> group by b.job having count(*) >1); |
All very well so far, but let's introduce a second table. Suppose we want to select records that occur twice within the two tables (and it's possible that the records appear within the same table)?
We can do this by using the UNION operator to bring back all records in both tables (remember to use UNION ALL, otherwise duplicates would be suppressed). We put this in an inline view, and then using GROUP BY to select only duplicates.
select empno, ename, job
from
(select empno, ename, job
from emp
UNION ALL
select empno, ename, job
from emp_temp) as empt
group by empno, ename, job having count(*) > 1;
I'm not sure how you would approach that in Excel! Easy enough in Mysql though!
Have fun!
Beautiful code
By admin on Aug 9, 2008 | In Daily Commute - Standard hints/tips, Formula 1 - More advanced stuff | Send feedback »
A bit of a delay since my last entry. I've started a new job which uses SQLServer (boo!) so my use of Mysql has been very small lately.
Anyway, let's take a little look at the group by clause. One of the most common ways I use it is to bring back records that are duplicated more than once. For example:
Code:
mysql> select ename, job, deptno,count(*) from emp group by ename, job, deptno having count(*) > 1 |
;
+-------+----------+--------+----------+
| ename | job | deptno | count(*) |
+-------+----------+--------+----------+
| WARD | SALESMAN | 30 | 2 |
+-------+----------+--------+----------+
1 row in set (0.00 sec)
Now how would you bring back (to delete or update) the latest insert of this duplicated record? In this case there are 2 records, one with empno of 7521, the latest record has the empno 8000. The following code (in my humble opinion) is very elegant, and is perhaps my favourite bit of sql I've seen.. thanks to this book sqlcookbook
Code:
mysql> select * from emp | |
-> where empno not in (select min(empno) | |
-> from emp | |
-> group by ename, job, deptno); |
+-------+-------+----------+------+------------+------+------+--------+
| empno | ename | job | mgr | hiredate | sal | comm | deptno |
+-------+-------+----------+------+------------+------+------+--------+
| 8000 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250 | 500 | 30 |
+-------+-------+----------+------+------------+------+------+--------+
1 row in set (0.01 sec)
Here's another way of using the group by clause (again using subqueries) to bring back the employees with the highest and lowest salary in the dataset.
Code:
mysql> select ename, job, deptno | |
-> from emp | |
-> where sal in ((select min(sal) from emp), | |
-> (select max(sal) from emp)); |
+-------+-----------+--------+
| ename | job | deptno |
+-------+-----------+--------+
| SMITH | CLERK | 20 |
| KING | PRESIDENT | 10 |
+-------+-----------+--------+
2 rows in set (0.10 sec)
Now that, to me is also beautiful code.
Enjoy your coding - Until next time!