Tags: data analysis
Using Mysql to calculate average (mean, mode or median)
By admin on Sep 28, 2010 | In Daily Commute - Standard hints/tips, Showroom - Examples of Mysql usage | Send feedback »
To calculate the mean (or average) we can utilise the AVG function within Mysql.
Code:
select avg(sal) as mean_sal | |
from emp; |
mean_sal
========
2073.9286
Remember though that the function will not work on a NULL field. If you want to include NULL values (eg assume zero value) use COALESCE to default the NULL to zero.
Code:
select avg(coalesce(sal,0)) as mean_sal | |
from emp; |
We can now produce the mean figure per grouped column, for example by department
Code:
select deptno, avg(coalesce(sal,0)) as mean_sal | |
from emp | |
group by deptno; |
deptno mean_sal
====== ========
10 2916.6667
20 2175.0000
30 1568.3333
The mode is the value that occurs most frequently in a given set of data.
If I want to find out from the PRODUCT table, what is the mode of the FULLPRICE field for a certain productgroup (in this case 'CA'), I can use a subquery to find the mode.
The subquery will calculate the various counts for each Fullprice, while the main part of the query will choose the price that occurs the most.
Code:
SELECT FullPrice FROM product | |
WHERE productgroupcode = 'CA' | |
GROUP BY FullPrice | |
HAVING count(*) >= all (SELECT count(*) | |
FROM product | |
WHERE productgroupcode = 'CA' | |
GROUP BY FullPrice) |
FullPrice
=========
14
There is another common requirement in finding an average, which is called the median.
The median is the value of the middle item in a set of ordered elements.
To calculate the median we can use a self join (join the table to itself), and then use the HAVING clause to filter those rows where the number of
times (P.FULLPRICE = Q.FULLPRICE) is greater than or equal to the number of times (Q.FULLPRICE > P.FULLPRICE) :
Code:
SELECT avg(FullPrice) | |
FROM ( select q.FullPrice | |
FROM product q, product p | |
WHERE p.productgroupcode = q.productgroupcode | |
AND q.productgroupcode = 'CA' | |
GROUP BY q.FullPrice | |
HAVING sum(case when q.FullPrice = p.FullPrice then 1 else 0 end) | |
>= abs(sum(sign(q.FullPrice - p.FullPrice))) | |
) t |
avg(FullPrice)
=========
14.0000
We can show the individual workings with this code that moves the SUM calculations into the SELECT list:
Code:
select q.FullPrice, p.FullPrice, | |
sum(case when q.FullPrice = p.FullPrice then 1 else 0 end) as cnt1, | |
abs(sum(sign(q.FullPrice - p.FullPrice))) as cnt2 | |
FROM product q, product p | |
WHERE p.productgroupcode = q.productgroupcode | |
AND q.productgroupcode = 'CA' | |
group by q.FullPrice |
FullPrice --- cnt1 --- cnt2
10 --- 25 --- 130
11 --- 4 --- 38
12 --- 1 --- 16
14 --- 64 --- 56
15 --- 1 --- 2
20 --- 1 --- 4
24 --- 9 --- 24
26 --- 1 --- 12
29 --- 9 --- 48
39 --- 25 --- 120
630 --- 1 --- 30
Finally, here I will list the records from the PRODUCT table so you can see the whole set we worked upon.
Code:
SELECT PRODUCTID, FULLPRICE | |
FROM product | |
WHERE PRODUCTGROUPCODE = 'CA' | |
ORDER BY FULLPRICE |
PRODUCTID --- FULLPRICE
12175 ---- 10
12014 ---- 10
13297 ---- 10
10002 ---- 10
10003 ---- 10
10838 ---- 11
13109 ---- 11
13107 ---- 12
13299 ---- 14
12177 ---- 14
14030 ---- 14
13621 ---- 14
12016 ---- 14
11003 ---- 14
11057 ---- 14
11056 ---- 14
10001 ---- 15
12949 ---- 20
13360 ---- 24
13470 ---- 24
10955 ---- 24
13359 ---- 26
13298 ---- 29
12015 ---- 29
12176 ---- 29
13024 ---- 39
10863 ---- 39
13023 ---- 39
10862 ---- 39
10993 ---- 39
12141 ---- 630
Checking for non-text entries.
By admin on Nov 23, 2008 | In Uncategorized, Daily Commute - Standard hints/tips | 1 feedback »
One common task in data cleansing is to check that a text field contains valid data. Control characters, initials or even numbers can be unwelcome values in some fields.
If we're merely checking that the cell contains text, as opposed to numbers, I find it easier to do this in Excel and export if necessary.
For those interested, the formula to do this would be something like:
=IF(ISTEXT(A3),"Good","False")
However, we have several options of addressing this problem in Mysql
One interesting way of doing this could be to check that the characters contain a vowel.
However if we just use the substring function as follows, we could be around a long time. The following code shows how we would start this.
Code:
SELECT firstname FROM `customer` | |
WHERE substring(firstname,1,1) not in ('A','E', 'I', 'O', 'U') | |
AND substring(firstname,2,1) not in ('A','E', 'I', 'O', 'U') | |
AND substring(firstname,3,1) not in ('A','E', 'I', 'O', 'U') |
A better way would be to use the Locate function to see if a word contains a vowel. If the string is not found then locate returns zero.
Code:
SELECT firstname FROM `customer` | |
WHERE locate('A', firstname) = 0 | |
and locate('E',firstname) = 0 | |
and locate('I',firstname) = 0 | |
and locate('O',firstname) = 0 | |
and locate('U',firstname) = 0 |
Although I used locate I could easily have used the function instr instead.
Remember though that the syntax order is different
Code:
ie instr(<search string>, <pattern>) |
Yet another way would be to use NOT LIKE.
Code:
SELECT firstname FROM `customer` | |
WHERE firstname not like '%A%' | |
AND firstname not like '%E%' | |
AND firstname not like '%I%' | |
AND firstname not like '%O%' | |
AND firstname not like '%U%' |
Some of you may have realised there are flaws in using the above. What about case? Maybe some entries just have initials? There could be acronyms or even the word rhythm (no a,e,i,o or u)!!
Perhaps a better way would be to make sure the value of the character has a proper ascii value.
For instance, if I used the following bit of code to test the first character of a field, I can bring back those records that start with a number (or have a non-standard character value).
Code:
select firstname, ascii(substring(firstname,1,1)), gender | |
from customer | |
where ascii(substring(firstname,1,1)) not between 65 and 122 | |
and ascii(substring(firstname,1,1)) > 0 |
If numbers are valid here, we can test for values between 48 and 122. Here's a link to ascii table values:
ascii table
Of course the above code only works on testing the first character of the string.
In order to test every character of the string we need to iterate through a string's characters. I will show how to do this in my next entry.
Lastly, for advanced programmers, there is a great feature in Mysql - the ability to use regular expressions ie regexp
Again, this needs a greater explanation than I have time for on this entry. However there is a section devoted to this on the Mysql manual pages.
mysql manual regexp
Also, there are many sites that give tutorials in using regular expressions.
Here's a good one regular-expressions
If you do decide to use regular expressions, remember that although they can be very powerful, they can also hit performance issues on large tables.
In conclusion, there are many ways of approaching this sort of task in Mysql. I hope the above notes have given you a good idea of how we can do this.