Checking for non-text entries.
By admin on Nov 23, 2008 | In Uncategorized, Daily Commute - Standard hints/tips | Send feedback »
One common task in data cleansing is to check that a text field contains valid data. Control characters, initials or even numbers can be unwelcome values in some fields.
If we're merely checking that the cell contains text, as opposed to numbers, I find it easier to do this in Excel and export if necessary.
For those interested, the formula to do this would be something like:
=IF(ISTEXT(A3),"Good","False")
However, we have several options of addressing this problem in Mysql
One interesting way of doing this could be to check that the characters contain a vowel.
However if we just use the substring function as follows, we could be around a long time. The following code shows how we would start this.
Code:
SELECT firstname FROM `customer` | |
WHERE substring(firstname,1,1) not in ('A','E', 'I', 'O', 'U') | |
AND substring(firstname,2,1) not in ('A','E', 'I', 'O', 'U') | |
AND substring(firstname,3,1) not in ('A','E', 'I', 'O', 'U') |
A better way would be to use the Locate function to see if a word contains a vowel. If the string is not found then locate returns zero.
Code:
SELECT firstname FROM `customer` | |
WHERE locate('A', firstname) = 0 | |
and locate('E',firstname) = 0 | |
and locate('I',firstname) = 0 | |
and locate('O',firstname) = 0 | |
and locate('U',firstname) = 0 |
Although I used locate I could easily have used the function instr instead.
Remember though that the syntax order is different
Code:
ie instr(<search string>, <pattern>) |
Yet another way would be to use NOT LIKE.
Code:
SELECT firstname FROM `customer` | |
WHERE firstname not like '%A%' | |
AND firstname not like '%E%' | |
AND firstname not like '%I%' | |
AND firstname not like '%O%' | |
AND firstname not like '%U%' |
Some of you may have realised there are flaws in using the above. What about case? Maybe some entries just have initials? There could be acronyms or even the word rhythm (no a,e,i,o or u)!!
Perhaps a better way would be to make sure the value of the character has a proper ascii value.
For instance, if I used the following bit of code to test the first character of a field, I can bring back those records that start with a number (or have a non-standard character value).
Code:
select firstname, ascii(substring(firstname,1,1)), gender | |
from customer | |
where ascii(substring(firstname,1,1)) not between 65 and 122 | |
and ascii(substring(firstname,1,1)) > 0 |
If numbers are valid here, we can test for values between 48 and 122. Here's a link to ascii table values:
ascii table
Of course the above code only works on testing the first character of the string.
In order to test every character of the string we need to iterate through a string's characters. I will show how to do this in my next entry.
Lastly, for advanced programmers, there is a great feature in Mysql - the ability to use regular expressions ie regexp
Again, this needs a greater explanation than I have time for on this entry. However there is a section devoted to this on the Mysql manual pages.
mysql manual regexp
Also, there are many sites that give tutorials in using regular expressions.
Here's a good one regular-expressions
If you do decide to use regular expressions, remember that although they can be very powerful, they can also hit performance issues on large tables.
In conclusion, there are many ways of approaching this sort of task in Mysql. I hope the above notes have given you a good idea of how we can do this.
No feedback yet
Leave a comment
| « Powerful string manipulation | More useful GROUP BY stuff. » |