Tags: promysql
Calculating distance with Mysql
By admin on Sep 30, 2009 | In Formula 1 - More advanced stuff, Showroom - Examples of Mysql usage | Send feedback »
After the recent posting of data in Wikileaks, it's claimed that here in the UK we can now freely access our longitude and latitude co-ordinates for each post code. However, some of the data in their files may well be deliberately corrupted - you may be infringing copyright law by using that data, and the mistakes will prove that you've used illegal data. Free the Postcode is perhaps a better place to get free data - http://www.freethepostcode.org/
Now, it's possible to calculate distances between each point using the following calculation (originally from the fine book Pro Mysql):
The distance d between two points (x1,y1) and (x2,y2) can be calculated from the following equation (so long as x values are latitude and y values are longitude in radians, r is the radius of the sphere which is 3956 in miles):
Code:
d= acos(sin(x1)sin(x2)+cos(x1)cos(x2)*cos(y2-y1)) * r |
Initially I downloaded relevant postcode data file onto my Linux system. I then (by using regex in egrep) selected those post codes beginning NG and a single digit (Nottingham - I'm not interested in loading every single postcode at this stage) and use the shell command cut to extract only those fields I'm after. In this case it's a comma-delimited file so we use the -d switch to let the function cut know that commas are the delimiter.
Code:
egrep 'NG[0-9]+ [0-9]' freepostcodesfile | cut -d , -f 1,14,15 > ng.txt |
I did the same for OX postcodes so I had some data in two different towns to play around with.
Prior to inserting the postcode data I used the following script to create the table (I'll explain the last three columns a little later).
Code:
CREATE TABLE postcodes ( postcode CHAR(8) NOT NULL | |
, longitude DECIMAL(9,6) NOT NULL | |
, latitude DECIMAL(9,6) NOT NULL | |
, rad_xaxis DECIMAL(9,6) | |
, rad_yaxis DECIMAL(9,6) | |
, rad_zaxis DECIMAL(9,6) | |
, PRIMARY KEY(postcode)); |
To insert the data into my table I used :
LOAD DATA LOCAL INFILE '/home/usname/gis/ng.txt' INTO TABLE postcodes FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' (postcode, longitude, latitude);
and again for the ox file (Oxford postcodes)
LOAD DATA LOCAL INFILE '/home/usname/gis/ox.txt' INTO TABLE postcodes FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n' (postcode, longitude, latitude);
After putting our data into a table we can then use the trig functions available in Mysql to calculate distances between two points. However we have to be aware of time-consuming functions, that make the selection of records from the table where the distance is below a certain figure (ie after the calculation of all relevant records) can be very cumbersome. It's better to precompute and store much of this in the database table where we're recording the postcodes. To start with we need to change the latitude and longitude figures given from degrees into radians. This is done using the following formula:
radians= degrees*(PI/180)
Luckily we can obtain PI in Mysql by using PI()
What I will do here is precompute and populate 3 extra fields for each record (in radians) and store them as rad_xaxis, rad_yaxis and rad_zaxis, as per the following definition:
xaxis = cos(radians(Lat)) * cos(radians(Lon))
yaxis = cos(radians(Lat)) * sin(radians(Lon))
zaxis = sin(radians(Lat))
So we update our calculated fields as per the following:
Code:
update postcodes set rad_xaxis | |
= cos((latitude * (PI() /180))) * cos((longitude * (PI() /180))); | |
update postcodes set rad_yaxis | |
= cos((latitude * (PI() /180))) * sin((longitude * (PI() /180))); | |
update postcodes set rad_zaxis | |
= sin((latitude * (PI() /180))); |
The distance between points can then be calculated using SQL loosely like (acos( xaxis * $xaxis + yaxis * $yaxis + zaxis * $zaxis ) * 3956) (where those starting with a $ are precomputed for the start point in question in the same manner as above)
So to get my distance between say the post code OX1 1AB (central Oxford) and those postcodes starting 'OX4 3Y%'
Code:
select b.postcode, acos(b.rad_xaxis * a.rad_xaxis + b.rad_yaxis * a.rad_yaxis + b.rad_zaxis * a.rad_zaxis) * 3956 AS distance | |
from postcodes a, postcodes b | |
where a.postcode = 'OX1 1AB' | |
and b.postcode like 'OX4 3Y%' | |
and b.latitude > 0; |
Note I don't want to calculate where the file concerned didn't have data, so add to the where clause the restriction that the latitude figure is greater than 0.
I have done some calculations using this formula (for example where the fixed post code is NG7 4ER in Nottingham), and it seems to be more accurate over greater distances - I'm not sure that you could trust it for under 5 miles. Even so it's very responsive and you can add different parts of the UK and have a distance calculator for different areas easily available. It should be remembered that as the Earth is not a perfect sphere, our calculations here will not be totally accurate, but we seem to be within 200 yards for the distance between Oxford and Nottingham (as the crow flies).
Now it's often the case that you won't want all the distances between lots of random postcodes (even within your home city). However, suppose you want to bring back all supermarkets (or takeaway shops or pubs etc) within 6 miles of your designated postcode? So long as you have the supermarkets recorded in a table along with the correct post code (make sure the format is exactly the same as the postcode table) then you can use the following statement:
Code:
select b.postcode, acos(b.rad_xaxis * a.rad_xaxis | |
+ b.rad_yaxis * a.rad_yaxis | |
+ b.rad_zaxis * a.rad_zaxis) * 3956 AS distance, | |
s.name, s.address | |
from postcodes a, postcodes b, shops s | |
where b.postcode = s.postcode and a.postcode = 'OX1 1AB' | |
and b.latitude > 0 | |
and s.shoptype = 'Supermarket' | |
and acos(b.rad_xaxis *a.rad_xaxis | |
+ b.rad_yaxis * a.rad_yaxis | |
+ b.rad_zaxis * a.rad_zaxis ) * 3956 <= 6 order by distance \G |
This brought back the following records (those records I've recorded in my shops database table - in real life obviously there are others within the specified distance):
******* 1. row *******
postcode: OX2 7BY
distance: 2.66911298653386
name: Midcounties Co-Op
address: 228-240 Banbury Road, Oxford
******* 2. row *******
postcode: OX4 3XQ
distance: 3.50779225250805
name: Midcounties Co-Op
address: 21 Templars Square, Cowley
******* 3. row *******
postcode: OX3 8RA
distance: 3.88607301210902
name: Midcounties Co-Op
address: Atkyns Road, Headington
******* 4. row *******
postcode: OX4 6XJ
distance: 4.67029242514653
name: Tesco
address: Oxford Retail Park, Cowley
4 rows in set (0.01 sec)
How good is that? I'll have to round the figures though - it doesn't make any sense to calculate to 15 decimal points! Hope you get as much enjoyment from using this technique as I do. Until next time - Enjoy your Mysql coding!
Making queries more efficient (3) - The importance of indexes
By admin on Apr 10, 2009 | In Formula 1 - More advanced stuff, Highway Maintenance - DBA stuff | Send feedback »
It's often said that poor use of indexes (or indeed a missing index) is the fastest way to kill a system. An index works by providing a pre-sorted list that the query can access, instead of having to perform a large sort in memory. It must be remembered that indexes come at a cost though, mainly in using up disk storage space (and also in increasing INSERT and UPDATE times). One important thing to note when creating an index, is that it's preferable to minimize the size of indexed fields. An index on a small int field is far better than an index say, on a varchar(100) field. If it's necessary to index a varchar field, consider limiting the index to the first x places, for example:
Code:
CREATE INDEX addr_line_indx ON members (addr_line1(15)); |
The index generally works on a direct WHERE condition for example where colA = 123, or when looking for a min(field) or max(field) value. The index can also be used for LIKE comparisons, so long as the comparison string does not start with a wildcard. So, using "WHERE fielda LIKE 'FARSC%'" will use an index on that field, the following won't: "WHERE fielda LIKE '%FARSCAPE%'". To read up more on how Mysql uses indexes:
Another factor to consider in putting an index on a field, is that of selectivity. If there are too many data records of the same value (eg if the field records gender, 'M' or 'F') then Mysql will find it quicker to load all the records into memory and scan through them, instead of using the index. The Mysql database keeps statistics on table indexes so it can make an informed guess as to when to use an index or not. So, it only makes sense to use indexes where the field has a high selectivty ie the data set contains mostly or entirely unique values.
One thing to remember in creating an index is it doesn't have to restricted to just one column. For example, if you query often on two columns firstname and lastname, you can create a multi-column (or composite) index:
Code:
CREATE INDEX name_idx ON members (lastname, firstname) |
When you use a composite index you have to remember that the index is only used on the second field mentioned, when the WHERE clause includes the first field. It may be necessary in some cases to create a further separate index on that second field, when you know queries will only specify that second field in the WHERE clause. Also, if you use OR (ie WHERE lastname = 'SMITH' OR firstname = 'PETE') then the index is ignored. You can read more about multiple indexes here:
mysql manual on multiple indexes
Sometimes an index is not used in a query because a function or conversion is carried out on the field.
If you can isolate an indexed field on the left side of a WHERE condition, you can then persuade Mysql to use the index in the query. The best example of this is something I found in the Book Pro Mysql (Kruckenberg and Pipes):
Bad performance:
Code:
SELECT * FROM customer_orders | |
WHERE TO_DAYS(order_created) - TO_DAYS(NOW()) <= 7; |
Good performance (the index on order_created field will be used):
Code:
SELECT * FROM customer_orders | |
WHERE order_created >= DATE_SUB(NOW(), INTERVAL 7 DAY); |
There will be cases however where you can't perform a search on a field without using a function or conversion of that field. In these cases, if the query is needed often, it may be more efficient to create an additional column containing the calculated value of that column (always assuming that the extra storage space isn't an issue). For example (again from the ProMysql book), the following SQL will not use an index, even if one exists on the email_address column:
Code:
SELECT * FROM customers | |
WHERE email_address LIKE '%aol.com'; |
The workaround for this is to create an additional column containing the reverse of the email address, and then index that column. We can then utilise the index by means of the following query:
Code:
SELECT * FROM customers | |
WHERE email_address_reversed | |
LIKE CONCAT(REVERSE('%aol.com'), '%'); |
Of course, as well as the additional disk space required for the additional column and index, we also need to ensure that the email_address_reversed field is populated regularly. This could be via the use of triggers on updates and inserts. Or you could run a table update routine at regular intervals. It all depends on how often you need the query to run efficiently.
There's an important consideration I'm only going to touch upon in this article. Sometimes you will need to persuade mysql to use the right index. Although it will normally guess correctly via looking at the query and analysing statistics, there are cases where it chooses the inefficient option. This subject is too large to include in this article, but I can recommend you read the following link showing 7 ways to convince mysql to use the right index
7-ways-to-convince-mysql-to-use-the-right-index
One last thing that may make your queries run faster. If your table has had a lot of deletes, inserts and updates, it may well be worth using the command OPTIMIZE TABLE, for example
Code:
OPTIMIZE TABLE large_order_table; |
This will effectively de-fragment a mysql database ie it will reduce the size of the table on disk and reduce query response time. It also keeps the index statistics current which helps Mysql decide when to use a particular index on a column.
Well, I feel I've shown a lot of different ways in how to optimize your use of indexes in Mysql. Please feel free to leave comments if you find this information of use (or where you disagree!).
Until next time - Happy coding!