I have a system which will return all users from the database and order the results by lowest distance from a reference zip code.
For example: User will come on the site, enter zip code and it will return him all other users who are nearest to his zip (ascending order)
How am i doing this now and why is it a problem ?
The system contains more than 30 million users and their zipcodes. I am retreiving all the users in a particular state and city (narrows the dataset down to about 10,000).
This is where the problem actually happens. Now, all the result sent by mysql (10,000) rows to PHP are sent to a zipcode calculator library which calculates this distance between the base zip code and user's zip code - 10,000 times. Then orders the result by the zip code nearest.
As you can see, this is very badly optimized code. And the 10,000 records are looped through twice. Not to mention the amount of RAM each httpd process takes just transferring data to and fro mysql.
What I would like to ask the gurus in here that is there anyway to optimize this ?
I have a few ideas of my own, but i'm not sure how efficient they are.
Try to do all the zipcode calculation and ordering in mysql itself and return the paginated number of rows. For this, i will need to move the distance between zipcode calculation logic to a stored procedure. This way I am preventing the processing of 10,000 records in PHP. However, there is still a problem. I would not need to calculate distance for zip codes which have already been calculated (for 2 users having the same zip code).
Secondly, how do i order rows in mysql using a stored procedure ?
What do you guys think ? Is this a good way ? Can i expect a performance boost using this ? Do you have any other suggestions ?
I know this question is huge, and i really appreciate the time you have taken to read till the end. I would really like to hear your thoughts about this.
As I'm not overly familiar with PHP or MySQL, I can only give some basic tips but they should help. This also assumes you have no direct way of interfacing with the zip library from MySQL.
First, as it's doubtful that you have 10k zip codes in a city, take your existing query and do something like
This will probably return a few dozen zip codes max, and no duplicates. Run this through your zip code library. That library itself is probably a source of slowness, as it has to look up the zip codes, and do a bunch of fancy trig to get actual distance. Take the results of this, and insert it into a temp table with just the zip code and the distance.
Once done with that list, have another query that gets the rest of the user data you want, and JOIN into the the temp table on zip code to get your distance.
This should give you quite a large speedup. You can do whatever paging you need in the second query after the results have been calculated. And no more looping through 10k rows.