Aaron Francis looks into querying MySQL with a ST_Distance_Sphere clause — a function to calculate the linear distance in meters between two lon/lat points — in an efficient way. There’s a few things he does there:
Ordering records in a random order involves these operations:
Load all rows into memory matching your conditions
Assign a random value RANDOM() to each row in the database
Sort all the rows according to this random value
Retain only the desired number of records from all sorted records
His solution is to pre-add randomness to each record, in an extra column. For it he uses a the Geometric Datatype POINT type. In Postgres he then uses the following query that orders the records by distance measured against a new random point.
SELECT * FROM repositories ORDER BY randomness <-> point(0.753,0.294) LIMIT 3;
In MySQL you also have a POINT class(ever since MySQL 5.7.6) that you can use. However I don’t really see how that would work there, as you’d need to calculate the distance for each record using a call to ST_Distance:
SET @randomx = RAND();
SET @randomy = RAND();
SELECT *, ST_Distance(POINT(@randomx, @randomy), randomness) AS distance FROM repositories ORDER BY distance DESC LIMIT 0,3;
💁♂️ Using EXPLAIN on the query above verifies it doesn’t use an index, and thus goes over all records.
What I do see working instead, is use of a single float value to hold pre-randomness:
-- Add column + index
ALTER TABLE `repositories` ADD `randomness` FLOAT(17,16) UNSIGNED NOT NULL AFTER `randomness`;
ALTER TABLE `repositories` ADD INDEX(`randomness`);
-- Update existing records. New records should have this number pre-generated before inserting
UPDATE `repositories` SET randomness = RAND() WHERE 1;
With that column in place, you could then do something like this:
SET @randomnumber = RAND(); -- This number would typically be generated by your PHP code, and then be injected as a query param
SELECT * FROM repositories WHERE randomness < @randomnumber ORDER BY randomness DESC 0,3;
Unlike the query using POINT(), this last query will leverage the index created on the randomness column 🙂
There are plenty of times in my career when I’ve stored a boolean and later wished I’d had a timestamp. There are zero times when I’ve stored a timestamp and regretted that decision.
Hear hear! Over the years I’ve come to include 9 meta fields for most of the tables I create: added_at, added_by, added_ip, edited_at, edited_by, edited_ip, deleted_at, deleted_by, and deleted_ip. Handy for whenever you receive a phone call saying that a record has disappeared, allowing you to pinpoint it to a user and a specific timestamp.
One of the cool things about MySQL 5.7 is the fact that it sports a few spatial convenience functions (since 5.7.6), allowing one to do operations on geometric values.
One of those convenience functions is ST_Distance_Sphere, which allows one to calculate the (spherical) distance between two points.
Recently I was working an project where I needed said function, yet unfortunately they were running some version of MySQL 5.6 … on which ST_Distance_Sphere isn’t available.
Instead of requesting to upgrade the MySQL server, I decided to polyfillST_Distance_Sphere instead. It’s really easy since it’s a function, and it basically is nothing more than the Haversine Formula which I, as a mapping aficionado, know how to juggle.
DROP FUNCTION IF EXISTS `ST_Distance_Sphere`$$
CREATE FUNCTION `ST_Distance_Sphere` (point1 POINT, point2 POINT)
no sql deterministic
declare R INTEGER DEFAULT 6371000;
declare `φ1` float;
declare `φ2` float;
declare `Δφ` float;
declare `Δλ` float;
declare a float;
declare c float;
set `φ1` = radians(y(point1));
set `φ2` = radians(y(point2));
set `Δφ` = radians(y(point2) - y(point1));
set `Δλ` = radians(x(point2) - x(point1));
set a = sin(`Δφ` / 2) * sin(`Δφ` / 2) + cos(`φ1`) * cos(`φ2`) * sin(`Δλ` / 2) * sin(`Δλ` / 2);
set c = 2 * atan2(sqrt(a), sqrt(1-a));
return R * c;
Run the little snippet above on your server after having selected a database first. Once executed it’ll persistently be available (just like a Stored Procedure) for you to use. Usage is the same as if it were natively available:
At GitHub they use MySQL as their main datastore. The setup is a typical “single-writer-multiple-readers” design. They loadbalance between server pools using HAProxy, with some cleverness built in:
Instead [of checking whether a MySQL server is live with mysql-check], we make our HAProxy pools context aware. We let the backend MySQL hosts make an informed decision: “should I be included in a pool or should I not?”
The HAProxy config contains all known hosts. The hosts themselves can answer to checks running with either HTTP 200 (OK), HTTP 503 (Not OK), or HTTP 404 (Maintenance). Based upon those answers HAProxy will assess if there are enough hosts in the main mysql_ro_main pool to handle the load, and automatically switch to the failover mysql_ro_backup pool if need be.
Postgres served us well in the early days of Uber, but we ran into significant problems scaling Postgres with our growth. In this article, we’ll explore some of the drawbacks we found with Postgres and explain the decision to build Schemaless and other backend services on top of MySQL.