This page discusses the methods covered on HowtoSelectRandomRecords.
There will always be a bottleneck to performance somewhere. Understanding where the slowpoint in your system exists can help inform desisions you make about how to approach an issue like this.
Issues that vary between systems include:
The wide variability of environments and requirements will likely lead to different results for each approach.
The two criticism of this approach are
Again the usual warning: sorting by
random()will force a sequential scan on the records. It might slow the system down pretty fast when the amount of records grows larger.
The following has been copied over(mostly) from HowtoSelectRandomRecords.
The following is shamelessly lifted from this email by Jemery Kemper
Josef Pospisil wrote:
> I’m doing it like this:
>
>> questions = self.find_by_sql(
>> “select id from questions where area_id = ? and score = ?”,
>> area_id, score])
>> questions.sort! {rand(3) – 1}[1..count].collect do |q|
>> Question.find(q.id)
>> end
>
> So if you need only ids that collect thing is not necessary. If anyone
> have any comments on this code please mail me cause it’s one of the
> bottleneck for my app, and maybe I’m doing something wrong.
You can cut down the number of queries from N+1 to 2:
module Enumerable
def random_sample(n)
sort_by { rand }.slice(0, n)
end
end
class Question < <a href="http://wiki.rubyonrails.org/rails/pages/ActiveRecord" class="existingWikiWord">ActiveRecord</a>::Base
SELECT_IDS_BY_AREA_AND_SCORE = "select #{primary_key} from #{table_name} where area_id=? and score=?".freeze
def self.random_sample(area_id, score, n = 3)
area_id = area_id.id if area_id.is_a?(Area)
find(find_by_sql([SELECT_IDS_BY_AREA_AND_SCORE, area_id, score]).random_sample(n).map { |q| q.id })
end
end
Calling find with an array of ids will generate 1 "select * from questions where id in (?)" intead of N "select * from questions where id = ?".
This seems backwards to me.
Suppose that there are N records in the table and you need X of them at random. Of the two methods given:
I guess the assumption is that the second one still needs to scan the entire table, but 1) the others method does to (and they have to send the data to the client) and 2) I would expect any decent SQL server should be able to optimize its way out of sorting on a key that doesn’t depend on the table—that’s exactly the sort of thing that query optimization is all about.
— MarkusQTwo points. 1) selecting a list of primary keys has a much lower cost than selecting full( see below ) rows. 2) A simple(and naive) test confirms (for me, on my machine) that sorting the table can be an expensive operation.
Here are some numbers from my local machine that seem to support method two. This is my local (not production) windows machine with a (mostly) default mysql 4. (YMMV).:
mysql> SELECT id FROM documents d LIMIT 3; +-----+ | id | +-----+ | 501 | | 502 | | 503 | +-----+ 3 rows in set (0.00 sec) mysql> SELECT id FROM documents d ORDER BY RAND() LIMIT 3; +------+ | id | +------+ | 510 | | 1144 | | 507 | +------+ 3 rows in set (4.88 sec)
Note the different times. 0.00 vs. 4.88.
Investigating a little further we can see that using order by rand() in our select forces the whole table to be sorted.
mysql> explain SELECT * FROM documents d \G
*************************** 1. row ***************************
table: d
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 58
Extra:
1 row in set (0.00 sec)
mysql> explain SELECT * FROM documents d ORDER BY RAND() \G
*************************** 1. row ***************************
table: d
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 58
Extra: Using temporary; Using filesort
1 row in set (0.00 sec)
Apparently temporary and filesort are “the bad ones”.
Maybe I could optimize my table to avoid this problem, but all the same, its easier for me to fetch a full list of ids, grab a random sample and then retrieve those records.
Firstly, let me say that I’m glad we’re getting “data driven” on this question. It’s always a good idea to see if reality buys your theories before you try selling them.
My first response was “gosh, I was wrong”—but after thinking about it for a few minutes I’m not so sure. Your results do not really address the question, and after realizing this, I decided to perform my own test and came to a very different conclusion.
The problem is that your first query only returns data from the first three rows of the table, so of course it is fast. But what the first and second methods require is returning the ids from all the records, and letting the client select among them.
When I tried this on a reasonably large dataset (both queries would of course be very fast on small tables) I found the ORDER BY RAND() method to be over twice as fast as just the first query required by the other methods.
mysql> select post_id from nuke_bbposts;
+---------+
| post_id |
+---------+
| 1 |
| 2 |
| 3 |
...several thousand rows omitted
| 14622 |
| 14623 |
| 14624 |
| 14625 |
| 14626 |
+---------+
13477 rows in set (0.05 sec)
mysql> select post_id from nuke_bbposts order by rand() limit 3;
+---------+
| post_id |
+---------+
| 1623 |
| 2734 |
| 13006 |
+---------+
3 rows in set (0.02 sec)
This was done sitting on the server (as a web app often would be); if the database were on another machine I suspect that the discrepancy would be much larger. Likewise, if the table had more rows I would expect the third method (which, as you may have guessed, I strongly prefer) would do even better.
— MarkusQVolley … and return!
Good call on my poor example. Hopefully I can do better this round. Now testing on the production server (one of the TextDrive machines).
mysql> SELECT id FROM documents;
… just over a hundred rows omitted
| 276 |
| 277 |
| 278 |
| 279 |
| 280 |
——-
138 rows in set (0.00 sec)mysql> SELECT id FROM documents d ORDER BY RAND
I think the root of the discrepency is the structure and content of the tables at play. The table I’m using has (relatively) few records, but each record has alot of info (a few blobs). The large record size I guess makes the sort operation particularly expensive for me. Again, ymmv.
LeeOHow very, very odd.
I am aproching the end of my productive period (I’ve been working for just over 14 hours, with a break to play football (socer) with some 3..7 year-olds for a few hours). So I may not be thinking straight but:
RAND() in the chapter on functions).This has been fun and insightfull. I attribute the speed increase in my second set to the fact that it’s running on a bonafied, properly configured server(faster hardwardware, and tweaked os and db setup)
Now if we were to do really do this properly, we shoud be using the benchmarking and profiling scripts available in Rails 0.12 (see BenchmarkSuiteWish).
Just as a datapoint(s), I did the same on a Windows box with a dataset size of 13k records.
5 tries of each method yielded the following numbers:
select * from customers; => 62ms
select * from customers order by rand() limit 3; => 15 ms
MichaelC?
The first query should be
select id from customers, since only the IDs are required — not whole rows — LeeO
Why not the following?
mysql> SELECT count(*) FROM table1;
x = rand( answer_from query )
“SELECT * FROM table1 limit #{x},1”
That will get you 1 random record in 2 queries. 3 queries for 2 records, 4 for 3, etc. Of course, you chance duplication, but that can be handled on the ruby end quite easily.
-Matt L.
This appoach requires n+1 queries, and can potentially return duplicates. Consider the other approaches(see above) as they don’t suffer either shortcomming — LeeO
I’ve had immense success with a hybrid approach when using MySQL (5.0.15, in this case): using rand/limit to obtain only the primary keys, then using an additional query to fetch the associated records. I’m working with a database with over 500,000 records, and both approaches listed here were incredibly slow. So I modified the RandomSample code as follows:
module RandomSample
def self.append_features(base)
def base.random_sample(n = 1)
find(find_by_sql("select #{primary_key} from #{table_name} order
by rand() limit #{n}").map { |q| q.id })
end
end
end
This reduced the time to fetch a random sample from 15s with straight rand() and over 30s with RandomSample to 2s (on a dual 2GHz Opteron w\ 2GB RAM, at that)
— Bascule
Is it possible to add weighting into the mix? Let’s say, for example, that I need a random record pulled, but that I like to prefer records that are newer. Any thoughts?
—Joshua