Thread: Select count(*) takes a long time
Hello, Forgive me if this is the wrong forum for such a question. What can I do to make "select count(*) from ..." return quickly? I think I had read that PostgreSQL, unlike most RDBMSs, does not store the current row count and so must be re-calculated on the fly. In my particular situation I have a home page that must select a "featured" article by choosing one at random from a table of 300 thousand or so. Doing select count(*) takes a few seconds. I could probably select all the rows and get the data in the same time. I thought I came up with a good solution, now I "select ... from ... order by random() limit 1", which is nice because it only requires one query to get what I want but it's still slow. I now cache the "featured" article because this query is too slow to make during a web page request. I have indexes on all the columns referenced by the where clause. Am I missing something obvious or does everyone just deal with slow "select count(*)" when using PostgreSQL? Thanks, Jeff PS: I'm running PostgreSQL 7.1.2 on FreeBSD 4.3.
"Jeff Johnson" <jeff@jeffjohnson.net> writes: > think I had read that PostgreSQL, unlike most RDBMSs, does not store > the current row count and so must be re-calculated on the fly. Correct. This is one of the downsides of MVCC: there is no unique row count, in general, so not much point in trying to keep track of it. > particular situation I have a home page that must select a "featured" > article by choosing one at random from a table of 300 thousand or so. Doesn't seem like count(*) is an essential component of a solution to this problem. What are the available article identifiers and indexes? For example, if you had a timestamp column with an index, you could define "a random article" as "the first one after a randomly chosen time", which could be retrieved efficiently with select * from articles where timestamp >= 'targeted time' limit 1; The target time could be chosen as some random fraction between the start of your database and now(). This'd be skewed by variations in the rate of article posting, but it'd probably do for your purposes. If there is a serial number column then it's even easier, since the range of article numbers is from 1 to the sequence's last_value. > I thought I came up with a good solution, now I "select ... from ... > order by random() limit 1", which is nice because it only requires one > query to get what I want but it's still slow. That is most definitely *not* going to be fast, since it requires an explicit sort of all the rows. regards, tom lane
> Doesn't seem like count(*) is an essential component of a solution to > this problem. What are the available article identifiers and indexes? > For example, if you had a timestamp column with an index, you could > define "a random article" as "the first one after a randomly chosen > time", which could be retrieved efficiently with > > select * from articles where timestamp >= 'targeted time' limit 1; > > The target time could be chosen as some random fraction between the > start of your database and now(). This'd be skewed by variations in > the rate of article posting, but it'd probably do for your purposes. > > If there is a serial number column then it's even easier, since the > range of article numbers is from 1 to the sequence's last_value. I think that might work well for me, thanks! > > I thought I came up with a good solution, now I "select ... from ... > > order by random() limit 1", which is nice because it only > requires one > > query to get what I want but it's still slow. > > That is most definitely *not* going to be fast, since it requires an > explicit sort of all the rows. Good point :) Thanks, Jeff
"Jeff Johnson" wrote: >order by random() This should be forbidden by law :-) -- René Pijlman <rpijlman@spamcop.net>