Thread: shouldn't postgres know the numer of rows in a (sorted) result-set before returning the first row?

hi,

i have some system where i show pages results on a web-page - the query
that returns the paged result looks like this:

(table has a few hundred thousand rows, result-set is ~30000)

a) select asset.asset_id, asset.found_time from asset.asset WHERE
found_time > 1130926914 AND pool_id in (1,2,3,4) AND asset.status IS
NULL order by found_time desc LIMIT 50 OFFSET 0
this query returns data in 0.064secs.

if i now want to display the pure number of documents that this query
would generate without the limit clase i would do:

b) select count(asset.asset_id) from asset.asset WHERE found_time >
1130926914 AND pool_id in (1,2,3,4) AND asset.status IS NULL
this query takes > 6 seconds!

i understand that postgres has to read every row from the heap to make
sure that they are all still valid and count. but from my understanding
query (a) would have something like an uncorrected count (somewhere
internally) for the whole query as it has to performed an "order by" on
the result-set before returning the first row.

i would be interested in getting this uncorrected count "after sort"
but  "before first row" in query (a). so in a fresh DB with no
updates/deletes this would be the correct count, and i could avoid the
very expensive  (b).

i'd like to hack that feature into my local portgres, i'm not asking
for inclusion in the official postgres, but could someone direct me if
my idea is feasable and where to look in the code (8.1)?

regards,
thies



Re: shouldn't postgres know the numer of rows in a (sorted)

From
Richard Huxton
Date:
Thies C Arntzen wrote:
> i would be interested in getting this uncorrected count "after sort"
> but  "before first row" in query (a). so in a fresh DB with no
> updates/deletes this would be the correct count, and i could avoid the
> very expensive  (b).

You don't say what applicaton language you are using, but most offer a
pg_num_rows() interface which tells you how many results are in the
recordset you have fetched.

Your best bet to learn more is to read whatever documentation comes with
your client library.

--
   Richard Huxton
   Archonet Ltd


Am 16.11.2005 um 14:07 schrieb Richard Huxton:

You don't say what applicaton language you are using, but most offer a pg_num_rows() interface which tells you how many results are in the recordset you have fetched.


my query uses LIMIT and OFFSET - so pg_num_rows will return what i specify in LIMIT (or less). that's not the count i was asking for.

re, thies

Re: shouldn't postgres know the numer of rows in a (sorted)

From
Richard Huxton
Date:
Thies C. Arntzen wrote:
>
> Am 16.11.2005 um 14:07 schrieb Richard Huxton:
>
>> You don't say what applicaton language you are using, but most  offer
>> a pg_num_rows() interface which tells you how many results  are in the
>> recordset you have fetched.
>
>
> my query uses LIMIT and OFFSET - so pg_num_rows will return what i
> specify in LIMIT (or less). that's not the count i was asking for.

Ah - apologies, I didn't read your post closely enough.

I think the answer then is "no". In some cases PG can short-circuit the
query and stop once 50 are fetched, which means it doesn't always know.

With your query I'm not sure whether it can or not. Your timings however
suggest that this is what is happening, otherwise both queries would
take approximately the same amount of time.

One thing I have noticed though, is that the sort-order of your query
might not be well defined.

select asset.asset_id, asset.found_time from asset.asset WHERE
found_time > 1130926914 AND pool_id in (1,2,3,4) AND asset.status IS
NULL order by found_time desc LIMIT 50 OFFSET 0

Unless found_time is unique then you might get different results on two
queries (since asset_id ordering is undefined).

--
   Richard Huxton
   Archonet Ltd

On Wed, Nov 16, 2005 at 01:23:08PM +0100, Thies C Arntzen wrote:
> hi,
>
> i have some system where i show pages results on a web-page - the query
> that returns the paged result looks like this:
>
> (table has a few hundred thousand rows, result-set is ~30000)
>
> a) select asset.asset_id, asset.found_time from asset.asset WHERE
> found_time > 1130926914 AND pool_id in (1,2,3,4) AND asset.status IS
> NULL order by found_time desc LIMIT 50 OFFSET 0
> this query returns data in 0.064secs.
>
> if i now want to display the pure number of documents that this query
> would generate without the limit clase i would do:
>
> b) select count(asset.asset_id) from asset.asset WHERE found_time >
> 1130926914 AND pool_id in (1,2,3,4) AND asset.status IS NULL
> this query takes > 6 seconds!

Umm, the first query doesn't calculate all the output nor does it even
have an estimate of it. Why do you think it does?

> i understand that postgres has to read every row from the heap to make
> sure that they are all still valid and count. but from my understanding
> query (a) would have something like an uncorrected count (somewhere
> internally) for the whole query as it has to performed an "order by" on
> the result-set before returning the first row.

Not if you have an index on "found_time". In that case it can return
the top 50 without even looking at most of the table. That's what
indexes are for. The only estimate it has is the one in EXPLAIN, and it
can find that without running the query at all.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Am 16.11.2005 um 14:49 schrieb Martijn van Oosterhout:


i understand that postgres has to read every row from the heap to make 

sure that they are all still valid and count. but from my understanding 

query (a) would have something like an uncorrected count (somewhere 

internally) for the whole query as it has to performed an "order by" on 

the result-set before returning the first row.


Not if you have an index on "found_time". In that case it can return

the top 50 without even looking at most of the table. That's what

indexes are for. The only estimate it has is the one in EXPLAIN, and it

can find that without running the query at all.


hey martijn,

my question is more in the line of


whereby my special case is all about beeing able to provide an [possible inaccuate] count for a query if possible: my understanding is that would be the case if the "where clase" and the "order by" clause have been satisfied from the indices and the only step left is to validate the records in the result by reading them from the heap. 

and -again- i'm not asking for a new feature but i'd like to play with it and am asking for hackers advice;-)

what am i missing?
re, thies

On Wed, Nov 16, 2005 at 03:33:10PM +0100, Thies C. Arntzen wrote:
> my question is more in the line of
>
> http://archives.postgresql.org/pgsql-hackers/2005-01/msg00247.php
>
> whereby my special case is all about beeing able to provide an
> [possible inaccuate] count for a query if possible: my understanding
> is that would be the case if the "where clase" and the "order by"
> clause have been satisfied from the indices and the only step left is
> to validate the records in the result by reading them from the heap.

The problem is that in the index scan you indicated, it goes no further
through the index than necessary to produce your answer. And it can't
satisfy the where clause from the index (unless those columns are in
your index, you don't say).

The logic (very simplified) basically goes:

1. Get next entry in index
2. Does entry match, if not goto 1
3. Extract matching tuple from heap
4. Check visibility and where clause
5. If not match, goto 1
6. Return this tuple
7. Have we returned 50 rows yet, if not goto 1
8. finish

As you can see, when you get to 8 you have no idea how much of the
index you scanned and no idea how much of the table you scanned. You
really have *no* idea how many more there might be. For example, say
step 1 generated 300 tuples, step 2 passed 200 of them and step 4
passed 50 of those (which it returned). How many tuples will the query
return in the end?

It's your assumption that we actually examine more of the index than
necessary that's wrong.

> and -again- i'm not asking for a new feature but i'd like to play
> with it and am asking for hackers advice;-)

Get the result from EXPLAIN, it's about as good as any other estimate we
can produce...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment