Thread: Re: [HACKERS] Much Ado About COUNT(*)

Re: [HACKERS] Much Ado About COUNT(*)

From
David Garamond
Date:
Merlin Moncure wrote:
> 6.  for large tables, you can get a pretty accurate count by doing:
> select count(*) * 10 from t where random() > .9;
> on my setup, this shaved about 15% off of the counting time...YMMV.

That's an interesting idea, using sampling to get an estimate.

Thanks for the tip.

--
dave

Re: [HACKERS] Much Ado About COUNT(*)

From
Greg Stark
Date:
David Garamond <lists@zara.6.isreserved.com> writes:

> Merlin Moncure wrote:
> > 6.  for large tables, you can get a pretty accurate count by doing:
> > select count(*) * 10 from t where random() > .9;
> > on my setup, this shaved about 15% off of the counting time...YMMV.
>
> That's an interesting idea, using sampling to get an estimate.

It's an interesting idea but this particular implementation isn't going to
save any time. It still has to read every record only now it has to spend
extra time doing a random() and the arithmetic.

In order for sampling to speed things up you would have to use an index to
actually reduce the number of records read.

The database could be clever and implement the same kind of sampling vacuum
does. That picks a random sampling of pages from the table without using an
index. But there's no way to implement the same kind of behaviour from the
user-visible features.

--
greg

Re: [HACKERS] Much Ado About COUNT(*)

From
Csaba Nagy
Date:
[snip]
> The database could be clever and implement the same kind of sampling vacuum
> does. That picks a random sampling of pages from the table without using an
> index. But there's no way to implement the same kind of behaviour from the
> user-visible features.
... meaning perhaps a new keyword accepted by SELECT, something like
"SAMPLE 1000" ? Which would mean sample records in a 1:1000 ratio ?
Would simplify (and probably speed up a lot) some estimating queries...

Cheers,
Csaba.



Re: [HACKERS] Much Ado About COUNT(*)

From
Greg Stark
Date:
Csaba Nagy <nagy@ecircle-ag.com> writes:

> [snip]
> > The database could be clever and implement the same kind of sampling vacuum
> > does. That picks a random sampling of pages from the table without using an
> > index. But there's no way to implement the same kind of behaviour from the
> > user-visible features.
> ... meaning perhaps a new keyword accepted by SELECT, something like
> "SAMPLE 1000" ? Which would mean sample records in a 1:1000 ratio ?
> Would simplify (and probably speed up a lot) some estimating queries...

See:

  http://www.jlcomp.demon.co.uk/faq/random.html

I think the Oracle syntax looks like

  SELECT * FROM foo SAMPLE (0.1)

I don't think I would have picked this syntax but it seems like a better idea
to copy the existing practice rather than invent a new one.

There are some details, like what to do when there's a WHERE clause or joins.
Oracle disallows joins entirely and I'm unclear what the best thing to do
about where clauses would be.

--
greg

Re: [HACKERS] Much Ado About COUNT(*)

From
Csaba Nagy
Date:
[snip]
> See:
>
>   http://www.jlcomp.demon.co.uk/faq/random.html
>
> I think the Oracle syntax looks like
>
>   SELECT * FROM foo SAMPLE (0.1)
>
> I don't think I would have picked this syntax but it seems like a better idea
> to copy the existing practice rather than invent a new one.
>
> There are some details, like what to do when there's a WHERE clause or joins.
> Oracle disallows joins entirely and I'm unclear what the best thing to do
> about where clauses would be.

The where clauses could be applied exactly as for a normal select, with
the sampling being just a pre-filtering condition for what rows to
consider.

If there would be a way to specify the table on which to apply the
sampling, then the whole construct could be replaced automatically by
the inline view the oracle link recommends.
I doubt there would be any benefit in sampling more than one table in a
query, it should just work to sample the biggest table, and join the
result with the others. Sampling is only useful for really big tables
anyway.

So the syntax above could be extended to:

SELECT * FROM foo SAMPLE (foo, 0.1)

and:

SELECT foo.*, bar.*
  FROM foo, bar
  WHERE foo.key = bar.key
  SAMPLE (foo, 0.1)

which means sample foo and join the result with bar.

All this makes sense from a user point of view, I wonder how big a PITA
is to implement it...

Cheers,
Csaba.




Re: [HACKERS] Much Ado About COUNT(*)

From
Wes
Date:
On 1/13/05 9:50 AM, "Greg Stark" <gsstark@mit.edu> wrote:

> I think the Oracle syntax looks like
>
>   SELECT * FROM foo SAMPLE (0.1)
>
> I don't think I would have picked this syntax but it seems like a better idea
> to copy the existing practice rather than invent a new one.

Of course, in Oracle 'count(*)' is instantaneous.  It doesn't have to count
the physical records one by one.

Wes



Re: [HACKERS] Much Ado About COUNT(*)

From
Greg Stark
Date:
Wes <wespvp@syntegra.com> writes:

> On 1/13/05 9:50 AM, "Greg Stark" <gsstark@mit.edu> wrote:
>
> Of course, in Oracle 'count(*)' is instantaneous.  It doesn't have to count
> the physical records one by one.

That's simply false. Oracle does indeed have to count the records one by one.

It doesn't have to read and ignore the dead records since they're in a
separate place (but on the other hand it sometimes have to go read that
separate place when it sees records that were committed after your
transaction).

It can also do index-only scans, which often helps, but it's still not
instantaneous.

--
greg

Re: [HACKERS] Much Ado About COUNT(*)

From
Wes
Date:
On 1/13/05 6:44 PM, "Greg Stark" <gsstark@mit.edu> wrote:

> That's simply false. Oracle does indeed have to count the records one by one.
>
> It doesn't have to read and ignore the dead records since they're in a
> separate place (but on the other hand it sometimes have to go read that
> separate place when it sees records that were committed after your
> transaction).
>
> It can also do index-only scans, which often helps, but it's still not
> instantaneous.

Ok, I stand corrected - I was given some wrong information.  However, my
experience has been that count(*) on Oracle is a whole lot faster than
PostgreSQL - what appeared instantaneous on Oracle took some time on
PostgreSQL.  That was one of the first things I noticed when moving a
database application to PostgreSQL.  I've since disposed of the Oracle
database, so can't go back and retest.

Wes



Re: [HACKERS] Much Ado About COUNT(*)

From
Greg Stark
Date:
Wes <wespvp@syntegra.com> writes:

> On 1/13/05 6:44 PM, "Greg Stark" <gsstark@mit.edu> wrote:
>
> > That's simply false. Oracle does indeed have to count the records one by one.

> Ok, I stand corrected - I was given some wrong information.  However, my
> experience has been that count(*) on Oracle is a whole lot faster than
> PostgreSQL - what appeared instantaneous on Oracle took some time on
> PostgreSQL.  That was one of the first things I noticed when moving a
> database application to PostgreSQL.  I've since disposed of the Oracle
> database, so can't go back and retest.

If it was instantaneous then the data must have all been in cache. A lot of
Oracle kudos really come down to the fact that Oracle is often run on beefier
machines than others.

But if it was merely 2x as fast or so, more if the table was really wide, then
it could have just been because of the fast index-only scan.

If it was more than 2-4x as fast for a narrow table and you don't think the
whole thing was in cache then I would start to wonder about whether your
postgres table suffered from bloat from not having vacuum run frequently
enough or having the fsm settings too low.

--
greg

Re: [HACKERS] Much Ado About COUNT(*)

From
"Frank D. Engel, Jr."
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

This is probably stupid for some reason, but why not use a 64-bit
integer to track the number of records in the table? Increment when
adding records, decrement when deleting them... then COUNT(*) could
just return that in cases where a query is known to be looking at all
of the records?

On Jan 14, 2005, at 12:04 PM, Wes wrote:

> On 1/13/05 6:44 PM, "Greg Stark" <gsstark@mit.edu> wrote:
>
>> That's simply false. Oracle does indeed have to count the records one
>> by one.
>>
>> It doesn't have to read and ignore the dead records since they're in a
>> separate place (but on the other hand it sometimes have to go read
>> that
>> separate place when it sees records that were committed after your
>> transaction).
>>
>> It can also do index-only scans, which often helps, but it's still not
>> instantaneous.
>
> Ok, I stand corrected - I was given some wrong information.  However,
> my
> experience has been that count(*) on Oracle is a whole lot faster than
> PostgreSQL - what appeared instantaneous on Oracle took some time on
> PostgreSQL.  That was one of the first things I noticed when moving a
> database application to PostgreSQL.  I've since disposed of the Oracle
> database, so can't go back and retest.
>
> Wes
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>
>
- -----------------------------------------------------------
Frank D. Engel, Jr.  <fde101@fjrhome.net>

$ ln -s /usr/share/kjvbible /usr/manual
$ true | cat /usr/manual | grep "John 3:16"
John 3:16 For God so loved the world, that he gave his only begotten
Son, that whosoever believeth in him should not perish, but have
everlasting life.
$
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)

iD8DBQFB6AO47aqtWrR9cZoRAs7UAKCQL3VnSu3770AyFoKW/X7xR7REfQCeK1Ud
M/sIP2jY+6LIfOcrJM5vvUQ=
=Qlbi
-----END PGP SIGNATURE-----



___________________________________________________________
$0 Web Hosting with up to 120MB web space, 1000 MB Transfer
10 Personalized POP and Web E-mail Accounts, and much more.
Signup at www.doteasy.com


Re: [HACKERS] Much Ado About COUNT(*)

From
Richard Huxton
Date:
Frank D. Engel, Jr. wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> This is probably stupid for some reason, but why not use a 64-bit
> integer to track the number of records in the table? Increment when
> adding records, decrement when deleting them... then COUNT(*) could just
> return that in cases where a query is known to be looking at all of the
> records?

Check the list archives for details, but you need to consider multiple
backends inserting/deleting concurrently. What you need is a separate
little table where you can log your transaction-id and number of rows
added/removed then you can figure out how many rows there are from
different viewpoints.

--
   Richard Huxton
   Archonet Ltd

Re: [HACKERS] Much Ado About COUNT(*)

From
Martijn van Oosterhout
Date:
On Fri, Jan 14, 2005 at 12:39:04PM -0500, Frank D. Engel, Jr. wrote:
> This is probably stupid for some reason, but why not use a 64-bit
> integer to track the number of records in the table? Increment when
> adding records, decrement when deleting them... then COUNT(*) could
> just return that in cases where a query is known to be looking at all
> of the records?

Because there is no single value for count(*), if you're in a
transaction that has added records it will be bigger than in a
transaction that hasn't. How does your integer deal with this?

The usual solutions this involve locking, which is precisely what MVCC
is designed to avoid.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: [HACKERS] Much Ado About COUNT(*)

From
"Frank D. Engel, Jr."
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Yep, that could cause problems.  Okay, now I'm joining the program.

The only thing I can see that would fix this for the integer would be
to keep track of the number of 'committed' records using the integer,
then for each new transaction, make a copy of the integer in order to
remember its "place", using an additional integer to track the offset
(how many rows have been added or removed), so that it can be correctly
applied at commit time.  It's probably too messy to be worthwhile this
way, though.  More trouble than it would be worth.

On Jan 14, 2005, at 1:38 PM, Martijn van Oosterhout wrote:

> On Fri, Jan 14, 2005 at 12:39:04PM -0500, Frank D. Engel, Jr. wrote:
>> This is probably stupid for some reason, but why not use a 64-bit
>> integer to track the number of records in the table? Increment when
>> adding records, decrement when deleting them... then COUNT(*) could
>> just return that in cases where a query is known to be looking at all
>> of the records?
>
> Because there is no single value for count(*), if you're in a
> transaction that has added records it will be bigger than in a
> transaction that hasn't. How does your integer deal with this?
>
> The usual solutions this involve locking, which is precisely what MVCC
> is designed to avoid.
>
> Hope this helps,
> --
> Martijn van Oosterhout   <kleptog@svana.org>
> http://svana.org/kleptog/
>> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is
>> a
>> tool for doing 5% of the work and then sitting around waiting for
>> someone
>> else to do the other 95% so you can sue them.
>>
- -----------------------------------------------------------
Frank D. Engel, Jr.  <fde101@fjrhome.net>

$ ln -s /usr/share/kjvbible /usr/manual
$ true | cat /usr/manual | grep "John 3:16"
John 3:16 For God so loved the world, that he gave his only begotten
Son, that whosoever believeth in him should not perish, but have
everlasting life.
$
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)

iD8DBQFB6BPb7aqtWrR9cZoRAjHHAJ9gOp6EOuZtvZATLX+3AUbvhQQmOwCdFF6J
+6JlJKNjrTlYW/8kqu+Z9Xs=
=OV2y
-----END PGP SIGNATURE-----



___________________________________________________________
$0 Web Hosting with up to 120MB web space, 1000 MB Transfer
10 Personalized POP and Web E-mail Accounts, and much more.
Signup at www.doteasy.com


MOVE

From
PFC
Date:
Hello,

Here I'm implementing a session management, which has a connections table
partitioned between
active and archived connections. A connection represents a connection
between a user and a chatroom.

I use partitioning for performance reasons.

The active table contains all the data for the active session : user_id,
chatroom_id, session start
time, and other information.
The archive table contains just the user_id, chatroom_id, session start
and end time, for logging
purposes, and for displaying on the site, which user was logged to which
chatroom and from when to when.

Thus, when a user disconnects from a chatroom, I must move one row from
the active to the archive
table. This poses no problem as there is a UNIQUE index
(iser_id,chatroom_id) so I select the row FOR
UPDATE, insert it in the archive table, then delete it.

Now, when a user logs out from the site, or when his session is purged by
the auto-expiration cron
job, I must also expire ALL his open chatroom connections.
INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
DELETE FROM active WHERE user_id = ...;

Now, if the user inserts a connection between the two queries above, the
thing will fail (the
connection will just be deleted). I know that there are many ways to do it
right :
- LOCK the table in exclusive mode
- use an additional primary key on the active table which is not related
to the user_id and the
chatroom_id, select the id's of the sessions to expire in a temporary
table, and use that
- use an extra field in the table to mark that the rows are being processed
- use transaction isolation level SERIALIZABLE

However, all these methods somehow don't feel right, and as this is an
often encountered problem,
I'd really like to have a sql command, say MOVE, or SELECT AND DELETE,
whatever, which acts like a SELECT,
returning the rows, but deleting them as well. Then I'd just do INSERT
INTO archive (...) SELECT ... AND
DELETE FROM active WHERE user_id = ...;

which would have the following advantages :
- No worries about locks :
- less chance of bugs
- higher performance because locks have to be waited on, by definition
- No need to do the request twice (so, it is twice as fast !)
- Simplicity and elegance

There would be an hidden bonus, that if you acquire locks, you better
COMMIT the transaction as
soon as possible to release them, whereas here, you can happily continue
in the transaction.

I think this command would make a nice cousin to the also very popular
INSERT... OR UPDATE which
tries to insert a row, and if it exists, UPDATES it instead of inserting
it !

What do you think ?





Re: MOVE

From
Martijn van Oosterhout
Date:
On Fri, Jan 14, 2005 at 08:49:24PM +0100, PFC wrote:
> the auto-expiration cron
> job, I must also expire ALL his open chatroom connections.
> INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
> DELETE FROM active WHERE user_id = ...;
>
> Now, if the user inserts a connection between the two queries above, the
> thing will fail (the
> connection will just be deleted). I know that there are many ways to do it
> right :

Why not just do it in a single transaction? I don't think you need to
use SERIALIZABLE at all, I think normal read-committed mode will do
what you want, no?

BEGIN;
INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
DELETE FROM active WHERE user_id = ...;
COMMIT;

The DELETE can only delete the rows returned by the select, that's the
whole point of transactions...

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: MOVE

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Why not just do it in a single transaction? I don't think you need to
> use SERIALIZABLE at all, I think normal read-committed mode will do
> what you want, no?

> BEGIN;
> INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
> DELETE FROM active WHERE user_id = ...;
> COMMIT;

No, that's exactly wrong: in read-committed mode the DELETE could delete
rows that were not seen by the SELECT.  It would work in serializable
mode though.

            regards, tom lane

Re: [HACKERS] Much Ado About COUNT(*)

From
Wes
Date:
On 1/14/05 12:47 PM, "Frank D. Engel, Jr." <fde101@fjrhome.net> wrote:

> It's probably too messy to be worthwhile this
> way, though.  More trouble than it would be worth.

It would be rather useful if there was a way to get a reasonably accurate
count (better than analyze provides) in a very short period.  When you've
got a relatively wide table that has hundreds of millions to over a billion
rows, and you need to report on how many rows in the table, that can take a
long time.

Wes



Re: MOVE

From
PFC
Date:
> BEGIN;
> INSERT INTO archive (...) SELECT ... FROM active WHERE user_id = ...;
> DELETE FROM active WHERE user_id = ...;
> COMMIT;
>
> The DELETE can only delete the rows returned by the select, that's the
> whole point of transactions...

    Well, in the case of having a unique index on user_id, and if no-one
updates the row between the insert and the delete, it will work ;)
    And if someone updates it in between, well, the update is not archived,
so you have to LOCK your row FOR UPDATE.
    And if this procedure is called twice at the same time, the rows will be
historized twice...
    etc...

    Which is precisely why I don't like this approach !

    As a side note, I've installed 8.0 rc, and wow. The slow queries feel a
lot faster on the command prompt, my small queries have became faster
too... very good work !

    In the end, I've implemented it with an AFTER DELETE trigger on the
'live' table, which, after the row has been deleted, inserts it in the
history table, using the magic variable OLD. This will work because the
row is already deleted, thus can't be concurrently updated by another
transaction (because a transaction trying to update a row will wait on the
lock acquired by the DELETE, and vice versa).

    So, for ONE row at a time, in a trigger, it works beautifully, thank you
postgres ! I find the solution very elegant : when a session expires, it
is deleted from the live session table, then the trigger catches it just
in time to shove it in the sessions history table, then some other tables
like user-to-chatroom connections, which happen to have a user_id
referencing into the live sessions table, get the ON DELETE CASCADE and
are also purged and historized automatically. I am very happy with this
solution BUT it's done one-row-at-a-time, so it's slower than I'd like !

    The key is to insert the row deleted from the live table into the history
table AFTER it has been deleted, to avoid all funky locks problems. Now
consider the following : you must DELETE several items at the same time
and historize them.

    If you INSERT then DELETE:
        - records can be inserted, updated or deleted between the two. The
inserted ones will be historized but not deleted (duplicates !), the
deleted ones will be lost forever, unhistorized, the updated ones won't
have their updates historized. Not very well for concurrecy !

    You can LOCK FOR UPDATE before, this solves the UPDATE and DELETE
problem, but not the INSERT problem.
    You can, of course, lock the entire table, but well, it reminds me too
much of the MySQL way.
    You can also use SERIALIZABLE mode which solves all the problems, but if
something goes wrong, everything fails and you have to retry the whole
trasaction, whereas a proper lock would be waited on...
    If there is a primary key in the 'live' table you can SELECT FOR UPDATE
into a tamporary table, then delete using the pkeys in the temp table,
then insert from the temp table... ugly !

    That's why I bother you to have the possibility of DELETE returning the
DELETE'd rows ;)

    It's not very useful if you process one row, but when you process several
at a time, it would be really great, because instead of 2*N queries
(DELETE+INSERT hidden in a trigger) you'd just do one (INSERT ... DELETE
AND SELECT ... FROM ...). Today, if you don't want to do it in a trigger,
you have to have a unique index, SELECT FOR UPDATE, INSERT, DELETE, that's
three queries per row.
    In a perfect world, this would be then used in an ON DELETE RULE which
would replace the DELETES by deletes inserting the rows into the history
table

    Also I've thought about some other interesting applications, if DELETE
returns rows, why not UPDATE or even INSERT ?
    Many applications use INSERT... then SELECT currval(sequence). I also
like to set defaults in the database, like for instance some rows which
have timestamp fields defaulting to now() or things like that. I have a
tree table with a ltree field which is generated by a trigger from the
parent's path and the current row's id. Some other fields are also
inherited from the parent. Why not do INSERT INTO ... AND SELECT ... which
would return the sequence field, and any other fields which have been
initialized by ON INSERT triggers... this would be neat... instead of
INSERT, SELECT currval, SELECT .. FROM table WHERE id=...
    Same thing for on update triggers.
    You could replace some plpgsql procedures with one query, and what's more
important, not worry about locking headaches.

    Anyway, my problem is solved now with triggers, but I like the idea very
much (and Oracle has it) (and Tom once said a DELETE was just more or less
like a SELECT)... so ...

    Regards






Re: [HACKERS] Much Ado About COUNT(*)

From
Greg Stark
Date:
"Frank D. Engel, Jr." <fde101@fjrhome.net> writes:

> Yep, that could cause problems.  Okay, now I'm joining the program.
>
> The only thing I can see that would fix this
> ...

There are well understood mechanisms to fix this. It's a "SMOP" or "simple
matter of programming". What you would do is insert into a summary table a
record that indicates how many records you've inserted into the master table.
Periodically you have some daemon collect up those records and replace them
with a single record.

But this can be done already by hand and it's not clear having the database do
it automatically is necessarily a good idea. It would impose a cost on every
insert when most of the time it wouldn't be useful.

Moreover this is just a special case of a general problem called "materialized
views". If it were added to the database it would probably be more worthwhile
implementing a more general feature that could handle other aggregate
functions besides count(*) as well as other types of queries besides simple
unqualified aggregates.

--
greg