Re: Select max(foo) and select count(*) optimization - Mailing list pgsql-performance

From Christopher Browne
Subject Re: Select max(foo) and select count(*) optimization
Date
Msg-id m3r7yddczi.fsf@wolfe.cbbrowne.com
Whole thread Raw
In response to Select max(foo) and select count(*) optimization  (John Siracusa <siracusa@mindspring.com>)
List pgsql-performance
pg@rbt.ca (Rod Taylor) wrote:
>> Especially with very large tables, hearing the disks grind as Postgres scans
>> every single row in order to determine the number of rows in a table or the
>> max value of a column (even a primary key created from a sequence) is pretty
>> painful.  If the implementation is not too horrendous, this is an area where
>> an orders-of-magnitude performance increase can  be had.
>
> Actually, it's very painful. For MySQL, they've accepted the concurrancy
> hit in order to accomplish it -- PostgreSQL would require a more subtle
> approach.
>
> Anyway, with Rules you can force this:
>
> ON INSERT UPDATE counter SET tablecount = tablecount + 1;
>
> ON DELETE UPDATE counter SET tablecount = tablecount - 1;
>
> You need to create a table "counter" with a single row that will keep
> track of the number of rows in the table. Just remember, you've now
> serialized all writes to the table, but in your situation it may be
> worth while.

There's a still more subtle approach that relieves the serialization
constraint, at some cost...

- You add rules that _insert_ a row each time there is an
  insert/delete
   ON INSERT insert into counts(table, value) values ('our_table', 1);
   ON DELETE insert into counts(table, value) values ('our_table', -1);

- The "select count(*) from our_table" is replaced by "select
  sum(value) from counts where table = 'our_table'"

- Periodically, a "compression" process goes through and either:

    a) Deletes the rows for 'our_table' and replaces them with one
       row with a conventionally-scanned 'count(*)' value, or

    b) Computes "select table, sum(value) as value from counts group
       by table", deletes all the existing rows in counts, and replaces
       them by the preceding selection, or

    c) Perhaps does something incremental that's like b), but which
       only processes parts of the "count" table at once.  Process
       500 rows, then COMMIT, or something of the sort...

Note that this "counts" table can potentially grow _extremely_ large.
The "win" comes when it gets compressed, so that instead of scanning
through 500K items, it index-scans through 27, the 1 that has the
"497000" that was the state of the table at the last compression, and
then 26 singletons.

A win comes in if an INSERT that adds in 50 rows can lead to
inserting ('our_table', 50) into COUNTS, or a delete that eliminates
5000 rows puts in ('our_table', -5000).

It's vital to run the "compression" reasonably often (much like VACUUM
:-)) in order that the COUNTS summary table stays relatively small.
--
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://www3.sympatico.ca/cbbrowne/wp.html
Debugging is twice  as hard as writing   the code in the first  place.
Therefore, if you write the code as cleverly as  possible, you are, by
definition, not smart enough to debug it.  -- Brian W. Kernighan

pgsql-performance by date:

Previous
From: Christopher Browne
Date:
Subject: Re: Select max(foo) and select count(*) optimization
Next
From: Bruce Momjian
Date:
Subject: Re: optimizing Postgres queries