Thread: Faster count(*)?

Faster count(*)?

From

"Owen Jacobson"

Date:

09 August 2005, 19:39:45

Salve.

I understand from various web searches and so on that PostgreSQL's MVCC
mechanism makes it very hard to use indices or table metadata to optimise
count(*).  Is there a better way to guess the "approximate size" of a table?

I'm trying to write a trigger that fires on insert and performs some
maintenance (collapsing overlapping boxes into a single large box,
specifically) as the table grows.  My initial attempt involved count(*) and,
as the number of pages in the table grew, that trigger bogged down the
database.

Any thoughts?

-O

Re: SPAM Faster count(*)?

From

dracula007@atlas.cz

Date:

09 August 2005, 20:30:05

I believe running count(*) means fulltable scan, and there's no way
to do it without it. But what about some "intermediate" table, with
the necessary counts?

That means to create a table with values (counts) you need, and on
every insert/delete/update increment or decrement the appropriate
values. This way you won't need the count(*) query anymore, and the
performance should be much better.

t.v.

> Salve.

> I understand from various web searches and so on that PostgreSQL's MVCC
> mechanism makes it very hard to use indices or table metadata to optimise
> count(*).  Is there a better way to guess the "approximate size" of a table?

> I'm trying to write a trigger that fires on insert and performs some
> maintenance (collapsing overlapping boxes into a single large box,
> specifically) as the table grows.  My initial attempt involved count(*) and,
> as the number of pages in the table grew, that trigger bogged down the
> database.

> Any thoughts?

Re: SPAM Faster count(*)?

From

Tom Lane

Date:

09 August 2005, 23:49:52

dracula007@atlas.cz writes:
> I believe running count(*) means fulltable scan, and there's no way
> to do it without it. But what about some "intermediate" table, with
> the necessary counts?

There's a fairly complete discussion in the PG list archives of a
reasonably-efficient scheme for maintaining such counts via triggers.
It wasn't efficient enough that we were willing to impose the overhead
on every application ... but if you really NEED a fast count(*) you
could implement it.  I'd like to see someone actually do it and put
up working code on pgfoundry; AFAIK it's only a paper design so far.

If you only want a very-approximate count, the best bet is to rely on
the planner's estimates, eg

regression=# explain select * from tenk1;                        QUERY PLAN                          
-------------------------------------------------------------Seq Scan on tenk1  (cost=0.00..458.00 rows=10000
width=244)                                          ^^^^^

Current best practice is to run the explain and parse out the "rows"
figure using a perl (or axe-of-choice) regexp, though we could be
persuaded to supply a simpler API if there's enough demand for it.
        regards, tom lane

Re: SPAM Faster count(*)?

From

Michael Fuhr

Date:

10 August 2005, 00:31:22

On Tue, Aug 09, 2005 at 10:49:14PM -0400, Tom Lane wrote:
> Current best practice is to run the explain and parse out the "rows"
> figure using a perl (or axe-of-choice) regexp, though we could be
> persuaded to supply a simpler API if there's enough demand for it.

Somebody else requested a row-count-estimate function a couple of
weeks ago:

http://archives.postgresql.org/pgsql-admin/2005-07/msg00256.php

-- 
Michael Fuhr

Re: SPAM Faster count(*)?

From

Andrew Sullivan

Date:

10 August 2005, 07:51:13

On Tue, Aug 09, 2005 at 10:49:14PM -0400, Tom Lane wrote:
> 
> Current best practice is to run the explain and parse out the "rows"
> figure using a perl (or axe-of-choice) regexp, though we could be
> persuaded to supply a simpler API if there's enough demand for it.

FWIW, this was another one of those things I must have heard a dozen
times at OSCON.  I suspect the simpler API would be popular,
particularly since post-8.0 the estimates are more reliable than they
used to be.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
The whole tendency of modern prose is away from concreteness.    --George Orwell

Re: SPAM Faster count(*)?

From

Michael Fuhr

Date:

10 August 2005, 10:32:35

On Tue, Aug 09, 2005 at 09:29:13PM -0600, Michael Fuhr wrote:
> On Tue, Aug 09, 2005 at 10:49:14PM -0400, Tom Lane wrote:
> > Current best practice is to run the explain and parse out the "rows"
> > figure using a perl (or axe-of-choice) regexp, though we could be
> > persuaded to supply a simpler API if there's enough demand for it.
> 
> Somebody else requested a row-count-estimate function a couple of
> weeks ago:
> 
> http://archives.postgresql.org/pgsql-admin/2005-07/msg00256.php

Here's a simple example that parses EXPLAIN output.  It should work
in 8.0.2 and later:

CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$
DECLARE   rec   record;   rows  integer;
BEGIN   FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP       rows := substring(rec."QUERY PLAN" FROM '
rows=([[:digit:]]+)');      EXIT WHEN rows IS NOT NULL;   END LOOP;
 
   RETURN rows;
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;

CREATE TABLE foo (r double precision);
INSERT INTO foo SELECT random() FROM generate_series(1, 1000);
ANALYZE foo;

SELECT count_estimate('SELECT * FROM foo WHERE r < 0.1');count_estimate 
----------------            97
(1 row)

EXPLAIN SELECT * FROM foo WHERE r < 0.1;                    QUERY PLAN                      
-----------------------------------------------------Seq Scan on foo  (cost=0.00..17.50 rows=97 width=8)  Filter: (r <
0.1::doubleprecision)
 
(2 rows)

-- 
Michael Fuhr

Re: Faster count(*)?

From

Richard Huxton

Date:

10 August 2005, 11:15:38

Owen Jacobson wrote:
> Salve.
> 
> I understand from various web searches and so on that PostgreSQL's MVCC
> mechanism makes it very hard to use indices or table metadata to optimise
> count(*).  Is there a better way to guess the "approximate size" of a table?

Plenty of good answers on how to estimate table-size, but it sounds like 
you just want to run your maintenance function "every so often".

1. Create a sequence "my_table_tracker_seq"
2. On insert, call nextval(my_table_tracker_seq)
3. If value modulo 1000 = 0, run the maintenance routine

That's about as fast as you can get, and might meet your needs. Of 
course you'll need to be more complex if you insert multiple rows at a time.

If you do a lot of deletion on the table and want to take that into 
account, have a second sequence you increment on deletion and subtract 
the one from the other.

Not always accurate enough, but it is quick.
--  Richard Huxton  Archonet Ltd

Re: SPAM Faster count(*)?

From

"Owen Jacobson"

Date:

10 August 2005, 13:26:31

Tom Lane wrote:

> dracula007@atlas.cz writes:
> > I believe running count(*) means fulltable scan, and there's no way
> > to do it without it. But what about some "intermediate" table, with
> > the necessary counts?
>
> There's a fairly complete discussion in the PG list archives of a
> reasonably-efficient scheme for maintaining such counts via triggers.
> It wasn't efficient enough that we were willing to impose the overhead
> on every application ... but if you really NEED a fast count(*) you
> could implement it.  I'd like to see someone actually do it and put
> up working code on pgfoundry; AFAIK it's only a paper design so far.
>
> If you only want a very-approximate count, the best bet is to rely on
> the planner's estimates, eg
>
> regression=# explain select * from tenk1;
>                          QUERY PLAN
> -------------------------------------------------------------
>  Seq Scan on tenk1  (cost=0.00..458.00 rows=10000 width=244)
>                                             ^^^^^
>
> Current best practice is to run the explain and parse out the "rows"
> figure using a perl (or axe-of-choice) regexp, though we could be
> persuaded to supply a simpler API if there's enough demand for it.

Yick.  Ok, given all of that, I've rewritten the trigger in question to fire
on different, indexable criteria (difference between "earliest" and "latest"
rows in the table).

Thanks, everyone.

Owen

Re: SPAM Faster count(*)?

From

Keith Worthington

Date:

12 August 2005, 01:41:37

Tom Lane wrote:
> dracula007@atlas.cz writes:
> 
>>I believe running count(*) means fulltable scan, and there's no way
>>to do it without it. But what about some "intermediate" table, with
>>the necessary counts?
> 
> 
> There's a fairly complete discussion in the PG list archives of a
> reasonably-efficient scheme for maintaining such counts via triggers.
> It wasn't efficient enough that we were willing to impose the overhead
> on every application ... but if you really NEED a fast count(*) you
> could implement it.  I'd like to see someone actually do it and put
> up working code on pgfoundry; AFAIK it's only a paper design so far.

I was kicking this around and came up with the following.  I have hit a 
couple of snags.

In the function I attempt to count the number of rows in a table being 
checked for the first time.  I wanted to use 'FROM TG_RELNAME' but as 
you can see I had to hard code my test table and comment out the trigger 
parameter. FROM tbl_demo--TG_RELNAME  Can someone tell me why that won't 
work?

Also the function doesn't seem to be getting ROW_COUNT properly.  The 
end result is that for this test the table is properly inserted into the 
monitoring table after it's forth insert but it is never updated after 
that.  Can someone help me see the forest through the trees?

-- Clean up the environment.
DROP TABLE tbl_row_count;
DROP TABLE tbl_demo;
DROP FUNCTION tf_update_row_count();

-- Build the table for holding the row counts.
CREATE TABLE tbl_row_count
(  relid oid NOT NULL,  row_count int8 NOT NULL DEFAULT 0
)
WITHOUT OIDS;
ALTER TABLE tbl_row_count OWNER TO postgres;
COMMENT ON COLUMN tbl_row_count.relid IS 'Contains relation id number.';
COMMENT ON COLUMN tbl_row_count.row_count IS 'Contains the number of 
rows in a relation.';

-- Build a table to test the trigger on.
CREATE TABLE tbl_demo
(  first_name varchar(30) NOT NULL
)
WITHOUT OIDS;
ALTER TABLE tbl_demo OWNER TO postgres;
COMMENT ON TABLE tbl_demo IS 'Table used for demonstrating a trigger.';

-- Create the trigger function to maintain the row counts.
CREATE OR REPLACE FUNCTION public.tf_update_row_count()  RETURNS "trigger" AS
$BODY$   DECLARE      v_row_count int8;   BEGIN
--    Store the row count before it disappears.      GET DIAGNOSTICS v_row_count = ROW_COUNT;
--    Check if this is a new table.      PERFORM relid         FROM public.tbl_row_count        WHERE relid = TG_RELID;
    IF FOUND THEN
 
--       Data for this table is already in the row count table.         IF TG_OP = 'INSERT' THEN            UPDATE
public.tbl_row_count              SET row_count = row_count + v_row_count             WHERE relid = TG_RELID;
ELSIFTG_OP = 'DELETE' THEN            UPDATE public.tbl_row_count               SET row_count = row_count - v_row_count
           WHERE relid = TG_RELID;         END IF;      ELSE
 
--       This is a new table so it needs to be counted.         SELECT count(*)           FROM tbl_demo--TG_RELNAME
     INTO v_row_count;         INSERT INTO public.tbl_row_count ( relid, row_count )                VALUES ( TG_RELID,
                      v_row_count                       );      END IF;      RETURN NULL;   END;
 
$BODY$  LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION public.tf_update_row_count() OWNER TO postgres;
GRANT EXECUTE ON FUNCTION public.tf_update_row_count() TO postgres;
GRANT EXECUTE ON FUNCTION public.tf_update_row_count() TO public;

-- Insert some initial data into the demo table.
INSERT INTO public.tbl_demo            ( first_name )            VALUES ( 'Keith' );
INSERT INTO public.tbl_demo            ( first_name )            VALUES ( 'Ed' );
INSERT INTO public.tbl_demo            ( first_name )            VALUES ( 'Kryss' );

-- Create the trigger on the demo table.
CREATE TRIGGER tgr_update_row_count  AFTER INSERT OR DELETE  ON public.tbl_demo  FOR EACH STATEMENT  EXECUTE PROCEDURE
public.tf_update_row_count();

-- Examine the starting state of the tables.
SELECT *  FROM public.tbl_demo;
SELECT *  FROM public.tbl_row_count;
SELECT relid,       relname,       row_count  FROM public.tbl_row_count  LEFT JOIN pg_class    ON ( tbl_row_count.relid
=pg_class.oid );
 

-- Insert a row.
INSERT INTO public.tbl_demo            ( first_name )            VALUES ( 'Jarus' );

-- Examine the new state of the tables.
SELECT *  FROM public.tbl_demo;
SELECT relid,       relname,       row_count  FROM public.tbl_row_count  LEFT JOIN pg_class    ON ( tbl_row_count.relid
=pg_class.oid );
 

-- Insert two more rows.
INSERT INTO public.tbl_demo            ( first_name )            VALUES ( 'Dani' );
INSERT INTO public.tbl_demo            ( first_name )            VALUES ( 'Mary' );

-- Examine the final state of the tables.
SELECT *  FROM public.tbl_demo;
SELECT relid,       relname,       row_count  FROM public.tbl_row_count  LEFT JOIN pg_class    ON ( tbl_row_count.relid
=pg_class.oid );
 

-- 
Kind Regards,
Keith