Re: count(*) performance improvement ideas - Mailing list pgsql-hackers

From Stephen Denne
Subject Re: count(*) performance improvement ideas
Date
Msg-id F0238EBA67824444BC1CB4700960CB48051D5D45@dmpeints002.isotach.com
Whole thread Raw
In response to Re: count(*) performance improvement ideas  ("Stephen Denne" <Stephen.Denne@datamail.co.nz>)
Responses Re: count(*) performance improvement ideas  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
(There is a possible performance bug mentioned at the end of the email, the rest is further discussion regarding
materialisedviews) 

I wrote
> Pavan Deolasee wrote
> > On Thu, Mar 20, 2008 at 3:24 AM, Stephen Denne
> > <Stephen.Denne@datamail.co.nz> wrote:
> >
> > >
> > >  Pavan also refers to deferred triggers, which has got me
> > thinking about another possible solution:
> > >
> > >  Instead of inserting a delta row, that will be updated a
> > lot of times, create an on commit drop temp table named after
> > the txid and the grouping value (which is an integer in my
> > case, perhaps hash it if you're grouping by something that
> > doesn't easily convert to part of a table name),
> > >  create an after insert initially deferred constraint to
> > call a function which will move the 'at commit' values of the
> > rows in the temp table to the summary table.
> > >  The temp table will only have one row inserted, updated
> > many times, then on commit the trigger is fired once, and the
> > temp table is dropped.
> > >
> > >  Does anyone think this will or won't work for some reason?
> >
> >
> > I think this should work, although you may need to take some
> > extra steps
> > to manage the summary table. Also, I think a single temp
> > table per transaction
> > should suffice. The temp table would have one row per "group
> > by" or "where"
> > condition on which you want to track the count. The
> > corresponding row will
> > be updated as and when the corresponding count changes. You
> would need
> > INSERT/DELETE/UPDATE triggers to do that. If there are any
> > subtransaction
> > aborts, that will be taken care by MVCC.
>
> Thanks for that. I had gone ahead and tried out the idea, and
> it was working 'ok'.
>
> Using one table per transaction has the benefit of less temp
> tables (but the same number of triggers waiting to run). It
> also removes the grouping key from the table name.
>
> I was using a single table per grouping key, with a single
> updated row in it.
> The benefit was simpler queries, and I could create an ON
> INSERT trigger that would be triggered only once when the
> temp table was created, and a 'zero' row was inserted,
> thereby separating the setup of the trigger from the
> maintenance of the delta.

One temp table per grouping key would presumably allocate at least one disk page per grouping key.
This might result in pretty poor performance. Though if the number of updates per grouping key is large, HOT would have
plentyof room on the page to write new row versions. 

Does creation & dropping of temp tables result in system catalog bloat?

> I took a wild guess at a way of finding out whether the temp
> table already exists:
>     not exists(select tablename from pg_catalog.pg_tables
> where tablename=temp_delta_txid_group)
> Is there a better/safer way?

Answering my own question:
There is a better way to do what I was doing, (not sure about a better way to check existence of a temp table
though)...

A custom variable class can be set up and used to record whether the transaction in question has been set up. (Thanks
toAndreas Kretschmer for pointing those out in another thread on -general) 

Alter this setting within postgresql.conf to add 'mv':
custom_variable_classes = 'mv'

Add this setting to postgresql.conf:
mv.initialized = 'false'

Then only set it to true local to the transaction.

Usage within a trigger:
  IF NOT (current_setting('mv.initialized')::boolean) THEN     -- trigger a deferred constraint function:     INSERT
INTOmv_txid_doc_type_summary VALUES (txid_current());     PERFORM set_config('mv.initialized', 'true', true);  END IF; 

The custom variable class can also be used to aggregate the deltas within a transaction, though there are some minor
difficulties:

1) They only store text
2) You can only find out that a value has not been set by catching an exception
3) There is no way to list the settings.

The main benefit is that changing a variable's setting does not write a new row version.
Creating new transactionally scoped variables seems to take around 150 to 200 bytes of the process ram per variable,
(dependingon the size of the value stored). 
The time to create them for the first time for a connection seems to vary a bit, between 200 and 1000 per minute. No IO
isbeing performed, CPU is at 100% 
These statistics gathered when trying variations of this query:

select count(*) from (select set_config('mv.test11.' || s1.s1, s2.s2::text, true) from (select s1 from
generate_series(1,1000)as s1) as s1, (select s2 from generate_series(1,1000) as s2) as s2) as s3; 

Which sets each one of 1000 variables to 1000 different values. This can take a few minutes, but strangely a second
connectionstarting the same query just after the first one, can finish in seconds. 

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any
attachmentsis confidential and may be subject to legal privilege.  If it is not intended for you please advise by reply
immediately,destroy it and do not copy, disclose or use it in any way. 
__________________________________________________________________ This email has been scanned by the DMZGlobal
BusinessQuality             Electronic Messaging Suite. 
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________




pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: DROP DATABASE vs patch to not remove files right away
Next
From: Tom Lane
Date:
Subject: Re: count(*) performance improvement ideas