Thread: Postgresql vs. aggregates

Postgresql vs. aggregates

From
jao@geophile.com
Date:
I have an application with a table that tracks objects with a "size"
attribute. What we want to do is to periodically report on the number
of these objects and the sum of the object sizes. The table will
typically have 1-5 million rows.

I know that the obvious "select count(*), sum(size) ..." is not a good
idea, and we have seen the impact such a query has on performance.

Then we tried creating triggers to maintain a table with object_count
and sum_size columns:

    create table summary(object_count bigint, sum_size bigint)

We populate the table with exactly one row, initialized to (0, 0).
A trigger runs an update like this:

    update summary
    set object_count = object_count + 1,
        sum_size = sum_size + new.size

The problem is that our application has periods where objects are
being created at a rapid rate. Updates to the summary table slow down
over time, and the slowdown is accompanied by very low CPU idle% as
reported by vmstat.

I found that if I have a thread which vacuums the summary table every
few seconds (e.g. 30), then update times stay reasonable. We're
starting to reconsider this approach, wondering if the vacuuming or
the buildup of row versions (in those 30 seconds) is behind some
variability in performance that we're seeing. To that end, we are
measuring the impact of more frequent and less frequent
vacuuming. I'll skip the details here since this is not the main point
of this email.

Another idea we had is to maintain object_count and sum_size in
sequences. The trigger would use currval and setval to maintain the
values. The problem with this is that when we try to examine the
sequence on another connection, we can't use currval before bumping
the value with nextval, which would interfere with maintenance of the
totals. (The error message for a sequence s is "ERROR:  s.currval
is not yet defined in this session".)

Yet another idea is to write C functions which maintain counts in
shared memory, simulating what we can't quite do with sequences.

I understand why asking for aggregate values computed over rapidly
changing data sets is not a completely meaningful thing to do. Yet we
have the requirement. Users of our product need to be able to look at
object_count and sum_size to get a sense of how an object load is
proceeding.

This shouldn't be so difficult. I've seen, on this mailing list,
the advice "don't do that, use a trigger" when someone want to do
select count(*). But I haven't seen any discussion of the consequences
of using a trigger.

Can anyone recommend a good way to maintain aggregates using a
trigger? Is vacuuming every few seconds really the only way to go?

Jack Orenstein


----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

Re: Postgresql vs. aggregates

From
"Scott Marlowe"
Date:
On Wed, 2004-06-09 at 16:32, jao@geophile.com wrote:
> I have an application with a table that tracks objects with a "size"
> attribute. What we want to do is to periodically report on the number
> of these objects and the sum of the object sizes. The table will
> typically have 1-5 million rows.
>
> I know that the obvious "select count(*), sum(size) ..." is not a good
> idea, and we have seen the impact such a query has on performance.
>
> Then we tried creating triggers to maintain a table with object_count
> and sum_size columns:
>
>     create table summary(object_count bigint, sum_size bigint)
>
> We populate the table with exactly one row, initialized to (0, 0).
> A trigger runs an update like this:
>
>     update summary
>     set object_count = object_count + 1,
>         sum_size = sum_size + new.size
>
> The problem is that our application has periods where objects are
> being created at a rapid rate. Updates to the summary table slow down
> over time, and the slowdown is accompanied by very low CPU idle% as
> reported by vmstat.

I'm wondering if these data are time sensitive, and if so, how
sensitive.  If they could be up to say a minute old or something, using
a materialized view of some sort might work well here.  Or a temp table
to hold all the incoming data until it needs to be dumped into the real
table.

Just a thought, might not work for you.

Also, do you have a lot of parallel updates to the same rows, or are
they all different rows?  If you're doing a lot of updating of the same
rows over and over, any database is going to be somewhat slow, either in
vacuuming like for postgresql, or locking as in row level locking
databases.

If they're all different, then this should be no different for
postgresql than for a row locking database, since each tuple will only
have two or three versions, instead of some smaller percentage of rows
having hundreds of versions.

Just wondering.


Re: Postgresql vs. aggregates

From
jao@geophile.com
Date:
Quoting Scott Marlowe <smarlowe@qwest.net>:

> On Wed, 2004-06-09 at 16:32, jao@geophile.com wrote:
> > Then we tried creating triggers to maintain a table with object_count
> > and sum_size columns:
> >
> >     create table summary(object_count bigint, sum_size bigint)
> >
> > We populate the table with exactly one row, initialized to (0, 0).
> > A trigger runs an update like this:
> >
> >     update summary
> >     set object_count = object_count + 1,
> >         sum_size = sum_size + new.size
> >
> > The problem is that our application has periods where objects are
> > being created at a rapid rate. Updates to the summary table slow down
> > over time, and the slowdown is accompanied by very low CPU idle% as
> > reported by vmstat.
>
> I'm wondering if these data are time sensitive, and if so, how
> sensitive.  If they could be up to say a minute old or something, using
> a materialized view of some sort might work well here.  Or a temp table
> to hold all the incoming data until it needs to be dumped into the real
> table.

So once a minute I'd run a stored procedure to compute statistics
and copy rows to the main table? That might be doable, but then
queries would have to be to a view combining the main table and
the temp table.

Can you expand on the use of materialized views to solve this?

> Also, do you have a lot of parallel updates to the same rows, or are
> they all different rows?  If you're doing a lot of updating of the same
> rows over and over, any database is going to be somewhat slow, either in
> vacuuming like for postgresql, or locking as in row level locking
> databases.
>
> If they're all different, then this should be no different for
> postgresql than for a row locking database, since each tuple will only
> have two or three versions, instead of some smaller percentage of rows
> having hundreds of versions.

The summary table has exactly one row, and whenever an object
is added to the main table, this one summary row is updated. An
odd feature of my application is that there is only one object
creation going on at a time. The summary row would not be a concurrency
hotspot in a non-MVCC database system.

But that raises an interesting idea. Suppose that instead of one
summary row, I had, let's say, 1000. When my application creates
an object, I choose one summary row at random (or round-robin) and update
it. So now, instead of one row with many versions, I have 1000 with 1000x
fewer versions each. When I want object counts and sizes, I'd sum up across
the 1000 summary rows. Would that allow me to maintain performance
for summary updates with less frequent vacuuming?

I'd find all this much easier to reason about if I understood how
the versions of a row are organized and accessed. How does postgresql
locate the correct version of a row?

Jack Orenstein



----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

Re: Postgresql vs. aggregates

From
Tom Lane
Date:
jao@geophile.com writes:
> I'd find all this much easier to reason about if I understood how
> the versions of a row are organized and accessed. How does postgresql
> locate the correct version of a row?

It doesn't, particularly.  A seqscan will of course visit all the
versions of a row, and an indexscan will visit all the versions
matching the indexscan key-column conditions.  It's up to the "time
qualification" tests (tqual.c) to accept only the version that is
visible to your transaction.

There are two or three implementation choices in the btree index
routines that are intended to increase the chances that you'll hit the
version you want sooner rather than later.  But they're only heuristics.
The bottom line is that we check all the versions till we find the
right one.

            regards, tom lane

Re: Postgresql vs. aggregates

From
Richard Huxton
Date:
jao@geophile.com wrote:

> But that raises an interesting idea. Suppose that instead of one
> summary row, I had, let's say, 1000. When my application creates
> an object, I choose one summary row at random (or round-robin) and update
> it. So now, instead of one row with many versions, I have 1000 with 1000x
> fewer versions each. When I want object counts and sizes, I'd sum up across
> the 1000 summary rows. Would that allow me to maintain performance
> for summary updates with less frequent vacuuming?

Perhaps the simplest approach might be to define the summary table as
containing a SERIAL and your count.
Every time you add another object insert (nextval(...), 1)
Every 10s summarise the table (i.e. replace 10 rows all "scored" 1 with
1 row scored 10)
Use sum() over the much smaller table to find your total.
Vacuum regularly.


--
   Richard Huxton
   Archonet Ltd

Re: Postgresql vs. aggregates

From
"Nick Barr"
Date:
Hi,

----- Original Message -----
From: "Richard Huxton" <dev@archonet.com>
To: <jao@geophile.com>
Cc: "Scott Marlowe" <smarlowe@qwest.net>; <pgsql-general@postgresql.org>
Sent: Thursday, June 10, 2004 8:03 AM
Subject: Re: [GENERAL] Postgresql vs. aggregates


> jao@geophile.com wrote:
>
> > But that raises an interesting idea. Suppose that instead of one
> > summary row, I had, let's say, 1000. When my application creates
> > an object, I choose one summary row at random (or round-robin) and
update
> > it. So now, instead of one row with many versions, I have 1000 with
1000x
> > fewer versions each. When I want object counts and sizes, I'd sum up
across
> > the 1000 summary rows. Would that allow me to maintain performance
> > for summary updates with less frequent vacuuming?
>
> Perhaps the simplest approach might be to define the summary table as
> containing a SERIAL and your count.
> Every time you add another object insert (nextval(...), 1)
> Every 10s summarise the table (i.e. replace 10 rows all "scored" 1 with
> 1 row scored 10)
> Use sum() over the much smaller table to find your total.
> Vacuum regularly.
>

Something along these lines except using a SUM instead of a COUNT.

http://archives.postgresql.org/pgsql-performance/2004-01/msg00059.php


Nick




Re: Postgresql vs. aggregates

From
Karsten Hilbert
Date:
> I understand why asking for aggregate values computed over rapidly
> changing data sets is not a completely meaningful thing to do. Yet we
> have the requirement. Users of our product need to be able to look at
> object_count and sum_size to get a sense of how an object load is
> proceeding.
If the customer just needs to know how things are *proceeding*
will it not suffice to be looking at reasonable estimates ?

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346