Thread: PG performance in high volume environment (many INSERTs and lots of aggregation reporting)

[Ppsted similar note to PG General but I suppose it's more appropriate
in this list. Apologies for cross-posting.]

Hi. Further to my bafflement with the "count(*)" queries as described
in this thread:

http://archives.postgresql.org/pgsql-general/2009-01/msg00804.php

It seems that whenever this question has come up, Postgresql comes up
very short in terms of "count(*)" functions.

The performance is always slow, because of the planner's need to guess
and such. I don't fully understand how the statistics work (and the
explanation on the PG website is way too geeky) but he columns I work
with already have a stat level of 100. Not helping at all.

We are now considering a web based logging functionality for users of
our website. This means the table could be heavily INSERTed into. We
get about 10 million hits a day, and I'm guessing that we will have to
keep this data around for a while.

My question: with that kind of volume and the underlying aggregation
functions (by product id, dates, possibly IP addresses or at least
countries of origin..) will PG ever be a good choice? Or should I be
looking at some other kind of tools? I wonder if OLAP tools would be
overkill for something that needs to look like a barebones version of
google analytics limited to our site..

Appreciate any thoughts. If possible I would prefer to tone down any
requests for MySQL and such!

Thanks!

> My question: with that kind of volume and the underlying aggregation
> functions (by product id, dates, possibly IP addresses or at least
> countries of origin..) will PG ever be a good choice? Or should I be
> looking at some other kind of tools? I wonder if OLAP tools would be
> overkill for something that needs to look like a barebones version of
> google analytics limited to our site..

Some other databases might have an optimization that makes this much
faster that it would ordinarily be.

select count(*) from table;

But I don't think anyone has an optimization that makes this fast:

select column, count(*) from table group by 1;

How do you expect the database to get this information other than be
reading the whole table and counting up the number of occurrences of
each value?  I guess an OLAP cube might precompute all the answers for
you, but I don't think MySQL is going to do that.

One option is to write a script that runs in the background and
updates all your statistics every 10 minutes or so, dumping the
results into separate (and smaller) tables that you can query quickly.

Another option (which is probably what I would do for really high
volume logging of web traffic) is to write your log records to a flat
file and then postprocess them with perl or something and load the
summary statistics into your database later.  PostgreSQL is really
fast, but nothing is as fast as writing to a flatfile.

...Robert

Phoenix Kiula wrote:
> [Ppsted similar note to PG General but I suppose it's more appropriate
> in this list. Apologies for cross-posting.]
>
> Hi. Further to my bafflement with the "count(*)" queries as described
> in this thread:
>
> http://archives.postgresql.org/pgsql-general/2009-01/msg00804.php
>
> It seems that whenever this question has come up, Postgresql comes up
> very short in terms of "count(*)" functions.

Sorry - I'm confused. That thread doesn't seem to contain a slow
count(*) query. You seem to be saying you're having problems with the
query taking 10-15 seconds, but the example takes less then half a
second. How have you identified the count() as being the problem here?

> The performance is always slow, because of the planner's need to guess
> and such. I don't fully understand how the statistics work (and the
> explanation on the PG website is way too geeky) but he columns I work
> with already have a stat level of 100. Not helping at all.

But your own email says it's slow sometimes:
  "My queries are fast in general *except* the first time"
I'm not sure how the planner comes into this.

> We are now considering a web based logging functionality for users of
> our website. This means the table could be heavily INSERTed into. We
> get about 10 million hits a day, and I'm guessing that we will have to
> keep this data around for a while.
>
> My question: with that kind of volume and the underlying aggregation
> functions (by product id, dates, possibly IP addresses or at least
> countries of origin..) will PG ever be a good choice?

A good choice compared to what?

> Or should I be
> looking at some other kind of tools? I wonder if OLAP tools would be
> overkill for something that needs to look like a barebones version of
> google analytics limited to our site..

Typically you'd summarise the data by hour/day via triggers / a
scheduled script if you weren't going towards a pre-packaged OLAP
toolkit. Otherwise you're going to have to scan the hundreds of millions
of rows you've accumulated.

> Appreciate any thoughts. If possible I would prefer to tone down any
> requests for MySQL and such!

I'm not sure MySQL is going to help you here - if you were running lots
of small, simple queries it might make sense. If you want to aggregate
data by varying criteria I don't think there is any sensible
optimisation (other than pre-calculating summaries).

--
  Richard Huxton
  Archonet Ltd

On 1/28/09, Phoenix Kiula <phoenix.kiula@gmail.com> wrote:
> [Ppsted similar note to PG General but I suppose it's more appropriate
>  in this list. Apologies for cross-posting.]
>
>  Hi. Further to my bafflement with the "count(*)" queries as described
>  in this thread:
>
>  http://archives.postgresql.org/pgsql-general/2009-01/msg00804.php
>
>  It seems that whenever this question has come up, Postgresql comes up
>  very short in terms of "count(*)" functions.
>
>  The performance is always slow, because of the planner's need to guess
>  and such. I don't fully understand how the statistics work (and the
>  explanation on the PG website is way too geeky) but he columns I work
>  with already have a stat level of 100. Not helping at all.

Your issue is not statistics/planner.  postgres just can't apply the
special case optimization that some other database do because of the
locking model.

all planner's 'guess'.  the main goal of statistics is to make the
guess better educated.

>  We are now considering a web based logging functionality for users of
>  our website. This means the table could be heavily INSERTed into. We
>  get about 10 million hits a day, and I'm guessing that we will have to
>  keep this data around for a while.

10m hits/day = 115 hits/sec.  This is no problem for even workstation
box assuming your disks can handle the syncs.  however, with extreme
insert heavy loads it helps alot to look at partitioning/rotation to
ease the pain of big deletes.

merlin

phoenix.kiula@gmail.com (Phoenix Kiula) writes:
> [Ppsted similar note to PG General but I suppose it's more appropriate
> in this list. Apologies for cross-posting.]
>
> Hi. Further to my bafflement with the "count(*)" queries as described
> in this thread:
>
> http://archives.postgresql.org/pgsql-general/2009-01/msg00804.php
>
> It seems that whenever this question has come up, Postgresql comes up
> very short in terms of "count(*)" functions.
>
> The performance is always slow, because of the planner's need to guess
> and such. I don't fully understand how the statistics work (and the
> explanation on the PG website is way too geeky) but he columns I work
> with already have a stat level of 100. Not helping at all.

That's definitely *NOT* due to "planner's need to guess"; it's due to
there being some *specific* work that PostgreSQL needs to do that some
other databases can avoid due to different storage strategies.

The matter is quite succinctly described here:

http://wiki.postgresql.org/wiki/Why_PostgreSQL_Instead_of_MySQL:_Comparing_Reliability_and_Speed_in_2007#Counting_rows_in_a_table

I'll just take one excerpt:
---------------------------
It is worth observing that it is only this precise form of aggregate
that must be so pessimistic; if augmented with a "WHERE" clause like

SELECT COUNT(*) FROM table WHERE status = 'something'

PostgreSQL, MySQL, and most other database implementations will take
advantage of available indexes against the restricted field(s) to
limit how many records must be counted, which can greatly accelerate
such queries.
---------------------------

It is common for systems where it is necessary for aggregation
reporting to be fast to do pre-computation of the aggregates, and that
is in no way specific to PostgreSQL.

If you need *really* fast aggregates, then it will be worthwhile to
put together triggers or procedures or something of the sort to help
pre-compute the aggregates.
--
(reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc"))
http://linuxfinances.info/info/wp.html
"When you have eliminated the impossible, whatever remains, however
improbable, must be the truth." -- Sir Arthur Conan Doyle (1859-1930),
English author. Sherlock Holmes, in The Sign of Four, ch. 6 (1889).
[...but see the Holmesian Fallacy, due to Bob Frankston...
<http://www.frankston.com/public/Essays/Holmesian%20Fallacy.asp>]

On Thu, Jan 29, 2009 at 1:56 PM, Chris Browne <cbbrowne@acm.org> wrote:
>
> It is common for systems where it is necessary for aggregation
> reporting to be fast to do pre-computation of the aggregates, and that
> is in no way specific to PostgreSQL.
>
> If you need *really* fast aggregates, then it will be worthwhile to
> put together triggers or procedures or something of the sort to help
> pre-compute the aggregates.

Just to add to this, at me last employer in Chicago, we had a database
from a very large database company who's CEO makes more than all the
people on this mailing list combined that shall not be named for
reasons like I don't want to be sued.  This database had a large
statistical dataset we replicated over to pgsql on a by the minute
basis so we could run big ugly queries anytime we felt like it without
blowing out the production database.

At night, or by hand, I would run such queries as select count(*) from
reallyreallyreallybigstatstable on it and compare it to postgresql.
PostgreSQL would take about 4 or 5 minutes to run this on a local
server running a software RAID-10 4 disc set on a single core P-4 Dell
workstation, and the really really big server in production took about
15 to 20 seconds.

Our local test server that ran the same really big database that
cannot be named and had a 16 disk RAID-6 array with gigs of memory and
4 cpu cores, took about 45 seconds to a minute to run the same select
count(*) query.

All of the machines showed high CPU and moderate I/O usage while
running said query.

So, there's probably some room for improvement in pgsql's way of doing
things, but it's not like the other database software was providing
instantaneous answers.  Basically, the second that a database server
becomes fast at running lots of update / select queries in a mixed
environment, things like fast select count(*) get slower.

To the OP: Try running 100 transactional clients against mysql
(updates/inserts/deletes/selects) while running a select count(*) and
see how it behaves.  Single thread use cases are kind of uninteresting
compared to lotsa users.  But if single thread use cases are your
bread and butter, then pgsql is possibly a poor choice of db.