Thread: "critical mass" reached?

"critical mass" reached?

From
Alex Howansky
Date:
Using 7.0.3, I've got a database that has about 30 tables. One in particular
seems to be giving us problems. It's a pretty simple table with one index that
logs certain web site activity. It gets about 100K insertions per day, and now
has about 6 million records.

All of a sudden (yesterday), we seem to have reached a "critical mass". No
other tables or code have changed significantly (or more than normal). However,
the database performance has abruptly become abyssmal -- the server which
previously hadn't broken a load average of 4 now spikes continuously between 20
and 40, rarely dropping below 10. Web server logs show normal activity. Also,
in the pgsql/data/base/dbname directory, I'm getting hundreds of pg_sorttemp
and pg_noname files lying around.

I thought there might be some data or index corruption, so I've even gone so
far as to initdb and recreate the database from scratch, using a previous
pg_dump output, but it has not helped.

Six million tables doesn't seem like it should be too much of a problem, but we
suspect this table might be the cause of the problem because it's the only one
that changes significantly from day to day. Memory is ok, there is no swapping,
disk space is plentiful, I don't know where else to look. Any ideas?

--
Alex Howansky
Wankwood Associates
http://www.wankwood.com/


Re: "critical mass" reached?

From
"Richard Huxton"
Date:
From: "Alex Howansky" <alex@wankwood.com>

> Using 7.0.3, I've got a database that has about 30 tables. One in
particular
> seems to be giving us problems. It's a pretty simple table with one index
that
> logs certain web site activity. It gets about 100K insertions per day, and
now
> has about 6 million records.
>
> All of a sudden (yesterday), we seem to have reached a "critical mass". No
> other tables or code have changed significantly (or more than normal).
However,
> the database performance has abruptly become abyssmal -- the server which
> previously hadn't broken a load average of 4 now spikes continuously
between 20
> and 40, rarely dropping below 10. Web server logs show normal activity.
Also,
> in the pgsql/data/base/dbname directory, I'm getting hundreds of
pg_sorttemp
> and pg_noname files lying around.

Presumably you're running vacuum analyze regularly (at least once a day I'd
guess) so I can only suspect that something has tipped the balance in the
cost estimations. Is there a particular query that's slow and can you post
an EXPLAIN?

> I thought there might be some data or index corruption, so I've even gone
so
> far as to initdb and recreate the database from scratch, using a previous
> pg_dump output, but it has not helped.

Looks like you've ruled out damage to the DB. What happens if you delete 3
million of the records in your log-table?

> Six million tables doesn't seem like it should be too much of a problem,
but we
> suspect this table might be the cause of the problem because it's the only
one
> that changes significantly from day to day. Memory is ok, there is no
swapping,
> disk space is plentiful, I don't know where else to look. Any ideas?

Six million _tables_ is a lot, but you're right 6M records is pretty small
compared to what some people are using.

See if you can't post an EXPLAIN of a problem query and the relevant table
defs.

- Richard Huxton


Re: "critical mass" reached?

From
Alex Howansky
Date:
> Presumably you're running vacuum analyze regularly (at least once a day I'd
> guess) so I can only suspect that something has tipped the balance in the
> cost estimations. Is there a particular query that's slow and can you post
> an EXPLAIN?

Oops, yes, sorry forgot to mention that. Vacuum analyze run nightly. There is
not just one particluar query that runs slow -- it's the database as a whole
(while apparently under the same average everyday load).

> Looks like you've ruled out damage to the DB. What happens if you delete 3
> million of the records in your log-table?

We haven't got that far yet. I was hoping to get some other ideas prior to
doing something so drastic, but we'll try it ...

> Six million _tables_ is a lot, but you're right 6M records is pretty small
> compared to what some people are using.

Oops again. I gotta stop trying to debug at 3am... :)

--
Alex Howansky
Wankwood Associates
http://www.wankwood.com/




Re: "critical mass" reached?

From
Alex Howansky
Date:
> Hm.  As Richard remarks, 6M records is not an especially big table;
> there are people running larger ones.  The leftover sorttemp files sound
> like you are suffering backend crashes --- but you didn't mention
> anything about unexpected disconnects.

I haven't noticed any myself, but the majority of our connections come from a
public web based app -- so I can't really tell if the consumer is experiencing
problems or not.

> The postmaster log would be a good place to look for more info (if
> you're not keeping one, turn it on).

I have a debug level 2 log of almost the entire day's activity. I scanned it
briefly but found nothing (it's 180 meg). Is there anything in particular I
should be looking for?

>  Also, make sure the postmaster is not being run with an environment of
> "ulimit -c 0" ... if the backends are crashing, we want to get some core
> files so we can see what's happening.

Ok, will verify. Thanks.

--
Alex Howansky
Wankwood Associates
http://www.wankwood.com/


Re: "critical mass" reached?

From
Tom Lane
Date:
Alex Howansky <alex@wankwood.com> writes:
> [ lots of bad stuff ]

Hm.  As Richard remarks, 6M records is not an especially big table;
there are people running larger ones.  The leftover sorttemp files sound
like you are suffering backend crashes --- but you didn't mention
anything about unexpected disconnects.

The postmaster log would be a good place to look for more info (if
you're not keeping one, turn it on).  Also, make sure the postmaster
is not being run with an environment of "ulimit -c 0" ... if the
backends are crashing, we want to get some core files so we can see
what's happening.

            regards, tom lane

Re: "critical mass" reached?

From
Tom Lane
Date:
Alex Howansky <alex@wankwood.com> writes:
>> The postmaster log would be a good place to look for more info (if
>> you're not keeping one, turn it on).

> I have a debug level 2 log of almost the entire day's activity.

Oh good.  Hang onto that.

> I scanned it
> briefly but found nothing (it's 180 meg). Is there anything in particular I
> should be looking for?

Unusual errors, reports of subprocess crashes ...

            regards, tom lane