Thread: postgres bogged down beyond tolerance

postgres bogged down beyond tolerance

From

"Tena Sakai"

Date:

14 November 2007, 14:57:09

Hi Everybody,

The postgres server I have (on redhat linux 2.6 with recent
Dell hardware (4 cpus)) is running terribly slow.

A job it should have gotten done in less than 1 hour
took 7.5 hours last night.

The job I am referencing above gets run via crontab every
evening and I comb the log file fairly diligently. The
parameters used for the run are indicative as to what orange
I should use to compare to what orange. The comparison is
not unreasonable is what I am asserting.

I checked kernel parameter
shmmax and it was set as 33554432. I "fixed" it as
suggested by the manual:
http://www.postgresql.org/docs/8.2/static/kernel-resources.html

Namely, I shutdown the database, issued two commands:
/sbin/sysctl -w kernel.shmmax=134217728
/sbin/sysctl -w kernel.shmall=2097152
and rebooted the computer.

After it came up, I checked the shmmax and it is set
as 33554432. Which surprised me. Since I used -w
flag, I thought it should've written to /etc/sysctl.conf,
but there is no such entry at all and the data of this
file is from 2006.

Can somebody please give me a tip, insight as to what I
am missing, doing wrong?

Here's tail of serverlog file in my data directory:

[2007-11-14 08:53:48.062 PST] <gadb 2007-11-14 08:53:36 PST idle in transaction>LOG: unexpected EOF on client connection
[2007-11-14 08:53:59.001 PST] <gadb 2007-11-14 08:53:48 PST idle in transaction>LOG: unexpected EOF on client connection
[2007-11-14 08:54:10.782 PST] <gadb 2007-11-14 08:53:59 PST idle in transaction>LOG: unexpected EOF on client connection
[2007-11-14 08:54:22.557 PST] <gadb 2007-11-14 08:54:10 PST idle in transaction>LOG: unexpected EOF on client connection
[2007-11-14 08:54:34.282 PST] <gadb 2007-11-14 08:54:22 PST idle in transaction>LOG: unexpected EOF on client connection
[2007-11-14 09:13:36.444 PST] <gadb 2007-11-14 09:13:24 PST idle in transaction>LOG: unexpected EOF on client connection
[2007-11-14 09:13:43.637 PST] <gadb 2007-11-14 09:13:32 PST idle in transaction>LOG: unexpected EOF on client connection
[2007-11-14 09:17:16.242 PST] <gadb 2007-11-14 09:17:03 PST idle in transaction>LOG: unexpected EOF on client connection
[2007-11-14 09:39:22.841 PST] <gjoslyn 2007-11-14 09:39:22 PST SELECT>ERROR: relation "msysconf" does not exist
[2007-11-14 09:39:22.842 PST] <gjoslyn 2007-11-14 09:39:22 PST SELECT>STATEMENT: SELECT Config, nValue FROM MSysConf

Many thanks in advance.

Regards,

Tena Sakai
tsakai@gallo.ucsf.edu

Re: postgres bogged down beyond tolerance

From

Alvaro Herrera

Date:

14 November 2007, 15:08:03

Tena Sakai wrote:

> I checked kernel parameter
> shmmax and it was set as 33554432.  I "fixed" it as
> suggested by the manual:
> http://www.postgresql.org/docs/8.2/static/kernel-resources.html

Note that just by changing shmmax you're not doing anything to Postgres
itself.  If you want Postgres to use more memory, you need to increase
shared_buffers in postgresql.conf.  Changing shmmax is only required so
that the kernel allows Postgres to allocate all those shared_buffers.

> Namely, I shutdown the database, issued two commands:
>   /sbin/sysctl -w kernel.shmmax=134217728
>   /sbin/sysctl -w kernel.shmall=2097152
> and rebooted the computer.
>
> After it came up, I checked the shmmax and it is set
> as 33554432.  Which surprised me.  Since I used -w
> flag, I thought it should've written to /etc/sysctl.conf,
> but there is no such entry at all and the data of this
> file is from 2006.

No, the -w actually means "write to the kernel", which is state that
doesn't persist by itself.  It's your own responsability to write it to
sysctl.conf.

Anyhow, I don't think any of this has anything to do with a performance
problem.  If you haven't ever vacuumed your database, do so now and try
again.  (Actually, have a read of the "maintenance" chapter in the
manual).

--
Alvaro Herrera       Valdivia, Chile   ICBM: S 39º 49' 18.1", W 73º 13' 56.4"
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos / con todos los humanos acabaré (Bender)

Re: postgres bogged down beyond tolerance

From

"Scott Marlowe"

Date:

14 November 2007, 15:47:52

On Nov 14, 2007 12:56 PM, Tena Sakai <tsakai@gallo.ucsf.edu> wrote:
>
>  The postgres server I have (on redhat linux 2.6 with recent
>  Dell hardware (4 cpus)) is running terribly slow.
>
>  A job it should have gotten done in less than 1 hour
>  took 7.5 hours last night.

OK, before you try to fix the problem by blindly changing settings,
let's figure out what the problem IS.

Sounds to me like you've got bloat in your storage subsystem.

psql into your database as a superuser (postgres, etc...) and run

vacuum verbose;

and post the last 10 or so lines here.

Re: postgres bogged down beyond tolerance

From

"Tena Sakai"

Date:

14 November 2007, 16:26:45

Hi Scott,

Many thanks for your suggestion. I have issued a
bit heavier command prior to reading your mail.

At psql prompt, I (as superuser) typed:
# vacuum full verbose analyse;
and it spewing many, many lines. The tail end of
the output from it looks like:

INFO: vacuuming "public.allele"
INFO: "allele": found 2518282 removable, 1257262 nonremovable row versions in 31511 pages
DETAIL: 0 dead row versions cannot be removed yet.
Nonremovable row versions range from 64 to 64 bytes long.
There were 0 unused item pointers.
Total free space (including removable row versions) is 161940948 bytes.
21011 pages are or will become empty, including 0 at the end of the table.
21042 pages containing 161815320 free bytes are potential move destinations.
CPU 0.28s/0.20u sec elapsed 3.43 sec.
INFO: index "allele_pkey" now contains 1257262 row versions in 14539 pages
DETAIL: 2518282 index row versions were removed.
9648 index pages have been deleted, 9648 are currently reusable.
CPU 1.01s/1.48u sec elapsed 10.05 sec.
INFO: index "allele_markerid_idx" now contains 1257262 row versions in 10385 pages
DETAIL: 2518282 index row versions were removed.
6900 index pages have been deleted, 6900 are currently reusable.
CPU 0.79s/1.22u sec elapsed 8.26 sec.
INFO: "allele": moved 1257262 row versions, truncated 31511 to 10478 pages
DETAIL: CPU 119.89s/34.19u sec elapsed 318.23 sec.
INFO: index "allele_pkey" now contains 1257262 row versions in 14539 pages
DETAIL: 1257262 index row versions were removed.
4849 index pages have been deleted, 4849 are currently reusable.
CPU 0.54s/0.86u sec elapsed 2.98 sec.
INFO: index "allele_markerid_idx" now contains 1257262 row versions in 10385 pages
DETAIL: 1257262 index row versions were removed.
3467 index pages have been deleted, 3467 are currently reusable.
CPU 0.36s/0.81u sec elapsed 2.27 sec.
INFO: vacuuming "pg_toast.pg_toast_16524"
INFO: "pg_toast_16524": found 0 removable, 0 nonremovable row versions in 0 pages
DETAIL: 0 dead row versions cannot be removed yet.
Nonremovable row versions range from 0 to 0 bytes long.
There were 0 unused item pointers.
Total free space (including removable row versions) is 0 bytes.
0 pages are or will become empty, including 0 at the end of the table.
0 pages containing 0 free bytes are potential move destinations.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: index "pg_toast_16524_index" now contains 0 row versions in 1 pages
DETAIL: 0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: analyzing "public.allele"
INFO: "allele": scanned 3000 of 10478 pages, containing 360000 live rows and 0 dead rows; 3000 rows in sample, 1257360 estimated total rows

I would appreciate it if you could give me a bit of
interpretation from psql.

Regards,

Tena Sakai
tsakai@gallo.ucsf.edu

-----Original Message-----
From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
Sent: Wed 11/14/2007 11:47 AM
To: Tena Sakai
Cc: pgsql-admin@postgresql.org
Subject: Re: [ADMIN] postgres bogged down beyond tolerance

On Nov 14, 2007 12:56 PM, Tena Sakai <tsakai@gallo.ucsf.edu> wrote:
>
> The postgres server I have (on redhat linux 2.6 with recent
> Dell hardware (4 cpus)) is running terribly slow.
>
> A job it should have gotten done in less than 1 hour
> took 7.5 hours last night.

OK, before you try to fix the problem by blindly changing settings,
let's figure out what the problem IS.

Sounds to me like you've got bloat in your storage subsystem.

psql into your database as a superuser (postgres, etc...) and run

vacuum verbose;

and post the last 10 or so lines here.

Re: postgres bogged down beyond tolerance

From

"Scott Marlowe"

Date:

14 November 2007, 17:16:34

On Nov 14, 2007 2:26 PM, Tena Sakai <tsakai@gallo.ucsf.edu> wrote:
>
> Hi Scott,
>
>  Many thanks for your suggestion.  I have issued a
>  bit heavier command prior to reading your mail.
>
>  At psql prompt, I (as superuser) typed:
>    # vacuum full verbose analyse;

OK, I had just wanted vacuum verbose.

With the full we get a slightly different output than what I was
looking for.  Without the full, you'd get a report at the end telling
you how many fsm pages you'd need to track all your dead tuples.

But we can still use this.

>
>  INFO:  vacuuming "public.allele"
>  INFO:  "allele": found 2518282 removable, 1257262 nonremovable row versions
> in 31511 pages

OK, what this is telling you is that the allele table had 2518282 dead
rows, and 1257262 live ones.  That's about 2 dead rows for every 1
live one, so that's a fair amount of bloat.

>  Total free space (including removable row versions) is 161940948 bytes.

This tells us that vacuum full reclaimed 162Megs or so of free space.

>  I would appreciate it if you could give me a bit of
>  interpretation from psql.

It looks to me like your tables were bloated.  After running vacuum
full your cron should run faster now.

HOWEVER, running vacuum full is more of a recovery procedure than a
normal maintenance operation.  Normally what you do is run the
autovacuum daemon and let it vacuum your tables automagically when
needed.  If the autovacuum daemon isn't running, then you can cron up
some vacuum analyze (no full) jobs to run every x minutes or hours and
that should keep the bloat at bay.

Try running your report again now and see how it does.

After you get pg_autovacuum running, run a vacuum verbose (just
verbose :) ) about a day after the system's been up and read the
bottom 10 or so lines from that (or post them here) to see how many
pages you need for fsm.  For a db with as many rows as you have,
100,000 is a good minimum starting point.  Numbers into the millions
aren't all that uncommon on larger servers.

Re: postgres bogged down beyond tolerance

From

"Kevin Grittner"

Date:

14 November 2007, 17:33:19

>>> On Wed, Nov 14, 2007 at  3:16 PM, in message
<dcc563d10711141316u53338170l5d31d7723c8dc30b@mail.gmail.com>, "Scott Marlowe"
<scott.marlowe@gmail.com> wrote:

> On Nov 14, 2007 2:26 PM, Tena Sakai <tsakai@gallo.ucsf.edu> wrote:
>>  INFO:  vacuuming "public.allele"
>>  INFO:  "allele": found 2518282 removable, 1257262 nonremovable row versions
>> in 31511 pages

>>  Total free space (including removable row versions) is 161940948 bytes.

> It looks to me like your tables were bloated.  After running vacuum
> full your cron should run faster now.

Isn't it usually a good idea to REINDEX after using VACUUM FULL to
recover from this level of bloat?

-Kevin

Re: postgres bogged down beyond tolerance

From

"Scott Marlowe"

Date:

14 November 2007, 17:47:26

On Nov 14, 2007 3:33 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> >>> On Wed, Nov 14, 2007 at  3:16 PM, in message
> <dcc563d10711141316u53338170l5d31d7723c8dc30b@mail.gmail.com>, "Scott Marlowe"
> <scott.marlowe@gmail.com> wrote:
>
> > On Nov 14, 2007 2:26 PM, Tena Sakai <tsakai@gallo.ucsf.edu> wrote:
> >>  INFO:  vacuuming "public.allele"
> >>  INFO:  "allele": found 2518282 removable, 1257262 nonremovable row versions
> >> in 31511 pages
>
> >>  Total free space (including removable row versions) is 161940948 bytes.
>
> > It looks to me like your tables were bloated.  After running vacuum
> > full your cron should run faster now.
>
> Isn't it usually a good idea to REINDEX after using VACUUM FULL to
> recover from this level of bloat?

Depends on your indexes.  These didn't look bloated to me, but that
wasn't the whole output of vacuum full either.  Yes, reindex is a good
idea after a vacuum full though.

Re: postgres bogged down beyond tolerance

From

"Tena Sakai"

Date:

15 November 2007, 01:22:10

Hi Scott, Kevin, Alvaro, and everybody else:

Many thanks for your help and advices. After vacuuming
and reindexing all tables, my nightly run zipped right
through at amazing speed.

Regards,

Tena Sakai
tsakai@gallo.ucsf.edu

-----Original Message-----
From: Scott Marlowe [mailto:scott.marlowe@gmail.com]
Sent: Wed 11/14/2007 1:46 PM
To: Kevin Grittner
Cc: Tena Sakai; pgsql-admin@postgresql.org
Subject: Re: [ADMIN] postgres bogged down beyond tolerance

On Nov 14, 2007 3:33 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
> >>> On Wed, Nov 14, 2007 at 3:16 PM, in message
> <dcc563d10711141316u53338170l5d31d7723c8dc30b@mail.gmail.com>, "Scott Marlowe"
> <scott.marlowe@gmail.com> wrote:
>
> > On Nov 14, 2007 2:26 PM, Tena Sakai <tsakai@gallo.ucsf.edu> wrote:
> >> INFO: vacuuming "public.allele"
> >> INFO: "allele": found 2518282 removable, 1257262 nonremovable row versions
> >> in 31511 pages
>
> >> Total free space (including removable row versions) is 161940948 bytes.
>
> > It looks to me like your tables were bloated. After running vacuum
> > full your cron should run faster now.
>
> Isn't it usually a good idea to REINDEX after using VACUUM FULL to
> recover from this level of bloat?

Depends on your indexes. These didn't look bloated to me, but that
wasn't the whole output of vacuum full either. Yes, reindex is a good
idea after a vacuum full though.