Thread: vacuumdb hanging database cluster

vacuumdb hanging database cluster

From
Steve Crawford
Date:
When I run:
vacuumdb --full --all --analyze --quiet
on my database cluster it will complete in < 2 minutes (this cluster
is a few million total rows and ~2GB).

After testing, I set this up as an off-hours cron job and it worked
fine for several days then hung the whole database. After my pager
pulled me from bed I found the vacuumdb process still running but the
vacuum process on the first database (alphabetically) was showing it
was waiting:
postgres: vacuumdb --full --all --analyze --quiet
postgres: postgres firstdb [local] VACUUM waiting

A couple hundred processes were showing as "startup waiting" and one
was "idle in transaction". The process in the "VACUUM waiting" state
was the only one connected to that database - all other connections
were to other databases.

CPU and disk utilization were essentially zero. Suspecting a lock
problem I attempted to use a pre-existing connection to view pg_locks
but it would not respond.

I killed the vacuum process and all the processes in the "waiting"
states cleared within a second or two and system returned to normal.
The pg_locks query also returned but showed no useful info.

I tracked down the process that was "idle in transaction" and it was a
pg_dump process running on another machine. This process does a
periodic dump of one very small table and should complete in a
fraction of a second but was still waiting since the previous day -
apparently without deleterious effects.

There was no useful info in the log.

I've stopped running the vacuum full job via cron till I can trust it.
Any ideas on how to track/prevent this behavior? Server is version
7.4.1 and my web searches have proved futile.

Cheers,
Steve


Re: vacuumdb hanging database cluster

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] On Behalf Of
> Steve Crawford
> Sent: Monday, July 26, 2004 1:23 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] vacuumdb hanging database cluster
>
>
> When I run:
> vacuumdb --full --all --analyze --quiet
> on my database cluster it will complete in < 2 minutes (this cluster
> is a few million total rows and ~2GB).
>
> After testing, I set this up as an off-hours cron job and it worked
> fine for several days then hung the whole database. After my pager
> pulled me from bed I found the vacuumdb process still running but the
> vacuum process on the first database (alphabetically) was showing it
> was waiting:
> postgres: vacuumdb --full --all --analyze --quiet
> postgres: postgres firstdb [local] VACUUM waiting
>
> A couple hundred processes were showing as "startup waiting" and one
> was "idle in transaction". The process in the "VACUUM waiting" state
> was the only one connected to that database - all other connections
> were to other databases.
>
> CPU and disk utilization were essentially zero. Suspecting a lock
> problem I attempted to use a pre-existing connection to view pg_locks
> but it would not respond.
>
> I killed the vacuum process and all the processes in the "waiting"
> states cleared within a second or two and system returned to normal.
> The pg_locks query also returned but showed no useful info.
>
> I tracked down the process that was "idle in transaction" and
> it was a
> pg_dump process running on another machine. This process does a
> periodic dump of one very small table and should complete in a
> fraction of a second but was still waiting since the previous day -
> apparently without deleterious effects.
>
> There was no useful info in the log.
>
> I've stopped running the vacuum full job via cron till I can
> trust it.
> Any ideas on how to track/prevent this behavior? Server is version
> 7.4.1 and my web searches have proved futile.

I have seen problematic behavior when one vacuum starts after another is
already running.
It might be a good idea to semaphore vacuum operations.
But my experience is with an older version of PostgreSQL, so the
problems you are seeing might be totally unrelated.

Re: vacuumdb hanging database cluster

From
Tom Lane
Date:
Steve Crawford <scrawford@pinpointresearch.com> writes:
> A couple hundred processes were showing as "startup waiting" and one
> was "idle in transaction". The process in the "VACUUM waiting" state
> was the only one connected to that database - all other connections
> were to other databases.

I suspect what must have happened is that the vacuum process was trying
to vacuum one of the shared catalogs (pg_database or pg_shadow), and was
blocked trying to get exclusive lock because someone else (the "idle in
transaction" guy) was holding some lock on that table.  At this point
all incoming connections, to any database, will block behind the VACUUM
until the idle guy closes his transaction and thereby releases his lock.

> I tracked down the process that was "idle in transaction" and it was a
> pg_dump process running on another machine. This process does a
> periodic dump of one very small table and should complete in a
> fraction of a second but was still waiting since the previous day -
> apparently without deleterious effects.

What was it waiting on?  Since it was idle instead of waiting, the
problem must have been on the client side.  I've not heard of pg_dump
just going to sleep for no reason...

> I've stopped running the vacuum full job via cron till I can trust it.
> Any ideas on how to track/prevent this behavior? Server is version
> 7.4.1 and my web searches have proved futile.

My recommendation would be to lose the --full.  If you're doing
sufficiently frequent vacuuming you have no need for that, and getting
rid of it means vacuum doesn't take exclusive table locks.  That means
it will neither block nor be blocked by ordinary readers and writers.

            regards, tom lane

Re: vacuumdb hanging database cluster

From
Steve Crawford
Date:
On Monday 26 July 2004 2:18 pm, Tom Lane wrote:
> Steve Crawford <scrawford@pinpointresearch.com> writes:
> > A couple hundred processes were showing as "startup waiting" and
> > one was "idle in transaction". The process in the "VACUUM
> > waiting" state was the only one connected to that database - all
> > other connections were to other databases.
>
> I suspect what must have happened is that the vacuum process was
> trying to vacuum one of the shared catalogs (pg_database or
> pg_shadow), and was blocked trying to get exclusive lock because
> someone else (the "idle in transaction" guy) was holding some lock
> on that table.  At this point all incoming connections, to any
> database, will block behind the VACUUM until the idle guy closes
> his transaction and thereby releases his lock.

I agree that it has the "smell" of a system-level lock - I just
couldn't get any info from pg_locks till it was cleared.

> > I tracked down the process that was "idle in transaction" and it
> > was a pg_dump process running on another machine. This process
> > does a periodic dump of one very small table and should complete
> > in a fraction of a second but was still waiting since the
> > previous day - apparently without deleterious effects.
>
> What was it waiting on?  Since it was idle instead of waiting, the
> problem must have been on the client side.  I've not heard of
> pg_dump just going to sleep for no reason...

Beats the heck out of me. We periodically dump some selected small
tables via a script using:
pg_dump -i -h $dbhost -U $dbuser -t $dumptable > dumpfile

It's very vanilla and generally works fine but sometimes (perhaps 1
per 1000+ runs) ends up idle in transaction. I'm going to take a much
closer look at pg_locks next time it happens.

The -i is because pg_dump on the client machine is 7.4.2 and the
server is 7.4.1 but that doesn't seem to be a problem.

> > I've stopped running the vacuum full job via cron till I can
> > trust it. Any ideas on how to track/prevent this behavior? Server
> > is version 7.4.1 and my web searches have proved futile.
>
> My recommendation would be to lose the --full.  If you're doing
> sufficiently frequent vacuuming you have no need for that, and
> getting rid of it means vacuum doesn't take exclusive table locks.
> That means it will neither block nor be blocked by ordinary readers
> and writers.

Yes, that's where I've headed. I'll save the occasional full vacuum
for manual running when I can watch it.

Thanks,
Steve


Re: vacuumdb hanging database cluster

From
Tom Lane
Date:
Steve Crawford <scrawford@pinpointresearch.com> writes:
>>> I tracked down the process that was "idle in transaction" and it
>>> was a pg_dump process running on another machine.
>>
>> What was it waiting on?

> Beats the heck out of me. We periodically dump some selected small
> tables via a script using:
> pg_dump -i -h $dbhost -U $dbuser -t $dumptable > dumpfile

> It's very vanilla and generally works fine but sometimes (perhaps 1
> per 1000+ runs) ends up idle in transaction. I'm going to take a much
> closer look at pg_locks next time it happens.

If it is "idle in transaction" and not "<something> waiting" then it is
not blocked waiting for someone's lock, so pg_locks is unlikely to yield
the answer.  I think it's got to be something funny on the pg_dump side.
Or maybe a communications problem?

            regards, tom lane