Thread: Vacuum going -D; crash or just impatience?

Vacuum going -D; crash or just impatience?

From
Josh Berkus
Date:
Folks,

I've a 7.2.4  report-generation database that has been growing for some time,
resuting in the nightly VACUUM FULL ANALYZE taking longer and longer. (most
of the data is copied nightly from other systems, so use of FSM is not very
effective).

The problem is that the nightly admin scripts are programmed to check for a
locked up nightly maintainence, and to "pg_ctl -m fast stop" it.   As the
VACUUM FULL now takes over an hour, it falsely detected a lockup and shutdown
the database in the middle of VACUUM.

On restarting the database, I manually VACUUM FULLed it, and the VACUUM would
speed through until hitting the spot where the database was shutdown, at
which point the VACUUM process went "D", and apparently locked up for 10
minutes.  No error messages were written to the logs.  Unfortunately, I could
not give it longer to see if it recovered because this is a production system
and I had to get it up and running from backup by 9am.

Does this sound like a crash during VACUUM, or just like it needed more time?

If anyone wants to analyze, I have a complete backup of the post-problem
PGDATA directory.  The host system is RH Linux 8.0.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Vacuum going -D; crash or just impatience?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Does this sound like a crash during VACUUM, or just like it needed more time?

I think it just needed more time.  VACUUM goes to great lengths to be
crash-safe.  I doubt that a "fast stop" could have left the database
in a corrupted state.

It is entirely likely that an interrupted VACUUM FULL would leave the
next VACUUM FULL with more, not less, work to do --- all of the tuples
the first one tried to move before being killed would now be dead and
need to be cleaned up.

> I've a 7.2.4  report-generation database that has been growing for some time,
> resuting in the nightly VACUUM FULL ANALYZE taking longer and longer. (most
> of the data is copied nightly from other systems, so use of FSM is not very
> effective).

Are you saying that you delete most or all of the rows, then vacuum?
You might consider TRUNCATE if you delete all the rows, or CLUSTER
if you delete most, as a substitute for VACUUM FULL.  (You'd still want
to run ANALYZE, after you load fresh data.)  VACUUM FULL is really
designed for the case where there are not a huge number of dead rows
--- it gets awfully slow if it has to move lots of data.

Also, I think you have probably not given the FSM enough chance.
If the FSM settings are adequate then it should work fine to do
    -- delete unwanted rows
    VACUUM (not FULL, not ANALYZE)
    -- load new rows
    ANALYZE

            regards, tom lane

Re: Vacuum going -D; crash or just impatience?

From
Josh Berkus
Date:
Tom,

> I think it just needed more time.  VACUUM goes to great lengths to be
> crash-safe.  I doubt that a "fast stop" could have left the database
> in a corrupted state.

OK, that's reasuring.   I would have liked to give the process more time, b=
ut=20
with users waiting ....

One thing I am puzzled by is the "D" status on the VACUUM process.  That wo=
uld=20
seem to indicate that VACUUM was waiting for some other process ... but I=
=20
can't imagine what it could be.   Suggestions?

> Are you saying that you delete most or all of the rows, then vacuum?
> You might consider TRUNCATE if you delete all the rows, or CLUSTER
> if you delete most, as a substitute for VACUUM FULL.  (You'd still want
> to run ANALYZE, after you load fresh data.)  VACUUM FULL is really
> designed for the case where there are not a huge number of dead rows
> --- it gets awfully slow if it has to move lots of data.

There are several "holding" tables which are truncated and then re-built.  =
But=20
the tables that are holding up VACUUM are the permanent ones, which are=20
experiencing up to 900,000 updates every night.=20=20

> Also, I think you have probably not given the FSM enough chance.
> If the FSM settings are adequate then it should work fine to do

Well, the holdup is the indexes, which are recycling about 500,000 pages an=
d=20
in 7.2.4 FSM doesn't help me.  Unfortunately, dropping the indexes during t=
he=20
data transformation isn't really an option, because the indexes support som=
e=20
of the data transform steps.

I'm wondering if I need to REINDEX more often; I think I'll try that next.

--=20
-Josh Berkus
 Aglio Database Solutions
 San Francisco

Re: Vacuum going -D; crash or just impatience?

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> One thing I am puzzled by is the "D" status on the VACUUM process.

Disk I/O wait state, no?  Pretty much what I'd expect for VACUUM ...

            regards, tom lane