Deadlock or other hang while vacuuming? - Mailing list pgsql-bugs

From Craig Ruff
Subject Deadlock or other hang while vacuuming?
Date
Msg-id 20041108182615.GA5121@ucar.edu
Whole thread Raw
Responses Re: Deadlock or other hang while vacuuming?
List pgsql-bugs
I have an annoying problem with some kind of a hang or deadlock triggered
sometimes when running a vacuum on a table with a pair of read only
cursors enumerating different subsets of rows of the same table.
(Originally, I also could have other queries that modified the table
running concurrently, but I prevented them from starting while the vacuum
query was running to narrow down the scope of the problem.)

I'm running PostgreSQL-7.4.5 on an SGI MIPS system running IRIX 6.5.24f,
compiled with the MIPSpro 7.4.2m C compiler.  The application driving
the database is multithreaded, and can have numerous sessions open to
backends.  I did make sure to verify I had compiled PostgreSQL with
threading enabled.

The table contains approximately 570,000 to 600,000 entries, and is defined
thusly:

    CREATE TABLE seg (
        id
            serial8
            PRIMARY KEY,
        name
            varchar(20)
            NOT NULL,
        lv_id
            int4
            NOT NULL
            REFERENCES lv(id),
        size
            int8
            NOT NULL
            CHECK (size >= 0),
        creation_time
            timestamp
            NOT NULL,
        last_use_time
            timestamp
            DEFAULT timestamp 'epoch'
            NOT NULL,
        UNIQUE(lv_id, name)
    ) WITHOUT OIDS;

The enumeration sessions take a while, as the client system driving them
is slow.  Each enumeration session has an exclusive backend connection,
and takes place inside a transaction.  An example sequence of events
looks like this:

    BEGIN;
    DECLARE lsess CURSOR FOR
      SELECT name, size, to_char(creation_time, 'YY.DDD'),
             to_char(last_use_time, 'YY.DDD')
            FROM seg WHERE lv_id = 12 AND name ~ '^M*';

(wait for a request for the next batch):

    FETCH 60 FROM lsess;

(repeat as necessary)

    CLOSE lsess;
    COMMIT;

I have a periodic task which kicks off vacuums of all of the tables in
the database every 20 minutes.  It vacuums the other tables, then runs
this query:

    VACUUM ANALYZE seg;


I'm not yet certain about the relative timing of the vacuum and the
declaration of the cursors.  It may be that the vacuum starts first,
or not.  I haven't figured that out yet (some additional debug output
may be necessary).

What happens is that the application grinds to a halt.  Looking at
core files (generated with kill -ILL <app>) shows that the vacuum query
is waiting for the result, the stack backtrace looks like this:

    pqSocketPoll
    pqSocketCheck
    pqWaitTimed
    pqWait
    PQgetResult
    PQexecFinish
    PQexec("VACUUM ANALYZE seg;")

(When I allowed the other concurrent table modifying queries, many would
also blocked in pqSocketPoll waiting for results).  This table is normally
vacuumed in less than 1 minute, but even waiting for 1.5 hours does not
change things.  No backend appears to be active at that point.

Gathering information from the pg_locks table produces this:

    relname    |  pid   |           mode           | granted
---------------+--------+--------------------------+---------
 seg           | 678547 | ShareUpdateExclusiveLock | t         (VACUUM)
 seg           | 678547 | ShareUpdateExclusiveLock | t
 seg_lv_id_key | 703519 | AccessShareLock          | t         (CURSOR lsess #1)
 seg           | 703519 | AccessShareLock          | t
 seg_lv_id_key | 703567 | AccessShareLock          | t         (CURSOR lsess #2)
 seg           | 703567 | AccessShareLock          | t
 pg_class      | 777441 | AccessShareLock          | t
 pg_locks      | 777441 | AccessShareLock          | t

I tried killing one of the backends handling one of the CURSORs to
see what its state looked like, but the core file was overwritten by
one from my app when it threw an exception cleaning up the aftermath. :-(
Nothing shows up in the serverlog output, other than the normal connection
and transaction log messages.

At this point, I'm ready to exclude the enumeration sessions from starting
when the vacuum is active, but I thought I'd try and gather information
just in case it is a problem in PostgreSQL.

Does anyone have any suggestions for tracking this down?

--

Craig Ruff          NCAR            cruff@ucar.edu
(303) 497-1211      P.O. Box 3000
            Boulder, CO  80307

pgsql-bugs by date:

Previous
From: SZŰCS Gábor
Date:
Subject: Re: Killed backend won't rollback transaction?
Next
From: Gaetano Mendola
Date:
Subject: Re: Possible bug: pg_hba.conf file