Deadlock or other hang while vacuuming? - Mailing list pgsql-bugs
From | Craig Ruff |
---|---|
Subject | Deadlock or other hang while vacuuming? |
Date | |
Msg-id | 20041108182615.GA5121@ucar.edu Whole thread Raw |
Responses |
Re: Deadlock or other hang while vacuuming?
|
List | pgsql-bugs |
I have an annoying problem with some kind of a hang or deadlock triggered sometimes when running a vacuum on a table with a pair of read only cursors enumerating different subsets of rows of the same table. (Originally, I also could have other queries that modified the table running concurrently, but I prevented them from starting while the vacuum query was running to narrow down the scope of the problem.) I'm running PostgreSQL-7.4.5 on an SGI MIPS system running IRIX 6.5.24f, compiled with the MIPSpro 7.4.2m C compiler. The application driving the database is multithreaded, and can have numerous sessions open to backends. I did make sure to verify I had compiled PostgreSQL with threading enabled. The table contains approximately 570,000 to 600,000 entries, and is defined thusly: CREATE TABLE seg ( id serial8 PRIMARY KEY, name varchar(20) NOT NULL, lv_id int4 NOT NULL REFERENCES lv(id), size int8 NOT NULL CHECK (size >= 0), creation_time timestamp NOT NULL, last_use_time timestamp DEFAULT timestamp 'epoch' NOT NULL, UNIQUE(lv_id, name) ) WITHOUT OIDS; The enumeration sessions take a while, as the client system driving them is slow. Each enumeration session has an exclusive backend connection, and takes place inside a transaction. An example sequence of events looks like this: BEGIN; DECLARE lsess CURSOR FOR SELECT name, size, to_char(creation_time, 'YY.DDD'), to_char(last_use_time, 'YY.DDD') FROM seg WHERE lv_id = 12 AND name ~ '^M*'; (wait for a request for the next batch): FETCH 60 FROM lsess; (repeat as necessary) CLOSE lsess; COMMIT; I have a periodic task which kicks off vacuums of all of the tables in the database every 20 minutes. It vacuums the other tables, then runs this query: VACUUM ANALYZE seg; I'm not yet certain about the relative timing of the vacuum and the declaration of the cursors. It may be that the vacuum starts first, or not. I haven't figured that out yet (some additional debug output may be necessary). What happens is that the application grinds to a halt. Looking at core files (generated with kill -ILL <app>) shows that the vacuum query is waiting for the result, the stack backtrace looks like this: pqSocketPoll pqSocketCheck pqWaitTimed pqWait PQgetResult PQexecFinish PQexec("VACUUM ANALYZE seg;") (When I allowed the other concurrent table modifying queries, many would also blocked in pqSocketPoll waiting for results). This table is normally vacuumed in less than 1 minute, but even waiting for 1.5 hours does not change things. No backend appears to be active at that point. Gathering information from the pg_locks table produces this: relname | pid | mode | granted ---------------+--------+--------------------------+--------- seg | 678547 | ShareUpdateExclusiveLock | t (VACUUM) seg | 678547 | ShareUpdateExclusiveLock | t seg_lv_id_key | 703519 | AccessShareLock | t (CURSOR lsess #1) seg | 703519 | AccessShareLock | t seg_lv_id_key | 703567 | AccessShareLock | t (CURSOR lsess #2) seg | 703567 | AccessShareLock | t pg_class | 777441 | AccessShareLock | t pg_locks | 777441 | AccessShareLock | t I tried killing one of the backends handling one of the CURSORs to see what its state looked like, but the core file was overwritten by one from my app when it threw an exception cleaning up the aftermath. :-( Nothing shows up in the serverlog output, other than the normal connection and transaction log messages. At this point, I'm ready to exclude the enumeration sessions from starting when the vacuum is active, but I thought I'd try and gather information just in case it is a problem in PostgreSQL. Does anyone have any suggestions for tracking this down? -- Craig Ruff NCAR cruff@ucar.edu (303) 497-1211 P.O. Box 3000 Boulder, CO 80307
pgsql-bugs by date: