Thread: COMMIT stuck for days after bulk delete
We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance) which seems to be stuck at COMMIT for 2 days now: mydb=# SELECT procpid, waiting, current_query, CURRENT_TIMESTAMP - query_start AS query_elapsed, CURRENT_TIMESTAMP - xact_start AS xact_elapsed FROM pg_stat_activity WHERE procpid != pg_backend_pid() AND current_query != '<IDLE>'; -[ RECORD 1 ]-+----------------------- procpid | 6061 waiting | f current_query | COMMIT; query_elapsed | 2 days 08:59:17.619142 xact_elapsed | 3 days 15:48:10.739912 The transaction behind that COMMIT has been the only thing running on this Postgres instance for the past 3 days or so, since Postgres was started on that machine. I spun the EC2 instance for this database up solely to test a database subsetting process, which is what the transaction was doing before it got stuck at COMMIT -- using a bunch of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or so of our data in order to be able to pg_dump a slimmed-down development copy. The EC2 instances we use have separate EBS-backed volumes for the PostgreSQL data and WAL directories. The backend in question seems to be stuck reading a ton of data from the data partition: the monitoring for those EBS volumes shows those volumes have been hammered reading a constant aggregate 90MB/sec since that COMMIT started. The write bandwidth to the postgresql-data partition has been almost nil since the COMMIT, and there has been no read/write activity on the WAL volumes. Here, we can see that backend has managed to read 22 TB despite the fact that the entire database is only 228 GB on disk. $ sudo cat /proc/6061/io rchar: 24505414843923 wchar: 23516159014 syscr: 2991395854 syscw: 2874613 read_bytes: 24791719338496 write_bytes: 22417580032 cancelled_write_bytes: 221208576 $ df -h /dev/md0 /dev/md1 Filesystem Size Used Avail Use% Mounted on /dev/md0 480G 228G 253G 48% /mnt/ebs/postgresql-data /dev/md1 32G 20G 13G 61% /mnt/ebs/postgresql-wal Running an strace on the backend shows a whole ton of read() calls and the occasional lseek(). I grabbed a backtrace of the backend with gdb, attached. Attached also are the non-default pg_settings for this instance. You'll notice that fsync, full_page_writes, and autovacuum are all off: this is intentional, since this instance is transient and has nothing important on it. There are no interesting errors in the Postgres log files since it was spun up. Any ideas on how to further diagnose or avoid this problem? Josh
Attachment
Josh Kupershmidt <schmiddy@gmail.com> writes: > We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance) > which seems to be stuck at COMMIT for 2 days now: > ... > The transaction behind that COMMIT has been the only thing running on > this Postgres instance for the past 3 days or so, since Postgres was > started on that machine. I spun the EC2 instance for this database up > solely to test a database subsetting process, which is what the > transaction was doing before it got stuck at COMMIT -- using a bunch > of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or > so of our data in order to be able to pg_dump a slimmed-down > development copy. A plausible guess is that the backend is running around trying to verify that some deferred foreign key constraints still hold. But without knowing what your schema is, that's only a guess. If that is it, a likely solution is to drop *all* the FK constraints before doing the bulk delete, then (in a new transaction, probably) recreate the ones you still want. regards, tom lane
On Tue, Jan 14, 2014 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Kupershmidt <schmiddy@gmail.com> writes: >> We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance) >> which seems to be stuck at COMMIT for 2 days now: >> ... >> The transaction behind that COMMIT has been the only thing running on >> this Postgres instance for the past 3 days or so, since Postgres was >> started on that machine. I spun the EC2 instance for this database up >> solely to test a database subsetting process, which is what the >> transaction was doing before it got stuck at COMMIT -- using a bunch >> of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or >> so of our data in order to be able to pg_dump a slimmed-down >> development copy. > > A plausible guess is that the backend is running around trying to verify > that some deferred foreign key constraints still hold. But without > knowing what your schema is, that's only a guess. Yeah, that's a good guess. A bunch of the FK constraints I am dropping and re-adding are marked DEFERRABLE INITIALLY DEFERRED; there are 167 counted by: SELECT COUNT(*) FROM pg_catalog.pg_constraint c WHERE contype = 'f' AND condeferrable AND condeferred AND connamespace = (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'public') ; > If that is it, a likely solution is to drop *all* the FK constraints > before doing the bulk delete, then (in a new transaction, probably) > recreate the ones you still want. Will try that, thanks for the suggestion. Josh