COMMIT stuck for days after bulk delete - Mailing list pgsql-performance

From Josh Kupershmidt
Subject COMMIT stuck for days after bulk delete
Date
Msg-id CAK3UJRHFhwnVyK-U6=aNmp=XU_ro-FtDygH6mo0NJusQffzxTA@mail.gmail.com
Whole thread Raw
Responses Re: COMMIT stuck for days after bulk delete
List pgsql-performance
We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance)
which seems to be stuck at COMMIT for 2 days now:

mydb=# SELECT procpid, waiting, current_query,
CURRENT_TIMESTAMP - query_start AS query_elapsed,
CURRENT_TIMESTAMP - xact_start AS xact_elapsed
FROM pg_stat_activity WHERE procpid != pg_backend_pid() AND
current_query != '<IDLE>';
-[ RECORD 1 ]-+-----------------------
procpid       | 6061
waiting       | f
current_query | COMMIT;
query_elapsed | 2 days 08:59:17.619142
xact_elapsed  | 3 days 15:48:10.739912


The transaction behind that COMMIT has been the only thing running on
this Postgres instance for the past 3 days or so, since Postgres was
started on that machine. I spun the EC2 instance for this database up
solely to test a database subsetting process, which is what the
transaction was doing before it got stuck at COMMIT -- using a bunch
of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or
so of our data in order to be able to pg_dump a slimmed-down
development copy.

The EC2 instances we use have separate EBS-backed volumes for the
PostgreSQL data and WAL directories. The backend in question seems to
be stuck reading a ton of data from the data partition: the monitoring
for those EBS volumes shows those volumes have been hammered reading a
constant aggregate 90MB/sec since that COMMIT started. The write
bandwidth to the postgresql-data partition has been almost nil since
the COMMIT, and there has been no read/write activity on the WAL
volumes.

Here, we can see that backend has managed to read 22 TB despite the
fact that the entire database is only 228 GB on disk.

$ sudo cat /proc/6061/io
rchar: 24505414843923
wchar: 23516159014
syscr: 2991395854
syscw: 2874613
read_bytes: 24791719338496
write_bytes: 22417580032
cancelled_write_bytes: 221208576

$ df -h /dev/md0 /dev/md1
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        480G  228G  253G  48% /mnt/ebs/postgresql-data
/dev/md1         32G   20G   13G  61% /mnt/ebs/postgresql-wal

Running an strace on the backend shows a whole ton of read() calls and
the occasional lseek(). I grabbed a backtrace of the backend with gdb,
attached.

Attached also are the non-default pg_settings for this instance.
You'll notice that fsync, full_page_writes, and autovacuum are all
off: this is intentional, since this instance is transient and has
nothing important on it. There are no interesting errors in the
Postgres log files since it was spun up.

Any ideas on how to further diagnose or avoid this problem?

Josh

Attachment

pgsql-performance by date:

Previous
From: Marc Cousin
Date:
Subject: Re: query plan not optimal
Next
From: Tom Lane
Date:
Subject: Re: COMMIT stuck for days after bulk delete