Thread: COMMIT stuck for days after bulk delete

COMMIT stuck for days after bulk delete

From
Josh Kupershmidt
Date:
We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance)
which seems to be stuck at COMMIT for 2 days now:

mydb=# SELECT procpid, waiting, current_query,
CURRENT_TIMESTAMP - query_start AS query_elapsed,
CURRENT_TIMESTAMP - xact_start AS xact_elapsed
FROM pg_stat_activity WHERE procpid != pg_backend_pid() AND
current_query != '<IDLE>';
-[ RECORD 1 ]-+-----------------------
procpid       | 6061
waiting       | f
current_query | COMMIT;
query_elapsed | 2 days 08:59:17.619142
xact_elapsed  | 3 days 15:48:10.739912


The transaction behind that COMMIT has been the only thing running on
this Postgres instance for the past 3 days or so, since Postgres was
started on that machine. I spun the EC2 instance for this database up
solely to test a database subsetting process, which is what the
transaction was doing before it got stuck at COMMIT -- using a bunch
of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or
so of our data in order to be able to pg_dump a slimmed-down
development copy.

The EC2 instances we use have separate EBS-backed volumes for the
PostgreSQL data and WAL directories. The backend in question seems to
be stuck reading a ton of data from the data partition: the monitoring
for those EBS volumes shows those volumes have been hammered reading a
constant aggregate 90MB/sec since that COMMIT started. The write
bandwidth to the postgresql-data partition has been almost nil since
the COMMIT, and there has been no read/write activity on the WAL
volumes.

Here, we can see that backend has managed to read 22 TB despite the
fact that the entire database is only 228 GB on disk.

$ sudo cat /proc/6061/io
rchar: 24505414843923
wchar: 23516159014
syscr: 2991395854
syscw: 2874613
read_bytes: 24791719338496
write_bytes: 22417580032
cancelled_write_bytes: 221208576

$ df -h /dev/md0 /dev/md1
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        480G  228G  253G  48% /mnt/ebs/postgresql-data
/dev/md1         32G   20G   13G  61% /mnt/ebs/postgresql-wal

Running an strace on the backend shows a whole ton of read() calls and
the occasional lseek(). I grabbed a backtrace of the backend with gdb,
attached.

Attached also are the non-default pg_settings for this instance.
You'll notice that fsync, full_page_writes, and autovacuum are all
off: this is intentional, since this instance is transient and has
nothing important on it. There are no interesting errors in the
Postgres log files since it was spun up.

Any ideas on how to further diagnose or avoid this problem?

Josh

Attachment

Re: COMMIT stuck for days after bulk delete

From
Tom Lane
Date:
Josh Kupershmidt <schmiddy@gmail.com> writes:
> We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance)
> which seems to be stuck at COMMIT for 2 days now:
> ...
> The transaction behind that COMMIT has been the only thing running on
> this Postgres instance for the past 3 days or so, since Postgres was
> started on that machine. I spun the EC2 instance for this database up
> solely to test a database subsetting process, which is what the
> transaction was doing before it got stuck at COMMIT -- using a bunch
> of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or
> so of our data in order to be able to pg_dump a slimmed-down
> development copy.

A plausible guess is that the backend is running around trying to verify
that some deferred foreign key constraints still hold.  But without
knowing what your schema is, that's only a guess.

If that is it, a likely solution is to drop *all* the FK constraints
before doing the bulk delete, then (in a new transaction, probably)
recreate the ones you still want.

            regards, tom lane


Re: COMMIT stuck for days after bulk delete

From
Josh Kupershmidt
Date:
On Tue, Jan 14, 2014 at 12:36 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Josh Kupershmidt <schmiddy@gmail.com> writes:
>> We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance)
>> which seems to be stuck at COMMIT for 2 days now:
>> ...
>> The transaction behind that COMMIT has been the only thing running on
>> this Postgres instance for the past 3 days or so, since Postgres was
>> started on that machine. I spun the EC2 instance for this database up
>> solely to test a database subsetting process, which is what the
>> transaction was doing before it got stuck at COMMIT -- using a bunch
>> of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or
>> so of our data in order to be able to pg_dump a slimmed-down
>> development copy.
>
> A plausible guess is that the backend is running around trying to verify
> that some deferred foreign key constraints still hold.  But without
> knowing what your schema is, that's only a guess.

Yeah, that's a good guess. A bunch of the FK constraints I am dropping
and re-adding are marked DEFERRABLE INITIALLY DEFERRED; there are 167
counted by:

SELECT COUNT(*)
  FROM pg_catalog.pg_constraint c
  WHERE contype = 'f' AND condeferrable AND condeferred AND
  connamespace =
    (SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'public') ;

> If that is it, a likely solution is to drop *all* the FK constraints
> before doing the bulk delete, then (in a new transaction, probably)
> recreate the ones you still want.

Will try that, thanks for the suggestion.

Josh