Thread: cannot abort transaction 2737414167, it was already committed

cannot abort transaction 2737414167, it was already committed

From
Justin Pryzby
Date:
We had this:

< 2023-12-25 04:06:20.062 MST telsasoft >ERROR:  could not open file "pg_tblspc/16395/PG_16_202307071/16384/121010871":
Input/outputerror
 
< 2023-12-25 04:06:20.062 MST telsasoft >STATEMENT:  commit
< 2023-12-25 04:06:20.062 MST telsasoft >WARNING:  AbortTransaction while in COMMIT state
< 2023-12-25 04:06:20.062 MST telsasoft >PANIC:  cannot abort transaction 2737414167, it was already committed
< 2023-12-25 04:06:20.473 MST  >LOG:  server process (PID 14678) was terminated by signal 6: Aborted

The application is a daily cronjob which would've just done:

begin;
lo_unlink(); -- the client-side function called from pygresql;
DELETE FROM tbl WHERE col=%s;
commit;

The table being removed would've been a transient (but not "temporary")
table created ~1 day prior.

It's possible that the filesystem had an IO error, but I can't find any
evidence of that.  Postgres is running entirely on zfs, which says:

scan: scrub repaired 0B in 00:07:03 with 0 errors on Mon Dec 25 04:49:07 2023
errors: No known data errors

My main question is why an IO error would cause the DB to abort, rather
than raising an ERROR.

This is pg16 compiled at efa8f6064, runing under centos7.  ZFS is 2.2.2,
but the pool hasn't been upgraded to use the features new since 2.1.

(gdb) bt
#0  0x00007fc961089387 in raise () from /lib64/libc.so.6
#1  0x00007fc96108aa78 in abort () from /lib64/libc.so.6
#2  0x00000000009438b7 in errfinish (filename=filename@entry=0xac8e20 "xact.c", lineno=lineno@entry=1742,
funcname=funcname@entry=0x9a6600<__func__.32495> "RecordTransactionAbort") at elog.c:604
 
#3  0x000000000054d6ab in RecordTransactionAbort (isSubXact=isSubXact@entry=false) at xact.c:1741
#4  0x000000000054d7bd in AbortTransaction () at xact.c:2814
#5  0x000000000054e015 in AbortCurrentTransaction () at xact.c:3415
#6  0x0000000000804e4e in PostgresMain (dbname=0x12ea840 "ts", username=0x12ea828 "telsasoft") at postgres.c:4354
#7  0x000000000077bdd6 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4465
#8  BackendStartup (port=0x12e44c0) at postmaster.c:4193
#9  ServerLoop () at postmaster.c:1783
#10 0x000000000077ce9a in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x12ad280) at postmaster.c:1467
#11 0x00000000004ba8b8 in main (argc=3, argv=0x12ad280) at main.c:198

#3  0x000000000054d6ab in RecordTransactionAbort (isSubXact=isSubXact@entry=false) at xact.c:1741
        xid = 2737414167
        rels = 0x94f549 <hash_seq_init+73>
        ndroppedstats = 0
        droppedstats = 0x0

#4  0x000000000054d7bd in AbortTransaction () at xact.c:2814
        is_parallel_worker = false

-- 
Justin



Re: cannot abort transaction 2737414167, it was already committed

From
Thomas Munro
Date:
On Thu, Dec 28, 2023 at 4:02 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> My main question is why an IO error would cause the DB to abort, rather
> than raising an ERROR.

In CommitTransaction() there is a stretch of code beginning s->state =
TRANS_COMMIT and ending s->state = TRANS_DEFAULT, from which we call
out to various subsystems' AtEOXact_XXX() functions.  There is no way
to roll back in that state, so anything that throws ERROR from those
routines is going to get something much like $SUBJECT.  Hmm, we'd know
which exact code path got that EIO from your smoldering core if we'd
put an explicit critical section there (if we're going to PANIC
anyway, it might as well not be from a different stack after
longjmp()...).

I guess the large object usage isn't directly relevant (that module's
EOXact stuff seems to be finished before TRANS_COMMIT, but I don't
know that code well).  Everything later is supposed to be about
closing/releasing/cleaning up, and for example smgrDoPendingDeletes()
reaches code with this relevant comment:

     * Note: smgr_unlink must treat deletion failure as a WARNING, not an
     * ERROR, because we've already decided to commit or abort the current
     * xact.

We don't really have a general ban on ereporting on system call
failure, though.  We've just singled unlink() out.  Only a few lines
above that we call DropRelationsAllBuffers(rels, nrels), and that
calls smgrnblocks(), and that might need to need to re-open() the
relation file to do lseek(SEEK_END), because PostgreSQL itself has no
tracking of relation size.  Hard to say but my best guess is that's
where you might have got your EIO, assuming you dropped the relation
in this transaction?

> This is pg16 compiled at efa8f6064, runing under centos7.  ZFS is 2.2.2,
> but the pool hasn't been upgraded to use the features new since 2.1.

I've been following recent ZFS stuff from a safe distance as a user.
AFAIK the extremely hard to hit bug fixed in that very recent release
didn't technically require the interesting new feature (namely block
cloning, though I think that helped people find the root cause after a
phase of false blame?).  Anyway, it had for symptom some bogus zero
bytes on read, not a spurious EIO.



Re: cannot abort transaction 2737414167, it was already committed

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> In CommitTransaction() there is a stretch of code beginning s->state =
> TRANS_COMMIT and ending s->state = TRANS_DEFAULT, from which we call
> out to various subsystems' AtEOXact_XXX() functions.  There is no way
> to roll back in that state, so anything that throws ERROR from those
> routines is going to get something much like $SUBJECT.  Hmm, we'd know
> which exact code path got that EIO from your smoldering core if we'd
> put an explicit critical section there (if we're going to PANIC
> anyway, it might as well not be from a different stack after
> longjmp()...).

+1, there's basically no hope of debugging this sort of problem
as things stand.

            regards, tom lane



Re: cannot abort transaction 2737414167, it was already committed

From
Justin Pryzby
Date:
On Thu, Dec 28, 2023 at 11:33:16AM +1300, Thomas Munro wrote:
> I guess the large object usage isn't directly relevant (that module's
> EOXact stuff seems to be finished before TRANS_COMMIT, but I don't
> know that code well).  Everything later is supposed to be about
> closing/releasing/cleaning up, and for example smgrDoPendingDeletes()
> reaches code with this relevant comment:
> 
>      * Note: smgr_unlink must treat deletion failure as a WARNING, not an
>      * ERROR, because we've already decided to commit or abort the current
>      * xact.
> 
> We don't really have a general ban on ereporting on system call
> failure, though.  We've just singled unlink() out.  Only a few lines
> above that we call DropRelationsAllBuffers(rels, nrels), and that
> calls smgrnblocks(), and that might need to need to re-open() the
> relation file to do lseek(SEEK_END), because PostgreSQL itself has no
> tracking of relation size.  Hard to say but my best guess is that's
> where you might have got your EIO, assuming you dropped the relation
> in this transaction?

Yeah.  In fact I was confused - this was not lo_unlink().
This uses normal tables, so would've done:

"begin;"
"DROP TABLE IF EXISTS %s", tablename
"DELETE FROM cached_objects WHERE cache_name=%s", tablename
"commit;"

> > This is pg16 compiled at efa8f6064, runing under centos7.  ZFS is 2.2.2,
> > but the pool hasn't been upgraded to use the features new since 2.1.
> 
> I've been following recent ZFS stuff from a safe distance as a user.
> AFAIK the extremely hard to hit bug fixed in that very recent release
> didn't technically require the interesting new feature (namely block
> cloning, though I think that helped people find the root cause after a
> phase of false blame?).  Anyway, it had for symptom some bogus zero
> bytes on read, not a spurious EIO.

The ZFS bug had to do with bogus bytes which may-or-may-not-be-zero, as
I understand.  The understanding is that the bug was pre-existing but
became more easy to hit in 2.2, and is fixed in 2.2.2 and 2.1.14.

-- 
Justin



Re: cannot abort transaction 2737414167, it was already committed

From
Justin Pryzby
Date:
On Wed, Dec 27, 2023 at 09:02:25AM -0600, Justin Pryzby wrote:
> We had this:
> 
> < 2023-12-25 04:06:20.062 MST telsasoft >ERROR:  could not open file
"pg_tblspc/16395/PG_16_202307071/16384/121010871":Input/output error
 
> < 2023-12-25 04:06:20.062 MST telsasoft >STATEMENT:  commit
> < 2023-12-25 04:06:20.062 MST telsasoft >WARNING:  AbortTransaction while in COMMIT state
> < 2023-12-25 04:06:20.062 MST telsasoft >PANIC:  cannot abort transaction 2737414167, it was already committed
> < 2023-12-25 04:06:20.473 MST  >LOG:  server process (PID 14678) was terminated by signal 6: Aborted

> It's possible that the filesystem had an IO error, but I can't find any
> evidence of that.  Postgres is running entirely on zfs, which says:

FYI: the VM which hit this error also just hit:

log_time               | 2024-01-07 05:19:11.611-07
error_severity         | ERROR
message                | could not open file "pg_tblspc/16395/PG_16_202307071/16384/123270438_vm": Input/output error
query                  | commit
location               | mdopenfork, md.c:663

Since I haven't seen this anywhere else, that's good evidence it's an
issue with the backing storage (even though the FS/kernel aren't nice
enough to say so).

-- 
Justin



On Thu, Dec 28, 2023 at 11:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > In CommitTransaction() there is a stretch of code beginning s->state =
> > TRANS_COMMIT and ending s->state = TRANS_DEFAULT, from which we call
> > out to various subsystems' AtEOXact_XXX() functions.  There is no way
> > to roll back in that state, so anything that throws ERROR from those
> > routines is going to get something much like $SUBJECT.  Hmm, we'd know
> > which exact code path got that EIO from your smoldering core if we'd
> > put an explicit critical section there (if we're going to PANIC
> > anyway, it might as well not be from a different stack after
> > longjmp()...).
>
> +1, there's basically no hope of debugging this sort of problem
> as things stand.

I was reminded of this thread by Justin's other file system snafu thread.

Naively defining a critical section to match the extent of the
TRANS_COMMIT state doesn't work, as a bunch of code under there uses
palloc().  That reminds me of the nearby RelationTruncate() thread,
and there is possibly even some overlap, plus more in this case...
ugh.

Hmm, AtEOXact_RelationMap() is one of those steps, but lives just
outside the crypto-critical-section created by TRANS_COMMIT, though
has its own normal CS for logging.  I wonder, given that "updating the
map file is effectively commit of the relocation", why wouldn't it
have a variant of the problem solved by DELAY_CHKPT_START for normal
commit records, under diabolical scheduling?  It's a stretch, but: You
log XLOG_RELMAP_UPDATE, a concurrent checkpoint runs with REDO after
that record, you crash before/during durable_rename(), and then you
perform crash recovery.  Now your catalog is still using the old
relfilenode on the primary, but any replica following along replays
XLOG_RELMAP_UPDATE and is using the new relfilenode, frozen in time,
for queries, while replaying changes to the old relfilenode.  Right?



On Thu, May 09, 2024 at 05:19:47PM +1200, Thomas Munro wrote:
> On Thu, Dec 28, 2023 at 11:42 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Thomas Munro <thomas.munro@gmail.com> writes:
> > > In CommitTransaction() there is a stretch of code beginning s->state =
> > > TRANS_COMMIT and ending s->state = TRANS_DEFAULT, from which we call
> > > out to various subsystems' AtEOXact_XXX() functions.  There is no way
> > > to roll back in that state, so anything that throws ERROR from those
> > > routines is going to get something much like $SUBJECT.  Hmm, we'd know
> > > which exact code path got that EIO from your smoldering core if we'd
> > > put an explicit critical section there (if we're going to PANIC
> > > anyway, it might as well not be from a different stack after
> > > longjmp()...).
> >
> > +1, there's basically no hope of debugging this sort of problem
> > as things stand.
> 
> I was reminded of this thread by Justin's other file system snafu thread.
> 
> Naively defining a critical section to match the extent of the
> TRANS_COMMIT state doesn't work, as a bunch of code under there uses
> palloc().  That reminds me of the nearby RelationTruncate() thread,
> and there is possibly even some overlap, plus more in this case...
> ugh.
> 
> Hmm, AtEOXact_RelationMap() is one of those steps, but lives just
> outside the crypto-critical-section created by TRANS_COMMIT, though
> has its own normal CS for logging.  I wonder, given that "updating the
> map file is effectively commit of the relocation", why wouldn't it
> have a variant of the problem solved by DELAY_CHKPT_START for normal
> commit records, under diabolical scheduling?  It's a stretch, but: You
> log XLOG_RELMAP_UPDATE, a concurrent checkpoint runs with REDO after
> that record, you crash before/during durable_rename(), and then you
> perform crash recovery.

See the CheckPointRelationMap() header comment for how relmapper behaves like
DELAY_CHKPT_START without using that flag.  I think its mechanism suffices.