Thread: BUG #18354: Aborted transaction aborted during cleanup when temp_file_limit exceeded

BUG #18354: Aborted transaction aborted during cleanup when temp_file_limit exceeded

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18354
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16.2
Operating system:   Ubuntu 22.04
Description:

The following script:
SET temp_file_limit = 100;
CREATE TABLE t (a text);
INSERT INTO t VALUES ('test');
SELECT 'INSERT INTO t SELECT * FROM t RETURNING *' FROM generate_series(1,
18)
\gexec

triggers two errors, a warning, and an assertion failure:
ERROR:  temporary file size exceeds temp_file_limit (100kB)
WARNING:  AbortTransaction while in ABORT state
ERROR:  temporary file size exceeds temp_file_limit (100kB)
server closed the connection unexpectedly

The backtrace for the first error is:
2024-02-21 11:40:06.947 UTC|law|regression|65d5e116.13cfcb|LOG:  statement:
INSERT INTO t SELECT * FROM t RETURNING *
2024-02-21 11:40:06.999 UTC|law|regression|65d5e116.13cfcb|ERROR:  temporary
file size exceeds temp_file_limit (100kB)
2024-02-21 11:40:06.999 UTC|law|regression|65d5e116.13cfcb|BACKTRACE:  
FileWrite at fd.c:2183:5
BufFileDumpBuffer at buffile.c:537:18
BufFileWrite at buffile.c:698:28
writetup_heap at tuplestore.c:1513:5
dumptuples at tuplestore.c:1210:35
tuplestore_puttuple_common at tuplestore.c:833:4
MemoryContextSwitchTo at palloc.h:142:23
 (inlined by) tuplestore_puttupleslot at tuplestore.c:722:2
tstoreReceiveSlot_notoast at tstoreReceiver.c:129:1
ExecutePlan at execMain.c:1701:7
standard_ExecutorRun at execMain.c:365:3
ExecutorRun at execMain.c:310:1
ProcessQuery at pquery.c:165:5
PortalRunMulti at pquery.c:1293:7
FillPortalStore at pquery.c:1042:8
PortalRun at pquery.c:763:6
exec_simple_query at postgres.c:1282:3
PostgresMain at postgres.c:4641:27
report_fork_failure_to_client at postmaster.c:4242:1
BackendStartup at postmaster.c:4199:22
ServerLoop at postmaster.c:1788:6
main at main.c:185:3

For the second error:
2024-02-21 11:40:07.001 UTC|law|regression|65d5e116.13cfcb|ERROR:  temporary
file size exceeds temp_file_limit (100kB)
2024-02-21 11:40:07.001 UTC|law|regression|65d5e116.13cfcb|BACKTRACE:  
FileWrite at fd.c:2183:5
BufFileDumpBuffer at buffile.c:537:18
BufFileFlush at buffile.c:723:3
BufFileClose at buffile.c:419:9
tuplestore_end at tuplestore.c:459:5
MemoryContextSwitchTo at palloc.h:142:23
 (inlined by) PortalDrop at portalmem.c:587:3
AtCleanup_Portals at portalmem.c:907:3
CleanupTransaction at xact.c:2903:2
AbortCurrentTransaction at xact.c:3341:18
PostgresMain at postgres.c:4356:6
report_fork_failure_to_client at postmaster.c:4242:1
BackendStartup at postmaster.c:4199:22
ServerLoop at postmaster.c:1788:6
main at main.c:185:3

For the failed assertion:
2024-02-21 11:40:07.001 UTC|law|regression|65d5e116.13cfcb|WARNING:
AbortTransaction while in ABORT state
TRAP: failed Assert("TransactionIdIsValid(proc->xid)"), File: "procarray.c",
Line: 680, PID: 1298379
ExceptionalCondition at assert.c:52:13
ProcArrayEndTransactionInternal at procarray.c:742:2
 (inlined by) ProcArrayEndTransaction at procarray.c:689:4
AbortTransaction at xact.c:2841:5
AbortCurrentTransaction at xact.c:3340:4
PostgresMain at postgres.c:4356:6
report_fork_failure_to_client at postmaster.c:4242:1
BackendStartup at postmaster.c:4199:22
ServerLoop at postmaster.c:1788:6
BackgroundWorkerInitializeConnection at postmaster.c:5604:1
main at main.c:185:3

Reproduced on REL_12_STABLE .. master.


At Wed, 21 Feb 2024 12:00:01 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> triggers two errors, a warning, and an assertion failure:
> ERROR:  temporary file size exceeds temp_file_limit (100kB)
> WARNING:  AbortTransaction while in ABORT state
> ERROR:  temporary file size exceeds temp_file_limit (100kB)
> server closed the connection unexpectedly
...
> For the second error:
> 2024-02-21 11:40:07.001 UTC|law|regression|65d5e116.13cfcb|ERROR:  temporary
> file size exceeds temp_file_limit (100kB)
> 2024-02-21 11:40:07.001 UTC|law|regression|65d5e116.13cfcb|BACKTRACE:  
> FileWrite at fd.c:2183:5
> BufFileDumpBuffer at buffile.c:537:18
> BufFileFlush at buffile.c:723:3
> BufFileClose at buffile.c:419:9
> tuplestore_end at tuplestore.c:459:5
> MemoryContextSwitchTo at palloc.h:142:23
>  (inlined by) PortalDrop at portalmem.c:587:3
> AtCleanup_Portals at portalmem.c:907:3

Therefore, BufFileClose should not flush the content during error
handling.  In the first place, tuplestore doesn't need to flush the
underlying files in _end and _clear. In this case, I would choose to
change the general behavior of tuplestore. The attached PoC patch
fixes the issue for me. It introduces a new "extended" function to
control flushing, avoiding the addition of an unnatural parameter to
BufFileClose. I suspect that it is usable in some other places, but I
haven't checked that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment
When the first error was reported, we entered AbortTransaction(),  proc->xid was changed to InvalidTransactionId in ProcArrayEndTransactionInternal().
After AbortTransaction() is done, we enter  into CleanupTransaction(), and error would be reported again. So we enter AbortTransaction() again because
the transaction block state is still TBLOCK_STARTED. MyProc->xid now is InvalidTransactionId, so the assert failed. 

Changing the behavior of tuplestore can work for this issue,  but I'm not sure if this change will affect other components which depend on BufFile(like nodeMaterial)

Kyotaro Horiguchi <horikyota.ntt@gmail.com> 于2024年2月22日周四 10:46写道:
At Wed, 21 Feb 2024 12:00:01 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
> triggers two errors, a warning, and an assertion failure:
> ERROR:  temporary file size exceeds temp_file_limit (100kB)
> WARNING:  AbortTransaction while in ABORT state
> ERROR:  temporary file size exceeds temp_file_limit (100kB)
> server closed the connection unexpectedly
...
> For the second error:
> 2024-02-21 11:40:07.001 UTC|law|regression|65d5e116.13cfcb|ERROR:  temporary
> file size exceeds temp_file_limit (100kB)
> 2024-02-21 11:40:07.001 UTC|law|regression|65d5e116.13cfcb|BACKTRACE: 
> FileWrite at fd.c:2183:5
> BufFileDumpBuffer at buffile.c:537:18
> BufFileFlush at buffile.c:723:3
> BufFileClose at buffile.c:419:9
> tuplestore_end at tuplestore.c:459:5
> MemoryContextSwitchTo at palloc.h:142:23
>  (inlined by) PortalDrop at portalmem.c:587:3
> AtCleanup_Portals at portalmem.c:907:3

Therefore, BufFileClose should not flush the content during error
handling.  In the first place, tuplestore doesn't need to flush the
underlying files in _end and _clear. In this case, I would choose to
change the general behavior of tuplestore. The attached PoC patch
fixes the issue for me. It introduces a new "extended" function to
control flushing, avoiding the addition of an unnatural parameter to
BufFileClose. I suspect that it is usable in some other places, but I
haven't checked that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center


--
Tender Wang
OpenPie:  https://en.openpie.com/
At Thu, 22 Feb 2024 11:49:29 +0800, Tender Wang <tndrwang@gmail.com> wrote in 
> Changing the behavior of tuplestore can work for this issue,  but I'm not
> sure if this change will affect other components which depend on
> BufFile(like nodeMaterial)

Right. Regarding this type of error, the best we can do is to be
careful not to cause exceptions (elog/ereport(>=ERROR)) within the
abort code path. This is somewhat similar to the case of the
interaction between malloc() and signal handler.

First I tried to skip flushing in BufFileClose based on transaction
state and isInterXact state. However, that didn't seem that simple. We
could provide some flag to signal we're in the state where we cannot
error out, which BufFileClose refers to. That in turn can cause other
issues.

I mentioned that I haven't checked for the similar potential issues
BufFileClose can cause, but if nodeMaterial is that, the new function
could be usable for that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



At Thu, 22 Feb 2024 14:38:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> At Thu, 22 Feb 2024 11:49:29 +0800, Tender Wang <tndrwang@gmail.com> wrote in 
> > Changing the behavior of tuplestore can work for this issue,  but I'm not
> > sure if this change will affect other components which depend on
> > BufFile(like nodeMaterial)

In the case of nodeMaterial, tuplestore is simply discarded and
underlying files are removed without flushing by the resource owner
mechanism. On the other hand, in the case of this bug report, the
files in the problem tuplestore are not managed by a resource
owner. As a result, PortalDrop have to explicitly call tuplestore_end
to release resources used by the tuplestore.

>     /*
>      * Delete tuplestore if present.  We should do this even under error
>      * conditions; since the tuplestore would have been using cross-
>      * transaction storage, its temp files need to be explicitly deleted.
>      */
>     if (portal->holdStore)
>     {
>         MemoryContext oldcontext;
> 
>         oldcontext = MemoryContextSwitchTo(portal->holdContext);
>         tuplestore_end(portal->holdStore);

As far as I can see, there are no caller sites to tuplestore_end in
the abort path or any error handling path.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center