Re: shared tempfile was not removed on statement_timeout (unreproducible) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: shared tempfile was not removed on statement_timeout (unreproducible)
Date
Msg-id CA+hUKGJStr-3B6qNnFEOpES8HHc3Wwe3wSrYYQJcQhHuTB9SdQ@mail.gmail.com
Whole thread Raw
In response to shared tempfile was not removed on statement_timeout (unreproducible)  (Justin Pryzby <pryzby@telsasoft.com>)
Responses Re: shared tempfile was not removed on statement_timeout(unreproducible)
Re: shared tempfile was not removed on statement_timeout
List pgsql-hackers
On Fri, Dec 13, 2019 at 7:05 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> I have a nagios check on ancient tempfiles, intended to catch debris left by
> crashed processes.  But triggered on this file:
>
> $ sudo find /var/lib/pgsql/12/data/base/pgsql_tmp -ls
> 142977    4 drwxr-x---   3 postgres postgres     4096 Dec 12 11:32 /var/lib/pgsql/12/data/base/pgsql_tmp
> 169868    4 drwxr-x---   2 postgres postgres     4096 Dec  7 01:35
/var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset
> 169347 5492 -rw-r-----   1 postgres postgres  5619712 Dec  7 01:35
/var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/0.0
> 169346 5380 -rw-r-----   1 postgres postgres  5505024 Dec  7 01:35
/var/lib/pgsql/12/data/base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/1.0
>
> I found:
>  2019-12-07 01:35:56 | 11025 | postgres | canceling statement due to statement timeout
          | CLUSTER pg_stat_database_snap USI
 
>  2019-12-07 01:35:56 | 11025 | postgres | temporary file: path "base/pgsql_tmp/pgsql_tmp11025.0.sharedfileset/2.0",
size5455872 | CLUSTER pg_stat_database_snap USI
 

Hmm.  I played around with this and couldn't reproduce it, but I
thought of something.  What if the statement timeout is reached while
we're in here:

#0  PathNameDeleteTemporaryDir (dirname=0x7fffffffd010
"base/pgsql_tmp/pgsql_tmp28884.31.sharedfileset") at fd.c:1471
#1  0x0000000000a32c77 in SharedFileSetDeleteAll (fileset=0x80182e2cc)
at sharedfileset.c:177
#2  0x0000000000a327e1 in SharedFileSetOnDetach (segment=0x80a6e62d8,
datum=34385093324) at sharedfileset.c:206
#3  0x0000000000a365ca in dsm_detach (seg=0x80a6e62d8) at dsm.c:684
#4  0x000000000061621b in DestroyParallelContext (pcxt=0x80a708f20) at
parallel.c:904
#5  0x00000000005d97b3 in _bt_end_parallel (btleader=0x80fe9b4b0) at
nbtsort.c:1473
#6  0x00000000005d92f0 in btbuild (heap=0x80a7bc4c8,
index=0x80a850a50, indexInfo=0x80fec1ab0) at nbtsort.c:340
#7  0x000000000067445b in index_build (heapRelation=0x80a7bc4c8,
indexRelation=0x80a850a50, indexInfo=0x80fec1ab0, isreindex=true,
parallel=true) at index.c:2963
#8  0x0000000000677bd3 in reindex_index (indexId=16532,
skip_constraint_checks=true, persistence=112 'p', options=0) at
index.c:3591
#9  0x0000000000678402 in reindex_relation (relid=16508, flags=18,
options=0) at index.c:3807
#10 0x000000000073928f in finish_heap_swap (OIDOldHeap=16508,
OIDNewHeap=16573, is_system_catalog=false,
swap_toast_by_content=false, check_constraints=false,
is_internal=true, frozenXid=604, cutoffMulti=1, newrelpersistence=112
'p') at cluster.c:1409
#11 0x00000000007389ab in rebuild_relation (OldHeap=0x80a7bc4c8,
indexOid=16532, verbose=false) at cluster.c:622
#12 0x000000000073849e in cluster_rel (tableOid=16508, indexOid=16532,
options=0) at cluster.c:428
#13 0x0000000000737f22 in cluster (stmt=0x800cfcbf0, isTopLevel=true)
at cluster.c:185
#14 0x0000000000a7cc5c in standard_ProcessUtility (pstmt=0x800cfcf40,
queryString=0x800cfc120 "cluster t USING t_i_idx ;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x800cfd030, completionTag=0x7fffffffe0b0 "") at utility.c:654

The CHECK_FOR_INTERRUPTS() inside the walkdir() loop could ereport()
out of there after deleting some but not all of your files, but the
code in dsm_detach() has already popped the callback (which it does
"to avoid infinite error recursion"), so it won't run again on error
cleanup.  Hmm.  But then... maybe the two log lines you quoted should
be the other way around for that.

> Actually, I tried using pg_ls_tmpdir(), but it unconditionally masks
> non-regular files and thus shared filesets.  Maybe that's worth discussion on a
> new thread ?
>
> src/backend/utils/adt/genfile.c
>                 /* Ignore anything but regular files */
>                 if (!S_ISREG(attrib.st_mode))
>                         continue;

+1, that's worth fixing.



pgsql-hackers by date:

Previous
From: Will Leinweber
Date:
Subject: Errors "failed to construct the join relation" and "failed to buildany 2-way joins"
Next
From: Jeff Davis
Date:
Subject: Re: Memory-Bounded Hash Aggregation