Home > mailing lists

Thread: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

[CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Neha Sharma

Date:

22 March 2021, 06:23:13

Hello,

While executing the below test case server crashed with Segfault 11 on master branch.

I have enabled the CLOBBER_CACHE_ALWAYS in src/include/pg_config_manual.h

Issue is only reproducing on master branch.

Test Case:

CREATE TABLE sm_5_323_table (col1 numeric);
CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);

CLUSTER sm_5_323_table USING sm_5_323_idx;

\! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres

Test case output:

edb@edb:~/PGClobber_build/postgresql/inst/bin$ ./psql postgres
psql (14devel)
Type "help" for help.

postgres=# CREATE TABLE sm_5_323_table (col1 numeric);
CREATE TABLE
postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CREATE INDEX
postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx;
CLUSTER
postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres
clusterdb: error: clustering of table "sm_5_323_table" in database "postgres" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

Stack Trace:

Core was generated by `postgres: edb postgres 127.0.0.1(50978) CLUSTER '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
485 if (reln->md_num_open_segs[forknum] > 0)
(gdb) bt
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
#1 0x000055e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at md.c:768
#2 0x000055e5c85eb61b in mdimmedsync (reln=0x0, forknum=forknum@entry=MAIN_FORKNUM) at md.c:930
#3 0x000055e5c85ec6e5 in smgrimmedsync (reln=<optimized out>, forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662
#4 0x000055e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70) at rewriteheap.c:342
#5 0x000055e5c81a32ea in heapam_relation_copy_for_cluster (OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058, OldIndex=<optimized out>, use_sort=<optimized out>, OldestXmin=<optimized out>,
xid_cutoff=<optimized out>, multi_cutoff=0x7ffcba6ebe64, num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70, tups_recently_dead=0x7ffcba6ebe78) at heapam_handler.c:984
#6 0x000055e5c82f218a in table_relation_copy_for_cluster (tups_recently_dead=0x7ffcba6ebe78, tups_vacuumed=0x7ffcba6ebe70, num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64,
xid_cutoff=0x7ffcba6ebe60, OldestXmin=<optimized out>, use_sort=<optimized out>, OldIndex=0x7f212ce40670, NewTable=0x7f212ce41058, OldTable=0x7f212ce41ba0)
at ../../../src/include/access/tableam.h:1656
#7 copy_table_data (pCutoffMulti=<synthetic pointer>, pFreezeXid=<synthetic pointer>, pSwapToastByContent=<synthetic pointer>, verbose=<optimized out>, OIDOldIndex=<optimized out>,
OIDOldHeap=16384, OIDNewHeap=<optimized out>) at cluster.c:908
#8 rebuild_relation (verbose=<optimized out>, indexOid=<optimized out>, OldHeap=<optimized out>) at cluster.c:604
#9 cluster_rel (tableOid=<optimized out>, indexOid=<optimized out>, params=<optimized out>) at cluster.c:427
#10 0x000055e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0, stmt=stmt@entry=0x55e5ca510368, isTopLevel=isTopLevel@entry=true) at cluster.c:195
#11 0x000055e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430, queryString=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822
#12 0x000055e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430, queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
queryEnv=<optimized out>, dest=dest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:525
#13 0x000055e5c85f6148 in PortalRunUtility (portal=portal@entry=0x55e5ca570d70, pstmt=pstmt@entry=0x55e5ca510430, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1159
#14 0x000055e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1305
#15 0x000055e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at pquery.c:779
#16 0x000055e5c85f389e in exec_simple_query (query_string=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;") at postgres.c:1185
#17 0x000055e5c85f51cf in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffcba6ec670, dbname=<optimized out>, username=<optimized out>) at postgres.c:4415
#18 0x000055e5c8522240 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4470
#19 BackendStartup (port=<optimized out>) at postmaster.c:4192
#20 ServerLoop () at postmaster.c:1737
#21 0x000055e5c85237ec in PostmasterMain (argc=<optimized out>, argv=0x55e5ca508fe0) at postmaster.c:1409
#22 0x000055e5c811a2cf in main (argc=3, argv=0x55e5ca508fe0) at main.c:209

Thanks.
--

Regards,

Neha Sharma

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

22 March 2021, 08:25:23

In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
get called which cause the system cache invalidation and due to CCA setting,
wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
operations and causes segmentation fault.

By calling RelationOpenSmgr() before calling smgrimmedsync() in
end_heap_rewrite() would fix the failure. Did the same in the attached patch.

Regards,
Amul



On Mon, Mar 22, 2021 at 11:53 AM Neha Sharma
<neha.sharma@enterprisedb.com> wrote:
>
> Hello,
>
> While executing the below test case server crashed with Segfault 11 on master branch.
> I have enabled the CLOBBER_CACHE_ALWAYS in src/include/pg_config_manual.h
>
> Issue is only reproducing on master branch.
>
> Test Case:
> CREATE TABLE sm_5_323_table (col1 numeric);
> CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
>
> CLUSTER sm_5_323_table USING sm_5_323_idx;
>
> \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres
>
> Test case output:
> edb@edb:~/PGClobber_build/postgresql/inst/bin$ ./psql postgres
> psql (14devel)
> Type "help" for help.
>
> postgres=# CREATE TABLE sm_5_323_table (col1 numeric);
> CREATE TABLE
> postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
> CREATE INDEX
> postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx;
> CLUSTER
> postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d
postgres
> clusterdb: error: clustering of table "sm_5_323_table" in database "postgres" failed: server closed the connection
unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
>
> Stack Trace:
> Core was generated by `postgres: edb postgres 127.0.0.1(50978) CLUSTER         '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
> 485 if (reln->md_num_open_segs[forknum] > 0)
> (gdb) bt
> #0  0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
> #1  0x000055e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at md.c:768
> #2  0x000055e5c85eb61b in mdimmedsync (reln=0x0, forknum=forknum@entry=MAIN_FORKNUM) at md.c:930
> #3  0x000055e5c85ec6e5 in smgrimmedsync (reln=<optimized out>, forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662
> #4  0x000055e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70) at rewriteheap.c:342
> #5  0x000055e5c81a32ea in heapam_relation_copy_for_cluster (OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058,
OldIndex=<optimizedout>, use_sort=<optimized out>, OldestXmin=<optimized out>,
 
>     xid_cutoff=<optimized out>, multi_cutoff=0x7ffcba6ebe64, num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70,
tups_recently_dead=0x7ffcba6ebe78)at heapam_handler.c:984
 
> #6  0x000055e5c82f218a in table_relation_copy_for_cluster (tups_recently_dead=0x7ffcba6ebe78,
tups_vacuumed=0x7ffcba6ebe70,num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64,
 
>     xid_cutoff=0x7ffcba6ebe60, OldestXmin=<optimized out>, use_sort=<optimized out>, OldIndex=0x7f212ce40670,
NewTable=0x7f212ce41058,OldTable=0x7f212ce41ba0)
 
>     at ../../../src/include/access/tableam.h:1656
> #7  copy_table_data (pCutoffMulti=<synthetic pointer>, pFreezeXid=<synthetic pointer>, pSwapToastByContent=<synthetic
pointer>,verbose=<optimized out>, OIDOldIndex=<optimized out>,
 
>     OIDOldHeap=16384, OIDNewHeap=<optimized out>) at cluster.c:908
> #8  rebuild_relation (verbose=<optimized out>, indexOid=<optimized out>, OldHeap=<optimized out>) at cluster.c:604
> #9  cluster_rel (tableOid=<optimized out>, indexOid=<optimized out>, params=<optimized out>) at cluster.c:427
> #10 0x000055e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0, stmt=stmt@entry=0x55e5ca510368,
isTopLevel=isTopLevel@entry=true)at cluster.c:195
 
> #11 0x000055e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430, queryString=0x55e5ca50f850 "CLUSTER
public.sm_5_323_table;",context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
 
>     queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822
> #12 0x000055e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430, queryString=<optimized out>,
context=context@entry=PROCESS_UTILITY_TOPLEVEL,params=<optimized out>,
 
>     queryEnv=<optimized out>, dest=dest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:525
> #13 0x000055e5c85f6148 in PortalRunUtility (portal=portal@entry=0x55e5ca570d70, pstmt=pstmt@entry=0x55e5ca510430,
isTopLevel=isTopLevel@entry=true,
>     setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at
pquery.c:1159
> #14 0x000055e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false,
>     dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at
pquery.c:1305
> #15 0x000055e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true,
 
>     dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at pquery.c:779
> #16 0x000055e5c85f389e in exec_simple_query (query_string=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;") at
postgres.c:1185
> #17 0x000055e5c85f51cf in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffcba6ec670, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4415
 
> #18 0x000055e5c8522240 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4470
> #19 BackendStartup (port=<optimized out>) at postmaster.c:4192
> #20 ServerLoop () at postmaster.c:1737
> #21 0x000055e5c85237ec in PostmasterMain (argc=<optimized out>, argv=0x55e5ca508fe0) at postmaster.c:1409
> #22 0x000055e5c811a2cf in main (argc=3, argv=0x55e5ca508fe0) at main.c:209
>
> Thanks.
> --
> Regards,
> Neha Sharma

Attachment

fix_failure_for_cca.patch

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amit Langote

Date:

22 March 2021, 09:32:52

On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:
> In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
> rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
> get called which cause the system cache invalidation and due to CCA setting,
> wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
> operations and causes segmentation fault.
>
> By calling RelationOpenSmgr() before calling smgrimmedsync() in
> end_heap_rewrite() would fix the failure. Did the same in the attached patch.

That makes sense.  I see a few commits in the git history adding
RelationOpenSmgr() before a smgr* operation, whenever such a problem
would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
for example.

I do wonder if there are still other smgr* operations in the source
code that are preceded by operations that would invalidate the
SMgrRelation that those smgr* operations would be called with.  For
example, the smgrnblocks() in gistBuildCallback() may get done too
late than a corresponding RelationOpenSmgr() on the index relation.

--
Amit Langote
EDB: http://www.enterprisedb.com

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

23 March 2021, 04:38:05

On Mon, Mar 22, 2021 at 3:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:
> > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
> > rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
> > get called which cause the system cache invalidation and due to CCA setting,
> > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
> > operations and causes segmentation fault.
> >
> > By calling RelationOpenSmgr() before calling smgrimmedsync() in
> > end_heap_rewrite() would fix the failure. Did the same in the attached patch.
>
> That makes sense.  I see a few commits in the git history adding
> RelationOpenSmgr() before a smgr* operation, whenever such a problem
> would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
> for example.
>

Thanks for the confirmation.

> I do wonder if there are still other smgr* operations in the source
> code that are preceded by operations that would invalidate the
> SMgrRelation that those smgr* operations would be called with.  For
> example, the smgrnblocks() in gistBuildCallback() may get done too
> late than a corresponding RelationOpenSmgr() on the index relation.
>

I did the check for gistBuildCallback() by adding Assert(index->rd_smgr)  before
smgrnblocks() with CCA setting and didn't see any problem there.

I think the easiest way to find that is to run a regression suite with CCA
build, perhaps, there is no guarantee that regression will hit all smgr*
operations, but that might hit most of them.

Regards,
Amul

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Neha Sharma

Date:

23 March 2021, 05:22:09

On Tue, Mar 23, 2021 at 10:08 AM Amul Sul <sulamul@gmail.com> wrote:

On Mon, Mar 22, 2021 at 3:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:
> > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
> > rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
> > get called which cause the system cache invalidation and due to CCA setting,
> > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
> > operations and causes segmentation fault.
> >
> > By calling RelationOpenSmgr() before calling smgrimmedsync() in
> > end_heap_rewrite() would fix the failure. Did the same in the attached patch.
>
> That makes sense. I see a few commits in the git history adding
> RelationOpenSmgr() before a smgr* operation, whenever such a problem
> would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
> for example.
>

Thanks for the confirmation.

> I do wonder if there are still other smgr* operations in the source
> code that are preceded by operations that would invalidate the
> SMgrRelation that those smgr* operations would be called with. For
> example, the smgrnblocks() in gistBuildCallback() may get done too
> late than a corresponding RelationOpenSmgr() on the index relation.
>

I did the check for gistBuildCallback() by adding Assert(index->rd_smgr) before
smgrnblocks() with CCA setting and didn't see any problem there.

I think the easiest way to find that is to run a regression suite with CCA
build, perhaps, there is no guarantee that regression will hit all smgr*
operations, but that might hit most of them.

Sure, will give a regression run with CCA enabled.

Regards,
Amul

Regards,

Neha Sharma

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Michael Paquier

Date:

23 March 2021, 07:12:01

On Tue, Mar 23, 2021 at 10:52:09AM +0530, Neha Sharma wrote:
> Sure, will give a regression run with CCA enabled.

I can confirm the regression between 13 and HEAD, so I have added an
open item.  It would be good to figure out the root issue here, and I
am ready to bet that the problem is deeper than it looks and that more
code paths could be involved.

It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
but the test is quick enough to reproduce.  It would be good to bisect
the origin point here as a first step.
--
Michael

Attachment

signature.asc

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Michael Paquier

Date:

23 March 2021, 10:09:03

On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote:
> It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
> but the test is quick enough to reproduce.  It would be good to bisect
> the origin point here as a first step.

One bisect later, the winner is:
commit: 3d351d916b20534f973eda760cde17d96545d4c4
author: Tom Lane <tgl@sss.pgh.pa.us>
date: Sun, 30 Aug 2020 12:21:51 -0400
Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

I am too tired to poke at that today, so I'll try tomorrow.  Tom may
beat me at that though.
--
Michael

Attachment

signature.asc

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Tom Lane

Date:

23 March 2021, 14:44:02

Michael Paquier <michael@paquier.xyz> writes:
> On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote:
>> It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS,
>> but the test is quick enough to reproduce.  It would be good to bisect
>> the origin point here as a first step.

> One bisect later, the winner is:
> commit: 3d351d916b20534f973eda760cde17d96545d4c4
> author: Tom Lane <tgl@sss.pgh.pa.us>
> date: Sun, 30 Aug 2020 12:21:51 -0400
> Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

> I am too tired to poke at that today, so I'll try tomorrow.  Tom may
> beat me at that though.

I think that's an artifact.  That commit didn't touch anything related to
relation opening or closing.  What it could have done, though, is change
CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
thus causing us to follow the buggy code path where before we didn't.

The interesting question here seems to be "why didn't the existing
CLOBBER_CACHE_ALWAYS buildfarm testing catch this?".  It looks to me like
the answer is that it only happens for an empty table (or at least one
where the data pattern is such that we skip the RelationOpenSmgr call
earlier in end_heap_rewrite) and we don't happen to be exercising that
exact scenario in the regression tests.

            regards, tom lane

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Tom Lane

Date:

23 March 2021, 15:29:24

I wrote:
> Michael Paquier <michael@paquier.xyz> writes:
>> One bisect later, the winner is:
>> commit: 3d351d916b20534f973eda760cde17d96545d4c4
>> author: Tom Lane <tgl@sss.pgh.pa.us>
>> date: Sun, 30 Aug 2020 12:21:51 -0400
>> Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.

> I think that's an artifact.  That commit didn't touch anything related to
> relation opening or closing.  What it could have done, though, is change
> CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
> thus causing us to follow the buggy code path where before we didn't.

On closer inspection, I believe the true culprit is c6b92041d,
which did this:

      */
     if (RelationNeedsWAL(state->rs_new_rel))
-        heap_sync(state->rs_new_rel);
+        smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
 
     logical_end_heap_rewrite(state);

heap_sync was careful about opening rd_smgr, the new code not so much.

I read the rest of that commit and didn't see any other equivalent
bugs, but I might've missed something.

            regards, tom lane

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

24 March 2021, 05:40:08

On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
> > Michael Paquier <michael@paquier.xyz> writes:
> >> One bisect later, the winner is:
> >> commit: 3d351d916b20534f973eda760cde17d96545d4c4
> >> author: Tom Lane <tgl@sss.pgh.pa.us>
> >> date: Sun, 30 Aug 2020 12:21:51 -0400
> >> Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE.
>
> > I think that's an artifact.  That commit didn't touch anything related to
> > relation opening or closing.  What it could have done, though, is change
> > CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan,
> > thus causing us to follow the buggy code path where before we didn't.
>
> On closer inspection, I believe the true culprit is c6b92041d,
> which did this:
>
>          */
>         if (RelationNeedsWAL(state->rs_new_rel))
> -               heap_sync(state->rs_new_rel);
> +               smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
>
>         logical_end_heap_rewrite(state);
>
> heap_sync was careful about opening rd_smgr, the new code not so much.
>
> I read the rest of that commit and didn't see any other equivalent
> bugs, but I might've missed something.
>

I too didn't find any other place replacing heap_sync() or equivalent place from
this commit where smgr* operation reaches without necessary precautions call.
heap_sync() was calling RelationOpenSmgr() through FlushRelationBuffers() before
it reached smgrimmedsync().  So we also need to make sure of the
RelationOpenSmgr() call before smgrimmedsync() as proposed previously.

Regards,
Amul

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Tom Lane

Date:

24 March 2021, 14:39:42

Amul Sul <sulamul@gmail.com> writes:
> On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> On closer inspection, I believe the true culprit is c6b92041d,
>> which did this:
>> -               heap_sync(state->rs_new_rel);
>> +               smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
>> heap_sync was careful about opening rd_smgr, the new code not so much.

> So we also need to make sure of the
> RelationOpenSmgr() call before smgrimmedsync() as proposed previously.

I wonder if we should try to get rid of this sort of bug by banning
direct references to rd_smgr?  That is, write the above and all
similar code like

    smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);

where we provide something like

static inline struct SMgrRelationData *
RelationGetSmgr(Relation rel)
{
    if (unlikely(rel->rd_smgr == NULL))
        RelationOpenSmgr(rel);
    return rel->rd_smgr;
}

and then we could get rid of most or all other RelationOpenSmgr calls.

This might create more code bloat than it's really worth, but
it'd be a simple and mechanically-checkable scheme.

            regards, tom lane

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

25 March 2021, 03:49:24

On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amul Sul <sulamul@gmail.com> writes:
> > On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> On closer inspection, I believe the true culprit is c6b92041d,
> >> which did this:
> >> -               heap_sync(state->rs_new_rel);
> >> +               smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM);
> >> heap_sync was careful about opening rd_smgr, the new code not so much.
>
> > So we also need to make sure of the
> > RelationOpenSmgr() call before smgrimmedsync() as proposed previously.
>
> I wonder if we should try to get rid of this sort of bug by banning
> direct references to rd_smgr?  That is, write the above and all
> similar code like
>
>         smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM);
>
> where we provide something like
>
> static inline struct SMgrRelationData *
> RelationGetSmgr(Relation rel)
> {
>         if (unlikely(rel->rd_smgr == NULL))
>                 RelationOpenSmgr(rel);
>         return rel->rd_smgr;
> }
>
> and then we could get rid of most or all other RelationOpenSmgr calls.
>

+1

> This might create more code bloat than it's really worth, but
> it'd be a simple and mechanically-checkable scheme.

I think that will be fine, one-time pain. If you want I will do those changes.

A quick question: Can't it be a macro instead of an inline function
like other macros we have in rel.h?

Regards,
Amul

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Tom Lane

Date:

25 March 2021, 05:50:29

Amul Sul <sulamul@gmail.com> writes:
> On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> static inline struct SMgrRelationData *
>> RelationGetSmgr(Relation rel)
>> {
>>     if (unlikely(rel->rd_smgr == NULL))
>>         RelationOpenSmgr(rel);
>>     return rel->rd_smgr;
>> }

> A quick question: Can't it be a macro instead of an inline function
> like other macros we have in rel.h?

The multiple-evaluation hazard seems like an issue.  We've tolerated
such hazards in the past, but mostly just because we weren't relying
on static inlines being available, so there wasn't a good way around
it.

Also, the conditional evaluation here would look rather ugly
in a macro, I think, if indeed you could do it at all without
provoking compiler warnings.

            regards, tom lane

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Kyotaro Horiguchi

Date:

25 March 2021, 06:40:34

Sorry for the bug.

At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Amul Sul <sulamul@gmail.com> writes:
> > On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> static inline struct SMgrRelationData *
> >> RelationGetSmgr(Relation rel)
> >> {
> >>     if (unlikely(rel->rd_smgr == NULL))
> >>         RelationOpenSmgr(rel);
> >>     return rel->rd_smgr;
> >> }
> 
> > A quick question: Can't it be a macro instead of an inline function
> > like other macros we have in rel.h?
> 
> The multiple-evaluation hazard seems like an issue.  We've tolerated
> such hazards in the past, but mostly just because we weren't relying
> on static inlines being available, so there wasn't a good way around
> it.
> 
> Also, the conditional evaluation here would look rather ugly
> in a macro, I think, if indeed you could do it at all without
> provoking compiler warnings.

FWIW, +1 for the function as is.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

25 March 2021, 10:48:45

On Thu, Mar 25, 2021 at 12:10 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> Sorry for the bug.
>
> At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in
> > Amul Sul <sulamul@gmail.com> writes:
> > > On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >> static inline struct SMgrRelationData *
> > >> RelationGetSmgr(Relation rel)
> > >> {
> > >>     if (unlikely(rel->rd_smgr == NULL))
> > >>         RelationOpenSmgr(rel);
> > >>     return rel->rd_smgr;
> > >> }
> >
> > > A quick question: Can't it be a macro instead of an inline function
> > > like other macros we have in rel.h?
> >
> > The multiple-evaluation hazard seems like an issue.  We've tolerated
> > such hazards in the past, but mostly just because we weren't relying
> > on static inlines being available, so there wasn't a good way around
> > it.
> >
> > Also, the conditional evaluation here would look rather ugly
> > in a macro, I think, if indeed you could do it at all without
> > provoking compiler warnings.
>
> FWIW, +1 for the function as is.
>

Ok, in the attached patch, I have added the inline function to rel.h, and for
that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr
by RelationGetSmgr() function and removed the RelationOpenSmgr() call from
the nearby to it which I don't think needed at all.

Regards,
Amul

Attachment

Add-RelationGetSmgr-inline-function.patch

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Alvaro Herrera

Date:

09 April 2021, 22:45:45

On 2021-Mar-25, Amul Sul wrote:

> Ok, in the attached patch, I have added the inline function to rel.h, and for
> that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr
> by RelationGetSmgr() function and removed the RelationOpenSmgr() call from
> the nearby to it which I don't think needed at all.

We forgot this patch earlier in the commitfest.  Do people think we
should still get it in on this cycle?  I'm +1 on that, since it's a
safety feature poised to prevent more bugs than it's likely to
introduce.

-- 
Álvaro Herrera       Valdivia, Chile

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Michael Paquier

Date:

19 April 2021, 06:55:08

On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> We forgot this patch earlier in the commitfest.  Do people think we
> should still get it in on this cycle?  I'm +1 on that, since it's a
> safety feature poised to prevent more bugs than it's likely to
> introduce.

No objections from here to do that now even after feature freeze.  I
also wonder, while looking at that, why you don't just remove the last
call within src/backend/catalog/heap.c.  This way, nobody is tempted
to use RelationOpenSmgr() anymore, and it could just be removed from
rel.h.
--
Michael

Attachment

signature.asc

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

19 April 2021, 07:26:18

On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> > We forgot this patch earlier in the commitfest.  Do people think we
> > should still get it in on this cycle?  I'm +1 on that, since it's a
> > safety feature poised to prevent more bugs than it's likely to
> > introduce.
>
> No objections from here to do that now even after feature freeze.  I
> also wonder, while looking at that, why you don't just remove the last
> call within src/backend/catalog/heap.c.  This way, nobody is tempted
> to use RelationOpenSmgr() anymore, and it could just be removed from
> rel.h.

Agree, did the same in the attached version, thanks.

Regards,
Amul

P.S. commitfest entry https://commitfest.postgresql.org/33/3084/

Attachment

v2_Add-RelationGetSmgr-inline-function.patch

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Kyotaro Horiguchi

Date:

19 April 2021, 08:35:52

At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in 
> On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:
> >
> > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> > > We forgot this patch earlier in the commitfest.  Do people think we
> > > should still get it in on this cycle?  I'm +1 on that, since it's a
> > > safety feature poised to prevent more bugs than it's likely to
> > > introduce.
> >
> > No objections from here to do that now even after feature freeze.  I
> > also wonder, while looking at that, why you don't just remove the last
> > call within src/backend/catalog/heap.c.  This way, nobody is tempted
> > to use RelationOpenSmgr() anymore, and it could just be removed from
> > rel.h.
> 
> Agree, did the same in the attached version, thanks.

+    smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
               (char *) metapage, true);
-    log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+    log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,

At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
to leave the line alone..  I don't mind other sccessive calls if any
since what I don't like is the notation there.

> P.S. commitfest entry https://commitfest.postgresql.org/33/3084/

Isn't this a kind of open item?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

19 April 2021, 10:57:25

On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in
> > On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> > > > We forgot this patch earlier in the commitfest.  Do people think we
> > > > should still get it in on this cycle?  I'm +1 on that, since it's a
> > > > safety feature poised to prevent more bugs than it's likely to
> > > > introduce.
> > >
> > > No objections from here to do that now even after feature freeze.  I
> > > also wonder, while looking at that, why you don't just remove the last
> > > call within src/backend/catalog/heap.c.  This way, nobody is tempted
> > > to use RelationOpenSmgr() anymore, and it could just be removed from
> > > rel.h.
> >
> > Agree, did the same in the attached version, thanks.
>
> +       smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
>                           (char *) metapage, true);
> -       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> +       log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
>
> At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
> to leave the line alone..  I don't mind other sccessive calls if any
> since what I don't like is the notation there.
>

Perhaps, isn't that bad. It is good to follow the practice of using
RelationGetSmgr() for rd_smgr access, IMHO.

> > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
>
> Isn't this a kind of open item?
>

Sorry, I didn't get you. Do I need to move this to some other bucket?

Regards,
Amul

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Justin Pryzby

Date:

19 April 2021, 11:55:04

On Mon, Apr 19, 2021 at 04:27:25PM +0530, Amul Sul wrote:
> On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote:
> >
> > At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in
> > > On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote:
> > > >
> > > > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote:
> > > > > We forgot this patch earlier in the commitfest.  Do people think we
> > > > > should still get it in on this cycle?  I'm +1 on that, since it's a
> > > > > safety feature poised to prevent more bugs than it's likely to
> > > > > introduce.
> > > >
> > > > No objections from here to do that now even after feature freeze.  I
> > > > also wonder, while looking at that, why you don't just remove the last
> > > > call within src/backend/catalog/heap.c.  This way, nobody is tempted
> > > > to use RelationOpenSmgr() anymore, and it could just be removed from
> > > > rel.h.
> > >
> > > Agree, did the same in the attached version, thanks.
> >
> > +       smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
> >                           (char *) metapage, true);
> > -       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> > +       log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
> >
> > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
> > to leave the line alone..  I don't mind other sccessive calls if any
> > since what I don't like is the notation there.
> >
> 
> Perhaps, isn't that bad. It is good to follow the practice of using
> RelationGetSmgr() for rd_smgr access, IMHO.
> 
> > > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
> >
> > Isn't this a kind of open item?
> >
> 
> Sorry, I didn't get you. Do I need to move this to some other bucket?

It's not a new feature, and shouldn't wait for July's CF since it's targetting
v14.

The original crash was fixed by Tom by commit 9d523119f.

So it's not exactly an "open item" for v14, but there's probably no better
place for it, so you could add it if you think it's at risk of being forgotten
(again).

https://wiki.postgresql.org/wiki/PostgreSQL_14_Open_Items

-- 
Justin

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Michael Paquier

Date:

19 April 2021, 12:02:21

On Mon, Apr 19, 2021 at 05:35:52PM +0900, Kyotaro Horiguchi wrote:
> Isn't this a kind of open item?

This does not qualify as an open item because it is not an actual bug
IMO, neither is it a defect of the existing code, so it seems
appropriate to me to not list it.
--
Michael

Attachment

signature.asc

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Tom Lane

Date:

19 April 2021, 13:19:36

Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Apr 19, 2021 at 05:35:52PM +0900, Kyotaro Horiguchi wrote:
>> Isn't this a kind of open item?

> This does not qualify as an open item because it is not an actual bug
> IMO, neither is it a defect of the existing code, so it seems
> appropriate to me to not list it.

Agreed, but by the same token, rushing it into v14 doesn't have any
clear benefit.  I'd be inclined to leave it for v15 at this point,
especially since we don't seem to have 100% consensus on the details.

            regards, tom lane

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Kyotaro Horiguchi

Date:

20 April 2021, 01:19:41

At Mon, 19 Apr 2021 09:19:36 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Michael Paquier <michael@paquier.xyz> writes:
> > On Mon, Apr 19, 2021 at 05:35:52PM +0900, Kyotaro Horiguchi wrote:
> >> Isn't this a kind of open item?
> 
> > This does not qualify as an open item because it is not an actual bug
> > IMO, neither is it a defect of the existing code, so it seems
> > appropriate to me to not list it.
> 
> Agreed, but by the same token, rushing it into v14 doesn't have any
> clear benefit.  I'd be inclined to leave it for v15 at this point,
> especially since we don't seem to have 100% consensus on the details.

Thanks. Seems reasonable.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Kyotaro Horiguchi

Date:

20 April 2021, 01:29:39

At Mon, 19 Apr 2021 16:27:25 +0530, Amul Sul <sulamul@gmail.com> wrote in 
> On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > +       smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
> >                           (char *) metapage, true);
> > -       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> > +       log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
> >
> > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
> > to leave the line alone..  I don't mind other sccessive calls if any
> > since what I don't like is the notation there.
> >
> 
> Perhaps, isn't that bad. It is good to follow the practice of using
> RelationGetSmgr() for rd_smgr access, IMHO.

I don't mind RelationGetSmgr(index)->smgr_rnode alone or
&variable->member alone and there's not the previous call to
RelationGetSmgr just above. How about using a temporary variable?

  SMgrRelation srel = RelationGetSmgr(index);
  smgrwrite(srel, ...);
  log_newpage(srel->..);


> > > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
> >
> > Isn't this a kind of open item?
> >
> 
> Sorry, I didn't get you. Do I need to move this to some other bucket?

As discussed in the other branch, I agree that it is registered to the
next CF, not registered as an open items of this cycle.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

20 April 2021, 05:48:26

On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Mon, 19 Apr 2021 16:27:25 +0530, Amul Sul <sulamul@gmail.com> wrote in
> > On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> > > +       smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
> > >                           (char *) metapage, true);
> > > -       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
> > > +       log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM,
> > >
> > > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer
> > > to leave the line alone..  I don't mind other sccessive calls if any
> > > since what I don't like is the notation there.
> > >
> >
> > Perhaps, isn't that bad. It is good to follow the practice of using
> > RelationGetSmgr() for rd_smgr access, IMHO.
>
> I don't mind RelationGetSmgr(index)->smgr_rnode alone or
> &variable->member alone and there's not the previous call to
> RelationGetSmgr just above. How about using a temporary variable?
>
>   SMgrRelation srel = RelationGetSmgr(index);
>   smgrwrite(srel, ...);
>   log_newpage(srel->..);
>

Understood.  Used a temporary variable for the place where
RelationGetSmgr() calls are placed too close or in a loop.

Please have a look at the attached version, thanks for the review.

Regards,
Amul

Attachment

v3_Add-RelationGetSmgr-inline-function.patch

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Anastasia Lubennikova

Date:

05 July 2021, 12:49:09

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested
Documentation:            not tested

I looked through the patch. Looks good to me. 

CFbot tests are passing and, as I got it from the thread, nobody opposes this refactoring, so, move it to RFC status.

The new status of this patch is: Ready for Committer

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Tom Lane

Date:

06 July 2021, 17:36:29

Amul Sul <sulamul@gmail.com> writes:
> On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
>> I don't mind RelationGetSmgr(index)->smgr_rnode alone or
>> &variable->member alone and there's not the previous call to
>> RelationGetSmgr just above. How about using a temporary variable?
>> 
>> SMgrRelation srel = RelationGetSmgr(index);
>> smgrwrite(srel, ...);
>> log_newpage(srel->..);

> Understood.  Used a temporary variable for the place where
> RelationGetSmgr() calls are placed too close or in a loop.

[ squint... ]  Doesn't this risk introducing exactly the sort of
cache-clobber hazard we're trying to prevent?  That is, the above is
not safe unless you are *entirely* certain that there is not and never
will be any possibility of a relcache flush before you are done using
the temporary variable.  Otherwise it can become a dangling pointer.

The point of the static-inline function idea was to be cheap enough
that it isn't worth worrying about this sort of risky optimization.
Given that an smgr function is sure to involve some kernel calls,
I doubt it's worth sweating over an extra test-and-branch beforehand.
So where I was hoping to get to is that smgr objects are *only*
referenced by RelationGetSmgr() calls and nobody ever keeps any
other pointers to them across any non-smgr operations.

            regards, tom lane

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

07 July 2021, 04:14:16

On Tue, Jul 6, 2021 at 11:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amul Sul <sulamul@gmail.com> writes:
> > On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi
> > <horikyota.ntt@gmail.com> wrote:
> >> I don't mind RelationGetSmgr(index)->smgr_rnode alone or
> >> &variable->member alone and there's not the previous call to
> >> RelationGetSmgr just above. How about using a temporary variable?
> >>
> >> SMgrRelation srel = RelationGetSmgr(index);
> >> smgrwrite(srel, ...);
> >> log_newpage(srel->..);
>
> > Understood.  Used a temporary variable for the place where
> > RelationGetSmgr() calls are placed too close or in a loop.
>
> [ squint... ]  Doesn't this risk introducing exactly the sort of
> cache-clobber hazard we're trying to prevent?  That is, the above is
> not safe unless you are *entirely* certain that there is not and never
> will be any possibility of a relcache flush before you are done using
> the temporary variable.  Otherwise it can become a dangling pointer.
>

Yeah, there will a hazard, even if we sure right but cannot guarantee future
changes in any subroutine that could get call in between.

> The point of the static-inline function idea was to be cheap enough
> that it isn't worth worrying about this sort of risky optimization.
> Given that an smgr function is sure to involve some kernel calls,
> I doubt it's worth sweating over an extra test-and-branch beforehand.
> So where I was hoping to get to is that smgr objects are *only*
> referenced by RelationGetSmgr() calls and nobody ever keeps any
> other pointers to them across any non-smgr operations.
>

Ok, will revert changes added in  the previous version, thanks.

Regards,
Amul

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

09 July 2021, 13:52:53

On Wed, Jul 7, 2021 at 9:44 AM Amul Sul <sulamul@gmail.com> wrote:
>
> On Tue, Jul 6, 2021 at 11:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > Amul Sul <sulamul@gmail.com> writes:
> > > On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi
> > > <horikyota.ntt@gmail.com> wrote:
> > >> I don't mind RelationGetSmgr(index)->smgr_rnode alone or
> > >> &variable->member alone and there's not the previous call to
> > >> RelationGetSmgr just above. How about using a temporary variable?
> > >>
> > >> SMgrRelation srel = RelationGetSmgr(index);
> > >> smgrwrite(srel, ...);
> > >> log_newpage(srel->..);
> >
> > > Understood.  Used a temporary variable for the place where
> > > RelationGetSmgr() calls are placed too close or in a loop.
> >
> > [ squint... ]  Doesn't this risk introducing exactly the sort of
> > cache-clobber hazard we're trying to prevent?  That is, the above is
> > not safe unless you are *entirely* certain that there is not and never
> > will be any possibility of a relcache flush before you are done using
> > the temporary variable.  Otherwise it can become a dangling pointer.
> >
>
> Yeah, there will a hazard, even if we sure right but cannot guarantee future
> changes in any subroutine that could get call in between.
>
> > The point of the static-inline function idea was to be cheap enough
> > that it isn't worth worrying about this sort of risky optimization.
> > Given that an smgr function is sure to involve some kernel calls,
> > I doubt it's worth sweating over an extra test-and-branch beforehand.
> > So where I was hoping to get to is that smgr objects are *only*
> > referenced by RelationGetSmgr() calls and nobody ever keeps any
> > other pointers to them across any non-smgr operations.
> >
>
> Ok, will revert changes added in  the previous version, thanks.
>

Herewith attached version did the same, thanks.

Regards,
Amul

Attachment

v4_Add-RelationGetSmgr-inline-function.patch

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Alvaro Herrera

Date:

09 July 2021, 14:00:34

On 2021-Jul-09, Amul Sul wrote:

> > On Tue, Jul 6, 2021 at 11:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

> > > The point of the static-inline function idea was to be cheap enough
> > > that it isn't worth worrying about this sort of risky optimization.
> > > Given that an smgr function is sure to involve some kernel calls,
> > > I doubt it's worth sweating over an extra test-and-branch beforehand.
> > > So where I was hoping to get to is that smgr objects are *only*
> > > referenced by RelationGetSmgr() calls and nobody ever keeps any
> > > other pointers to them across any non-smgr operations.

> Herewith attached version did the same, thanks.

I think it would be valuable to have a comment in that function to point
out what is the function there for.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Industry suffers from the managerial dogma that for the sake of stability
and continuity, the company should be independent of the competence of
individual employees."                                      (E. Dijkstra)

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

12 July 2021, 05:00:36

On Fri, Jul 9, 2021 at 7:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Jul-09, Amul Sul wrote:
>
> > > On Tue, Jul 6, 2021 at 11:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> > > > The point of the static-inline function idea was to be cheap enough
> > > > that it isn't worth worrying about this sort of risky optimization.
> > > > Given that an smgr function is sure to involve some kernel calls,
> > > > I doubt it's worth sweating over an extra test-and-branch beforehand.
> > > > So where I was hoping to get to is that smgr objects are *only*
> > > > referenced by RelationGetSmgr() calls and nobody ever keeps any
> > > > other pointers to them across any non-smgr operations.
>
> > Herewith attached version did the same, thanks.
>
> I think it would be valuable to have a comment in that function to point
> out what is the function there for.

Thanks for the suggestion, added the same in the attached version.

Regards,
Amul

Attachment

v5_Add-RelationGetSmgr-inline-function.patch

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Tom Lane

Date:

12 July 2021, 21:07:46

Amul Sul <sulamul@gmail.com> writes:
> [ v5_Add-RelationGetSmgr-inline-function.patch ]

Pushed with minor cosmetic adjustments.

RelationCopyStorage() kind of gives me the willies.
It's not really an smgr-level function, but we call it
everywhere with smgr pointers that belong to relcache entries:

     /* copy main fork */
-    RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
+    RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
                         rel->rd_rel->relpersistence);

So that would fail hard if a relcache flush could occur inside
that function.  It seems impossible today, so I settled for
just annotating the function to that effect.  But it won't
surprise me a bit if somebody breaks it in future due to not
having read/understood the comment.

            regards, tom lane

Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb

From

Amul Sul

Date:

13 July 2021, 03:48:05

Thanks a lot Tom.

On Tue, Jul 13, 2021 at 2:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Amul Sul <sulamul@gmail.com> writes:
> > [ v5_Add-RelationGetSmgr-inline-function.patch ]
>
> Pushed with minor cosmetic adjustments.
>
> RelationCopyStorage() kind of gives me the willies.
> It's not really an smgr-level function, but we call it
> everywhere with smgr pointers that belong to relcache entries:
>
>         /* copy main fork */
> -       RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM,
> +       RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM,
>                                                 rel->rd_rel->relpersistence);
>
> So that would fail hard if a relcache flush could occur inside
> that function.  It seems impossible today, so I settled for
> just annotating the function to that effect.  But it won't
> surprise me a bit if somebody breaks it in future due to not
> having read/understood the comment.
>
>                         regards, tom lane