Thread: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
Hello,
While executing the below test case server crashed with Segfault 11 on master branch.
I have enabled the CLOBBER_CACHE_ALWAYS in src/include/pg_config_manual.h
Issue is only reproducing on master branch.
Test Case:
CREATE TABLE sm_5_323_table (col1 numeric);
CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CLUSTER sm_5_323_table USING sm_5_323_idx;
\! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres
CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CLUSTER sm_5_323_table USING sm_5_323_idx;
\! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres
Test case output:
edb@edb:~/PGClobber_build/postgresql/inst/bin$ ./psql postgres
psql (14devel)
Type "help" for help.
postgres=# CREATE TABLE sm_5_323_table (col1 numeric);
CREATE TABLE
postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CREATE INDEX
postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx;
CLUSTER
postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres
clusterdb: error: clustering of table "sm_5_323_table" in database "postgres" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql (14devel)
Type "help" for help.
postgres=# CREATE TABLE sm_5_323_table (col1 numeric);
CREATE TABLE
postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1);
CREATE INDEX
postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx;
CLUSTER
postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres
clusterdb: error: clustering of table "sm_5_323_table" in database "postgres" failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
Stack Trace:
Core was generated by `postgres: edb postgres 127.0.0.1(50978) CLUSTER '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
485 if (reln->md_num_open_segs[forknum] > 0)
(gdb) bt
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
#1 0x000055e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at md.c:768
#2 0x000055e5c85eb61b in mdimmedsync (reln=0x0, forknum=forknum@entry=MAIN_FORKNUM) at md.c:930
#3 0x000055e5c85ec6e5 in smgrimmedsync (reln=<optimized out>, forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662
#4 0x000055e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70) at rewriteheap.c:342
#5 0x000055e5c81a32ea in heapam_relation_copy_for_cluster (OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058, OldIndex=<optimized out>, use_sort=<optimized out>, OldestXmin=<optimized out>,
xid_cutoff=<optimized out>, multi_cutoff=0x7ffcba6ebe64, num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70, tups_recently_dead=0x7ffcba6ebe78) at heapam_handler.c:984
#6 0x000055e5c82f218a in table_relation_copy_for_cluster (tups_recently_dead=0x7ffcba6ebe78, tups_vacuumed=0x7ffcba6ebe70, num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64,
xid_cutoff=0x7ffcba6ebe60, OldestXmin=<optimized out>, use_sort=<optimized out>, OldIndex=0x7f212ce40670, NewTable=0x7f212ce41058, OldTable=0x7f212ce41ba0)
at ../../../src/include/access/tableam.h:1656
#7 copy_table_data (pCutoffMulti=<synthetic pointer>, pFreezeXid=<synthetic pointer>, pSwapToastByContent=<synthetic pointer>, verbose=<optimized out>, OIDOldIndex=<optimized out>,
OIDOldHeap=16384, OIDNewHeap=<optimized out>) at cluster.c:908
#8 rebuild_relation (verbose=<optimized out>, indexOid=<optimized out>, OldHeap=<optimized out>) at cluster.c:604
#9 cluster_rel (tableOid=<optimized out>, indexOid=<optimized out>, params=<optimized out>) at cluster.c:427
#10 0x000055e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0, stmt=stmt@entry=0x55e5ca510368, isTopLevel=isTopLevel@entry=true) at cluster.c:195
#11 0x000055e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430, queryString=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822
#12 0x000055e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430, queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
queryEnv=<optimized out>, dest=dest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:525
#13 0x000055e5c85f6148 in PortalRunUtility (portal=portal@entry=0x55e5ca570d70, pstmt=pstmt@entry=0x55e5ca510430, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1159
#14 0x000055e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1305
#15 0x000055e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at pquery.c:779
#16 0x000055e5c85f389e in exec_simple_query (query_string=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;") at postgres.c:1185
#17 0x000055e5c85f51cf in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffcba6ec670, dbname=<optimized out>, username=<optimized out>) at postgres.c:4415
#18 0x000055e5c8522240 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4470
#19 BackendStartup (port=<optimized out>) at postmaster.c:4192
#20 ServerLoop () at postmaster.c:1737
#21 0x000055e5c85237ec in PostmasterMain (argc=<optimized out>, argv=0x55e5ca508fe0) at postmaster.c:1409
#22 0x000055e5c811a2cf in main (argc=3, argv=0x55e5ca508fe0) at main.c:209
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
485 if (reln->md_num_open_segs[forknum] > 0)
(gdb) bt
#0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485
#1 0x000055e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at md.c:768
#2 0x000055e5c85eb61b in mdimmedsync (reln=0x0, forknum=forknum@entry=MAIN_FORKNUM) at md.c:930
#3 0x000055e5c85ec6e5 in smgrimmedsync (reln=<optimized out>, forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662
#4 0x000055e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70) at rewriteheap.c:342
#5 0x000055e5c81a32ea in heapam_relation_copy_for_cluster (OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058, OldIndex=<optimized out>, use_sort=<optimized out>, OldestXmin=<optimized out>,
xid_cutoff=<optimized out>, multi_cutoff=0x7ffcba6ebe64, num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70, tups_recently_dead=0x7ffcba6ebe78) at heapam_handler.c:984
#6 0x000055e5c82f218a in table_relation_copy_for_cluster (tups_recently_dead=0x7ffcba6ebe78, tups_vacuumed=0x7ffcba6ebe70, num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64,
xid_cutoff=0x7ffcba6ebe60, OldestXmin=<optimized out>, use_sort=<optimized out>, OldIndex=0x7f212ce40670, NewTable=0x7f212ce41058, OldTable=0x7f212ce41ba0)
at ../../../src/include/access/tableam.h:1656
#7 copy_table_data (pCutoffMulti=<synthetic pointer>, pFreezeXid=<synthetic pointer>, pSwapToastByContent=<synthetic pointer>, verbose=<optimized out>, OIDOldIndex=<optimized out>,
OIDOldHeap=16384, OIDNewHeap=<optimized out>) at cluster.c:908
#8 rebuild_relation (verbose=<optimized out>, indexOid=<optimized out>, OldHeap=<optimized out>) at cluster.c:604
#9 cluster_rel (tableOid=<optimized out>, indexOid=<optimized out>, params=<optimized out>) at cluster.c:427
#10 0x000055e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0, stmt=stmt@entry=0x55e5ca510368, isTopLevel=isTopLevel@entry=true) at cluster.c:195
#11 0x000055e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430, queryString=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822
#12 0x000055e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430, queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL, params=<optimized out>,
queryEnv=<optimized out>, dest=dest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:525
#13 0x000055e5c85f6148 in PortalRunUtility (portal=portal@entry=0x55e5ca570d70, pstmt=pstmt@entry=0x55e5ca510430, isTopLevel=isTopLevel@entry=true,
setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1159
#14 0x000055e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1305
#15 0x000055e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at pquery.c:779
#16 0x000055e5c85f389e in exec_simple_query (query_string=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;") at postgres.c:1185
#17 0x000055e5c85f51cf in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffcba6ec670, dbname=<optimized out>, username=<optimized out>) at postgres.c:4415
#18 0x000055e5c8522240 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4470
#19 BackendStartup (port=<optimized out>) at postmaster.c:4192
#20 ServerLoop () at postmaster.c:1737
#21 0x000055e5c85237ec in PostmasterMain (argc=<optimized out>, argv=0x55e5ca508fe0) at postmaster.c:1409
#22 0x000055e5c811a2cf in main (argc=3, argv=0x55e5ca508fe0) at main.c:209
Thanks.
--
--
Regards,
Neha Sharma
In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster() get called which cause the system cache invalidation and due to CCA setting, wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent operations and causes segmentation fault. By calling RelationOpenSmgr() before calling smgrimmedsync() in end_heap_rewrite() would fix the failure. Did the same in the attached patch. Regards, Amul On Mon, Mar 22, 2021 at 11:53 AM Neha Sharma <neha.sharma@enterprisedb.com> wrote: > > Hello, > > While executing the below test case server crashed with Segfault 11 on master branch. > I have enabled the CLOBBER_CACHE_ALWAYS in src/include/pg_config_manual.h > > Issue is only reproducing on master branch. > > Test Case: > CREATE TABLE sm_5_323_table (col1 numeric); > CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1); > > CLUSTER sm_5_323_table USING sm_5_323_idx; > > \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres > > Test case output: > edb@edb:~/PGClobber_build/postgresql/inst/bin$ ./psql postgres > psql (14devel) > Type "help" for help. > > postgres=# CREATE TABLE sm_5_323_table (col1 numeric); > CREATE TABLE > postgres=# CREATE INDEX sm_5_323_idx ON sm_5_323_table(col1); > CREATE INDEX > postgres=# CLUSTER sm_5_323_table USING sm_5_323_idx; > CLUSTER > postgres=# \! /PGClobber_build/postgresql/inst/bin/clusterdb -t sm_5_323_table -U edb -h localhost -p 5432 -d postgres > clusterdb: error: clustering of table "sm_5_323_table" in database "postgres" failed: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > Stack Trace: > Core was generated by `postgres: edb postgres 127.0.0.1(50978) CLUSTER '. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485 > 485 if (reln->md_num_open_segs[forknum] > 0) > (gdb) bt > #0 0x000055e5c85ea0b4 in mdopenfork (reln=0x0, forknum=MAIN_FORKNUM, behavior=1) at md.c:485 > #1 0x000055e5c85eb2f0 in mdnblocks (reln=0x0, forknum=MAIN_FORKNUM) at md.c:768 > #2 0x000055e5c85eb61b in mdimmedsync (reln=0x0, forknum=forknum@entry=MAIN_FORKNUM) at md.c:930 > #3 0x000055e5c85ec6e5 in smgrimmedsync (reln=<optimized out>, forknum=forknum@entry=MAIN_FORKNUM) at smgr.c:662 > #4 0x000055e5c81ae28b in end_heap_rewrite (state=state@entry=0x55e5ca5d1d70) at rewriteheap.c:342 > #5 0x000055e5c81a32ea in heapam_relation_copy_for_cluster (OldHeap=0x7f212ce41ba0, NewHeap=0x7f212ce41058, OldIndex=<optimizedout>, use_sort=<optimized out>, OldestXmin=<optimized out>, > xid_cutoff=<optimized out>, multi_cutoff=0x7ffcba6ebe64, num_tuples=0x7ffcba6ebe68, tups_vacuumed=0x7ffcba6ebe70, tups_recently_dead=0x7ffcba6ebe78)at heapam_handler.c:984 > #6 0x000055e5c82f218a in table_relation_copy_for_cluster (tups_recently_dead=0x7ffcba6ebe78, tups_vacuumed=0x7ffcba6ebe70,num_tuples=0x7ffcba6ebe68, multi_cutoff=0x7ffcba6ebe64, > xid_cutoff=0x7ffcba6ebe60, OldestXmin=<optimized out>, use_sort=<optimized out>, OldIndex=0x7f212ce40670, NewTable=0x7f212ce41058,OldTable=0x7f212ce41ba0) > at ../../../src/include/access/tableam.h:1656 > #7 copy_table_data (pCutoffMulti=<synthetic pointer>, pFreezeXid=<synthetic pointer>, pSwapToastByContent=<synthetic pointer>,verbose=<optimized out>, OIDOldIndex=<optimized out>, > OIDOldHeap=16384, OIDNewHeap=<optimized out>) at cluster.c:908 > #8 rebuild_relation (verbose=<optimized out>, indexOid=<optimized out>, OldHeap=<optimized out>) at cluster.c:604 > #9 cluster_rel (tableOid=<optimized out>, indexOid=<optimized out>, params=<optimized out>) at cluster.c:427 > #10 0x000055e5c82f2b7f in cluster (pstate=pstate@entry=0x55e5ca5315c0, stmt=stmt@entry=0x55e5ca510368, isTopLevel=isTopLevel@entry=true)at cluster.c:195 > #11 0x000055e5c85fcbc6 in standard_ProcessUtility (pstmt=0x55e5ca510430, queryString=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;",context=PROCESS_UTILITY_TOPLEVEL, params=0x0, > queryEnv=0x0, dest=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:822 > #12 0x000055e5c85fd436 in ProcessUtility (pstmt=pstmt@entry=0x55e5ca510430, queryString=<optimized out>, context=context@entry=PROCESS_UTILITY_TOPLEVEL,params=<optimized out>, > queryEnv=<optimized out>, dest=dest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at utility.c:525 > #13 0x000055e5c85f6148 in PortalRunUtility (portal=portal@entry=0x55e5ca570d70, pstmt=pstmt@entry=0x55e5ca510430, isTopLevel=isTopLevel@entry=true, > setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1159 > #14 0x000055e5c85f71a4 in PortalRunMulti (portal=portal@entry=0x55e5ca570d70, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, > dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=qc@entry=0x7ffcba6ec340) at pquery.c:1305 > #15 0x000055e5c85f8823 in PortalRun (portal=portal@entry=0x55e5ca570d70, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,run_once=run_once@entry=true, > dest=dest@entry=0x55e5ca510710, altdest=altdest@entry=0x55e5ca510710, qc=0x7ffcba6ec340) at pquery.c:779 > #16 0x000055e5c85f389e in exec_simple_query (query_string=0x55e5ca50f850 "CLUSTER public.sm_5_323_table;") at postgres.c:1185 > #17 0x000055e5c85f51cf in PostgresMain (argc=argc@entry=1, argv=argv@entry=0x7ffcba6ec670, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4415 > #18 0x000055e5c8522240 in BackendRun (port=<optimized out>, port=<optimized out>) at postmaster.c:4470 > #19 BackendStartup (port=<optimized out>) at postmaster.c:4192 > #20 ServerLoop () at postmaster.c:1737 > #21 0x000055e5c85237ec in PostmasterMain (argc=<optimized out>, argv=0x55e5ca508fe0) at postmaster.c:1409 > #22 0x000055e5c811a2cf in main (argc=3, argv=0x55e5ca508fe0) at main.c:209 > > Thanks. > -- > Regards, > Neha Sharma
Attachment
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Amit Langote
Date:
On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote: > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets > rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster() > get called which cause the system cache invalidation and due to CCA setting, > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent > operations and causes segmentation fault. > > By calling RelationOpenSmgr() before calling smgrimmedsync() in > end_heap_rewrite() would fix the failure. Did the same in the attached patch. That makes sense. I see a few commits in the git history adding RelationOpenSmgr() before a smgr* operation, whenever such a problem would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7, for example. I do wonder if there are still other smgr* operations in the source code that are preceded by operations that would invalidate the SMgrRelation that those smgr* operations would be called with. For example, the smgrnblocks() in gistBuildCallback() may get done too late than a corresponding RelationOpenSmgr() on the index relation. -- Amit Langote EDB: http://www.enterprisedb.com
On Mon, Mar 22, 2021 at 3:03 PM Amit Langote <amitlangote09@gmail.com> wrote: > > On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote: > > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets > > rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster() > > get called which cause the system cache invalidation and due to CCA setting, > > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent > > operations and causes segmentation fault. > > > > By calling RelationOpenSmgr() before calling smgrimmedsync() in > > end_heap_rewrite() would fix the failure. Did the same in the attached patch. > > That makes sense. I see a few commits in the git history adding > RelationOpenSmgr() before a smgr* operation, whenever such a problem > would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7, > for example. > Thanks for the confirmation. > I do wonder if there are still other smgr* operations in the source > code that are preceded by operations that would invalidate the > SMgrRelation that those smgr* operations would be called with. For > example, the smgrnblocks() in gistBuildCallback() may get done too > late than a corresponding RelationOpenSmgr() on the index relation. > I did the check for gistBuildCallback() by adding Assert(index->rd_smgr) before smgrnblocks() with CCA setting and didn't see any problem there. I think the easiest way to find that is to run a regression suite with CCA build, perhaps, there is no guarantee that regression will hit all smgr* operations, but that might hit most of them. Regards, Amul
On Tue, Mar 23, 2021 at 10:08 AM Amul Sul <sulamul@gmail.com> wrote:
On Mon, Mar 22, 2021 at 3:03 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 5:26 PM Amul Sul <sulamul@gmail.com> wrote:
> > In heapam_relation_copy_for_cluster(), begin_heap_rewrite() sets
> > rwstate->rs_new_rel->rd_smgr correctly but next line tuplesort_begin_cluster()
> > get called which cause the system cache invalidation and due to CCA setting,
> > wipe out rwstate->rs_new_rel->rd_smgr which wasn't restored for the subsequent
> > operations and causes segmentation fault.
> >
> > By calling RelationOpenSmgr() before calling smgrimmedsync() in
> > end_heap_rewrite() would fix the failure. Did the same in the attached patch.
>
> That makes sense. I see a few commits in the git history adding
> RelationOpenSmgr() before a smgr* operation, whenever such a problem
> would have been discovered: 4942ee656ac, afa8f1971ae, bf347c60bdd7,
> for example.
>
Thanks for the confirmation.
> I do wonder if there are still other smgr* operations in the source
> code that are preceded by operations that would invalidate the
> SMgrRelation that those smgr* operations would be called with. For
> example, the smgrnblocks() in gistBuildCallback() may get done too
> late than a corresponding RelationOpenSmgr() on the index relation.
>
I did the check for gistBuildCallback() by adding Assert(index->rd_smgr) before
smgrnblocks() with CCA setting and didn't see any problem there.
I think the easiest way to find that is to run a regression suite with CCA
build, perhaps, there is no guarantee that regression will hit all smgr*
operations, but that might hit most of them.
Sure, will give a regression run with CCA enabled.
Regards,
Amul
Regards,
Neha Sharma
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Michael Paquier
Date:
On Tue, Mar 23, 2021 at 10:52:09AM +0530, Neha Sharma wrote: > Sure, will give a regression run with CCA enabled. I can confirm the regression between 13 and HEAD, so I have added an open item. It would be good to figure out the root issue here, and I am ready to bet that the problem is deeper than it looks and that more code paths could be involved. It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS, but the test is quick enough to reproduce. It would be good to bisect the origin point here as a first step. -- Michael
Attachment
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Michael Paquier
Date:
On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote: > It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS, > but the test is quick enough to reproduce. It would be good to bisect > the origin point here as a first step. One bisect later, the winner is: commit: 3d351d916b20534f973eda760cde17d96545d4c4 author: Tom Lane <tgl@sss.pgh.pa.us> date: Sun, 30 Aug 2020 12:21:51 -0400 Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE. I am too tired to poke at that today, so I'll try tomorrow. Tom may beat me at that though. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Tue, Mar 23, 2021 at 04:12:01PM +0900, Michael Paquier wrote: >> It takes some time to initialize a cluster under CLOBBER_CACHE_ALWAYS, >> but the test is quick enough to reproduce. It would be good to bisect >> the origin point here as a first step. > One bisect later, the winner is: > commit: 3d351d916b20534f973eda760cde17d96545d4c4 > author: Tom Lane <tgl@sss.pgh.pa.us> > date: Sun, 30 Aug 2020 12:21:51 -0400 > Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE. > I am too tired to poke at that today, so I'll try tomorrow. Tom may > beat me at that though. I think that's an artifact. That commit didn't touch anything related to relation opening or closing. What it could have done, though, is change CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan, thus causing us to follow the buggy code path where before we didn't. The interesting question here seems to be "why didn't the existing CLOBBER_CACHE_ALWAYS buildfarm testing catch this?". It looks to me like the answer is that it only happens for an empty table (or at least one where the data pattern is such that we skip the RelationOpenSmgr call earlier in end_heap_rewrite) and we don't happen to be exercising that exact scenario in the regression tests. regards, tom lane
I wrote: > Michael Paquier <michael@paquier.xyz> writes: >> One bisect later, the winner is: >> commit: 3d351d916b20534f973eda760cde17d96545d4c4 >> author: Tom Lane <tgl@sss.pgh.pa.us> >> date: Sun, 30 Aug 2020 12:21:51 -0400 >> Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE. > I think that's an artifact. That commit didn't touch anything related to > relation opening or closing. What it could have done, though, is change > CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan, > thus causing us to follow the buggy code path where before we didn't. On closer inspection, I believe the true culprit is c6b92041d, which did this: */ if (RelationNeedsWAL(state->rs_new_rel)) - heap_sync(state->rs_new_rel); + smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM); logical_end_heap_rewrite(state); heap_sync was careful about opening rd_smgr, the new code not so much. I read the rest of that commit and didn't see any other equivalent bugs, but I might've missed something. regards, tom lane
On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: > > Michael Paquier <michael@paquier.xyz> writes: > >> One bisect later, the winner is: > >> commit: 3d351d916b20534f973eda760cde17d96545d4c4 > >> author: Tom Lane <tgl@sss.pgh.pa.us> > >> date: Sun, 30 Aug 2020 12:21:51 -0400 > >> Redefine pg_class.reltuples to be -1 before the first VACUUM or ANALYZE. > > > I think that's an artifact. That commit didn't touch anything related to > > relation opening or closing. What it could have done, though, is change > > CLUSTER's behavior on this empty table from use-an-index to use-a-seqscan, > > thus causing us to follow the buggy code path where before we didn't. > > On closer inspection, I believe the true culprit is c6b92041d, > which did this: > > */ > if (RelationNeedsWAL(state->rs_new_rel)) > - heap_sync(state->rs_new_rel); > + smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM); > > logical_end_heap_rewrite(state); > > heap_sync was careful about opening rd_smgr, the new code not so much. > > I read the rest of that commit and didn't see any other equivalent > bugs, but I might've missed something. > I too didn't find any other place replacing heap_sync() or equivalent place from this commit where smgr* operation reaches without necessary precautions call. heap_sync() was calling RelationOpenSmgr() through FlushRelationBuffers() before it reached smgrimmedsync(). So we also need to make sure of the RelationOpenSmgr() call before smgrimmedsync() as proposed previously. Regards, Amul
Amul Sul <sulamul@gmail.com> writes: > On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> On closer inspection, I believe the true culprit is c6b92041d, >> which did this: >> - heap_sync(state->rs_new_rel); >> + smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM); >> heap_sync was careful about opening rd_smgr, the new code not so much. > So we also need to make sure of the > RelationOpenSmgr() call before smgrimmedsync() as proposed previously. I wonder if we should try to get rid of this sort of bug by banning direct references to rd_smgr? That is, write the above and all similar code like smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM); where we provide something like static inline struct SMgrRelationData * RelationGetSmgr(Relation rel) { if (unlikely(rel->rd_smgr == NULL)) RelationOpenSmgr(rel); return rel->rd_smgr; } and then we could get rid of most or all other RelationOpenSmgr calls. This might create more code bloat than it's really worth, but it'd be a simple and mechanically-checkable scheme. regards, tom lane
On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amul Sul <sulamul@gmail.com> writes: > > On Tue, Mar 23, 2021 at 8:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> On closer inspection, I believe the true culprit is c6b92041d, > >> which did this: > >> - heap_sync(state->rs_new_rel); > >> + smgrimmedsync(state->rs_new_rel->rd_smgr, MAIN_FORKNUM); > >> heap_sync was careful about opening rd_smgr, the new code not so much. > > > So we also need to make sure of the > > RelationOpenSmgr() call before smgrimmedsync() as proposed previously. > > I wonder if we should try to get rid of this sort of bug by banning > direct references to rd_smgr? That is, write the above and all > similar code like > > smgrimmedsync(RelationGetSmgr(state->rs_new_rel), MAIN_FORKNUM); > > where we provide something like > > static inline struct SMgrRelationData * > RelationGetSmgr(Relation rel) > { > if (unlikely(rel->rd_smgr == NULL)) > RelationOpenSmgr(rel); > return rel->rd_smgr; > } > > and then we could get rid of most or all other RelationOpenSmgr calls. > +1 > This might create more code bloat than it's really worth, but > it'd be a simple and mechanically-checkable scheme. I think that will be fine, one-time pain. If you want I will do those changes. A quick question: Can't it be a macro instead of an inline function like other macros we have in rel.h? Regards, Amul
Amul Sul <sulamul@gmail.com> writes: > On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> static inline struct SMgrRelationData * >> RelationGetSmgr(Relation rel) >> { >> if (unlikely(rel->rd_smgr == NULL)) >> RelationOpenSmgr(rel); >> return rel->rd_smgr; >> } > A quick question: Can't it be a macro instead of an inline function > like other macros we have in rel.h? The multiple-evaluation hazard seems like an issue. We've tolerated such hazards in the past, but mostly just because we weren't relying on static inlines being available, so there wasn't a good way around it. Also, the conditional evaluation here would look rather ugly in a macro, I think, if indeed you could do it at all without provoking compiler warnings. regards, tom lane
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Kyotaro Horiguchi
Date:
Sorry for the bug. At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Amul Sul <sulamul@gmail.com> writes: > > On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> static inline struct SMgrRelationData * > >> RelationGetSmgr(Relation rel) > >> { > >> if (unlikely(rel->rd_smgr == NULL)) > >> RelationOpenSmgr(rel); > >> return rel->rd_smgr; > >> } > > > A quick question: Can't it be a macro instead of an inline function > > like other macros we have in rel.h? > > The multiple-evaluation hazard seems like an issue. We've tolerated > such hazards in the past, but mostly just because we weren't relying > on static inlines being available, so there wasn't a good way around > it. > > Also, the conditional evaluation here would look rather ugly > in a macro, I think, if indeed you could do it at all without > provoking compiler warnings. FWIW, +1 for the function as is. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Mar 25, 2021 at 12:10 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > Sorry for the bug. > > At Thu, 25 Mar 2021 01:50:29 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > > Amul Sul <sulamul@gmail.com> writes: > > > On Wed, Mar 24, 2021 at 8:09 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> static inline struct SMgrRelationData * > > >> RelationGetSmgr(Relation rel) > > >> { > > >> if (unlikely(rel->rd_smgr == NULL)) > > >> RelationOpenSmgr(rel); > > >> return rel->rd_smgr; > > >> } > > > > > A quick question: Can't it be a macro instead of an inline function > > > like other macros we have in rel.h? > > > > The multiple-evaluation hazard seems like an issue. We've tolerated > > such hazards in the past, but mostly just because we weren't relying > > on static inlines being available, so there wasn't a good way around > > it. > > > > Also, the conditional evaluation here would look rather ugly > > in a macro, I think, if indeed you could do it at all without > > provoking compiler warnings. > > FWIW, +1 for the function as is. > Ok, in the attached patch, I have added the inline function to rel.h, and for that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr by RelationGetSmgr() function and removed the RelationOpenSmgr() call from the nearby to it which I don't think needed at all. Regards, Amul
Attachment
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Alvaro Herrera
Date:
On 2021-Mar-25, Amul Sul wrote: > Ok, in the attached patch, I have added the inline function to rel.h, and for > that, I end up including smgr.h to rel.h. I tried to replace all rel->rd_smgr > by RelationGetSmgr() function and removed the RelationOpenSmgr() call from > the nearby to it which I don't think needed at all. We forgot this patch earlier in the commitfest. Do people think we should still get it in on this cycle? I'm +1 on that, since it's a safety feature poised to prevent more bugs than it's likely to introduce. -- Álvaro Herrera Valdivia, Chile
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Michael Paquier
Date:
On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote: > We forgot this patch earlier in the commitfest. Do people think we > should still get it in on this cycle? I'm +1 on that, since it's a > safety feature poised to prevent more bugs than it's likely to > introduce. No objections from here to do that now even after feature freeze. I also wonder, while looking at that, why you don't just remove the last call within src/backend/catalog/heap.c. This way, nobody is tempted to use RelationOpenSmgr() anymore, and it could just be removed from rel.h. -- Michael
Attachment
On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote: > > We forgot this patch earlier in the commitfest. Do people think we > > should still get it in on this cycle? I'm +1 on that, since it's a > > safety feature poised to prevent more bugs than it's likely to > > introduce. > > No objections from here to do that now even after feature freeze. I > also wonder, while looking at that, why you don't just remove the last > call within src/backend/catalog/heap.c. This way, nobody is tempted > to use RelationOpenSmgr() anymore, and it could just be removed from > rel.h. Agree, did the same in the attached version, thanks. Regards, Amul P.S. commitfest entry https://commitfest.postgresql.org/33/3084/
Attachment
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Kyotaro Horiguchi
Date:
At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in > On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote: > > > We forgot this patch earlier in the commitfest. Do people think we > > > should still get it in on this cycle? I'm +1 on that, since it's a > > > safety feature poised to prevent more bugs than it's likely to > > > introduce. > > > > No objections from here to do that now even after feature freeze. I > > also wonder, while looking at that, why you don't just remove the last > > call within src/backend/catalog/heap.c. This way, nobody is tempted > > to use RelationOpenSmgr() anymore, and it could just be removed from > > rel.h. > > Agree, did the same in the attached version, thanks. + smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, (char *) metapage, true); - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, + log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, At the log_newpage, index is guaranteed to have rd_smgr. So I prefer to leave the line alone.. I don't mind other sccessive calls if any since what I don't like is the notation there. > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/ Isn't this a kind of open item? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in > > On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote: > > > > We forgot this patch earlier in the commitfest. Do people think we > > > > should still get it in on this cycle? I'm +1 on that, since it's a > > > > safety feature poised to prevent more bugs than it's likely to > > > > introduce. > > > > > > No objections from here to do that now even after feature freeze. I > > > also wonder, while looking at that, why you don't just remove the last > > > call within src/backend/catalog/heap.c. This way, nobody is tempted > > > to use RelationOpenSmgr() anymore, and it could just be removed from > > > rel.h. > > > > Agree, did the same in the attached version, thanks. > > + smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, > (char *) metapage, true); > - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, > + log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, > > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer > to leave the line alone.. I don't mind other sccessive calls if any > since what I don't like is the notation there. > Perhaps, isn't that bad. It is good to follow the practice of using RelationGetSmgr() for rd_smgr access, IMHO. > > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/ > > Isn't this a kind of open item? > Sorry, I didn't get you. Do I need to move this to some other bucket? Regards, Amul
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Justin Pryzby
Date:
On Mon, Apr 19, 2021 at 04:27:25PM +0530, Amul Sul wrote: > On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > > > At Mon, 19 Apr 2021 12:56:18 +0530, Amul Sul <sulamul@gmail.com> wrote in > > > On Mon, Apr 19, 2021 at 12:25 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > > > On Fri, Apr 09, 2021 at 06:45:45PM -0400, Alvaro Herrera wrote: > > > > > We forgot this patch earlier in the commitfest. Do people think we > > > > > should still get it in on this cycle? I'm +1 on that, since it's a > > > > > safety feature poised to prevent more bugs than it's likely to > > > > > introduce. > > > > > > > > No objections from here to do that now even after feature freeze. I > > > > also wonder, while looking at that, why you don't just remove the last > > > > call within src/backend/catalog/heap.c. This way, nobody is tempted > > > > to use RelationOpenSmgr() anymore, and it could just be removed from > > > > rel.h. > > > > > > Agree, did the same in the attached version, thanks. > > > > + smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, > > (char *) metapage, true); > > - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, > > + log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, > > > > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer > > to leave the line alone.. I don't mind other sccessive calls if any > > since what I don't like is the notation there. > > > > Perhaps, isn't that bad. It is good to follow the practice of using > RelationGetSmgr() for rd_smgr access, IMHO. > > > > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/ > > > > Isn't this a kind of open item? > > > > Sorry, I didn't get you. Do I need to move this to some other bucket? It's not a new feature, and shouldn't wait for July's CF since it's targetting v14. The original crash was fixed by Tom by commit 9d523119f. So it's not exactly an "open item" for v14, but there's probably no better place for it, so you could add it if you think it's at risk of being forgotten (again). https://wiki.postgresql.org/wiki/PostgreSQL_14_Open_Items -- Justin
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Michael Paquier
Date:
On Mon, Apr 19, 2021 at 05:35:52PM +0900, Kyotaro Horiguchi wrote: > Isn't this a kind of open item? This does not qualify as an open item because it is not an actual bug IMO, neither is it a defect of the existing code, so it seems appropriate to me to not list it. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Mon, Apr 19, 2021 at 05:35:52PM +0900, Kyotaro Horiguchi wrote: >> Isn't this a kind of open item? > This does not qualify as an open item because it is not an actual bug > IMO, neither is it a defect of the existing code, so it seems > appropriate to me to not list it. Agreed, but by the same token, rushing it into v14 doesn't have any clear benefit. I'd be inclined to leave it for v15 at this point, especially since we don't seem to have 100% consensus on the details. regards, tom lane
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Kyotaro Horiguchi
Date:
At Mon, 19 Apr 2021 09:19:36 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Michael Paquier <michael@paquier.xyz> writes: > > On Mon, Apr 19, 2021 at 05:35:52PM +0900, Kyotaro Horiguchi wrote: > >> Isn't this a kind of open item? > > > This does not qualify as an open item because it is not an actual bug > > IMO, neither is it a defect of the existing code, so it seems > > appropriate to me to not list it. > > Agreed, but by the same token, rushing it into v14 doesn't have any > clear benefit. I'd be inclined to leave it for v15 at this point, > especially since we don't seem to have 100% consensus on the details. Thanks. Seems reasonable. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Kyotaro Horiguchi
Date:
At Mon, 19 Apr 2021 16:27:25 +0530, Amul Sul <sulamul@gmail.com> wrote in > On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > + smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, > > (char *) metapage, true); > > - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, > > + log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, > > > > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer > > to leave the line alone.. I don't mind other sccessive calls if any > > since what I don't like is the notation there. > > > > Perhaps, isn't that bad. It is good to follow the practice of using > RelationGetSmgr() for rd_smgr access, IMHO. I don't mind RelationGetSmgr(index)->smgr_rnode alone or &variable->member alone and there's not the previous call to RelationGetSmgr just above. How about using a temporary variable? SMgrRelation srel = RelationGetSmgr(index); smgrwrite(srel, ...); log_newpage(srel->..); > > > P.S. commitfest entry https://commitfest.postgresql.org/33/3084/ > > > > Isn't this a kind of open item? > > > > Sorry, I didn't get you. Do I need to move this to some other bucket? As discussed in the other branch, I agree that it is registered to the next CF, not registered as an open items of this cycle. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Mon, 19 Apr 2021 16:27:25 +0530, Amul Sul <sulamul@gmail.com> wrote in > > On Mon, Apr 19, 2021 at 2:05 PM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > > > + smgrwrite(RelationGetSmgr(index), INIT_FORKNUM, BLOOM_METAPAGE_BLKNO, > > > (char *) metapage, true); > > > - log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM, > > > + log_newpage(&(RelationGetSmgr(index))->smgr_rnode.node, INIT_FORKNUM, > > > > > > At the log_newpage, index is guaranteed to have rd_smgr. So I prefer > > > to leave the line alone.. I don't mind other sccessive calls if any > > > since what I don't like is the notation there. > > > > > > > Perhaps, isn't that bad. It is good to follow the practice of using > > RelationGetSmgr() for rd_smgr access, IMHO. > > I don't mind RelationGetSmgr(index)->smgr_rnode alone or > &variable->member alone and there's not the previous call to > RelationGetSmgr just above. How about using a temporary variable? > > SMgrRelation srel = RelationGetSmgr(index); > smgrwrite(srel, ...); > log_newpage(srel->..); > Understood. Used a temporary variable for the place where RelationGetSmgr() calls are placed too close or in a loop. Please have a look at the attached version, thanks for the review. Regards, Amul
Attachment
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Anastasia Lubennikova
Date:
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested Documentation: not tested I looked through the patch. Looks good to me. CFbot tests are passing and, as I got it from the thread, nobody opposes this refactoring, so, move it to RFC status. The new status of this patch is: Ready for Committer
Amul Sul <sulamul@gmail.com> writes: > On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: >> I don't mind RelationGetSmgr(index)->smgr_rnode alone or >> &variable->member alone and there's not the previous call to >> RelationGetSmgr just above. How about using a temporary variable? >> >> SMgrRelation srel = RelationGetSmgr(index); >> smgrwrite(srel, ...); >> log_newpage(srel->..); > Understood. Used a temporary variable for the place where > RelationGetSmgr() calls are placed too close or in a loop. [ squint... ] Doesn't this risk introducing exactly the sort of cache-clobber hazard we're trying to prevent? That is, the above is not safe unless you are *entirely* certain that there is not and never will be any possibility of a relcache flush before you are done using the temporary variable. Otherwise it can become a dangling pointer. The point of the static-inline function idea was to be cheap enough that it isn't worth worrying about this sort of risky optimization. Given that an smgr function is sure to involve some kernel calls, I doubt it's worth sweating over an extra test-and-branch beforehand. So where I was hoping to get to is that smgr objects are *only* referenced by RelationGetSmgr() calls and nobody ever keeps any other pointers to them across any non-smgr operations. regards, tom lane
On Tue, Jul 6, 2021 at 11:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amul Sul <sulamul@gmail.com> writes: > > On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi > > <horikyota.ntt@gmail.com> wrote: > >> I don't mind RelationGetSmgr(index)->smgr_rnode alone or > >> &variable->member alone and there's not the previous call to > >> RelationGetSmgr just above. How about using a temporary variable? > >> > >> SMgrRelation srel = RelationGetSmgr(index); > >> smgrwrite(srel, ...); > >> log_newpage(srel->..); > > > Understood. Used a temporary variable for the place where > > RelationGetSmgr() calls are placed too close or in a loop. > > [ squint... ] Doesn't this risk introducing exactly the sort of > cache-clobber hazard we're trying to prevent? That is, the above is > not safe unless you are *entirely* certain that there is not and never > will be any possibility of a relcache flush before you are done using > the temporary variable. Otherwise it can become a dangling pointer. > Yeah, there will a hazard, even if we sure right but cannot guarantee future changes in any subroutine that could get call in between. > The point of the static-inline function idea was to be cheap enough > that it isn't worth worrying about this sort of risky optimization. > Given that an smgr function is sure to involve some kernel calls, > I doubt it's worth sweating over an extra test-and-branch beforehand. > So where I was hoping to get to is that smgr objects are *only* > referenced by RelationGetSmgr() calls and nobody ever keeps any > other pointers to them across any non-smgr operations. > Ok, will revert changes added in the previous version, thanks. Regards, Amul
On Wed, Jul 7, 2021 at 9:44 AM Amul Sul <sulamul@gmail.com> wrote: > > On Tue, Jul 6, 2021 at 11:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > Amul Sul <sulamul@gmail.com> writes: > > > On Tue, Apr 20, 2021 at 6:59 AM Kyotaro Horiguchi > > > <horikyota.ntt@gmail.com> wrote: > > >> I don't mind RelationGetSmgr(index)->smgr_rnode alone or > > >> &variable->member alone and there's not the previous call to > > >> RelationGetSmgr just above. How about using a temporary variable? > > >> > > >> SMgrRelation srel = RelationGetSmgr(index); > > >> smgrwrite(srel, ...); > > >> log_newpage(srel->..); > > > > > Understood. Used a temporary variable for the place where > > > RelationGetSmgr() calls are placed too close or in a loop. > > > > [ squint... ] Doesn't this risk introducing exactly the sort of > > cache-clobber hazard we're trying to prevent? That is, the above is > > not safe unless you are *entirely* certain that there is not and never > > will be any possibility of a relcache flush before you are done using > > the temporary variable. Otherwise it can become a dangling pointer. > > > > Yeah, there will a hazard, even if we sure right but cannot guarantee future > changes in any subroutine that could get call in between. > > > The point of the static-inline function idea was to be cheap enough > > that it isn't worth worrying about this sort of risky optimization. > > Given that an smgr function is sure to involve some kernel calls, > > I doubt it's worth sweating over an extra test-and-branch beforehand. > > So where I was hoping to get to is that smgr objects are *only* > > referenced by RelationGetSmgr() calls and nobody ever keeps any > > other pointers to them across any non-smgr operations. > > > > Ok, will revert changes added in the previous version, thanks. > Herewith attached version did the same, thanks. Regards, Amul
Attachment
Re: [CLOBBER_CACHE]Server crashed with segfault 11 while executing clusterdb
From
Alvaro Herrera
Date:
On 2021-Jul-09, Amul Sul wrote: > > On Tue, Jul 6, 2021 at 11:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > The point of the static-inline function idea was to be cheap enough > > > that it isn't worth worrying about this sort of risky optimization. > > > Given that an smgr function is sure to involve some kernel calls, > > > I doubt it's worth sweating over an extra test-and-branch beforehand. > > > So where I was hoping to get to is that smgr objects are *only* > > > referenced by RelationGetSmgr() calls and nobody ever keeps any > > > other pointers to them across any non-smgr operations. > Herewith attached version did the same, thanks. I think it would be valuable to have a comment in that function to point out what is the function there for. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Industry suffers from the managerial dogma that for the sake of stability and continuity, the company should be independent of the competence of individual employees." (E. Dijkstra)
On Fri, Jul 9, 2021 at 7:30 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Jul-09, Amul Sul wrote: > > > > On Tue, Jul 6, 2021 at 11:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > The point of the static-inline function idea was to be cheap enough > > > > that it isn't worth worrying about this sort of risky optimization. > > > > Given that an smgr function is sure to involve some kernel calls, > > > > I doubt it's worth sweating over an extra test-and-branch beforehand. > > > > So where I was hoping to get to is that smgr objects are *only* > > > > referenced by RelationGetSmgr() calls and nobody ever keeps any > > > > other pointers to them across any non-smgr operations. > > > Herewith attached version did the same, thanks. > > I think it would be valuable to have a comment in that function to point > out what is the function there for. Thanks for the suggestion, added the same in the attached version. Regards, Amul
Attachment
Amul Sul <sulamul@gmail.com> writes: > [ v5_Add-RelationGetSmgr-inline-function.patch ] Pushed with minor cosmetic adjustments. RelationCopyStorage() kind of gives me the willies. It's not really an smgr-level function, but we call it everywhere with smgr pointers that belong to relcache entries: /* copy main fork */ - RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM, + RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM, rel->rd_rel->relpersistence); So that would fail hard if a relcache flush could occur inside that function. It seems impossible today, so I settled for just annotating the function to that effect. But it won't surprise me a bit if somebody breaks it in future due to not having read/understood the comment. regards, tom lane
Thanks a lot Tom. On Tue, Jul 13, 2021 at 2:37 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Amul Sul <sulamul@gmail.com> writes: > > [ v5_Add-RelationGetSmgr-inline-function.patch ] > > Pushed with minor cosmetic adjustments. > > RelationCopyStorage() kind of gives me the willies. > It's not really an smgr-level function, but we call it > everywhere with smgr pointers that belong to relcache entries: > > /* copy main fork */ > - RelationCopyStorage(rel->rd_smgr, dstrel, MAIN_FORKNUM, > + RelationCopyStorage(RelationGetSmgr(rel), dstrel, MAIN_FORKNUM, > rel->rd_rel->relpersistence); > > So that would fail hard if a relcache flush could occur inside > that function. It seems impossible today, so I settled for > just annotating the function to that effect. But it won't > surprise me a bit if somebody breaks it in future due to not > having read/understood the comment. > > regards, tom lane