Thread: Segfault on ANALYZE in SERIALIZABLE isolation
Hackers, Head of master is giving me a segfault on running ANALYZE when isolation mode is SERIALIZABLE. My configure: export CFLAGS="-g" export LDFLAGS="-L/usr/local/opt/readline/lib" export CPPFLAGS="-I/usr/local/opt/readline/include" ./configure \ --prefix=/Users/joe/Development/tmp/pg \ --enable-cassert \ --enable-debug \ --with-readline To reproduce: [joe@oberon pg]$ ./bin/initdb -D $(pwd)/data The files belonging to this database system will be owned by user "joe". This user must also own the server process. The database cluster will be initialized with locale "en_GB.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". Data page checksums are disabled. creating directory /Users/joe/Development/tmp/pg/data ... ok creating subdirectories ... ok selecting dynamic shared memory implementation ... posix selecting default max_connections ... 100 selecting default shared_buffers ... 128MB selecting default timezone ... Europe/London creating configuration files ... ok running bootstrap script ... ok performing post-bootstrap initialization ... ok syncing data to disk ... ok initdb: warning: enabling "trust" authentication for local connections You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb. Success. You can now start the database server using: ./bin/pg_ctl -D /Users/joe/Development/tmp/pg/data -l logfile start [joe@oberon pg]$ ./bin/pg_ctl -D /Users/joe/Development/tmp/pg/data -l logfile start && ./bin/psql -d postgres waiting for server to start.... done server started psql (12devel) Type "help" for help. [local] joe@postgres=# ALTER SYSTEM SET DEFAULT_TRANSACTION_ISOLATION TO 'serializable'; ALTER SYSTEM [local] joe@postgres=# \q [joe@oberon pg]$ ./bin/pg_ctl -D /Users/joe/Development/tmp/pg/data -l logfile restart && ./bin/psql -d postgres waiting for server to shut down.... done server stopped waiting for server to start.... done server started psql (12devel) Type "help" for help. [local] joe@postgres=# ANALYZE; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. @!> Logline: 2019-05-18 15:10:06.831 BST [45177] LOG: server process (PID 45186) was terminated by signal 11: Segmentation fault: 11 Cheers, -Joe
Hi I can reproduce with: set default_transaction_isolation TO serializable ; analyze ; Here is backtrace: #0 SerializationNeededForRead (snapshot=0x0, relation=0x7f53e9a525f8) at predicate.c:530 #1 PredicateLockRelation (relation=relation@entry=0x7f53e9a525f8, snapshot=snapshot@entry=0x0) at predicate.c:2507 #2 0x0000562395b78a14 in heap_beginscan (relation=0x7f53e9a525f8, snapshot=0x0, nkeys=0, key=0x0, parallel_scan=0x0, allow_strat=<optimizedout>, allow_sync=false, allow_pagemode=true, is_bitmapscan=false, is_samplescan=true, temp_snap=false) at heapam.c:1180 #3 0x0000562395c782d7 in table_beginscan_analyze (rel=0x7f53e9a525f8) at ../../../src/include/access/tableam.h:786 #4 acquire_sample_rows (onerel=onerel@entry=0x7f53e9a525f8, elevel=elevel@entry=13, rows=rows@entry=0x562396f01dd0, targrows=targrows@entry=30000, totalrows=totalrows@entry=0x7ffd0603e498, totaldeadrows=totaldeadrows@entry=0x7ffd0603e490) at analyze.c:1032 #5 0x0000562395c790f2 in do_analyze_rel (onerel=onerel@entry=0x7f53e9a525f8, params=params@entry=0x7ffd0603e6a0, va_cols=va_cols@entry=0x0, acquirefunc=0x562395c781fa <acquire_sample_rows>, relpages=0, inh=inh@entry=false, in_outer_xact=false, elevel=13) atanalyze.c:502 #6 0x0000562395c79930 in analyze_rel (relid=<optimized out>, relation=0x0, params=params@entry=0x7ffd0603e6a0, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:260 #7 0x0000562395cf6f90 in vacuum (relations=0x562396ecbf80, params=params@entry=0x7ffd0603e6a0, bstrategy=<optimized out>,bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:413 #8 0x0000562395cf759d in ExecVacuum (pstate=pstate@entry=0x562396df69f8, vacstmt=vacstmt@entry=0x562396dd54c0, isTopLevel=isTopLevel@entry=true) at vacuum.c:199 #9 0x0000562395e84863 in standard_ProcessUtility (pstmt=0x562396dd5820, queryString=0x562396dd4ad8 "analyze ;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x562396dd5918, completionTag=0x7ffd0603ea10 "") at utility.c:670 #10 0x0000562395e84dba in ProcessUtility (pstmt=pstmt@entry=0x562396dd5820, queryString=<optimized out>, context=<optimizedout>, params=<optimized out>, queryEnv=<optimized out>, dest=dest@entry=0x562396dd5918, completionTag=0x7ffd0603ea10 "") atutility.c:360 #11 0x0000562395e811a1 in PortalRunUtility (portal=portal@entry=0x562396e3a178, pstmt=pstmt@entry=0x562396dd5820, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x562396dd5918, completionTag=completionTag@entry=0x7ffd0603ea10"") at pquery.c:1175 #12 0x0000562395e81e0e in PortalRunMulti (portal=portal@entry=0x562396e3a178, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x562396dd5918, altdest=altdest@entry=0x562396dd5918, completionTag=completionTag@entry=0x7ffd0603ea10 "") at pquery.c:1321 #13 0x0000562395e82b99 in PortalRun (portal=portal@entry=0x562396e3a178, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x562396dd5918, altdest=altdest@entry=0x562396dd5918, completionTag=0x7ffd0603ea10"") at pquery.c:796 #14 0x0000562395e7ee14 in exec_simple_query (query_string=query_string@entry=0x562396dd4ad8 "analyze ;") at postgres.c:1215 #15 0x0000562395e80cfc in PostgresMain (argc=<optimized out>, argv=argv@entry=0x562396e00320, dbname=<optimized out>, username=<optimizedout>) at postgres.c:4249 #16 0x0000562395df6358 in BackendRun (port=port@entry=0x562396df7d30) at postmaster.c:4431 #17 0x0000562395df9477 in BackendStartup (port=port@entry=0x562396df7d30) at postmaster.c:4122 #18 0x0000562395df969a in ServerLoop () at postmaster.c:1704 #19 0x0000562395dfabdb in PostmasterMain (argc=3, argv=<optimized out>) at postmaster.c:1377 #20 0x0000562395d59083 in main (argc=3, argv=0x562396dcf200) at main.c:228 regards, Sergei
Hello Seems table_beginscan_analyze (src/include/access/tableam.h) should not pass second argument as NULL. CC'ing Andres Freund PS: also I noticed in src/include/utils/snapshot.h exactly same comment for both SNAPSHOT_SELF and SNAPSHOT_DIRTY - theyhave no difference? regards, Sergei
Sergei Kornilov <sk@zsrv.org> writes: > I can reproduce with: > set default_transaction_isolation TO serializable ; > analyze ; > Here is backtrace: > #0 SerializationNeededForRead (snapshot=0x0, relation=0x7f53e9a525f8) at predicate.c:530 > #1 PredicateLockRelation (relation=relation@entry=0x7f53e9a525f8, snapshot=snapshot@entry=0x0) at predicate.c:2507 > #2 0x0000562395b78a14 in heap_beginscan (relation=0x7f53e9a525f8, snapshot=0x0, nkeys=0, key=0x0, parallel_scan=0x0, allow_strat=<optimizedout>, > allow_sync=false, allow_pagemode=true, is_bitmapscan=false, is_samplescan=true, temp_snap=false) at heapam.c:1180 > #3 0x0000562395c782d7 in table_beginscan_analyze (rel=0x7f53e9a525f8) at ../../../src/include/access/tableam.h:786 > #4 acquire_sample_rows (onerel=onerel@entry=0x7f53e9a525f8, elevel=elevel@entry=13, rows=rows@entry=0x562396f01dd0, targrows=targrows@entry=30000, > totalrows=totalrows@entry=0x7ffd0603e498, totaldeadrows=totaldeadrows@entry=0x7ffd0603e490) at analyze.c:1032 So the problem is that something is passing a null snapshot to something that isn't expecting that. This seems closely related to the tableam API issue that was being debated a day or two back about whether it's ever valid to hand a null snapshot to an AM. Andres, which layer do you think is at fault here? regards, tom lane
I wrote: > Sergei Kornilov <sk@zsrv.org> writes: >> I can reproduce with: >> set default_transaction_isolation TO serializable ; >> analyze ; > So the problem is that something is passing a null snapshot to something > that isn't expecting that. This seems closely related to the tableam > API issue that was being debated a day or two back about whether it's > ever valid to hand a null snapshot to an AM. Andres, which layer do > you think is at fault here? Bisecting confirms that this broke at commit 737a292b5de296615a715ddce2b2d83d1ee245c5 Author: Andres Freund <andres@anarazel.de> Date: Sat Mar 30 16:21:09 2019 -0700 tableam: VACUUM and ANALYZE support. I'd thought possibly this had something to do with bb16aba50 (Enable parallel query with SERIALIZABLE isolation) but the bisection result makes it pretty clear that it's just a tableam API screwup. regards, tom lane
Hi, On May 18, 2019 11:55:01 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >I wrote: >> Sergei Kornilov <sk@zsrv.org> writes: >>> I can reproduce with: >>> set default_transaction_isolation TO serializable ; >>> analyze ; > >> So the problem is that something is passing a null snapshot to >something >> that isn't expecting that. This seems closely related to the tableam >> API issue that was being debated a day or two back about whether it's >> ever valid to hand a null snapshot to an AM. Andres, which layer do >> you think is at fault here? Not quite - that was about the DML callbacks, this is about the scan itself. And while we have a snapshot allocated, theanalyze version of the beginscan intentionally doesn't take a snapshot. >Bisecting confirms that this broke at > >commit 737a292b5de296615a715ddce2b2d83d1ee245c5 >Author: Andres Freund <andres@anarazel.de> >Date: Sat Mar 30 16:21:09 2019 -0700 > > tableam: VACUUM and ANALYZE support. > >I'd thought possibly this had something to do with bb16aba50 (Enable >parallel query with SERIALIZABLE isolation) but the bisection result >makes it pretty clear that it's just a tableam API screwup. I'm not yet at my computer, but I think all that's needed is to expand the check that prevents the predicate lock to be acquiredfor heap type scans to the analyze case. I'll check it in a few. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Andres Freund <andres@anarazel.de> writes: > Not quite - that was about the DML callbacks, this is about the scan itself. And while we have a snapshot allocated, theanalyze version of the beginscan intentionally doesn't take a snapshot. Uh, what? That's a *huge* regression. See, eg, 7170268ef. We really want ANALYZE to act as though it's reading a normal MVCC snapshot. regards, tom lane
Hi, On 2019-05-18 15:48:47 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Not quite - that was about the DML callbacks, this is about the scan itself. And while we have a snapshot allocated,the analyze version of the beginscan intentionally doesn't take a snapshot. > > Uh, what? That's a *huge* regression. See, eg, 7170268ef. We > really want ANALYZE to act as though it's reading a normal MVCC > snapshot. Hm? That's not new at all? In 11 we just do: switch (HeapTupleSatisfiesVacuum(&targtuple, OldestXmin, targbuffer)) I.e. unrelated to the tableam changes there's no mvcc snapshot in for visibility determinations. And that's not changed, heap's implementation still uses HTSV. We do *hold* a snapshot, but that's all outside of analyze.c afaik. I've complained before that we're using a snapshot for analyze's sampling scan in a lot of cases where it's not necessary, and that's a very significant problem in production (where autvacuum doesn't cause slowdowns, but autoanalyze does quite substantially). But I'd not change it just as an aside. Greetings, Andres Freund
Hi, Thanks for the report Joe! I've pushed a fix for this. I ended up going down the path of making scan_begin's arguments a bitmask. Given that several people expressed desire for that, and that recognizing analyze scans would have required a new argument, that seemed the most reasonable course. I think the code handling sync/strat in heapam's initscan() could be simplified so we don't set/unset the flags (and previously the booleans) multiple times. But that seems like it ought to be done separately. I'd normally have asked for a round of feedback for the changes, but it seems more urgent to get something out for beta1. As the changes are all below tableam, we can adjust this later without causing much trouble. Regards, Andres On 2019-05-18 13:12:41 -0700, Andres Freund wrote: > Hi, > > On 2019-05-18 15:48:47 -0400, Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > Not quite - that was about the DML callbacks, this is about the scan itself. And while we have a snapshot allocated,the analyze version of the beginscan intentionally doesn't take a snapshot. > > > > Uh, what? That's a *huge* regression. See, eg, 7170268ef. We > > really want ANALYZE to act as though it's reading a normal MVCC > > snapshot. > > Hm? That's not new at all? In 11 we just do: > > switch (HeapTupleSatisfiesVacuum(&targtuple, > OldestXmin, > targbuffer)) > > I.e. unrelated to the tableam changes there's no mvcc snapshot in for > visibility determinations. And that's not changed, heap's implementation > still uses HTSV. We do *hold* a snapshot, but that's all outside of > analyze.c afaik. Greetings, Andres Freund
Hi, On 2019-05-18 18:12:31 +0300, Sergei Kornilov wrote: > Seems table_beginscan_analyze (src/include/access/tableam.h) should not pass second argument as NULL. As hopefully explained downthread, and in the commit message, that's not really the concern. We shouldn't use the snapshot in the first place. > PS: also I noticed in src/include/utils/snapshot.h exactly same > comment for both SNAPSHOT_SELF and SNAPSHOT_DIRTY - they have no > difference? That was copy & paste mistake. Fixed. Also expanded the comments a bit. Thanks for noticing! Greetings, Andres Freund