Thread: PG 12 beta 1 segfault during analyze
I encountered the following segfault when running against a PG 12 beta1 during a analyze against a table. #0 0x000056008ad0c826 in update_attstats (vacattrstats=0x0, natts=2139062143, inh=false, relid=<error reading variable: Cannot access memory at address 0x40>) at analyze.c:572 #1 do_analyze_rel (onerel=onerel@entry=0x7f0bc59a7a38, params=params@entry=0x7ffe06aeabb0, va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>, relpages=8, inh=inh@entry=false, in_outer_xact=false, elevel=13) at analyze.c:572 #2 0x000056008ad0d2e0 in analyze_rel (relid=<optimized out>, relation=<optimized out>, params=params@entry=0x7ffe06aeabb0, va_cols=0x0, in_outer_xact=<optimized out>, bstrategy=<optimized out>) at analyze.c:260 #3 0x000056008ad81300 in vacuum (relations=0x56008c4d1110, params=params@entry=0x7ffe06aeabb0, bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=isTopLevel@entry=true) at vacuum.c:413 #4 0x000056008ad8197f in ExecVacuum (pstate=pstate@entry=0x56008c5c2688, vacstmt=vacstmt@entry=0x56008c3e0428, isTopLevel=isTopLevel@entry=true) at vacuum.c:199 #5 0x000056008af0133b in standard_ProcessUtility (pstmt=0x56008c982e50, queryString=0x56008c3df368 "select \"_disorder_replica\".finishTableAfterCopy(3); analyze \"disorder\".\"do_inventory\"; ", context=<optimized out>, params=0x0, queryEnv=0x0, dest=0x56008c9831d8, completionTag=0x7ffe06aeaef0 "") at utility.c:670 #6 0x000056008aefe112 in PortalRunUtility (portal=0x56008c4515f8, pstmt=0x56008c982e50, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=<optimized out>, completionTag=0x7ffe06aeaef0 "") at pquery.c:1175 #7 0x000056008aefec91 in PortalRunMulti (portal=portal@entry=0x56008c4515f8, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x56008c9831d8, altdest=altdest@entry=0x56008c9831d8, completionTag=completionTag@entry=0x7ffe06aeaef0 "") at pquery.c:1328 #8 0x000056008aeff9e9 in PortalRun (portal=portal@entry=0x56008c4515f8, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x56008c9831d8, altdest=altdest@entry=0x56008c9831d8, completionTag=0x7ffe06aeaef0 "") at pquery.c:796 #9 0x000056008aefb6bb in exec_simple_query ( query_string=0x56008c3df368 "select \"_disorder_replica\".finishTableAfterCopy(3); analyze \"disorder\".\"do_inventory\"; ") at postgres.c:1215 With master from today(aa087ec64f703a52f3c48c) I still get segfaults under do_analyze_rel compute_index_stats (onerel=0x7f84bf1436a8, col_context=0x55a5d3d56640, numrows=<optimized out>, rows=0x55a5d4039520, nindexes=<optimized out>, indexdata=0x3ff0000000000000, totalrows=500) at analyze.c:711 #1 do_analyze_rel (onerel=onerel@entry=0x7f84bf1436a8, params=0x7ffdde2b5c40, params@entry=0x3ff0000000000000, va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>, relpages=11, inh=inh@entry=false, in_outer_xact=true, elevel=13) at analyze.c:552
Steve Singer <steve@ssinger.info> writes: > I encountered the following segfault when running against a PG 12 beta1 > during a analyze against a table. Nobody else has reported this, so you're going to have to work on producing a self-contained test case, or else debugging it yourself. regards, tom lane
On 6/15/19 10:18 PM, Tom Lane wrote: > Steve Singer <steve@ssinger.info> writes: >> I encountered the following segfault when running against a PG 12 beta1 >> during a analyze against a table. > Nobody else has reported this, so you're going to have to work on > producing a self-contained test case, or else debugging it yourself. > > regards, tom lane > > > The attached patch fixes the issue. Steve
Attachment
Steve Singer <steve@ssinger.info> writes: > The attached patch fixes the issue. Hmm, that's a pretty obvious mistake :-( but after some fooling around I've not been able to cause a crash with it. I wonder what test case you were using, on what platform? regards, tom lane
Hi, On 2019-06-18 00:32:17 -0400, Tom Lane wrote: > Steve Singer <steve@ssinger.info> writes: > > The attached patch fixes the issue. > > Hmm, that's a pretty obvious mistake :-( but after some fooling around > I've not been able to cause a crash with it. I wonder what test case > you were using, on what platform? I suspect that's because the bug is "only" in the HEAPTUPLE_DELETE_IN_PROGRESS case. And it's "harmless" as far as crashing goes in the !TransactionIdIsCurrentTransactionId() case, because as the tuple is sampled, we just return. And then there still needs to be an actually dead row afterwards, to actually trigger dereferencing the modified deadrows. And then acquire_sample_rows()'s deadrows actually needs to point to something that causes crashes when modified. I can definitely get it to do a "wild" pointer write: Breakpoint 2, heapam_scan_analyze_next_tuple (scan=0x55f8fcb92728, OldestXmin=512, liverows=0x7fff56159850, deadrows=0x7fff56159f50, slot=0x55f8fcb92b40) at /home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:1061 1061 *deadrows += 1; (gdb) p deadrows $9 = (double *) 0x7fff56159f50 (gdb) up #1 0x000055f8fad922c5 in table_scan_analyze_next_tuple (scan=0x55f8fcb92728, OldestXmin=512, liverows=0x7fff56159850, deadrows=0x7fff56159848, slot=0x55f8fcb92b40) at /home/andres/src/postgresql/src/include/access/tableam.h:1467 1467 return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin, (gdb) p deadrows $10 = (double *) 0x7fff56159848 making a question of a crash just a question of the exact stack layout and the number of deleted tuples. Will fix tomorrow morning. Greetings, Andres Freund
On 6/18/19 12:32 AM, Tom Lane wrote: > Steve Singer <steve@ssinger.info> writes: >> The attached patch fixes the issue. > Hmm, that's a pretty obvious mistake :-( but after some fooling around > I've not been able to cause a crash with it. I wonder what test case > you were using, on what platform? > > regards, tom lane > > I was running the slony regression tests. The crash happened when it tries to replicate a particular table that already has data in it on the replica. It doesn't happen with every table and I haven't been able to replicate the crash in as self contained test by manually doing similar steps to just that table with psql. This is on x64.
Andres Freund <andres@anarazel.de> writes: > On 2019-06-18 00:32:17 -0400, Tom Lane wrote: >> Hmm, that's a pretty obvious mistake :-( but after some fooling around >> I've not been able to cause a crash with it. I wonder what test case >> you were using, on what platform? > I suspect that's because the bug is "only" in the > HEAPTUPLE_DELETE_IN_PROGRESS case. And it's "harmless" as far as > crashing goes in the !TransactionIdIsCurrentTransactionId() case, > because as the tuple is sampled, we just return. And then there still > needs to be an actually dead row afterwards, to actually trigger > dereferencing the modified deadrows. And then acquire_sample_rows()'s > deadrows actually needs to point to something that causes crashes when > modified. Right, I'd come to the same conclusions last night, but failed to make a crasher example. Not sure why though, because my first try today blew up real good: --- \set N 10 create table bug as select generate_series(1,:N) f1; delete from bug where f1 = :N; begin; delete from bug; analyze verbose bug; rollback; drop table bug; --- On my machine, N smaller than 10 doesn't do it, but of course that will be very platform-specific. > Will fix tomorrow morning. OK. To save you the trouble of "git blame", it looks like 737a292b5de296615a715ddce2b2d83d1ee245c5 introduced this. regards, tom lane
Hi, On 2019-06-17 21:46:02 -0400, Steve Singer wrote: > On 6/15/19 10:18 PM, Tom Lane wrote: > > Steve Singer <steve@ssinger.info> writes: > > > I encountered the following segfault when running against a PG 12 beta1 > > > during a analyze against a table. > > Nobody else has reported this, so you're going to have to work on > > producing a self-contained test case, or else debugging it yourself. > > > > regards, tom lane > > > > > > > The attached patch fixes the issue. Thanks for the bug report, diagnosis and patch. Pushed. I included a small testcase for ANALYZE running in a transaction that also modified a few rows, after going back and forth on it for a while. Seems unlikely that we'll reintroduce this specific bug, but it seems good to have test coverage a least some of the HEAPTUPLE_DELETE_IN_PROGRESS path. We currently have none... I think the testcase would catch the issue at hand on most machines, by mixing live/dead/deleted-by-current-transaction rows. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > Hi, > > On 2019-06-17 21:46:02 -0400, Steve Singer wrote: >> On 6/15/19 10:18 PM, Tom Lane wrote: >> > Steve Singer <steve@ssinger.info> writes: >> > > I encountered the following segfault when running against a PG 12 beta1 >> > > during a analyze against a table. >> > Nobody else has reported this, so you're going to have to work on >> > producing a self-contained test case, or else debugging it yourself. >> > >> > regards, tom lane >> > >> > >> > >> The attached patch fixes the issue. > > Thanks for the bug report, diagnosis and patch. Pushed. I was going to suggest trying to prevent similar bugs by declaring these and other output parameters as `double *const foo` in tableam.h, but doing that without adding the corresponding `const` in heapam_handler.c doesn't even raise a warning. Still, declaring them as *const in both places might serve as an example/reminder for people writing their own table AMs. - ilmari -- "I use RMS as a guide in the same way that a boat captain would use a lighthouse. It's good to know where it is, but you generally don't want to find yourself in the same spot." - Tollef Fog Heen