Thread: PG 12 beta 1 segfault during analyze

PG 12 beta 1 segfault during analyze

From
Steve Singer
Date:
I encountered the following segfault when running against a  PG 12 beta1

during a analyze against a table.


#0  0x000056008ad0c826 in update_attstats (vacattrstats=0x0, 
natts=2139062143, inh=false,
     relid=<error reading variable: Cannot access memory at address 
0x40>) at analyze.c:572
#1  do_analyze_rel (onerel=onerel@entry=0x7f0bc59a7a38, 
params=params@entry=0x7ffe06aeabb0, va_cols=va_cols@entry=0x0,
     acquirefunc=<optimized out>, relpages=8, inh=inh@entry=false, 
in_outer_xact=false, elevel=13) at analyze.c:572
#2  0x000056008ad0d2e0 in analyze_rel (relid=<optimized out>, 
relation=<optimized out>,
     params=params@entry=0x7ffe06aeabb0, va_cols=0x0, 
in_outer_xact=<optimized out>, bstrategy=<optimized out>)
     at analyze.c:260
#3  0x000056008ad81300 in vacuum (relations=0x56008c4d1110, 
params=params@entry=0x7ffe06aeabb0,
     bstrategy=<optimized out>, bstrategy@entry=0x0, 
isTopLevel=isTopLevel@entry=true) at vacuum.c:413
#4  0x000056008ad8197f in ExecVacuum 
(pstate=pstate@entry=0x56008c5c2688, vacstmt=vacstmt@entry=0x56008c3e0428,
     isTopLevel=isTopLevel@entry=true) at vacuum.c:199
#5  0x000056008af0133b in standard_ProcessUtility (pstmt=0x56008c982e50,
     queryString=0x56008c3df368 "select 
\"_disorder_replica\".finishTableAfterCopy(3); analyze 
\"disorder\".\"do_inventory\"; ", context=<optimized out>, params=0x0, 
queryEnv=0x0, dest=0x56008c9831d8, completionTag=0x7ffe06aeaef0 "")
     at utility.c:670
#6  0x000056008aefe112 in PortalRunUtility (portal=0x56008c4515f8, 
pstmt=0x56008c982e50, isTopLevel=<optimized out>,
     setHoldSnapshot=<optimized out>, dest=<optimized out>, 
completionTag=0x7ffe06aeaef0 "") at pquery.c:1175
#7  0x000056008aefec91 in PortalRunMulti 
(portal=portal@entry=0x56008c4515f8, isTopLevel=isTopLevel@entry=true,
     setHoldSnapshot=setHoldSnapshot@entry=false, 
dest=dest@entry=0x56008c9831d8, altdest=altdest@entry=0x56008c9831d8,
     completionTag=completionTag@entry=0x7ffe06aeaef0 "") at pquery.c:1328
#8  0x000056008aeff9e9 in PortalRun (portal=portal@entry=0x56008c4515f8, 
count=count@entry=9223372036854775807,
     isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, 
dest=dest@entry=0x56008c9831d8,
     altdest=altdest@entry=0x56008c9831d8, completionTag=0x7ffe06aeaef0 
"") at pquery.c:796
#9  0x000056008aefb6bb in exec_simple_query (
     query_string=0x56008c3df368 "select 
\"_disorder_replica\".finishTableAfterCopy(3); analyze 
\"disorder\".\"do_inventory\"; ") at postgres.c:1215


With master from today(aa087ec64f703a52f3c48c) I still get segfaults 
under do_analyze_rel

compute_index_stats (onerel=0x7f84bf1436a8, col_context=0x55a5d3d56640, 
numrows=<optimized out>, rows=0x55a5d4039520,
     nindexes=<optimized out>, indexdata=0x3ff0000000000000, 
totalrows=500) at analyze.c:711
#1  do_analyze_rel (onerel=onerel@entry=0x7f84bf1436a8, 
params=0x7ffdde2b5c40, params@entry=0x3ff0000000000000,
     va_cols=va_cols@entry=0x0, acquirefunc=<optimized out>, 
relpages=11, inh=inh@entry=false, in_outer_xact=true,
     elevel=13) at analyze.c:552





Re: PG 12 beta 1 segfault during analyze

From
Tom Lane
Date:
Steve Singer <steve@ssinger.info> writes:
> I encountered the following segfault when running against a  PG 12 beta1
> during a analyze against a table.

Nobody else has reported this, so you're going to have to work on
producing a self-contained test case, or else debugging it yourself.

            regards, tom lane



Re: PG 12 beta 1 segfault during analyze

From
Steve Singer
Date:
On 6/15/19 10:18 PM, Tom Lane wrote:
> Steve Singer <steve@ssinger.info> writes:
>> I encountered the following segfault when running against a  PG 12 beta1
>> during a analyze against a table.
> Nobody else has reported this, so you're going to have to work on
> producing a self-contained test case, or else debugging it yourself.
>
>             regards, tom lane
>
>
>
The attached patch fixes the issue.


Steve



Attachment

Re: PG 12 beta 1 segfault during analyze

From
Tom Lane
Date:
Steve Singer <steve@ssinger.info> writes:
> The attached patch fixes the issue.

Hmm, that's a pretty obvious mistake :-( but after some fooling around
I've not been able to cause a crash with it.  I wonder what test case
you were using, on what platform?

            regards, tom lane



Re: PG 12 beta 1 segfault during analyze

From
Andres Freund
Date:
Hi,

On 2019-06-18 00:32:17 -0400, Tom Lane wrote:
> Steve Singer <steve@ssinger.info> writes:
> > The attached patch fixes the issue.
>
> Hmm, that's a pretty obvious mistake :-( but after some fooling around
> I've not been able to cause a crash with it.  I wonder what test case
> you were using, on what platform?

I suspect that's because the bug is "only" in the
HEAPTUPLE_DELETE_IN_PROGRESS case. And it's "harmless" as far as
crashing goes in the !TransactionIdIsCurrentTransactionId() case,
because as the tuple is sampled, we just return. And then there still
needs to be an actually dead row afterwards, to actually trigger
dereferencing the modified deadrows. And then acquire_sample_rows()'s
deadrows actually needs to point to something that causes crashes when
modified.

I can definitely get it to do a "wild" pointer write:

Breakpoint 2, heapam_scan_analyze_next_tuple (scan=0x55f8fcb92728, OldestXmin=512, liverows=0x7fff56159850,
    deadrows=0x7fff56159f50, slot=0x55f8fcb92b40) at
/home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:1061
1061                    *deadrows += 1;
(gdb) p deadrows
$9 = (double *) 0x7fff56159f50
(gdb) up
#1  0x000055f8fad922c5 in table_scan_analyze_next_tuple (scan=0x55f8fcb92728, OldestXmin=512, liverows=0x7fff56159850,
    deadrows=0x7fff56159848, slot=0x55f8fcb92b40) at /home/andres/src/postgresql/src/include/access/tableam.h:1467
1467        return scan->rs_rd->rd_tableam->scan_analyze_next_tuple(scan, OldestXmin,
(gdb) p deadrows
$10 = (double *) 0x7fff56159848

making a question of a crash just a question of the exact stack layout
and the number of deleted tuples.

Will fix tomorrow morning.

Greetings,

Andres Freund



Re: PG 12 beta 1 segfault during analyze

From
Steve Singer
Date:
On 6/18/19 12:32 AM, Tom Lane wrote:
> Steve Singer <steve@ssinger.info> writes:
>> The attached patch fixes the issue.
> Hmm, that's a pretty obvious mistake :-( but after some fooling around
> I've not been able to cause a crash with it.  I wonder what test case
> you were using, on what platform?
>
>             regards, tom lane
>
>

I was running the slony regression tests.  The crash happened when it 
tries to replicate a particular table that already has data in it on the 
replica.  It doesn't happen with every table and  I haven't been able to 
replicate the crash in as self contained test by manually doing similar 
steps to just that table with psql.

This is on x64.





Re: PG 12 beta 1 segfault during analyze

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2019-06-18 00:32:17 -0400, Tom Lane wrote:
>> Hmm, that's a pretty obvious mistake :-( but after some fooling around
>> I've not been able to cause a crash with it.  I wonder what test case
>> you were using, on what platform?

> I suspect that's because the bug is "only" in the
> HEAPTUPLE_DELETE_IN_PROGRESS case. And it's "harmless" as far as
> crashing goes in the !TransactionIdIsCurrentTransactionId() case,
> because as the tuple is sampled, we just return. And then there still
> needs to be an actually dead row afterwards, to actually trigger
> dereferencing the modified deadrows. And then acquire_sample_rows()'s
> deadrows actually needs to point to something that causes crashes when
> modified.

Right, I'd come to the same conclusions last night, but failed to make
a crasher example.  Not sure why though, because my first try today
blew up real good:

---
\set N 10

create table bug as select generate_series(1,:N) f1;
delete from bug where f1 = :N;

begin;
delete from bug;
analyze verbose bug;
rollback;

drop table bug;
---

On my machine, N smaller than 10 doesn't do it, but of course that
will be very platform-specific.

> Will fix tomorrow morning.

OK.  To save you the trouble of "git blame", it looks like
737a292b5de296615a715ddce2b2d83d1ee245c5 introduced this.

            regards, tom lane



Re: PG 12 beta 1 segfault during analyze

From
Andres Freund
Date:
Hi,

On 2019-06-17 21:46:02 -0400, Steve Singer wrote:
> On 6/15/19 10:18 PM, Tom Lane wrote:
> > Steve Singer <steve@ssinger.info> writes:
> > > I encountered the following segfault when running against a  PG 12 beta1
> > > during a analyze against a table.
> > Nobody else has reported this, so you're going to have to work on
> > producing a self-contained test case, or else debugging it yourself.
> > 
> >             regards, tom lane
> > 
> > 
> > 
> The attached patch fixes the issue.

Thanks for the bug report, diagnosis and patch. Pushed.

I included a small testcase for ANALYZE running in a transaction that
also modified a few rows, after going back and forth on it for a
while. Seems unlikely that we'll reintroduce this specific bug, but it
seems good to have test coverage a least some of the
HEAPTUPLE_DELETE_IN_PROGRESS path. We currently have none...

I think the testcase would catch the issue at hand on most machines, by
mixing live/dead/deleted-by-current-transaction rows.

Greetings,

Andres Freund



Re: PG 12 beta 1 segfault during analyze

From
ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date:
Andres Freund <andres@anarazel.de> writes:

> Hi,
>
> On 2019-06-17 21:46:02 -0400, Steve Singer wrote:
>> On 6/15/19 10:18 PM, Tom Lane wrote:
>> > Steve Singer <steve@ssinger.info> writes:
>> > > I encountered the following segfault when running against a  PG 12 beta1
>> > > during a analyze against a table.
>> > Nobody else has reported this, so you're going to have to work on
>> > producing a self-contained test case, or else debugging it yourself.
>> > 
>> >             regards, tom lane
>> > 
>> > 
>> > 
>> The attached patch fixes the issue.
>
> Thanks for the bug report, diagnosis and patch. Pushed.

I was going to suggest trying to prevent similar bugs by declaring these
and other output parameters as `double *const foo` in tableam.h, but
doing that without adding the corresponding `const` in heapam_handler.c
doesn't even raise a warning.

Still, declaring them as *const in both places might serve as an
example/reminder for people writing their own table AMs.

- ilmari
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen