Re: BUG #6425: Bus error in slot_deform_tuple - Mailing list pgsql-bugs

From Bridget Frey
Subject Re: BUG #6425: Bus error in slot_deform_tuple
Date
Msg-id CAHOc93kziir_R4+u3ybF8kidKSkpXDNJ5KJ9ynUHfEYuiX7Z-Q@mail.gmail.com
Whole thread Raw
In response to Re: BUG #6425: Bus error in slot_deform_tuple  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-bugs
I just wanted to say thanks to everyone who has been working so hard on
this issue.  I realize it's not certain that this would fix the issues
we're seeing, but we'd be willing to try it out and report back.  The only
caveat is we would need to deploy it to production, so if someone could let
us know what the risk factor is here (e.g. the potential to make things
worse), that would help us plan out how and when we would want to try it.

Thanks again, I'm really hopeful that this will fix the issues we're seeing
- and, if not, at least there seems to be good momentum towards getting to
the root of the problem.
-B

On Thu, Feb 2, 2012 at 10:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I wrote:
> > I have not gotten very far with the coredump, except to observe that
> > gdb says the Assert ought to have passed: ...
> > This suggests very strongly that indeed the buffer was changing under
> > us.
>
> I probably ought to let the test case run overnight before concluding
> anything, but at this point it's run for two-plus hours with no errors
> after applying this patch:
>
> diff --git a/src/backend/access/transam/xlog.c
> b/src/backend/access/transam/xlog.c
> index cce87a3..b128bfd 100644
> *** a/src/backend/access/transam/xlog.c
> --- b/src/backend/access/transam/xlog.c
> *************** RestoreBkpBlocks(XLogRecPtr lsn, XLogRec
> *** 3716,3724 ****
>                }
>                else
>                {
> -                       /* must zero-fill the hole */
> -                       MemSet((char *) page, 0, BLCKSZ);
>                        memcpy((char *) page, blk, bkpb.hole_offset);
>                        memcpy((char *) page + (bkpb.hole_offset +
> bkpb.hole_length),
>                                   blk + bkpb.hole_offset,
>                                   BLCKSZ - (bkpb.hole_offset +
> bkpb.hole_length));
> --- 3716,3724 ----
>                }
>                else
>                {
>                        memcpy((char *) page, blk, bkpb.hole_offset);
> +                       /* must zero-fill the hole */
> +                       MemSet((char *) page + bkpb.hole_offset, 0,
> bkpb.hole_length);
>                        memcpy((char *) page + (bkpb.hole_offset +
> bkpb.hole_length),
>                                   blk + bkpb.hole_offset,
>                                   BLCKSZ - (bkpb.hole_offset +
> bkpb.hole_length));
>
>
> The existing code makes the page state transiently invalid (all zeroes)
> for no particularly good reason, and consumes useless cycles to do so,
> so this would be a good change in any case.  The reason it is relevant
> to our current problem is that even though RestoreBkpBlocks faithfully
> takes exclusive lock on the buffer, *that is not enough to guarantee
> that no one else is touching that buffer*.  Another backend that has
> already located a visible tuple on a page is entitled to keep accessing
> that tuple with only a buffer pin.  So the existing code transiently
> wipes the data from underneath the other backend's pin.
>
> It's clear how this explains the symptoms I saw (Assert reporting wrong
> value of lp_flags even though the backend must previously have seen the
> right value, and the eventual coredump captured the right value too).
> It's less clear though whether this explains the symptoms seen by Duncan
> and Bridget.  They presumably are running without asserts enabled, so
> it's unsurprising that they don't see the Assert failure, but what
> happens if control gets past that?  There seem to be several possible
> failure modes:
>
> * Reader picks up zero lp_off/lp_len from the line pointer, and then
> tries to interpret the page header as a tuple.  The results would be
> predictable only until RestoreBkpBlocks puts back nonzero data there,
> and then it's a bit of a mess.  (In particular, t_hoff would be read out
> of the pd_prune_xid field if I counted right, and so would have a rather
> unpredictable value.)
>
> * Reader finds correct location of tuple, but sees t_hoff and/or
> t_infomask as zeroes (the latter possibly causing it to not check for
> nulls, if it doesn't think HEAP_HASNULL is set).  Until RestoreBkpBlocks
> puts back the data, this would devolve to the next case, but after that
> it's a bit unpredictable again.
>
> * Reader finds correct location of data, but sees zeroes there.
>
> I believe that the reported failures involving palloc(-3) in
> text_to_cstring can be explained as instances of seeing zeroes where a
> text or varchar value is expected.  Zeroes would look like a long-format
> varlena header with value zero, and the code would subtract 4 to get the
> data length, then add 1 for the trailing NUL byte that's needed in
> cstring representation, and thus ask for -3 bytes for the cstring
> equivalent.  Furthermore, all three of the above cases end up doing that
> as long as the page stays all-zero for long enough.  If some nonzero
> data gets restored before we're done examining the page, you get some
> other behavior of doubtful predictability.  Maybe the other reported
> symptoms always fall out of that, or maybe not --- it seems surprising
> that we don't have a wider set of visible misbehaviors.
>
> Whether or not this single bug explains all the cases reported so far,
> it certainly seems possible that there are other mistakes of the same
> sort in individual WAL replay routines.  I think we'd better go over
> all of them with a fine-tooth comb.  In general, a WAL replay routine
> can no longer be allowed to create transiently invalid page states
> that would not have occurred in the "live" version of the page change.
>
> I am even more troubled than I was before about what this says about
> the amount of testing Hot Standby has gotten, because AFAICS absolutely
> any use of Hot Standby, no matter the particulars, ought to be heavily
> exposed to this bug.
>
>                        regards, tom lane
>



--
Bridget Frey  Director, Data & Analytics Engineering | Redfin

bridget.frey@redfin.com | tel: 206.576.5894

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #6425: Bus error in slot_deform_tuple
Next
From: giri.anamika0@gmail.com
Date:
Subject: BUG #6430: Wal Segment Size is 128GB