Thread: BUG #10432: failed to re-find parent key in index

BUG #10432: failed to re-find parent key in index

From
m.sakrejda@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      10432
Logged by:          Maciek Sakrejda
Email address:      m.sakrejda@gmail.com
PostgreSQL version: 9.3.4
Operating system:   Ubuntu 12.04 LTS 64-bit
Description:

Hi,

I just got the error `failed to re-find parent key in index "1665279" for
split pages 175193/193740`. From the list archives [1], it looks like
there's no easy recovery, but this happened on a replica right after the
recovery trigger file was touched and recovery completed, so that's not a
huge concern (unless it also happens to the next replica). Is there
something to be learned from the failed cluster? The only odd thing I see in
the logs is that it looks like we tried to take a base backup before
recovery actually completed, though the attempt appears to have failed
cleanly. Here are the logs from around the time of the failure, including
WAL-E logs:

May 23 16:21:28 postgres[10]: [1244-1]  LOG:  trigger file found:
/etc/postgresql/wal-e.d/pull-env/STANDBY_OFF
May 23 16:21:28 postgres[7545]: [5-1]  FATAL:  terminating walreceiver
process due to administrator command
May 23 16:21:28 wal_e.operator.backup INFO     MSG: begin wal restore#012
    STRUCTURED: time=2014-05-23T16:21:28.169498-00 pid=7757 action=wal-fetch
key=s3://.../wal_005/000000020000033900000066.lzo prefix=.../
seg=000000020000033900000066 state=begin
May 23 16:21:28 wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer
locate object while performing wal restore#012        DETAIL: The absolute
URI that could not be located is
s3://.../wal_005/000000020000033900000066.lzo.#012        HINT: This can be
normal when Postgres is trying to detect what timelines are available during
restoration.#012        STRUCTURED: time=2014-05-23T16:21:28.364020-00
pid=7757
May 23 16:21:28 wal_e.operator.backup INFO     MSG: complete wal restore#012
       STRUCTURED: time=2014-05-23T16:21:28.365570-00 pid=7757
action=wal-fetch key=s3://.../wal_005/000000020000033900000066.lzo
prefix=.../ seg=000000020000033900000066 state=complete
May 23 16:21:28 postgres[10]: [1245-1]  LOG:  redo done at 339/650000C8
May 23 16:21:28 postgres[10]: [1246-1]  LOG:  last completed transaction was
at log time 2014-05-23 14:43:31.251167+00
May 23 16:21:28 wal_e.operator.backup INFO     MSG: begin wal restore#012
    STRUCTURED: time=2014-05-23T16:21:28.535075-00 pid=7763 action=wal-fetch
key=s3://.../wal_005/000000020000033900000065.lzo prefix=.../
seg=000000020000033900000065 state=begin
May 23 16:21:29 wal_e.blobstore.s3.s3_util INFO     MSG: completed download
and decompression#012        DETAIL: Downloaded and decompressed
"s3://.../wal_005/000000020000033900000065.lzo" to
"pg_xlog/RECOVERYXLOG"#012        STRUCTURED:
time=2014-05-23T16:21:29.094826-00 pid=7763
May 23 16:21:29 wal_e.operator.backup INFO     MSG: complete wal restore#012
       STRUCTURED: time=2014-05-23T16:21:29.096962-00 pid=7763
action=wal-fetch key=s3://.../wal_005/000000020000033900000065.lzo
prefix=.../ seg=000000020000033900000065 state=complete
May 23 16:21:29 postgres[10]: [1247-1]  LOG:  restored log file
"000000020000033900000065" from archive
May 23 16:21:29 wal_e.operator.backup INFO     MSG: begin wal restore#012
    STRUCTURED: time=2014-05-23T16:21:29.269909-00 pid=7769 action=wal-fetch
key=s3://.../wal_005/00000003.history.lzo prefix=.../ seg=00000003.history
state=begin
May 23 16:21:29 wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer
locate object while performing wal restore#012        DETAIL: The absolute
URI that could not be located is s3://.../wal_005/00000003.history.lzo.#012
      HINT: This can be normal when Postgres is trying to detect what
timelines are available during restoration.#012        STRUCTURED:
time=2014-05-23T16:21:29.424936-00 pid=7769
May 23 16:21:29 wal_e.operator.backup INFO     MSG: complete wal restore#012
       STRUCTURED: time=2014-05-23T16:21:29.426295-00 pid=7769
action=wal-fetch key=s3://.../wal_005/00000003.history.lzo prefix=.../
seg=00000003.history state=complete
May 23 16:21:29 postgres[10]: [1248-1]  LOG:  selected new timeline ID: 3
May 23 16:21:29 wal_e.operator.backup INFO     MSG: begin wal restore#012
    STRUCTURED: time=2014-05-23T16:21:29.593188-00 pid=7775 action=wal-fetch
key=s3://.../wal_005/00000002.history.lzo prefix=.../ seg=00000002.history
state=begin
May 23 16:21:29 postgres[7779]: [4-1] [unknown] LOG:  connection received:
host=[local]
May 23 16:21:29 postgres[7779]: [5-1] postgres LOG:  connection authorized:
user=postgres database=postgres
May 23 16:21:29 postgres[7781]: [4-1] [unknown] LOG:  connection received:
host=[local]
May 23 16:21:29 postgres[7781]: [5-1] postgres LOG:  connection authorized:
user=postgres database=postgres
May 23 16:21:29 postgres[7781]: [6-1] postgres ERROR:  recovery is in
progress
May 23 16:21:29 postgres[7781]: [6-2] postgres HINT:  WAL control functions
cannot be executed during recovery.
May 23 16:21:29 postgres[7781]: [6-3] postgres STATEMENT:  COPY (SELECT
file_name,   lpad(file_offset::text, 8, '0') AS file_offset FROM
pg_xlogfile_name_offset(
pg_start_backup('freeze_start_2014-05-23T16:21:29.681458+00:00'))) TO STDOUT
WITH CSV HEADER;
May 23 16:21:29 wal_e.operator.backup WARNING  MSG: blocking on sending WAL
segments#012        DETAIL: The backup was not completed successfully, but
we have to wait anyway.  See README: TODO about pg_cancel_backup#012
STRUCTURED: time=2014-05-23T16:21:29.764625-00 pid=18790
May 23 16:21:29 postgres[7782]: [4-1] [unknown] LOG:  connection received:
host=[local]
May 23 16:21:29 postgres[7782]: [5-1] postgres LOG:  connection authorized:
user=postgres database=postgres
May 23 16:21:29 postgres[7782]: [6-1] postgres ERROR:  recovery is in
progress
May 23 16:21:29 postgres[7782]: [6-2] postgres HINT:  WAL control functions
cannot be executed during recovery.
May 23 16:21:29 postgres[7782]: [6-3] postgres STATEMENT:  COPY (SELECT
file_name,   lpad(file_offset::text, 8, '0') AS file_offset FROM
pg_xlogfile_name_offset(  pg_stop_backup())) TO STDOUT WITH CSV HEADER;
May 23 16:21:29 wal_e.main   ERROR    MSG: Could not stop hot backup#012
   STRUCTURED: time=2014-05-23T16:21:29.844455-00 pid=18790
May 23 16:21:29 wal_e.blobstore.s3.s3_util INFO     MSG: completed download
and decompression#012        DETAIL: Downloaded and decompressed
"s3://.../wal_005/00000002.history.lzo" to "pg_xlog/RECOVERYHISTORY"#012
   STRUCTURED: time=2014-05-23T16:21:29.866686-00 pid=7775
May 23 16:21:29 wal_e.operator.backup INFO     MSG: complete wal restore#012
       STRUCTURED: time=2014-05-23T16:21:29.868954-00 pid=7775
action=wal-fetch key=s3://.../wal_005/00000002.history.lzo prefix=.../
seg=00000002.history state=complete
May 23 16:21:29 postgres[10]: [1249-1]  LOG:  restored log file
"00000002.history" from archive
May 23 16:21:30 postgres[10]: [1250-1]  LOG:  archive recovery complete
May 23 16:21:30 postgres[10]: [1251-1]  FATAL:  failed to re-find parent key
in index "1665279" for split pages 175193/193740
May 23 16:21:31 postgres[8]: [4-1]  LOG:  startup process (PID 10) exited
with exit code 1
May 23 16:21:31 postgres[8]: [5-1]  LOG:  terminating any other active
server processes


[1]: http://www.postgresql.org/message-id/7753.1291075332@sss.pgh.pa.us

Re: BUG #10432: failed to re-find parent key in index

From
Heikki Linnakangas
Date:
On 05/23/2014 08:52 PM, m.sakrejda@gmail.com wrote:
> I just got the error `failed to re-find parent key in index "1665279" for
> split pages 175193/193740`. From the list archives [1], it looks like
> there's no easy recovery, but this happened on a replica right after the
> recovery trigger file was touched and recovery completed, so that's not a
> huge concern (unless it also happens to the next replica). Is there
> something to be learned from the failed cluster?

I would be interested in seeing the structure of the index, if there is
anything else corrupt in there. Also, what WAL actions led to the error?
Try something like:

  pg_xlogdump -r btree  -p $PGDATA -s 339/65000000 | grep 1665279

and search that for any records related to the failed split, e.g.
grepping further for the block numbers in the error message.

- Heikki

Re: BUG #10432: failed to re-find parent key in index

From
Maciek Sakrejda
Date:
On Tue, May 27, 2014 at 11:06 AM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

> I would be interested in seeing the structure of the index, if there is
> anything else corrupt in there.


It's an index on (integer, timestamp without time zone). Unfortunately,
it's a customer DB, so getting more direct access may be problematic. Is
there metadata we can gather about it that could be useful?

Also, what WAL actions led to the error? Try something like:
>
>  pg_xlogdump -r btree  -p $PGDATA -s 339/65000000 | grep 1665279
>
> and search that for any records related to the failed split, e.g. grepping
> further for the block numbers in the error message.


That gives me no output even without the grep (except to complain that the
next segment is missing unless I pass '-e 339/65FFFFFF', which is
reasonable, right?). I also had to change the timeline with `-t 2`, but I
don't imagine that makes a difference here. If I go back further, I do get
output for that index, but nothing that mentions 175193 or 193740.

Also, it turned out that this was a persistent problem--a couple of
replicas failed the same way. I then worked with the customer and had them
re-create the index, and that seems to have resolved the issue. My
colleague Greg Stark has taken over the forensic investigation here--he may
have more to add.

Re: BUG #10432: failed to re-find parent key in index

From
Andres Freund
Date:
Hi,

On 2014-05-29 08:56:10 -0700, Maciek Sakrejda wrote:
> On Tue, May 27, 2014 at 11:06 AM, Heikki Linnakangas <
> hlinnakangas@vmware.com> wrote:
>
> > I would be interested in seeing the structure of the index, if there is
> > anything else corrupt in there.
>
>
> It's an index on (integer, timestamp without time zone). Unfortunately,
> it's a customer DB, so getting more direct access may be problematic. Is
> there metadata we can gather about it that could be useful?
>
> Also, what WAL actions led to the error? Try something like:
> >
> >  pg_xlogdump -r btree  -p $PGDATA -s 339/65000000 | grep 1665279
> >
> > and search that for any records related to the failed split, e.g. grepping
> > further for the block numbers in the error message.

I wonder why the failure didn't show the record that triggered the
error? This is on a primary? If so, my guess is that this happened
during the 'incomplete action' cleanup. Do you have a better explanation
Heikki?

> Also, it turned out that this was a persistent problem--a couple of
> replicas failed the same way. I then worked with the customer and had them
> re-create the index, and that seems to have resolved the issue. My
> colleague Greg Stark has taken over the forensic investigation here--he may
> have more to add.

My hope^Wguess is that this is a symptom of
1a917ae8610d44985fd2027da0cfe60ccece9104 (not released) or even
9a57858f1103b89a5674f0d50c5fe1f756411df6 (9.3.4). Once the hot chain is
corrupted such errors could occur
When were those standbys made? Did the issue occur on the primary as
well?

Greetings,

Andres Freund

PS: wal-e's intersperesed output is rather annoying...

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #10432: failed to re-find parent key in index

From
Maciek Sakrejda
Date:
On Thu, May 29, 2014 at 9:08 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

> Hi,
>
> On 2014-05-29 08:56:10 -0700, Maciek Sakrejda wrote:
> > On Tue, May 27, 2014 at 11:06 AM, Heikki Linnakangas <
> > hlinnakangas@vmware.com> wrote:
> >
> > > I would be interested in seeing the structure of the index, if there is
> > > anything else corrupt in there.
> >
> >
> > It's an index on (integer, timestamp without time zone). Unfortunately,
> > it's a customer DB, so getting more direct access may be problematic. Is
> > there metadata we can gather about it that could be useful?
> >
> > Also, what WAL actions led to the error? Try something like:
> > >
> > >  pg_xlogdump -r btree  -p $PGDATA -s 339/65000000 | grep 1665279
> > >
> > > and search that for any records related to the failed split, e.g.
> grepping
> > > further for the block numbers in the error message.
>
> I wonder why the failure didn't show the record that triggered the
> error? This is on a primary?


No, I ran pg_xlogdump on the failed replica--I thought that's what Heikki
was suggesting (and it seemed to me like the source of the problem would be
there).

My hope^Wguess is that this is a symptom of
> 1a917ae8610d44985fd2027da0cfe60ccece9104 (not released) or even
> 9a57858f1103b89a5674f0d50c5fe1f756411df6 (9.3.4). Once the hot chain is
> corrupted such errors could occur
> When were those standbys made? Did the issue occur on the primary as
> well?
>

The original ancestor was a 9.3.2. No problems on the primary.

PS: wal-e's intersperesed output is rather annoying...
>

I thought it might be relevant. I'll exclude it in the future.

Re: BUG #10432: failed to re-find parent key in index

From
Andres Freund
Date:
Hi,

On 2014-05-29 10:14:25 -0700, Maciek Sakrejda wrote:
> > I wonder why the failure didn't show the record that triggered the
> > error? This is on a primary?

> No, I ran pg_xlogdump on the failed replica--I thought that's what Heikki
> was suggesting (and it seemed to me like the source of the problem would be
> there).

The WAL should be the same everywhere... But what I was wondering about
was less about the xlogdump but more about the lack of a message that
tells us the record that triggered the error.

> My hope^Wguess is that this is a symptom of
> > 1a917ae8610d44985fd2027da0cfe60ccece9104 (not released) or even
> > 9a57858f1103b89a5674f0d50c5fe1f756411df6 (9.3.4). Once the hot chain is
> > corrupted such errors could occur
> > When were those standbys made? Did the issue occur on the primary as
> > well?
> >

> The original ancestor was a 9.3.2. No problems on the primary.

So, this is quite possibly just a 'delayed' consequence from the earlier
bugs.

> PS: wal-e's intersperesed output is rather annoying...

> I thought it might be relevant. I'll exclude it in the future.

Wasn't really related to this bug. More of a general observation that it
frequently is a bit verbose...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #10432: failed to re-find parent key in index

From
Maciek Sakrejda
Date:
On Thu, May 29, 2014 at 10:19 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

> The WAL should be the same everywhere... But what I was wondering about
> was less about the xlogdump but more about the lack of a message that
> tells us the record that triggered the error.
>

No idea. I'm new to pg_xlogdump, but I did get output if I went further
back, so I'm pretty sure I was using it correctly.


> So, this is quite possibly just a 'delayed' consequence from the earlier
> bugs.
>

Yeah, makes sense. I don't suppose there's a simple way to periodically
"audit" index integrity?


> Wasn't really related to this bug. More of a general observation that it
> frequently is a bit verbose...
>

No argument there. I think it does have a --terse flag, but we find the
default verbose output useful occasionally.

Re: BUG #10432: failed to re-find parent key in index

From
Greg Stark
Date:
On Thu, May 29, 2014 at 6:44 PM, Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
> Yeah, makes sense. I don't suppose there's a simple way to periodically
> "audit" index integrity?

Heikki and Peter were starting work on a btree checking program at
PGCon. But there would have to be a separate program for each type of
index and the list of possible invalid states to check for is pretty
long. I think it would actually be a pretty handy program to have if
only to document the invariants that we believe btrees should hold
which right now is spread throughout a lot of code.

I pulled down WAL going back 11 days and haven't turned up anything
referring to either of those blocks in this relation. Is it possible
there's an off-by-one error in the error message (or xlogdump) though?
I do find references to 175194 and 193741...

# grep -h ' \(19374[01]\|17519[234]\)' *.1665280
rmgr: Btree       len (rec/tot):     18/  5434, tx:   85039158, lsn:
307/9E007260, prev 307/9E005E08, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/30
        backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 304, length: 2832
rmgr: Btree       len (rec/tot):     18/  5510, tx:   85042421, lsn:
307/A90CC220, prev 307/A90CC1D0, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/30
        backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 308, length: 2756
rmgr: Btree       len (rec/tot):     18/  6234, tx:   85624987, lsn:
30E/ED2F9870, prev 30E/ED2F8138, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 193741/45
        backup bkp #0; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 432, length: 2032
rmgr: Btree       len (rec/tot):     18/  6294, tx:   85668302, lsn:
30F/4400F888, prev 30F/4400E138, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 193741/45
        backup bkp #0; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 436, length: 1972
rmgr: Btree       len (rec/tot):     18/  5586, tx:   86227110, lsn:
315/D74C3D00, prev 315/D74C3CB0, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/55
        backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 312, length: 2680
rmgr: Btree       len (rec/tot):     18/  5662, tx:   86235168, lsn:
315/ED01FFC0, prev 315/ED01FF70, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/55
        backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 316, length: 2604
rmgr: Btree       len (rec/tot):   4054/ 14306, tx:   86349912, lsn:
317/BC070968, prev 317/BC070918, bkp: 1100, desc: split_r: rel
1663/16385/1665280 left 20842, right 306675, next 193741, level 0,
firstright 67
        backup bkp #1; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 436, length: 1972
rmgr: Btree       len (rec/tot):     18/  6354, tx:   86461004, lsn:
319/011987A8, prev 319/01198758, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 193741/103
        backup bkp #0; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 440, length: 1912
rmgr: Btree       len (rec/tot):     18/  6414, tx:   86462619, lsn:
319/0919D338, prev 319/0919D2E8, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 193741/103
        backup bkp #0; rel 1663/16385/1665280; fork: main; block:
193741; hole: offset: 444, length: 1852
rmgr: Btree       len (rec/tot):     74/   106, tx:   86462626, lsn:
319/091B24B0, prev 319/091B2460, bkp: 0000, desc: insert: rel
1663/16385/1665280; tid 193741/103
rmgr: Btree       len (rec/tot):     18/  5738, tx:   98148324, lsn:
33B/C7E1CA08, prev 33B/C7E1C9B8, bkp: 1000, desc: insert: rel
1663/16385/1665280; tid 175194/48
        backup bkp #0; rel 1663/16385/1665280; fork: main; block:
175194; hole: offset: 320, length: 2528





--
greg

Re: BUG #10432: failed to re-find parent key in index

From
Heikki Linnakangas
Date:
> I pulled down WAL going back 11 days and haven't turned up anything
> referring to either of those blocks in this relation.

Can you go further back? Can you look at the LSNs on the pages, both
after the split and also in an old base backup? That would give us an
idea of when the error might've happened.

> Is it possible
> there's an off-by-one error in the error message (or xlogdump) though?

Don't think so, the elog and the pg_xlogdump code both look correct to me.

- Heikki

Re: BUG #10432: failed to re-find parent key in index

From
Greg Stark
Date:
On Tue, May 27, 2014 at 7:06 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I would be interested in seeing the structure of the index, if there is
> anything else corrupt in there.

So now let me check my arithmetic. block 175193 is the block starting
at byte offset 361439232 of 1665279.1 ?

And block 193740 is at byte offset 513376256 of 1665279.1 (which
appears to be off the end actually, but I haven't recovered the
database yet)

My math is:

8192 * ( 175193 - 131072 )
361439232

8192 * ( 193740 - 131072 )
513376256


--
greg

Re: BUG #10432: failed to re-find parent key in index

From
Greg Stark
Date:
Ok, not sure why my first attempt didn't turn this up. I found the
split in segment 334/91:

rmgr: Btree       len (rec/tot):   3776/  9220, tx:   95765459, lsn:
334/91455AB8, prev 334/91455A70, bkp: 0100, desc: split_l: rel
1663/16385/1665279 left 175193, right 193740, next 182402, level 0,
firstright 138

I've attached all the xlog records pertaining to this relation
grepping for either of these two blocks.

Now interestingly the hot backup was taken starting at 334/90 and
replay didn't finish until 339/65 so it is entirely possible, even
likely, that the backup caught this split in an inconsistent state.

How should I go about dumping the two blocks? I have the backup prior
to WAL replay as well as all the WAL for this time period. I can't
connect to the database so I'm guessing this will look like replay
until it hits a record for these block, use dd to extract the block,
rinse lather repeat. Then dump each of those extracted pages using
pageinspect on byteas. This sounds pretty laborious :(

Attachment

Re: BUG #10432: failed to re-find parent key in index

From
Andres Freund
Date:
On 2014-06-02 18:17:46 +0100, Greg Stark wrote:
> Ok, not sure why my first attempt didn't turn this up. I found the
> split in segment 334/91:
>
> rmgr: Btree       len (rec/tot):   3776/  9220, tx:   95765459, lsn:
> 334/91455AB8, prev 334/91455A70, bkp: 0100, desc: split_l: rel
> 1663/16385/1665279 left 175193, right 193740, next 182402, level 0,
> firstright 138
>
> I've attached all the xlog records pertaining to this relation
> grepping for either of these two blocks.
>
> Now interestingly the hot backup was taken starting at 334/90 and
> replay didn't finish until 339/65 so it is entirely possible, even
> likely, that the backup caught this split in an inconsistent state.

Did you check whether all the necessary FPIs were generated? That'd be
my very first suspect.

How many checkpoints are inbetween 334/90 and 339/65?

> How should I go about dumping the two blocks? I have the backup prior
> to WAL replay as well as all the WAL for this time period. I can't
> connect to the database so I'm guessing this will look like replay
> until it hits a record for these block, use dd to extract the block,
> rinse lather repeat. Then dump each of those extracted pages using
> pageinspect on byteas. This sounds pretty laborious :(

I guess you could make xlogdump dump the data from the backup blocks...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #10432: failed to re-find parent key in index

From
Greg Stark
Date:
On Mon, Jun 2, 2014 at 6:40 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> Did you check whether all the necessary FPIs were generated? That'd be
> my very first suspect.

Really? Shouldn't only the last one matter? All the other ones will be
overwritten later by later full page writes anywys, no? Also, i
thought this was pretty much underlying infrastructure that would be
pretty hard to get wrong in just one call site.

I see a relatively few inserts that don't have backup blocks. The few
that do are preceded by another insert for the same block since the
last checkpoint redo pointer location. There is precisely one split
and it's shortly after the backup started but is a split to a
different right block from the one in the error.

> How many checkpoints are inbetween 334/90 and 339/65?

68 (though 13 of those are basically empty as the system was idle for a while)

> I guess you could make xlogdump dump the data from the backup blocks...

That's a clever idea. But it proved to be a little less useful than I
expected. Because it's a btree, not a heap I can't actually decipher
the items using bt_page_items() on plain byteas.

I am looking at the page headers and I'm a bit surprised to find the
LSN on the backup block images doesn't match the LSN of the record the
backup blocks are found in. I guess that makes sense since some AMs
will be generating records prior to even having touched the page in
question? Other than that there's not much to see in the page headers
:(

--
greg

Attachment

Re: BUG #10432: failed to re-find parent key in index

From
Andres Freund
Date:
On 2014-06-03 01:36:40 +0100, Greg Stark wrote:
> On Mon, Jun 2, 2014 at 6:40 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >
> > Did you check whether all the necessary FPIs were generated? That'd be
> > my very first suspect.
>
> Really? Shouldn't only the last one matter? All the other ones will be
> overwritten later by later full page writes anywys, no? Also, i
> thought this was pretty much underlying infrastructure that would be
> pretty hard to get wrong in just one call site.

Well, if we missed a single FPI somewhere - e.g. by accidentally not
filling XLogRecData->buffer or by confusing which bkp block numbers
refer to what (both happened during 9.4 development) you'd potentially
get a torn page. And that'd very well explain such an error message.

Your split record had only one backup block. I'd manually make sure all
the other ones previously had some. You probably need to look in the nbt
code to see which bkp block refers to what.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #10432: failed to re-find parent key in index

From
Greg Stark
Date:
Ok, I made some progress. It turns out this was a pre-existing problem
in the master. They've been getting "failed to re-find parent" errors
for weeks. Far longer than I have any WAL or backups for.

What I did find that was interesting is that this error basically made
the backups worthless. I could build a hot standby and connect and
query it. But as soon as recovery finished it would try to clean up
the incomplete split and fail. Because it had noticed the incomplete
split it had skipped every restartpoint and the next time I tried to
start it it insisted on restarting recovery from the beginning. If we
had been lucky enough not to do any page splits in the broken index
while the backup was being taken all would have been fine. But that
doesn't seem to have happened so all the backups were unrecoverable.

So a few thoughts on how to improve things:

1) Failed to re-find parent should perhaps not be FATAL to recovery.
In fact any index replay error would really be nice not to have to
crash on. All crashing does is prevent the user from being able to
bring up their database and REINDEX the btree. This may be another use
case for the machinery that would protect against corrupt hash indexes
or user-defined indexes -- if we could mark the index invalid and
proceed (perhaps ignoring subsequent records for it) that would be
great.

2) When we see an abort record we could check for any cleanup actions
triggered by that transaction and run them right away. I think the
checkpoints (and maybe hot standby snapshots or vacuum cleanup
records?) also include information about the oldest xid running, they
would also let us prune the cleanup actions sooner. That would at
least find the error sooner. In conjunction with (1) it would also
mean subsequent restartpoints would be effective instead of
suppressing restartpoints right to the end of recovery.

3) The lack of logs around an error during recovery makes it hard to
decipher what's going on. It would be nice to see "Beginning Xlog
cleanup (1 incomplete splits to replay)" and when it crashed "Last
safe point to restart recovery is 324/ABCDEF". As it was it was a
pretty big mystery why the database crashed, the logs made it appear
as if it had started up fine.  And it was unclear why restarting it
caused it to replay from the beginning, I thought maybe something was
wrong with our scripts.

Re: BUG #10432: failed to re-find parent key in index

From
Greg Stark
Date:
On Wed, Jun 4, 2014 at 12:14 PM, Greg Stark <stark@mit.edu> wrote:
> Because it had noticed the incomplete
> split it had skipped every restartpoint and the next time I tried to
> start it it insisted on restarting recovery from the beginning.

Incidentally this also caused operational problems. Suppressing all
restartpoints also meant Postgres kept around all the WAL from the
entire recovery. This caused the pg_xlog volume to fill up.



--
greg

Re: BUG #10432: failed to re-find parent key in index

From
Andres Freund
Date:
Hi,

On 2014-06-04 12:14:27 +0100, Greg Stark wrote:
> Ok, I made some progress. It turns out this was a pre-existing problem
> in the master. They've been getting "failed to re-find parent" errors
> for weeks. Far longer than I have any WAL or backups for.

Ok.

> 1) Failed to re-find parent should perhaps not be FATAL to recovery.
> In fact any index replay error would really be nice not to have to
> crash on.

I think that's not really realistic. We'd need to put a significant
amount of machinery for this in to be workable. Suddenly a crash restart
doesn't guarantee that you're indexes are there anymore? Not nice.

> All crashing does is prevent the user from being able to
> bring up their database and REINDEX the btree. This may be another use
> case for the machinery that would protect against corrupt hash indexes
> or user-defined indexes -- if we could mark the index invalid and
> proceed (perhaps ignoring subsequent records for it) that would be
> great.
>
> 2) When we see an abort record we could check for any cleanup actions
> triggered by that transaction and run them right away. I think the
> checkpoints (and maybe hot standby snapshots or vacuum cleanup
> records?) also include information about the oldest xid running, they
> would also let us prune the cleanup actions sooner. That would at
> least find the error sooner. In conjunction with (1) it would also
> mean subsequent restartpoints would be effective instead of
> suppressing restartpoints right to the end of recovery.

Heikki removed restartpoints from 9.4 alltogether so most of these are
gone. As all these -even if they were doable - sound far too large for
backpatching I think it's luckily mostly done.

> 3) The lack of logs around an error during recovery makes it hard to
> decipher what's going on. It would be nice to see "Beginning Xlog
> cleanup (1 incomplete splits to replay)" and when it crashed "Last
> safe point to restart recovery is 324/ABCDEF". As it was it was a
> pretty big mystery why the database crashed, the logs made it appear
> as if it had started up fine.  And it was unclear why restarting it
> caused it to replay from the beginning, I thought maybe something was
> wrong with our scripts.

I think this should be fixed by setting up error context stack support
in two places. a) in StartupXLOG() before the rm_cleanup() calls b) in <
9.4 inside the individual cleanup routines.
We do all that around redo routines, but, as evidenced here, that's not
always enough.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #10432: failed to re-find parent key in index

From
Heikki Linnakangas
Date:
On 06/04/2014 02:14 PM, Greg Stark wrote:
> Ok, I made some progress. It turns out this was a pre-existing problem
> in the master. They've been getting "failed to re-find parent" errors
> for weeks. Far longer than I have any WAL or backups for.

Bummer. You don't have logs reaching back to when the first error
happened either, I presume.

The most likely cause for originally failing to insert a downlink is
running out of disk space when trying to split the parent page. (Even
that's pretty unlikely, though.)

> What I did find that was interesting is that this error basically made
> the backups worthless. I could build a hot standby and connect and
> query it. But as soon as recovery finished it would try to clean up
> the incomplete split and fail. Because it had noticed the incomplete
> split it had skipped every restartpoint and the next time I tried to
> start it it insisted on restarting recovery from the beginning. If we
> had been lucky enough not to do any page splits in the broken index
> while the backup was being taken all would have been fine. But that
> doesn't seem to have happened so all the backups were unrecoverable.

It's only *incomplete* splits that happen while the backup is taken that
matter. i.e. a page is split but inserting the downlink to the parent
fails for some reason. If a page is split and the downlink is inserted
successfully, that's OK.

> So a few thoughts on how to improve things:
>
> 1) Failed to re-find parent should perhaps not be FATAL to recovery.
> In fact any index replay error would really be nice not to have to
> crash on. All crashing does is prevent the user from being able to
> bring up their database and REINDEX the btree. This may be another use
> case for the machinery that would protect against corrupt hash indexes
> or user-defined indexes -- if we could mark the index invalid and
> proceed (perhaps ignoring subsequent records for it) that would be
> great.

Yeah, that would be nice as an option.

> 2) When we see an abort record we could check for any cleanup actions
> triggered by that transaction and run them right away. I think the
> checkpoints (and maybe hot standby snapshots or vacuum cleanup
> records?) also include information about the oldest xid running, they
> would also let us prune the cleanup actions sooner. That would at
> least find the error sooner. In conjunction with (1) it would also
> mean subsequent restartpoints would be effective instead of
> suppressing restartpoints right to the end of recovery.

You can't do any cleanup actions until you've recovered all the WAL. A
cleanup action means inserting a tuple to the parent page, and you can't
do that in the middle of recovery.

But we could detect the case and give a warning.

> 3) The lack of logs around an error during recovery makes it hard to
> decipher what's going on. It would be nice to see "Beginning Xlog
> cleanup (1 incomplete splits to replay)" and when it crashed "Last
> safe point to restart recovery is 324/ABCDEF". As it was it was a
> pretty big mystery why the database crashed, the logs made it appear
> as if it had started up fine.  And it was unclear why restarting it
> caused it to replay from the beginning, I thought maybe something was
> wrong with our scripts

Yeah. All of this is gone in 9.4 anyway, but if you can come up with
something non-invasive that can be back-patched...

- Heikki