Thread: BUG #10432: failed to re-find parent key in index
The following bug has been logged on the website: Bug reference: 10432 Logged by: Maciek Sakrejda Email address: m.sakrejda@gmail.com PostgreSQL version: 9.3.4 Operating system: Ubuntu 12.04 LTS 64-bit Description: Hi, I just got the error `failed to re-find parent key in index "1665279" for split pages 175193/193740`. From the list archives [1], it looks like there's no easy recovery, but this happened on a replica right after the recovery trigger file was touched and recovery completed, so that's not a huge concern (unless it also happens to the next replica). Is there something to be learned from the failed cluster? The only odd thing I see in the logs is that it looks like we tried to take a base backup before recovery actually completed, though the attempt appears to have failed cleanly. Here are the logs from around the time of the failure, including WAL-E logs: May 23 16:21:28 postgres[10]: [1244-1] LOG: trigger file found: /etc/postgresql/wal-e.d/pull-env/STANDBY_OFF May 23 16:21:28 postgres[7545]: [5-1] FATAL: terminating walreceiver process due to administrator command May 23 16:21:28 wal_e.operator.backup INFO MSG: begin wal restore#012 STRUCTURED: time=2014-05-23T16:21:28.169498-00 pid=7757 action=wal-fetch key=s3://.../wal_005/000000020000033900000066.lzo prefix=.../ seg=000000020000033900000066 state=begin May 23 16:21:28 wal_e.blobstore.s3.s3_util WARNING MSG: could no longer locate object while performing wal restore#012 DETAIL: The absolute URI that could not be located is s3://.../wal_005/000000020000033900000066.lzo.#012 HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.#012 STRUCTURED: time=2014-05-23T16:21:28.364020-00 pid=7757 May 23 16:21:28 wal_e.operator.backup INFO MSG: complete wal restore#012 STRUCTURED: time=2014-05-23T16:21:28.365570-00 pid=7757 action=wal-fetch key=s3://.../wal_005/000000020000033900000066.lzo prefix=.../ seg=000000020000033900000066 state=complete May 23 16:21:28 postgres[10]: [1245-1] LOG: redo done at 339/650000C8 May 23 16:21:28 postgres[10]: [1246-1] LOG: last completed transaction was at log time 2014-05-23 14:43:31.251167+00 May 23 16:21:28 wal_e.operator.backup INFO MSG: begin wal restore#012 STRUCTURED: time=2014-05-23T16:21:28.535075-00 pid=7763 action=wal-fetch key=s3://.../wal_005/000000020000033900000065.lzo prefix=.../ seg=000000020000033900000065 state=begin May 23 16:21:29 wal_e.blobstore.s3.s3_util INFO MSG: completed download and decompression#012 DETAIL: Downloaded and decompressed "s3://.../wal_005/000000020000033900000065.lzo" to "pg_xlog/RECOVERYXLOG"#012 STRUCTURED: time=2014-05-23T16:21:29.094826-00 pid=7763 May 23 16:21:29 wal_e.operator.backup INFO MSG: complete wal restore#012 STRUCTURED: time=2014-05-23T16:21:29.096962-00 pid=7763 action=wal-fetch key=s3://.../wal_005/000000020000033900000065.lzo prefix=.../ seg=000000020000033900000065 state=complete May 23 16:21:29 postgres[10]: [1247-1] LOG: restored log file "000000020000033900000065" from archive May 23 16:21:29 wal_e.operator.backup INFO MSG: begin wal restore#012 STRUCTURED: time=2014-05-23T16:21:29.269909-00 pid=7769 action=wal-fetch key=s3://.../wal_005/00000003.history.lzo prefix=.../ seg=00000003.history state=begin May 23 16:21:29 wal_e.blobstore.s3.s3_util WARNING MSG: could no longer locate object while performing wal restore#012 DETAIL: The absolute URI that could not be located is s3://.../wal_005/00000003.history.lzo.#012 HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.#012 STRUCTURED: time=2014-05-23T16:21:29.424936-00 pid=7769 May 23 16:21:29 wal_e.operator.backup INFO MSG: complete wal restore#012 STRUCTURED: time=2014-05-23T16:21:29.426295-00 pid=7769 action=wal-fetch key=s3://.../wal_005/00000003.history.lzo prefix=.../ seg=00000003.history state=complete May 23 16:21:29 postgres[10]: [1248-1] LOG: selected new timeline ID: 3 May 23 16:21:29 wal_e.operator.backup INFO MSG: begin wal restore#012 STRUCTURED: time=2014-05-23T16:21:29.593188-00 pid=7775 action=wal-fetch key=s3://.../wal_005/00000002.history.lzo prefix=.../ seg=00000002.history state=begin May 23 16:21:29 postgres[7779]: [4-1] [unknown] LOG: connection received: host=[local] May 23 16:21:29 postgres[7779]: [5-1] postgres LOG: connection authorized: user=postgres database=postgres May 23 16:21:29 postgres[7781]: [4-1] [unknown] LOG: connection received: host=[local] May 23 16:21:29 postgres[7781]: [5-1] postgres LOG: connection authorized: user=postgres database=postgres May 23 16:21:29 postgres[7781]: [6-1] postgres ERROR: recovery is in progress May 23 16:21:29 postgres[7781]: [6-2] postgres HINT: WAL control functions cannot be executed during recovery. May 23 16:21:29 postgres[7781]: [6-3] postgres STATEMENT: COPY (SELECT file_name, lpad(file_offset::text, 8, '0') AS file_offset FROM pg_xlogfile_name_offset( pg_start_backup('freeze_start_2014-05-23T16:21:29.681458+00:00'))) TO STDOUT WITH CSV HEADER; May 23 16:21:29 wal_e.operator.backup WARNING MSG: blocking on sending WAL segments#012 DETAIL: The backup was not completed successfully, but we have to wait anyway. See README: TODO about pg_cancel_backup#012 STRUCTURED: time=2014-05-23T16:21:29.764625-00 pid=18790 May 23 16:21:29 postgres[7782]: [4-1] [unknown] LOG: connection received: host=[local] May 23 16:21:29 postgres[7782]: [5-1] postgres LOG: connection authorized: user=postgres database=postgres May 23 16:21:29 postgres[7782]: [6-1] postgres ERROR: recovery is in progress May 23 16:21:29 postgres[7782]: [6-2] postgres HINT: WAL control functions cannot be executed during recovery. May 23 16:21:29 postgres[7782]: [6-3] postgres STATEMENT: COPY (SELECT file_name, lpad(file_offset::text, 8, '0') AS file_offset FROM pg_xlogfile_name_offset( pg_stop_backup())) TO STDOUT WITH CSV HEADER; May 23 16:21:29 wal_e.main ERROR MSG: Could not stop hot backup#012 STRUCTURED: time=2014-05-23T16:21:29.844455-00 pid=18790 May 23 16:21:29 wal_e.blobstore.s3.s3_util INFO MSG: completed download and decompression#012 DETAIL: Downloaded and decompressed "s3://.../wal_005/00000002.history.lzo" to "pg_xlog/RECOVERYHISTORY"#012 STRUCTURED: time=2014-05-23T16:21:29.866686-00 pid=7775 May 23 16:21:29 wal_e.operator.backup INFO MSG: complete wal restore#012 STRUCTURED: time=2014-05-23T16:21:29.868954-00 pid=7775 action=wal-fetch key=s3://.../wal_005/00000002.history.lzo prefix=.../ seg=00000002.history state=complete May 23 16:21:29 postgres[10]: [1249-1] LOG: restored log file "00000002.history" from archive May 23 16:21:30 postgres[10]: [1250-1] LOG: archive recovery complete May 23 16:21:30 postgres[10]: [1251-1] FATAL: failed to re-find parent key in index "1665279" for split pages 175193/193740 May 23 16:21:31 postgres[8]: [4-1] LOG: startup process (PID 10) exited with exit code 1 May 23 16:21:31 postgres[8]: [5-1] LOG: terminating any other active server processes [1]: http://www.postgresql.org/message-id/7753.1291075332@sss.pgh.pa.us
On 05/23/2014 08:52 PM, m.sakrejda@gmail.com wrote: > I just got the error `failed to re-find parent key in index "1665279" for > split pages 175193/193740`. From the list archives [1], it looks like > there's no easy recovery, but this happened on a replica right after the > recovery trigger file was touched and recovery completed, so that's not a > huge concern (unless it also happens to the next replica). Is there > something to be learned from the failed cluster? I would be interested in seeing the structure of the index, if there is anything else corrupt in there. Also, what WAL actions led to the error? Try something like: pg_xlogdump -r btree -p $PGDATA -s 339/65000000 | grep 1665279 and search that for any records related to the failed split, e.g. grepping further for the block numbers in the error message. - Heikki
On Tue, May 27, 2014 at 11:06 AM, Heikki Linnakangas < hlinnakangas@vmware.com> wrote: > I would be interested in seeing the structure of the index, if there is > anything else corrupt in there. It's an index on (integer, timestamp without time zone). Unfortunately, it's a customer DB, so getting more direct access may be problematic. Is there metadata we can gather about it that could be useful? Also, what WAL actions led to the error? Try something like: > > pg_xlogdump -r btree -p $PGDATA -s 339/65000000 | grep 1665279 > > and search that for any records related to the failed split, e.g. grepping > further for the block numbers in the error message. That gives me no output even without the grep (except to complain that the next segment is missing unless I pass '-e 339/65FFFFFF', which is reasonable, right?). I also had to change the timeline with `-t 2`, but I don't imagine that makes a difference here. If I go back further, I do get output for that index, but nothing that mentions 175193 or 193740. Also, it turned out that this was a persistent problem--a couple of replicas failed the same way. I then worked with the customer and had them re-create the index, and that seems to have resolved the issue. My colleague Greg Stark has taken over the forensic investigation here--he may have more to add.
Hi, On 2014-05-29 08:56:10 -0700, Maciek Sakrejda wrote: > On Tue, May 27, 2014 at 11:06 AM, Heikki Linnakangas < > hlinnakangas@vmware.com> wrote: > > > I would be interested in seeing the structure of the index, if there is > > anything else corrupt in there. > > > It's an index on (integer, timestamp without time zone). Unfortunately, > it's a customer DB, so getting more direct access may be problematic. Is > there metadata we can gather about it that could be useful? > > Also, what WAL actions led to the error? Try something like: > > > > pg_xlogdump -r btree -p $PGDATA -s 339/65000000 | grep 1665279 > > > > and search that for any records related to the failed split, e.g. grepping > > further for the block numbers in the error message. I wonder why the failure didn't show the record that triggered the error? This is on a primary? If so, my guess is that this happened during the 'incomplete action' cleanup. Do you have a better explanation Heikki? > Also, it turned out that this was a persistent problem--a couple of > replicas failed the same way. I then worked with the customer and had them > re-create the index, and that seems to have resolved the issue. My > colleague Greg Stark has taken over the forensic investigation here--he may > have more to add. My hope^Wguess is that this is a symptom of 1a917ae8610d44985fd2027da0cfe60ccece9104 (not released) or even 9a57858f1103b89a5674f0d50c5fe1f756411df6 (9.3.4). Once the hot chain is corrupted such errors could occur When were those standbys made? Did the issue occur on the primary as well? Greetings, Andres Freund PS: wal-e's intersperesed output is rather annoying... -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, May 29, 2014 at 9:08 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2014-05-29 08:56:10 -0700, Maciek Sakrejda wrote: > > On Tue, May 27, 2014 at 11:06 AM, Heikki Linnakangas < > > hlinnakangas@vmware.com> wrote: > > > > > I would be interested in seeing the structure of the index, if there is > > > anything else corrupt in there. > > > > > > It's an index on (integer, timestamp without time zone). Unfortunately, > > it's a customer DB, so getting more direct access may be problematic. Is > > there metadata we can gather about it that could be useful? > > > > Also, what WAL actions led to the error? Try something like: > > > > > > pg_xlogdump -r btree -p $PGDATA -s 339/65000000 | grep 1665279 > > > > > > and search that for any records related to the failed split, e.g. > grepping > > > further for the block numbers in the error message. > > I wonder why the failure didn't show the record that triggered the > error? This is on a primary? No, I ran pg_xlogdump on the failed replica--I thought that's what Heikki was suggesting (and it seemed to me like the source of the problem would be there). My hope^Wguess is that this is a symptom of > 1a917ae8610d44985fd2027da0cfe60ccece9104 (not released) or even > 9a57858f1103b89a5674f0d50c5fe1f756411df6 (9.3.4). Once the hot chain is > corrupted such errors could occur > When were those standbys made? Did the issue occur on the primary as > well? > The original ancestor was a 9.3.2. No problems on the primary. PS: wal-e's intersperesed output is rather annoying... > I thought it might be relevant. I'll exclude it in the future.
Hi, On 2014-05-29 10:14:25 -0700, Maciek Sakrejda wrote: > > I wonder why the failure didn't show the record that triggered the > > error? This is on a primary? > No, I ran pg_xlogdump on the failed replica--I thought that's what Heikki > was suggesting (and it seemed to me like the source of the problem would be > there). The WAL should be the same everywhere... But what I was wondering about was less about the xlogdump but more about the lack of a message that tells us the record that triggered the error. > My hope^Wguess is that this is a symptom of > > 1a917ae8610d44985fd2027da0cfe60ccece9104 (not released) or even > > 9a57858f1103b89a5674f0d50c5fe1f756411df6 (9.3.4). Once the hot chain is > > corrupted such errors could occur > > When were those standbys made? Did the issue occur on the primary as > > well? > > > The original ancestor was a 9.3.2. No problems on the primary. So, this is quite possibly just a 'delayed' consequence from the earlier bugs. > PS: wal-e's intersperesed output is rather annoying... > I thought it might be relevant. I'll exclude it in the future. Wasn't really related to this bug. More of a general observation that it frequently is a bit verbose... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, May 29, 2014 at 10:19 AM, Andres Freund <andres@2ndquadrant.com> wrote: > The WAL should be the same everywhere... But what I was wondering about > was less about the xlogdump but more about the lack of a message that > tells us the record that triggered the error. > No idea. I'm new to pg_xlogdump, but I did get output if I went further back, so I'm pretty sure I was using it correctly. > So, this is quite possibly just a 'delayed' consequence from the earlier > bugs. > Yeah, makes sense. I don't suppose there's a simple way to periodically "audit" index integrity? > Wasn't really related to this bug. More of a general observation that it > frequently is a bit verbose... > No argument there. I think it does have a --terse flag, but we find the default verbose output useful occasionally.
On Thu, May 29, 2014 at 6:44 PM, Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > Yeah, makes sense. I don't suppose there's a simple way to periodically > "audit" index integrity? Heikki and Peter were starting work on a btree checking program at PGCon. But there would have to be a separate program for each type of index and the list of possible invalid states to check for is pretty long. I think it would actually be a pretty handy program to have if only to document the invariants that we believe btrees should hold which right now is spread throughout a lot of code. I pulled down WAL going back 11 days and haven't turned up anything referring to either of those blocks in this relation. Is it possible there's an off-by-one error in the error message (or xlogdump) though? I do find references to 175194 and 193741... # grep -h ' \(19374[01]\|17519[234]\)' *.1665280 rmgr: Btree len (rec/tot): 18/ 5434, tx: 85039158, lsn: 307/9E007260, prev 307/9E005E08, bkp: 1000, desc: insert: rel 1663/16385/1665280; tid 175194/30 backup bkp #0; rel 1663/16385/1665280; fork: main; block: 175194; hole: offset: 304, length: 2832 rmgr: Btree len (rec/tot): 18/ 5510, tx: 85042421, lsn: 307/A90CC220, prev 307/A90CC1D0, bkp: 1000, desc: insert: rel 1663/16385/1665280; tid 175194/30 backup bkp #0; rel 1663/16385/1665280; fork: main; block: 175194; hole: offset: 308, length: 2756 rmgr: Btree len (rec/tot): 18/ 6234, tx: 85624987, lsn: 30E/ED2F9870, prev 30E/ED2F8138, bkp: 1000, desc: insert: rel 1663/16385/1665280; tid 193741/45 backup bkp #0; rel 1663/16385/1665280; fork: main; block: 193741; hole: offset: 432, length: 2032 rmgr: Btree len (rec/tot): 18/ 6294, tx: 85668302, lsn: 30F/4400F888, prev 30F/4400E138, bkp: 1000, desc: insert: rel 1663/16385/1665280; tid 193741/45 backup bkp #0; rel 1663/16385/1665280; fork: main; block: 193741; hole: offset: 436, length: 1972 rmgr: Btree len (rec/tot): 18/ 5586, tx: 86227110, lsn: 315/D74C3D00, prev 315/D74C3CB0, bkp: 1000, desc: insert: rel 1663/16385/1665280; tid 175194/55 backup bkp #0; rel 1663/16385/1665280; fork: main; block: 175194; hole: offset: 312, length: 2680 rmgr: Btree len (rec/tot): 18/ 5662, tx: 86235168, lsn: 315/ED01FFC0, prev 315/ED01FF70, bkp: 1000, desc: insert: rel 1663/16385/1665280; tid 175194/55 backup bkp #0; rel 1663/16385/1665280; fork: main; block: 175194; hole: offset: 316, length: 2604 rmgr: Btree len (rec/tot): 4054/ 14306, tx: 86349912, lsn: 317/BC070968, prev 317/BC070918, bkp: 1100, desc: split_r: rel 1663/16385/1665280 left 20842, right 306675, next 193741, level 0, firstright 67 backup bkp #1; rel 1663/16385/1665280; fork: main; block: 193741; hole: offset: 436, length: 1972 rmgr: Btree len (rec/tot): 18/ 6354, tx: 86461004, lsn: 319/011987A8, prev 319/01198758, bkp: 1000, desc: insert: rel 1663/16385/1665280; tid 193741/103 backup bkp #0; rel 1663/16385/1665280; fork: main; block: 193741; hole: offset: 440, length: 1912 rmgr: Btree len (rec/tot): 18/ 6414, tx: 86462619, lsn: 319/0919D338, prev 319/0919D2E8, bkp: 1000, desc: insert: rel 1663/16385/1665280; tid 193741/103 backup bkp #0; rel 1663/16385/1665280; fork: main; block: 193741; hole: offset: 444, length: 1852 rmgr: Btree len (rec/tot): 74/ 106, tx: 86462626, lsn: 319/091B24B0, prev 319/091B2460, bkp: 0000, desc: insert: rel 1663/16385/1665280; tid 193741/103 rmgr: Btree len (rec/tot): 18/ 5738, tx: 98148324, lsn: 33B/C7E1CA08, prev 33B/C7E1C9B8, bkp: 1000, desc: insert: rel 1663/16385/1665280; tid 175194/48 backup bkp #0; rel 1663/16385/1665280; fork: main; block: 175194; hole: offset: 320, length: 2528 -- greg
> I pulled down WAL going back 11 days and haven't turned up anything > referring to either of those blocks in this relation. Can you go further back? Can you look at the LSNs on the pages, both after the split and also in an old base backup? That would give us an idea of when the error might've happened. > Is it possible > there's an off-by-one error in the error message (or xlogdump) though? Don't think so, the elog and the pg_xlogdump code both look correct to me. - Heikki
On Tue, May 27, 2014 at 7:06 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > I would be interested in seeing the structure of the index, if there is > anything else corrupt in there. So now let me check my arithmetic. block 175193 is the block starting at byte offset 361439232 of 1665279.1 ? And block 193740 is at byte offset 513376256 of 1665279.1 (which appears to be off the end actually, but I haven't recovered the database yet) My math is: 8192 * ( 175193 - 131072 ) 361439232 8192 * ( 193740 - 131072 ) 513376256 -- greg
Ok, not sure why my first attempt didn't turn this up. I found the split in segment 334/91: rmgr: Btree len (rec/tot): 3776/ 9220, tx: 95765459, lsn: 334/91455AB8, prev 334/91455A70, bkp: 0100, desc: split_l: rel 1663/16385/1665279 left 175193, right 193740, next 182402, level 0, firstright 138 I've attached all the xlog records pertaining to this relation grepping for either of these two blocks. Now interestingly the hot backup was taken starting at 334/90 and replay didn't finish until 339/65 so it is entirely possible, even likely, that the backup caught this split in an inconsistent state. How should I go about dumping the two blocks? I have the backup prior to WAL replay as well as all the WAL for this time period. I can't connect to the database so I'm guessing this will look like replay until it hits a record for these block, use dd to extract the block, rinse lather repeat. Then dump each of those extracted pages using pageinspect on byteas. This sounds pretty laborious :(
Attachment
On 2014-06-02 18:17:46 +0100, Greg Stark wrote: > Ok, not sure why my first attempt didn't turn this up. I found the > split in segment 334/91: > > rmgr: Btree len (rec/tot): 3776/ 9220, tx: 95765459, lsn: > 334/91455AB8, prev 334/91455A70, bkp: 0100, desc: split_l: rel > 1663/16385/1665279 left 175193, right 193740, next 182402, level 0, > firstright 138 > > I've attached all the xlog records pertaining to this relation > grepping for either of these two blocks. > > Now interestingly the hot backup was taken starting at 334/90 and > replay didn't finish until 339/65 so it is entirely possible, even > likely, that the backup caught this split in an inconsistent state. Did you check whether all the necessary FPIs were generated? That'd be my very first suspect. How many checkpoints are inbetween 334/90 and 339/65? > How should I go about dumping the two blocks? I have the backup prior > to WAL replay as well as all the WAL for this time period. I can't > connect to the database so I'm guessing this will look like replay > until it hits a record for these block, use dd to extract the block, > rinse lather repeat. Then dump each of those extracted pages using > pageinspect on byteas. This sounds pretty laborious :( I guess you could make xlogdump dump the data from the backup blocks... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 2, 2014 at 6:40 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Did you check whether all the necessary FPIs were generated? That'd be > my very first suspect. Really? Shouldn't only the last one matter? All the other ones will be overwritten later by later full page writes anywys, no? Also, i thought this was pretty much underlying infrastructure that would be pretty hard to get wrong in just one call site. I see a relatively few inserts that don't have backup blocks. The few that do are preceded by another insert for the same block since the last checkpoint redo pointer location. There is precisely one split and it's shortly after the backup started but is a split to a different right block from the one in the error. > How many checkpoints are inbetween 334/90 and 339/65? 68 (though 13 of those are basically empty as the system was idle for a while) > I guess you could make xlogdump dump the data from the backup blocks... That's a clever idea. But it proved to be a little less useful than I expected. Because it's a btree, not a heap I can't actually decipher the items using bt_page_items() on plain byteas. I am looking at the page headers and I'm a bit surprised to find the LSN on the backup block images doesn't match the LSN of the record the backup blocks are found in. I guess that makes sense since some AMs will be generating records prior to even having touched the page in question? Other than that there's not much to see in the page headers :( -- greg
Attachment
On 2014-06-03 01:36:40 +0100, Greg Stark wrote: > On Mon, Jun 2, 2014 at 6:40 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > > > Did you check whether all the necessary FPIs were generated? That'd be > > my very first suspect. > > Really? Shouldn't only the last one matter? All the other ones will be > overwritten later by later full page writes anywys, no? Also, i > thought this was pretty much underlying infrastructure that would be > pretty hard to get wrong in just one call site. Well, if we missed a single FPI somewhere - e.g. by accidentally not filling XLogRecData->buffer or by confusing which bkp block numbers refer to what (both happened during 9.4 development) you'd potentially get a torn page. And that'd very well explain such an error message. Your split record had only one backup block. I'd manually make sure all the other ones previously had some. You probably need to look in the nbt code to see which bkp block refers to what. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Ok, I made some progress. It turns out this was a pre-existing problem in the master. They've been getting "failed to re-find parent" errors for weeks. Far longer than I have any WAL or backups for. What I did find that was interesting is that this error basically made the backups worthless. I could build a hot standby and connect and query it. But as soon as recovery finished it would try to clean up the incomplete split and fail. Because it had noticed the incomplete split it had skipped every restartpoint and the next time I tried to start it it insisted on restarting recovery from the beginning. If we had been lucky enough not to do any page splits in the broken index while the backup was being taken all would have been fine. But that doesn't seem to have happened so all the backups were unrecoverable. So a few thoughts on how to improve things: 1) Failed to re-find parent should perhaps not be FATAL to recovery. In fact any index replay error would really be nice not to have to crash on. All crashing does is prevent the user from being able to bring up their database and REINDEX the btree. This may be another use case for the machinery that would protect against corrupt hash indexes or user-defined indexes -- if we could mark the index invalid and proceed (perhaps ignoring subsequent records for it) that would be great. 2) When we see an abort record we could check for any cleanup actions triggered by that transaction and run them right away. I think the checkpoints (and maybe hot standby snapshots or vacuum cleanup records?) also include information about the oldest xid running, they would also let us prune the cleanup actions sooner. That would at least find the error sooner. In conjunction with (1) it would also mean subsequent restartpoints would be effective instead of suppressing restartpoints right to the end of recovery. 3) The lack of logs around an error during recovery makes it hard to decipher what's going on. It would be nice to see "Beginning Xlog cleanup (1 incomplete splits to replay)" and when it crashed "Last safe point to restart recovery is 324/ABCDEF". As it was it was a pretty big mystery why the database crashed, the logs made it appear as if it had started up fine. And it was unclear why restarting it caused it to replay from the beginning, I thought maybe something was wrong with our scripts.
On Wed, Jun 4, 2014 at 12:14 PM, Greg Stark <stark@mit.edu> wrote: > Because it had noticed the incomplete > split it had skipped every restartpoint and the next time I tried to > start it it insisted on restarting recovery from the beginning. Incidentally this also caused operational problems. Suppressing all restartpoints also meant Postgres kept around all the WAL from the entire recovery. This caused the pg_xlog volume to fill up. -- greg
Hi, On 2014-06-04 12:14:27 +0100, Greg Stark wrote: > Ok, I made some progress. It turns out this was a pre-existing problem > in the master. They've been getting "failed to re-find parent" errors > for weeks. Far longer than I have any WAL or backups for. Ok. > 1) Failed to re-find parent should perhaps not be FATAL to recovery. > In fact any index replay error would really be nice not to have to > crash on. I think that's not really realistic. We'd need to put a significant amount of machinery for this in to be workable. Suddenly a crash restart doesn't guarantee that you're indexes are there anymore? Not nice. > All crashing does is prevent the user from being able to > bring up their database and REINDEX the btree. This may be another use > case for the machinery that would protect against corrupt hash indexes > or user-defined indexes -- if we could mark the index invalid and > proceed (perhaps ignoring subsequent records for it) that would be > great. > > 2) When we see an abort record we could check for any cleanup actions > triggered by that transaction and run them right away. I think the > checkpoints (and maybe hot standby snapshots or vacuum cleanup > records?) also include information about the oldest xid running, they > would also let us prune the cleanup actions sooner. That would at > least find the error sooner. In conjunction with (1) it would also > mean subsequent restartpoints would be effective instead of > suppressing restartpoints right to the end of recovery. Heikki removed restartpoints from 9.4 alltogether so most of these are gone. As all these -even if they were doable - sound far too large for backpatching I think it's luckily mostly done. > 3) The lack of logs around an error during recovery makes it hard to > decipher what's going on. It would be nice to see "Beginning Xlog > cleanup (1 incomplete splits to replay)" and when it crashed "Last > safe point to restart recovery is 324/ABCDEF". As it was it was a > pretty big mystery why the database crashed, the logs made it appear > as if it had started up fine. And it was unclear why restarting it > caused it to replay from the beginning, I thought maybe something was > wrong with our scripts. I think this should be fixed by setting up error context stack support in two places. a) in StartupXLOG() before the rm_cleanup() calls b) in < 9.4 inside the individual cleanup routines. We do all that around redo routines, but, as evidenced here, that's not always enough. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 06/04/2014 02:14 PM, Greg Stark wrote: > Ok, I made some progress. It turns out this was a pre-existing problem > in the master. They've been getting "failed to re-find parent" errors > for weeks. Far longer than I have any WAL or backups for. Bummer. You don't have logs reaching back to when the first error happened either, I presume. The most likely cause for originally failing to insert a downlink is running out of disk space when trying to split the parent page. (Even that's pretty unlikely, though.) > What I did find that was interesting is that this error basically made > the backups worthless. I could build a hot standby and connect and > query it. But as soon as recovery finished it would try to clean up > the incomplete split and fail. Because it had noticed the incomplete > split it had skipped every restartpoint and the next time I tried to > start it it insisted on restarting recovery from the beginning. If we > had been lucky enough not to do any page splits in the broken index > while the backup was being taken all would have been fine. But that > doesn't seem to have happened so all the backups were unrecoverable. It's only *incomplete* splits that happen while the backup is taken that matter. i.e. a page is split but inserting the downlink to the parent fails for some reason. If a page is split and the downlink is inserted successfully, that's OK. > So a few thoughts on how to improve things: > > 1) Failed to re-find parent should perhaps not be FATAL to recovery. > In fact any index replay error would really be nice not to have to > crash on. All crashing does is prevent the user from being able to > bring up their database and REINDEX the btree. This may be another use > case for the machinery that would protect against corrupt hash indexes > or user-defined indexes -- if we could mark the index invalid and > proceed (perhaps ignoring subsequent records for it) that would be > great. Yeah, that would be nice as an option. > 2) When we see an abort record we could check for any cleanup actions > triggered by that transaction and run them right away. I think the > checkpoints (and maybe hot standby snapshots or vacuum cleanup > records?) also include information about the oldest xid running, they > would also let us prune the cleanup actions sooner. That would at > least find the error sooner. In conjunction with (1) it would also > mean subsequent restartpoints would be effective instead of > suppressing restartpoints right to the end of recovery. You can't do any cleanup actions until you've recovered all the WAL. A cleanup action means inserting a tuple to the parent page, and you can't do that in the middle of recovery. But we could detect the case and give a warning. > 3) The lack of logs around an error during recovery makes it hard to > decipher what's going on. It would be nice to see "Beginning Xlog > cleanup (1 incomplete splits to replay)" and when it crashed "Last > safe point to restart recovery is 324/ABCDEF". As it was it was a > pretty big mystery why the database crashed, the logs made it appear > as if it had started up fine. And it was unclear why restarting it > caused it to replay from the beginning, I thought maybe something was > wrong with our scripts Yeah. All of this is gone in 9.4 anyway, but if you can come up with something non-invasive that can be back-patched... - Heikki