Thread: Recovery inconsistencies, standby much larger than primary
Since the point release we've run into a number of databases that when we restore from a base backup end up being larger than the primary database was. Sometimes by a large factor. The data below is from 9.1.11 (both primary and standby) but we've seen the same thing on 9.2.6. primary$ for i in 1261982 1364767 1366221 473158 ; do echo -n "$i " ; du -shc $i* | tail -1 ; done 1261982 29G total 1364767 23G total 1366221 12G total 473158 76G total standby$ for i in 1261982 1364767 1366221 473158 ; do echo -n "$i " ; du -shc $i* | tail -1 ; done 1261982 55G total 1364767 28G total 1366221 17G total 473158 139G total I've run the snaga xlogdump on the WAL records played before reaching a consistent point (we confirmed the extra storage had already appeared by then) and grepped for the above relfilenode but they're quite large. I believe these dumps don't contain any sensitive data, when I verify that I can upload one of them for inspection. $ ls -lh [14]* -rw-rw-r-- 1 heroku heroku 325M Jan 24 04:13 1261982 -rw-r--r-- 1 root root 352M Jan 25 00:04 1364767 -rw-r--r-- 1 root root 123M Jan 25 00:04 1366221 -rw-r--r-- 1 root root 357M Jan 25 00:04 473158 The first three are btrees and the fourth is a haeap btw. We're also seeing log entries about "wal contains reference to invalid pages" but these errors seem only vaguely correlated. Sometimes we get the errors but the tables don't grow noticeably and sometimes we don't get the errors and the tables are much larger. Much of the added space is uninitialized pages as you might expect but I don't understand is how the database can start up without running into the "reference to invalid pages" panic consistently. We check both that there are no references after consistency is reached *and* that any references before consistency are resolved by a truncate or unlink before consistency. The primary was never this large btw, so it's not just a case of leftover files from drops or truncates that might have failed on the standby. I'm assuming this is somehow related to the mulixact or transaction wraparound problems but I don't really understand how they could be hitting when both the primary and standby are post-upgrade to the most recent point release which have the fixes -- greg
Hi, On 2014-01-24 19:23:28 -0500, Greg Stark wrote: > Since the point release we've run into a number of databases that when > we restore from a base backup end up being larger than the primary > database was. Sometimes by a large factor. The data below is from > 9.1.11 (both primary and standby) but we've seen the same thing on > 9.2.6. What's the procedure for creating those standbys? Were they of similar size after being cloned? > primary$ for i in 1261982 1364767 1366221 473158 ; do echo -n "$i " ; > du -shc $i* | tail -1 ; done > 1261982 29G total > 1364767 23G total > 1366221 12G total > 473158 76G total > > standby$ for i in 1261982 1364767 1366221 473158 ; do echo -n "$i " ; > du -shc $i* | tail -1 ; done > 1261982 55G total > 1364767 28G total > 1366221 17G total > 473158 139G total > ... > The first three are btrees and the fourth is a haeap btw. Are those all of the same underlying heap relation? > We're also seeing log entries about "wal contains reference to invalid > pages" but these errors seem only vaguely correlated. Sometimes we get > the errors but the tables don't grow noticeably and sometimes we don't > get the errors and the tables are much larger. Uhm. I am a bit confused. You see those in the standby's log? At !debug log levels? That'd imply that the standby is dead and needed to be recloned, no? How do you continue after that? > Much of the added space is uninitialized pages as you might expect but > I don't understand is how the database can start up without running > into the "reference to invalid pages" panic consistently. We check > both that there are no references after consistency is reached *and* > that any references before consistency are resolved by a truncate or > unlink before consistency. Well, it's pretty easy to get into a situation with lot's of new pages. Lots of concurrent inserts that all fail before logging WAL. The next insert will extend the relation and only initialise that last value. It'd be interesting to look for TRUNCATE records using xlogdump. Could you show those for starters? > I'm assuming this is somehow related to the mulixact or transaction > wraparound problems but I don't really understand how they could be > hitting when both the primary and standby are post-upgrade to the most > recent point release which have the fixes That doesn't sound likely. For one the symptoms don't fit, for another, those problems are mostly 9.3+. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jan 26, 2014 at 9:45 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2014-01-24 19:23:28 -0500, Greg Stark wrote: >> Since the point release we've run into a number of databases that when >> we restore from a base backup end up being larger than the primary >> database was. Sometimes by a large factor. The data below is from >> 9.1.11 (both primary and standby) but we've seen the same thing on >> 9.2.6. > > What's the procedure for creating those standbys? Were they of similar > size after being cloned? These are restored from base backup using WAL-E and then started in standby mode. The logs are retrieved using archive_command (which is WAL-E) after it retrieves lots of archived wal the database switches to streaming. We confirmed from size monitoring that the standby database grew substantially before the time it reported reaching consistent state, so I only downloaded the WAL from that range for analysis. >> primary$ for i in 1261982 1364767 1366221 473158 ; do echo -n "$i " ; >> du -shc $i* | tail -1 ; done >> 1261982 29G total >> 1364767 23G total >> 1366221 12G total >> 473158 76G total >> >> standby$ for i in 1261982 1364767 1366221 473158 ; do echo -n "$i " ; >> du -shc $i* | tail -1 ; done >> 1261982 55G total >> 1364767 28G total >> 1366221 17G total >> 473158 139G total >> ... >> The first three are btrees and the fourth is a haeap btw. > > Are those all of the same underlying heap relation? Are you asking whether the relfilenode was reused for a different relation? I doubt it. Or are you asking if the first three indexes are for the same heap (presumably the fourth one)? I don't think so but I can check. >> We're also seeing log entries about "wal contains reference to invalid >> pages" but these errors seem only vaguely correlated. Sometimes we get >> the errors but the tables don't grow noticeably and sometimes we don't >> get the errors and the tables are much larger. > > Uhm. I am a bit confused. You see those in the standby's log? At !debug > log levels? That'd imply that the standby is dead and needed to be > recloned, no? How do you continue after that? It's possible I'm confusing symptoms from an unrelated problem. But the symptom we saw was that it got this error, recovery crashed, then recovery started again and it replayed fine. I agree that doesn't jive with the code I see in 9.3, I didn't check how long the code was this tense though. >> Much of the added space is uninitialized pages as you might expect but >> I don't understand is how the database can start up without running >> into the "reference to invalid pages" panic consistently. We check >> both that there are no references after consistency is reached *and* >> that any references before consistency are resolved by a truncate or >> unlink before consistency. > > Well, it's pretty easy to get into a situation with lot's of new > pages. Lots of concurrent inserts that all fail before logging WAL. The > next insert will extend the relation and only initialise that last > value. > > It'd be interesting to look for TRUNCATE records using xlogdump. Could > you show those for starters? There are no records matching grep -i truncate in any of those extracts for those relfilenodes. I'm grepping the whole xlogdump now but it'll take a while. So far no truncates anywhere. >> I'm assuming this is somehow related to the mulixact or transaction >> wraparound problems but I don't really understand how they could be >> hitting when both the primary and standby are post-upgrade to the most >> recent point release which have the fixes > > That doesn't sound likely. For one the symptoms don't fit, for another, > those problems are mostly 9.3+. These problems all started to appear after the latest point release btw. That could just be a coincidence of course. -- greg
On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> We're also seeing log entries about "wal contains reference to invalid >> pages" but these errors seem only vaguely correlated. Sometimes we get >> the errors but the tables don't grow noticeably and sometimes we don't >> get the errors and the tables are much larger. > > Uhm. I am a bit confused. You see those in the standby's log? At !debug > log levels? That'd imply that the standby is dead and needed to be > recloned, no? How do you continue after that? So in chatting with Heikki last night we came up with a scenario where this check is insufficient. If you have multiple checkpoints during the base backup then there will be restartpoints during recovery. If the reference to the invalid page is before the restartpont then after crashing recovery and coming back up the recovery will go forward fine. Fixing this check doesn't look trivial. I'm inclined to say to suppress any restartpoints while there are references to invalid pages in the hash. The problem with this is that it will prevent trimming the xlog during recovery. It seems frightening that most days recovery will take little extra space but if you happen to have a drop table or truncate during the base backup then your recovery might require a lot of extra space. The alternative of spilling the hash table to disk at every restartpoint seems kind of hokey. Then we need to worry about fsyncing this file, cleaning it up, dealing with the file after crashes, etc. -- greg
On 2014-01-31 11:09:14 +0000, Greg Stark wrote: > On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > > >> We're also seeing log entries about "wal contains reference to invalid > >> pages" but these errors seem only vaguely correlated. Sometimes we get > >> the errors but the tables don't grow noticeably and sometimes we don't > >> get the errors and the tables are much larger. > > > > Uhm. I am a bit confused. You see those in the standby's log? At !debug > > log levels? That'd imply that the standby is dead and needed to be > > recloned, no? How do you continue after that? > So in chatting with Heikki last night we came up with a scenario where > this check is insufficient. But that seems unrelated to the issue at hand, right? > If you have multiple checkpoints during the base backup then there > will be restartpoints during recovery. If the reference to the invalid > page is before the restartpont then after crashing recovery and coming > back up the recovery will go forward fine. We don't perform restartpoints if there are invalid pages registered. Check the XLogHaveInvalidPages() call in xlog.c. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-01-31 11:09:14 +0000, Greg Stark wrote: > On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > > >> We're also seeing log entries about "wal contains reference to invalid > >> pages" but these errors seem only vaguely correlated. Sometimes we get > >> the errors but the tables don't grow noticeably and sometimes we don't > >> get the errors and the tables are much larger. > > > > Uhm. I am a bit confused. You see those in the standby's log? At !debug > > log levels? That'd imply that the standby is dead and needed to be > > recloned, no? How do you continue after that? > So in chatting with Heikki last night we came up with a scenario where > this check is insufficient. The slightly more likely explanation for transient errors is that you hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had only taken effect if HS has already assembled a snapshot, which can make such an error vanish after a restart... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 31, 2014 at 11:26 AM, Andres Freund <andres@2ndquadrant.com> wrote: > The slightly more likely explanation for transient errors is that you > hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had > only taken effect if HS has already assembled a snapshot, which can make > such an error vanish after a restart... Which one, there seem to be several.... So this seems like it's more likely to be a symptom of whatever is causing the table to grow than a cause? That is, there's some bug causing the standby to extend the btree dramatically resulting in lots of uninitialized pages and touching those pages triggers this bug. But this doesn't explain why the btree is being extended I don't think. -- greg
On 2014-01-31 11:46:09 +0000, Greg Stark wrote: > On Fri, Jan 31, 2014 at 11:26 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > The slightly more likely explanation for transient errors is that you > > hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had > > only taken effect if HS has already assembled a snapshot, which can make > > such an error vanish after a restart... > > Which one, there seem to be several.... > > So this seems like it's more likely to be a symptom of whatever is > causing the table to grow than a cause? That is, there's some bug > causing the standby to extend the btree dramatically resulting in lots > of uninitialized pages and touching those pages triggers this bug. But > this doesn't explain why the btree is being extended I don't think. I don't think anything we've talked about so far is likely to explain the issue. I don't have time atm to look closer, but what I'd do is try to look if there are any pages with valid LSNs on the standby in the bloated area... That then might give you a hint where to look. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
1261982.53 is entirely nuls. I think that's true for most if not all of the intervening files, still investigating. The 54th segment is nul up to offset 1f0c0000 after which it has valid looking blocks: # hexdump 1261982.54 | head -100 0000000 0000 0000 0000 0000 0000 0000 0000 0000 * 1f0c0000 0ea1 0000 8988 0063 0006 0000 04d8 0cf0 However when I grep xlogdump for any records mentioning this block I get nothing. In fact the largest block I find in the xlog is 3646630: # grep 'tid 3646631/' 1261982 | wc -l 0 # grep 'tid 3646630/' 1261982 | wc -l 177 Looking at the block above it looks like the LSN is 0000EA100638988 which I find in the logs but it's a btree insert on a different btree: [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982 blk:3634978 hole_off/len:1240/2072 [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid 2746914/219 [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/637140] bkpblock[1]: s/d/r:1663/16385/1364767 blk:2746914 hole_off/len:1180/2372 [cur:EA1/63A0A8, xid:1418089147, rmid:1(Transaction), len/tot_len:32/64, info:0, prev:EA1/638988] d/s:16385/1663 commit at 2014-01-21 05:41:11 UTC
On 2014-01-31 14:39:47 +0000, Greg Stark wrote: > 1261982.53 is entirely nuls. I think that's true for most if not all > of the intervening files, still investigating. > > The 54th segment is nul up to offset 1f0c0000 after which it has valid > looking blocks: It'd be interesting to dump the page header for that using pageinspect. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 31, 2014 at 2:39 PM, Greg Stark <stark@mit.edu> wrote: > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982 > blk:3634978 hole_off/len:1240/2072 > [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, > info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid > 2746914/219 Actually wait, the immediate previous record is indeed on the right filenode. Is the LSN printed in xlogdump the LSN that would be in the pd_lsn or is the pd_lsn going to be from the following record? -- greg
On 2014-01-31 14:59:21 +0000, Greg Stark wrote: > On Fri, Jan 31, 2014 at 2:39 PM, Greg Stark <stark@mit.edu> wrote: > > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, > > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982 > > blk:3634978 hole_off/len:1240/2072 > > [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, > > info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid > > 2746914/219 > > Actually wait, the immediate previous record is indeed on the right > filenode. Is the LSN printed in xlogdump the LSN that would be in the > pd_lsn or is the pd_lsn going to be from the following record? It points to the end of the record (i.e. the beginning of the next). It needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush enough. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 31, 2014 at 3:08 PM, Andres Freund <andres@2ndquadrant.com> wrote: > It points to the end of the record (i.e. the beginning of the next). It > needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush > enough. Ah, in which case the relevant record is: [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid 3634978/282 [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982 blk:3634978 hole_off/len:1240/2072 It looks like get_raw_page() refuses to read past the end of relpages. I could make a clone of this database to allow experimenting with tweaking relpages but it may or may not reproduce the problem... =# select pg_relation_size('data_pkey') / 1024 / 1024 / 1024;?column? ---------- 23 (1 row) =# select get_raw_page('data_pkey', 'main', 11073632) ; ERROR: block number 11073632 is out of range for relation "data_pkey" d9de7pcqls4ib6=# select relpages from pg_class where relname = 'data_pkey';relpages ---------- 2889286 -- greg
On 2014-01-31 15:15:24 +0000, Greg Stark wrote: > On Fri, Jan 31, 2014 at 3:08 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > > It points to the end of the record (i.e. the beginning of the next). It > > needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush > > enough. > > Ah, in which case the relevant record is: > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, > info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid > 3634978/282 > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982 > blk:3634978 hole_off/len:1240/2072 > > > It looks like get_raw_page() refuses to read past the end of relpages. > I could make a clone of this database to allow experimenting with > tweaking relpages but it may or may not reproduce the problem... No, it uses smgrnblocks() to get the size. > =# select get_raw_page('data_pkey', 'main', 11073632) ; > ERROR: block number 11073632 is out of range for relation "data_pkey" Isn't the page 3634978? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> =# select get_raw_page('data_pkey', 'main', 11073632) ; >> ERROR: block number 11073632 is out of range for relation "data_pkey" > > Isn't the page 3634978? The page in the record is. But the page on disk is in the 54th segment at offset 1F0C0000 So unless my arithmetic is wrong: bc -l ibase=16 400 * 400 * 400 / 2000 * 54 + 1F0C0000 / 2000 11073632 -- greg
On 2014-01-31 15:21:35 +0000, Greg Stark wrote: > On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> =# select get_raw_page('data_pkey', 'main', 11073632) ; > >> ERROR: block number 11073632 is out of range for relation "data_pkey" > > > > Isn't the page 3634978? > > The page in the record is. It'd be interesting to look at the referenced page using bt_page_items(). > But the page on disk is in the 54th segment at offset 1F0C0000 It's interesting that the smgr gets this wrong then (as also evidenced by the fact that relation_size does as well). Could you please do a ls -l path/to/relfilenode*? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > It's interesting that the smgr gets this wrong then (as also evidenced > by the fact that relation_size does as well). Could you please do a ls > -l path/to/relfilenode*? IIRC, smgrnblocks will stop as soon as it finds a segment that is not 1GB in size. Could you check the lengths of all segments for that relation? regards, tom lane
On 2014-01-31 10:33:16 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > It's interesting that the smgr gets this wrong then (as also evidenced > > by the fact that relation_size does as well). Could you please do a ls > > -l path/to/relfilenode*? > > IIRC, smgrnblocks will stop as soon as it finds a segment that is not > 1GB in size. Could you check the lengths of all segments for that > relation? Yea, that's what I am wondering about. I wanted the full list because there could be an entire file missing and it's interesting to see at which time they were last touched relative to each other... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Greg Stark <stark@mit.edu> writes: > On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Isn't the page 3634978? > The page in the record is. > But the page on disk is in the 54th segment at offset 1F0C0000 > So unless my arithmetic is wrong: > bc -l > ibase=16 > 400 * 400 * 400 / 2000 * 54 + 1F0C0000 / 2000 > 11073632 At least two of us are confused. I get # select ((2^30) * 54.0 + 'x1F0C0000'::bit(32)::int) / 8192;?column? ---------- 7141472 (1 row) regards, tom lane
Sorry guys. I transposed two numbers when looking up the relation. "data_pk" wasn't the right index. =# select (page_header(get_raw_page('index_data_id', 'main', 3020854))).* ; lsn | tli | flags | lower | upper | special| pagesize | version | prune_xid --------------+-----+-------+-------+-------+---------+----------+---------+-----------CF0/2DD67BB8 | 5 | 0 | 792| 5104 | 8176 | 8192 | 4 | 0 (1 row)
Greg Stark <stark@mit.edu> writes: > Sorry guys. I transposed two numbers when looking up the relation. > "data_pk" wasn't the right index. > =# select (page_header(get_raw_page('index_data_id', 'main', 3020854))).* ; > lsn | tli | flags | lower | upper | special | pagesize | > version | prune_xid > --------------+-----+-------+-------+-------+---------+----------+---------+----------- > CF0/2DD67BB8 | 5 | 0 | 792 | 5104 | 8176 | 8192 | > 4 | 0 > (1 row) Clearly not the right page --- the LSN isn't what we saw in the hexdump. regards, tom lane
On Fri, Jan 31, 2014 at 3:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> 400 * 400 * 400 / 2000 * 54 + 1F0C0000 / 2000 >> 11073632 Ooops, it's reading 54 in hex there. > # select ((2^30) * 54.0 + 'x1F0C0000'::bit(32)::int) / 8192; > ?column? > ---------- > 7141472 ibase=16 400 * 400 * 400 / 2000 * 36 + 1F0C0000 / 2000 7141472.00000000000000000000 So now that I've got the arithmetic right: =# select (page_header(get_raw_page('index_data_id', 'main', 7141472))).* ; lsn | tli | flags | lower | upper | special| pagesize | version | prune_xid ------------+-----+-------+-------+-------+---------+----------+---------+-----------EA1/638988 | 6 | 0 | 1240 | 3312 | 8176 | 8192 | 4 | 0 -- greg
On 2014-01-31 10:33:16 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > It's interesting that the smgr gets this wrong then (as also evidenced > > by the fact that relation_size does as well). Could you please do a ls > > -l path/to/relfilenode*? > > IIRC, smgrnblocks will stop as soon as it finds a segment that is not > 1GB in size. Could you check the lengths of all segments for that > relation? I am not sure that explains the issue, but I think the redo action for truncation is not safe across crashes. A XLOG_SMGR_TRUNCATE will just do a smgrtruncate() (and then mdtruncate) which will iterate over the segments starting at 0 till mdnblocks()/segment_size and *truncate* but not delete individual segment files that are not needed anymore, right? If we crash in the midst of that a new mdtruncate() will be issued, but it will get a shorter value back from mdnblocks(). Am I missing something? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
So just to summarize, this xlog record: [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid 3634978/282 [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982 blk:3634978 hole_off/len:1240/2072 Appears to have been written to this location: d9de7pcqls4ib6=# select (page_header(get_raw_page('index_event_data_on_event_occurrence_id', 'main', 7141472))).* ; lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid ------------+-----+-------+-------+-------+---------+----------+---------+-----------EA1/638988 | 6 | 0 | 1240 | 3312 | 8176 | 8192 | 4 | 0 (1 row) The correct location appears to have been written to by later records (a split where it's the leftsib to be specific) so any forensic evidence is lost: d9de7pcqls4ib6=# select (page_header(get_raw_page('index_event_data_on_event_occurrence_id', 'main', 3634978))).* ; lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid ------------+-----+-------+-------+-------+---------+----------+---------+-----------EA1/A90CF8 | 6 | 0 | 792 | 5104 | 8176 | 8192 | 4 | 0 (1 row) But the record prior to that lsn is a split where the leftsib matches that record: [cur:EA1/A8EDE0, xid:1418089314, rmid:11(Btree), len/tot_len:3502/7938, info:68, prev:EA1/A8D7A0] split_r: s/d/r:1663/16385/1261982 leftsib 3634978 [cur:EA1/A8EDE0, xid:1418089314, rmid:11(Btree), len/tot_len:3502/7938, info:68, prev:EA1/A8D7A0] bkpblock[2]: s/d/r:1663/16385/1261982 blk:1881360 hole_off/len:892/3812 Interestingly there are a bunch of other records that also write to that block but there are no other full page writes. That seems to imply that the index is currently missing the leaf inserted by the EA1/637140 record.
Greg Stark <stark@mit.edu> writes: > So just to summarize, this xlog record: > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, > info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid > 3634978/282 > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982 > blk:3634978 hole_off/len:1240/2072 > Appears to have been written to [ block 7141472 ] I've been staring at the code for a bit trying to guess how that could have happened. Since the WAL record has a backup block, btree_xlog_insert would have passed control to RestoreBackupBlock, which would call XLogReadBufferExtended with mode RBM_ZERO, so there would be no complaint about writing past the end of the relation. Now, you can imagine some very low-level error causing a write to go to the wrong page due to a seek problem or some such, but it's hard to credit that that would've resulted in creation of all the intervening segment files. Some level of our code had to have thought it was being told to extend the relation. However, on closer inspection I was a bit surprised to realize that there are two possible candidates for doing that! XLogReadBufferExtended will extend the relation, a block at a time, if told to write a page past the current nominal EOF. And in md.c, _mdfd_getseg will *also* extend the relation if we're InRecovery, even though it normally would not do so when called from mdwrite(). Given the behavior in XLogReadBufferExtended, I rather think that the InRecovery special case in _mdfd_getseg is dead code and should be removed. But for the purpose at hand, it's more interesting to try to confirm which of these code levels did the extension. I notice that _mdfd_getseg only bothers to write the last physical page of each segment, whereas XLogReadBufferExtended knows nothing of segments and will ploddingly write every page. So on a filesystem that supports "holes" in files, I'd expect that the added segments would be fully allocated if XLogReadBufferExtended did the deed, but they'd be quite small if _mdfd_getseg did so. The du results you started with suggest that the former is the case, but could you verify that the filesystem this is on supports holes and that du will report only the actually allocated space when there's a hole? Assuming that the extension was done in XLogReadBufferExtended, we are forced to the conclusion that XLogReadBufferExtended was passed a bad block number (viz 7141472); and it's pretty hard to see how that could happen. RestoreBackupBlock is just passing the value it got out of the WAL record. I thought about the idea that it was wrong about exactly where the BkpBlock struct was in the record, but that would presumably lead to garbage relnode and fork numbers not just a bad block number. So I'm still baffled ... regards, tom lane
<p dir="ltr">One thing I keep coming back to is a bad ran chip setting a bit in the block number. But I just can't seem toget it to add up. The difference is not a power of two, it had happened on two different machines, and we don't see otherweirdness on the machine. It seems like a strange coincidence it would happen to the same variable twice and not toother variables.<p dir="ltr">Unless there's some unrelated code writing through a wild pointer, possibly to a stack allocatedobject that just happens to often be that variable?<p dir="ltr">-- <br /> greg<div class="gmail_quote">On 31 Jan2014 20:21, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br type="attribution" /><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Greg Stark <<ahref="mailto:stark@mit.edu">stark@mit.edu</a>> writes:<br /> > So just to summarize, this xlog record:<br />> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,<br /> > info:8, prev:EA1/635290] insert_leaf:s/d/r:1663/16385/1261982 tid<br /> > 3634978/282<br /> > [cur:EA1/637140, xid:1418089147, rmid:11(Btree),len/tot_len:18/6194,<br /> > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982<br /> >blk:3634978 hole_off/len:1240/2072<br /><br /> > Appears to have been written to [ block 7141472 ]<br /><br /> I'vebeen staring at the code for a bit trying to guess how that could<br /> have happened. Since the WAL record has a backupblock, btree_xlog_insert<br /> would have passed control to RestoreBackupBlock, which would call<br /> XLogReadBufferExtendedwith mode RBM_ZERO, so there would be no complaint<br /> about writing past the end of the relation. Now, you can imagine some<br /> very low-level error causing a write to go to the wrong page due to a seek<br />problem or some such, but it's hard to credit that that would've resulted<br /> in creation of all the intervening segmentfiles. Some level of our code<br /> had to have thought it was being told to extend the relation.<br /><br /> However,on closer inspection I was a bit surprised to realize that there<br /> are two possible candidates for doing that! XLogReadBufferExtended will<br /> extend the relation, a block at a time, if told to write a page past<br /> the currentnominal EOF. And in md.c, _mdfd_getseg will *also* extend<br /> the relation if we're InRecovery, even though itnormally would not do<br /> so when called from mdwrite().<br /><br /> Given the behavior in XLogReadBufferExtended, Irather think that the<br /> InRecovery special case in _mdfd_getseg is dead code and should be<br /> removed. But for thepurpose at hand, it's more interesting to try to<br /> confirm which of these code levels did the extension. I noticethat<br /> _mdfd_getseg only bothers to write the last physical page of each segment,<br /> whereas XLogReadBufferExtendedknows nothing of segments and will<br /> ploddingly write every page. So on a filesystem that supports"holes"<br /> in files, I'd expect that the added segments would be fully allocated<br /> if XLogReadBufferExtendeddid the deed, but they'd be quite small if<br /> _mdfd_getseg did so. The du results you started withsuggest that the<br /> former is the case, but could you verify that the filesystem this is<br /> on supports holes andthat du will report only the actually allocated<br /> space when there's a hole?<br /><br /> Assuming that the extensionwas done in XLogReadBufferExtended, we are<br /> forced to the conclusion that XLogReadBufferExtended was passeda bad<br /> block number (viz 7141472); and it's pretty hard to see how that could<br /> happen. RestoreBackupBlockis just passing the value it got out of the<br /> WAL record. I thought about the idea that it was wrongabout exactly<br /> where the BkpBlock struct was in the record, but that would presumably<br /> lead to garbage relnodeand fork numbers not just a bad block number.<br /><br /> So I'm still baffled ...<br /><br /> regards, tom lane<br /></blockquote></div>
Greg Stark <stark@mit.edu> writes: > One thing I keep coming back to is a bad ran chip setting a bit in the > block number. But I just can't seem to get it to add up. The difference is > not a power of two, it had happened on two different machines, and we don't > see other weirdness on the machine. It seems like a strange coincidence it > would happen to the same variable twice and not to other variables. I also looked at the bit patterns for the two block numbers, and couldn't detect any relationship. > Unless there's some unrelated code writing through a wild pointer, possibly > to a stack allocated object that just happens to often be that variable? Yeah, I'd been wondering if the WAL record somehow got corrupted while in memory (presumably after being CRC-checked). It's a bit hard to see how though. Are all the bloated-on-the-slave relations indexes? I think the most fruitful thing to do at this point is to try to isolate the bloating events for the other affected rels as you've done for this one. Maybe we'll see a pattern. regards, tom lane
On 01/31/2014 01:11 PM, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: >> One thing I keep coming back to is a bad ran chip setting a bit in the >> block number. But I just can't seem to get it to add up. The difference is >> not a power of two, it had happened on two different machines, and we don't >> see other weirdness on the machine. It seems like a strange coincidence it >> would happen to the same variable twice and not to other variables. > > I also looked at the bit patterns for the two block numbers, and couldn't > detect any relationship. > >> Unless there's some unrelated code writing through a wild pointer, possibly >> to a stack allocated object that just happens to often be that variable? > > Yeah, I'd been wondering if the WAL record somehow got corrupted while > in memory (presumably after being CRC-checked). It's a bit hard to see > how though. > > Are all the bloated-on-the-slave relations indexes? I think the most > fruitful thing to do at this point is to try to isolate the bloating > events for the other affected rels as you've done for this one. > Maybe we'll see a pattern. FWIW, we've periodically seen reports from our clients of replica databases being slightly larger than the master. Nothing reproducable or as severe as Greg's issue, or we'd have reported it. But this could be a more widespread issue, just that it affects most users in the +5% ballpark, so they don't notice. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Fri, Jan 31, 2014 at 10:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, I'd been wondering if the WAL record somehow got corrupted while > in memory (presumably after being CRC-checked). It's a bit hard to see > how though. One thing I mentioned early on but bears repeating is that this instance is 9.1.11. Also something that occurred to me at 3am -- the "reference to invalid pages" recovery errors that replayed correctly after the panic might also explain why the slave seems to operate correctly. It's possible after the panic it replayed those same records correctly. > Are all the bloated-on-the-slave relations indexes? I think the most > fruitful thing to do at this point is to try to isolate the bloating > events for the other affected rels as you've done for this one. > Maybe we'll see a pattern. I'll poke at those others tomorrow/today. I can also try to bring up a new standby from the same base backup but it'll take time. It's a large database. Also the fear I have above is that if I set a recovery target I might make it miss the bug. -- greg
Josh Berkus <josh@agliodbs.com> writes: > FWIW, we've periodically seen reports from our clients of replica > databases being slightly larger than the master. Nothing reproducable > or as severe as Greg's issue, or we'd have reported it. But this could > be a more widespread issue, just that it affects most users in the +5% > ballpark, so they don't notice. Well, the "bloat" aspect of it is really the least of our problems. AFAICT what is happening is flat-out index corruption, that is, an intended update isn't applied where it should be but to some seemingly-random other page. If this is common then it seems like we ought to be hearing a lot of reports of query misbehavior on slaves. regards, tom lane
On Fri, Jan 31, 2014 at 8:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So on a filesystem that supports "holes" > in files, I'd expect that the added segments would be fully allocated > if XLogReadBufferExtended did the deed, but they'd be quite small if > _mdfd_getseg did so. The du results you started with suggest that the > former is the case, but could you verify that the filesystem this is > on supports holes and that du will report only the actually allocated > space when there's a hole? Yup, it's xfs: # dd if=/dev/zero seek=1M count=1 bs=1kB of=hole 1+0 records in 1+0 records out 1000 bytes (1.0 kB) copied, 5.7286e-05 s, 17.5 MB/s # ls -ls hole 4 -rw-r--r-- 1 root root 1048577000 Feb 1 09:35 hole # ls -ls 1261982.5* 1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 14 12:51 1261982.5 1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:05 1261982.50 1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:07 1261982.51 1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:09 1261982.52 1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:10 1261982.53508680 -rw------- 1 ua8157t9mbut8r postgres 520888320 Jan 23 02:12 1261982.54 -- greg
The plot thickens... Looking at the next relation I see a similar symptom of a single valid block at the end of several segments of nuls. This relation is also a btree on the same table and has a header in the near vicinity of the xlog: d9de7pcqls4ib6=# select (page_header(get_raw_page('event_data_event_id_person_id', 'main', 3631161))).* ; lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid ------------+-----+-------+-------+-------+---------+----------+---------+-----------EA1/63A0A8 | 6 | 0 | 1180 | 3552 | 8176 | 8192 | 4 | 0 (1 row) And indeed it's the very next xlog record: [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid 2746914/219 [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/637140] bkpblock[1]: s/d/r:1663/16385/1364767 blk:2746914 hole_off/len:1180/2372 [cur:EA1/63A0A8, xid:1418089147, rmid:1(Transaction), len/tot_len:32/64, info:0, prev:EA1/638988] d/s:16385/1663 commit at 2014-01-21 05:41:11 UTC
I've poked at this a bit more. There are at least 10 relations where the last block doesn't match the block mentioned in the xlog record that its LSN indicates. At least it looks like from the info xlogdump prints. Including two blocks where the "correct" block has the same LSN which maybe means the record was correctlly replayed the second time. Or perhaps it just means I'm underestimating the complexity of btree splits. [cur:EAD/511424E0, xid:1423308342, rmid:11(Btree), len/tot_len:3750/12786, info:77, prev:EAD/51142490] split_r: s/d/r:1663/16385/1521566 leftsib 1697585 [cur:EAD/511424E0, xid:1423308342, rmid:11(Btree), len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[1]: s/d/r:1663/16385/1521566 blk:1697585 hole_off/len:576/4288 [cur:EAD/511424E0, xid:1423308342, rmid:11(Btree), len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[2]: s/d/r:1663/16385/1521566 blk:786380 hole_off/len:740/3140 relfilenode | blockno | lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid -------------+----------+--------------+-----+-------+-------+-------+---------+----------+---------+------------ 1521566| 1697585 | EAD/511456E8 | 6 | 0 | 576 | 4864 | 8176 | 8192 | 4 | 0 1521566 | 1704143| EAD/511456E8 | 6 | 0 | 644 | 4456 | 8176 | 8192 | 4 | 0 [cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree), len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] split_r: s/d/r:1663/16385/4995658 leftsib 139569 [cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree), len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] bkpblock[2]: s/d/r:1663/16385/4995658 blk:18152 hole_off/len:28/8028relfilenode | blockno | lsn | tli | flags | lower | upper| special | pagesize | version | prune_xid -------------+----------+--------------+-----+-------+-------+-------+---------+----------+---------+------------ 4995658| 139569 | EAD/520F2058 | 6 | 0 | 152 | 4336 | 8176 | 8192 | 4 | 0 4995658 | 139584| EAD/520F2058 | 6 | 0 | 164 | 3976 | 8176 | 8192 | 4 | 0
Hm, I'm not entirely convinced those are erroneous replays to wrong blocks. They don't look right but there are no blocks of NULs preceding them. So if they're wrong then they only extended the relations by a single block. The relfilenodes that have nul blocks before the last block are: relfilenode | blockno | segnum | offset | lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid -------------+----------+--------+----------+-----+-----+-------+-------+-------+---------+----------+---------+----------- 1261982 | 7141472 | 54 | 7141472 | 0/0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 473158 | 18090059 | 138 | 18090059 | 0/0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 1366221 | 2208159 | 16 | 2208159 | 0/0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 1364767 | 3631161 | 27 | 3631161 | 0/0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 1519286 | 311808 | 2 | 311808 | 0/0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
Greg Stark <stark@mit.edu> writes: > The relfilenodes that have nul blocks before the last block are: Can we see the associated WAL records (ie, the ones matching the LSNs in the last blocks of these files)? regards, tom lane
On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> The relfilenodes that have nul blocks before the last block are: > > Can we see the associated WAL records (ie, the ones matching the LSNs > in the last blocks of these files)? Sorry, I've lost track of what information I already shared or didn't, i've been staring at these records all day. It would be so much easier if xlogdump was a fdw or set returning function so I could do joins with pageinspect data: This is the information from the final block headers: relfilenode | blockno | segnum | offset | lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid -------------+----------+--------+----------+------------+-----+-------+-------+-------+---------+----------+---------+------------ 473158 | 18090059 | 138 | 18090059 | EA1/625E28 | 6 | 0 | 144 | 896 | 8192 | 8192 | 4 | 1401029863 1366221 | 2208159 | 16 | 2208159 | EA1/62DDF0 | 6| 0 | 1180 | 3552 | 8176 | 8192 | 4 | 0 1261982 | 7141472 | 54 | 7141472 | EA1/638988 | 6| 0 | 1240 | 3312 | 8176 | 8192 | 4 | 0 1364767 | 3631161 | 27 | 3631161 | EA1/63A0A8 | 6| 0 | 1180 | 3552 | 8176 | 8192 | 4 | 0 1519286 | 311808 | 2 | 311808 | EA1/708B08 | 6| 1 | 312 | 3040 | 8192 | 8192 | 4 | 0 These are all the records from the tx that corresponds to the lsn in the first relfilenode: [cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524, info:72, prev:EA1/624080] hot_update: s/d/r:1663/16385/473158 block 9386399 off 29 to block 9386399 off 30 [cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524, info:72, prev:EA1/624080] bkpblock[1]: s/d/r:1663/16385/473158 blk:9386399 hole_off/len:144/752 [cur:EA1/625E28, xid:1418089146, rmid:1(Transaction), len/tot_len:32/64, info:0, prev:EA1/6240C0] d/s:16385/1663 commit at 2014-01-21 05:41:11 UTC The middle three blocks have LSNs that are all part of the same tx, here are all the records for this tx: [cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441, info:8, prev:EA1/625E68] insert: s/d/r:1663/16385/473158 blk/off:9386398/33 header: none [cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441, info:8, prev:EA1/625E68] bkpblock[1]: s/d/r:1663/16385/473158 blk:9386398 hole_off/len:156/1828 [cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214, info:8, prev:EA1/625F28] insert_leaf: s/d/r:1663/16385/473176 tid 1319804/405 [cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214, info:8, prev:EA1/625F28] bkpblock[1]: s/d/r:1663/16385/473176 blk:1319804 hole_off/len:1644/52 [cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494, info:8, prev:EA1/627868] insert_leaf: s/d/r:1663/16385/473182 tid 1186167/147 [cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494, info:8, prev:EA1/627868] bkpblock[1]: s/d/r:1663/16385/473182 blk:1186167 hole_off/len:1300/1772 [cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314, info:8, prev:EA1/629898] insert_leaf: s/d/r:1663/16385/1270734 tid 1294137/2 [cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314, info:8, prev:EA1/629898] bkpblock[1]: s/d/r:1663/16385/1270734 blk:1294137 hole_off/len:1064/2952 [cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/62B210] insert_leaf: s/d/r:1663/16385/1366221 tid 1364573/66 [cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/62B210] bkpblock[1]: s/d/r:1663/16385/1366221 blk:1364573 hole_off/len:1180/2372 [cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814, info:8, prev:EA1/62C6E8] insert_leaf: s/d/r:1663/16385/1404440 tid 1195953/51 [cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814, info:8, prev:EA1/62C6E8] bkpblock[1]: s/d/r:1663/16385/1404440 blk:1195953 hole_off/len:964/3452 [cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862, info:8, prev:EA1/62DDF0] insert_leaf: s/d/r:1663/16385/1410405 tid 1894183/39 [cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862, info:8, prev:EA1/62DDF0] bkpblock[1]: s/d/r:1663/16385/1410405 blk:1894183 hole_off/len:988/1404 [cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254, info:8, prev:EA1/62F0D8] insert_leaf: s/d/r:1663/16385/1521566 tid 1691110/132 [cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254, info:8, prev:EA1/62F0D8] bkpblock[1]: s/d/r:1663/16385/1521566 blk:1691110 hole_off/len:1044/1012 [cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174, info:8, prev:EA1/630BC0] insert_leaf: s/d/r:1663/16385/5285587 tid 386419/102 [cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174, info:8, prev:EA1/630BC0] bkpblock[1]: s/d/r:1663/16385/5285587 blk:386419 hole_off/len:1036/3092 [cur:EA1/633C68, xid:1418089147, rmid:15(Sequence), len/tot_len:158/190, info:0, prev:EA1/632830] seq [cur:EA1/633D28, xid:1418089147, rmid:10(Heap), len/tot_len:188/220, info:128, prev:EA1/633C68] insert(init): s/d/r:1663/16385/16523 blk/off:12933288/1 header: t_infomask2 11 t_infomask 2051 t_hoff 32 [cur:EA1/633E08, xid:1418089147, rmid:11(Btree), len/tot_len:18/5234, info:8, prev:EA1/633D28] insert_leaf: s/d/r:1663/16385/1261892 tid 3008033/256 [cur:EA1/633E08, xid:1418089147, rmid:11(Btree), len/tot_len:18/5234, info:8, prev:EA1/633D28] bkpblock[1]: s/d/r:1663/16385/1261892 blk:3008033 hole_off/len:1048/3032 [cur:EA1/635290, xid:1418089147, rmid:11(Btree), len/tot_len:18/7834, info:8, prev:EA1/633E08] insert_leaf: s/d/r:1663/16385/184672 tid 1704550/5 [cur:EA1/635290, xid:1418089147, rmid:11(Btree), len/tot_len:18/7834, info:8, prev:EA1/633E08] bkpblock[1]: s/d/r:1663/16385/184672 blk:1704550 hole_off/len:1568/432 [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid 3634978/282 [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194, info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982 blk:3634978 hole_off/len:1240/2072 [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid 2746914/219 [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894, info:8, prev:EA1/637140] bkpblock[1]: s/d/r:1663/16385/1364767 blk:2746914 hole_off/len:1180/2372 [cur:EA1/63A0A8, xid:1418089147, rmid:1(Transaction), len/tot_len:32/64, info:0, prev:EA1/638988] d/s:16385/1663 commit at 2014-01-21 05:41:11 UTC The last block does a fair amount of traffic, these are the first four records the first two are the only ones that touch this relfilenode: [cur:EA1/707548, xid:1418089185, rmid:10(Heap), len/tot_len:21/5541, info:8, prev:EA1/707488] insert: s/d/r:1663/16385/1519286 blk/off:281032/13 header: none [cur:EA1/707548, xid:1418089185, rmid:10(Heap), len/tot_len:21/5541, info:8, prev:EA1/707488] bkpblock[1]: s/d/r:1663/16385/1519286 blk:281032 hole_off/len:312/2728 [cur:EA1/708B08, xid:1418089185, rmid:11(Btree), len/tot_len:18/6814, info:8, prev:EA1/707548] insert_leaf: s/d/r:1663/16385/1519292 tid 110564/335 [cur:EA1/708B08, xid:1418089185, rmid:11(Btree), len/tot_len:18/6814, info:8, prev:EA1/707548] bkpblock[1]: s/d/r:1663/16385/1519292 blk:110564 hole_off/len:1364/1452 -- greg
Greg Stark <stark@mit.edu> writes: > On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Can we see the associated WAL records (ie, the ones matching the LSNs >> in the last blocks of these files)? > Sorry, I've lost track of what information I already shared or didn't, Hm. So one of these is a heap update, not an index update, which lets out the theory that it's something specific to indexes. But they are all full-page-image updates, so the WAL replay code path for full-page images still seems to be the suspect. What version were you running before 9.1.11 exactly? I took a look through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any changes that seemed even vaguely related to this. There are some changes in known-transaction tracking, but it's hard to see a connection there. Most of the other diffs are in code that wouldn't execute during WAL replay at all. regards, tom lane
On Mon, Feb 3, 2014 at 12:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > What version were you running before 9.1.11 exactly? I took a look > through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any > changes that seemed even vaguely related to this. There are some > changes in known-transaction tracking, but it's hard to see a connection > there. Most of the other diffs are in code that wouldn't execute during > WAL replay at all. Both the primary and the standby were 9.1.11 from the get-go. The database the primary was forked off of was 9.1.10 but as far as I can tell the primary in the current pair has no problems. What's worse is we created a new standby from the same base backup and replayed the same records and it didn't reproduce the problem. This means either it's a hardware problem -- but we've seen it on multiple standbys on this database and at least one other database which is in a different data centre -- or it's a race condition --but that's hard to credit in the recovery code which is basically single-threaded. And these records are from before the standby reaches a consistency so it's hard to see how a connection from a hot standby client could cause any kind of race condition. The only other thread that could conceivably cause a heisenbug is the bgwriter. It's hard to imagine how a race condition in there could be so easy to hit that it would happen four times on one restore but otherwise go mostly unnoticed. -- greg
Greg Stark <stark@mit.edu> writes: > Both the primary and the standby were 9.1.11 from the get-go. The > database the primary was forked off of was 9.1.10 but as far as I can > tell the primary in the current pair has no problems. > What's worse is we created a new standby from the same base backup and > replayed the same records and it didn't reproduce the problem. This > means either it's a hardware problem -- but we've seen it on multiple > standbys on this database and at least one other database which is in > a different data centre -- or it's a race condition --but that's hard > to credit in the recovery code which is basically single-threaded. > And these records are from before the standby reaches a consistency so > it's hard to see how a connection from a hot standby client could > cause any kind of race condition. The only other thread that could > conceivably cause a heisenbug is the bgwriter. It's hard to imagine > how a race condition in there could be so easy to hit that it would > happen four times on one restore but otherwise go mostly unnoticed. I had noticed that the WAL records that were mis-replayed seemed to be bunched pretty close together (two of them even adjacent). Could you confirm that? If so, it seems like we're looking for some condition that makes mis-replay fairly probable for a period of time, but in itself might be quite improbable. Not that that helps much at nailing it down. You might well be on to something with the bgwriter idea, considering that none of the WAL replay code was originally written with any concurrent execution in mind. We might've missed some place where additional locking is needed. regards, tom lane
On Thu, Feb 6, 2014 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I had noticed that the WAL records that were mis-replayed seemed to > be bunched pretty close together (two of them even adjacent). Could > you confirm that? If so, it seems like we're looking for some condition > that makes mis-replay fairly probable for a period of time, but in > itself might be quite improbable. Not that that helps much at > nailing it down. Well one thing that argues for is hardware problems. It could be that the right memory mapping just happened to line up with that variable on the stack for that short time and then it was mapped to something else entirely. Or the machine was overheating for a short time and then the temperature became more reasonable. Or the person with the x-ray source walked by in that short time window. That doesn't explain the other instance or the other copies of this database. I think the most productive thing I can do is switch my attention to the other database to see if it really looks like the same problem. > You might well be on to something with the bgwriter idea, considering > that none of the WAL replay code was originally written with any > concurrent execution in mind. We might've missed some place where > additional locking is needed. Except that the bgwriter has been in there for a few years already. Unless there's been some other change, possibly involving copying some code that was safe in some context but not where it was copied to. The problem with the bgwriter being at fault is that from what I can see the bgwriter will never extend a file. That means the xlog recovery code must have done it. That means even if the bgwriter came along and looked at the buffer we just read in it would already be too late to cause mischief. The xlog code extends the file *first* then reads in the backup block into a buffer. I can't see how it could corrupt the stack or the wal recovery buffer in the window between reading in the wal buffer and deciding to extend the relation. -- greg
On 2014-02-06 23:41:19 +0100, Greg Stark wrote: > The problem with the bgwriter being at fault is that from what I can > see the bgwriter will never extend a file. That means the xlog > recovery code must have done it. That means even if the bgwriter came > along and looked at the buffer we just read in it would already be too > late to cause mischief. The xlog code extends the file *first* then > reads in the backup block into a buffer. I can't see how it could > corrupt the stack or the wal recovery buffer in the window between > reading in the wal buffer and deciding to extend the relation. That's not necessarily true. If e.g. the buffer mapping would change racily, the result write from the bgwriter could very well end up increasing the file size, leaving a hole inbetween its write and the original size. Are there any other relations that are as big as the corrupted relations got extended to? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > That's not necessarily true. If e.g. the buffer mapping would change > racily, the result write from the bgwriter could very well end up > increasing the file size, leaving a hole inbetween its write and the > original size. a) the segment isn't sparse and b) there were whole segments full of nuls between the end of the tables and the final blocks. So the file was definitely extended by Postgres, not the OS and the bgwriter passes EXTENSION_FAIL which means it wouldn't create those intervening segments. > Are there any other relations that are as big as the corrupted relations > got extended to? I was wondering the same thing. But no, the extended file is 39GB larger than the next largest relation. Also, the final block in the first relation is clearly a version of the same block from the correct location. Look at the ctids and the xids on the rows for lp 3 in the attached file for example. That second copy is from the location in the xlog record. -- greg
Attachment
Greg Stark <stark@mit.edu> writes: > On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> That's not necessarily true. If e.g. the buffer mapping would change >> racily, the result write from the bgwriter could very well end up >> increasing the file size, leaving a hole inbetween its write and the >> original size. > a) the segment isn't sparse and b) there were whole segments full of > nuls between the end of the tables and the final blocks. > So the file was definitely extended by Postgres, not the OS and the > bgwriter passes EXTENSION_FAIL which means it wouldn't create those > intervening segments. But ... when InRecovery, md.c will create such segments too. We had dismissed that on the grounds that the files would be sparse because of the way md.c creates them. However, it is real damn hard to see how the loop in XLogReadBufferExtended could've accessed a bogus block, other than hardware misfeasance which I don't believe any more than you do. The blkno that's passed to that function came directly out of a WAL record that's in the private memory of the startup process and recently passed a CRC check. You'd have to believe some sort of asynchronous memory clobber inside the startup process. On the other hand, if _mdfd_getseg did the deed, there's a whole lot more space for something funny to have happened, because now we're talking about a buffer being written in preparation for eviction from shared buffers, long after WAL replay filled it. So I'm wondering if there's something wrong with our deduction from file non-sparseness. In this connection, google quickly found me a report of XFS "losing" the sparse state of a file across multiple writes: http://oss.sgi.com/archives/xfs/2011-06/msg00225.html I wonder whether that bug or a similar one exists in your production kernel. regards, tom lane
On 2014-02-06 18:42:05 -0500, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: > > On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> That's not necessarily true. If e.g. the buffer mapping would change > >> racily, the result write from the bgwriter could very well end up > >> increasing the file size, leaving a hole inbetween its write and the > >> original size. > > > a) the segment isn't sparse and b) there were whole segments full of > > nuls between the end of the tables and the final blocks. > > > So the file was definitely extended by Postgres, not the OS and the > > bgwriter passes EXTENSION_FAIL which means it wouldn't create those > > intervening segments. > > But ... when InRecovery, md.c will create such segments too. We had > dismissed that on the grounds that the files would be sparse because > of the way md.c creates them. However, it is real damn hard to see > how the loop in XLogReadBufferExtended could've accessed a bogus block, > other than hardware misfeasance which I don't believe any more than > you do. The blkno that's passed to that function came directly out > of a WAL record that's in the private memory of the startup process > and recently passed a CRC check. You'd have to believe some sort > of asynchronous memory clobber inside the startup process. That reminds me, not that I directly see how it could be responsible, there's still 20131029011623.GJ20248@awork2.anarazel.de ff. around. I don't think we came to a agreement in that thread how to fix the problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > That reminds me, not that I directly see how it could be responsible, > there's still 20131029011623.GJ20248@awork2.anarazel.de ff. around. I > don't think we came to a agreement in that thread how to fix the > problem. Hm, yeah. I'm not sure I believe Heikki's argument that we need to avoid closing the smgr relation. If that stuff is being used in any performance-critical paths then we've got trouble already. regards, tom lane
On 2014-02-06 20:06:03 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > That reminds me, not that I directly see how it could be responsible, > > there's still 20131029011623.GJ20248@awork2.anarazel.de ff. around. I > > don't think we came to a agreement in that thread how to fix the > > problem. > > Hm, yeah. I'm not sure I believe Heikki's argument that we need to avoid > closing the smgr relation. If that stuff is being used in any > performance-critical paths then we've got trouble already. Me neither. And I still am hesitant to start doing an unconditional smgropen(rnode, InvalidBackendId), but maybe that's also misplaced. My gut feeling would either to go with the very simple closing the smgr (which was the case before the current form of the fake relcache infrastructure!) or add support of unowning the smgr rel (as in 20131105193632.GD14819@awork2.anarazel.de). After being slightly more awake it's even harder to see how it could be the cause for this thread's problem. True, it's a rogue write into palloc()ed memory that's used by somebody else, but afaics it will only ever write a NULL. While not impossible it seems a bit of a stretch how that could cause the symptoms. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 6, 2014 at 11:41 PM, Greg Stark <stark@mit.edu> wrote: > > That doesn't explain the other instance or the other copies of this > database. I think the most productive thing I can do is switch my > attention to the other database to see if it really looks like the > same problem. So here's an instance in the other database, this one is on a different continent from the first one so it's definitely a different physical machine. I've had to copy the blocks over to another machine because the database is down and still in standby mode anyways. I don't have the xlog file yet. Bad block's page header -- this is in the 56'th relation segment: =# select (page_header(E'\\x20050000583b05aa050000002800180500200420000000001098e00f2090e00f088d24061885e00f0000000000000000')).*; lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid --------------+-----+-------+-------+-------+---------+----------+---------+-----------520/AA053B58 | 5 | 0 | 40| 1304 | 8192 | 8192 | 4 | 0 (1 row) =# select (heap_page_items(E'\\x20050000583b05aa050000002800180500200420000000001098e00f2090e00f088d24061885e00f000....')).*;lp| lp_off| lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid ----+--------+----------+--------+---------+--------+----------+-------------+-------------+------------+--------+--------+------- 1| 6160 | 1 | 2032 | 7635393 | 0 | 0 | (4773121,1) | 3 | 2306 | 24 | | 2 | 4128 | 1 | 2032 | 7635393 | 0 | 0 | (4773121,2) | 3 | 2306 | 24 | | 3 | 3336 | 1 | 786 | 7635393 | 0 | 0 | (4773121,3) | 3 | 2306 | 24 | | 4 | 1304 | 1 | 2032 | 7635428 | 0 | 0 | (4773121,4) | 3 | 2306 | 24 | | (4 rows) Looking at the block at offset 4773121 (which is in the 36th segment): =# select (heap_page_items(E'\\x20050000a00a0bad050000002c00a00200200420000000001098e00f2090e00f088d24061885e00fa082ec040000000.....')).*;lp |lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid ----+--------+----------+--------+---------+--------+----------+-------------+-------------+------------+--------+--------+------- 1| 6160 | 1 | 2032 | 7635393 | 0 | 0 | (4773121,1) | 3 | 2306 | 24 | | 2 | 4128 | 1 | 2032 | 7635393 | 0 | 0 | (4773121,2) | 3 | 2306 | 24 | | 3 | 3336 | 1 | 786 | 7635393 | 0 | 0 | (4773121,3) | 3 | 2306 | 24 | | 4 | 1304 | 1 | 2032 | 7635428 | 0 | 0 | (4773121,4) | 3 | 2306 | 24 | | 5 | 672 | 1 | 630 | 7635580 | 0 | 0 | (4773121,5) | 3 | 2306 | 24 | | (5 rows) d9de7pcqls4ib6=# select (page_header(E'\\x20050000a00a0bad050000002c00a00200200420000000001098e00f2090e00f088d24061885e00fa082ec040000000000000000....')).*; lsn | tli | flags | lower | upper | special | pagesize | version | prune_xid --------------+-----+-------+-------+-------+---------+----------+---------+-----------520/AD0B0AA0 | 5 | 0 | 44| 672 | 8192 | 8192 | 4 | 0 (1 row) -- greg
On Sun, Feb 9, 2014 at 2:54 PM, Greg Stark <stark@mit.edu> wrote: > Bad block's page header -- this is in the 56'th relation segment: > > =# select (page_header(E'\\x20050000583b05aa050000002800180500200420000000001098e00f2090e00f088d24061885e00f0000000000000000')).*; > lsn | tli | flags | lower | upper | special | pagesize | > version | prune_xid > --------------+-----+-------+-------+-------+---------+----------+---------+----------- > 520/AA053B58 | 5 | 0 | 40 | 1304 | 8192 | 8192 | > 4 | 0 > Looking at the block at offset 4773121 (which is in the 36th segment): ... > d9de7pcqls4ib6=# select > (page_header(E'\\x20050000a00a0bad050000002c00a00200200420000000001098e00f2090e00f088d24061885e00fa082ec040000000000000000....')).*; > lsn | tli | flags | lower | upper | special | pagesize | > version | prune_xid > --------------+-----+-------+-------+-------+---------+----------+---------+----------- > 520/AD0B0AA0 | 5 | 0 | 44 | 672 | 8192 | 8192 | > 4 | 0 > (1 row) And I finally tracked down the xlog record for this stray write: [cur:520/AA051FE0, xid:7635428, rmid:10(Heap), len/tot_len:21/7005, info:8, prev:520/AA051F20] insert: s/d/r:1663/16385/16650 blk/off:4773121/4 header: none [cur:520/AA051FE0, xid:7635428, rmid:10(Heap), len/tot_len:21/7005, info:8, prev:520/AA051F20] bkpblock[1]: s/d/r:1663/16385/16650 blk:4773121 hole_off/len:40/1264 -- greg
So I think I've come up with a scenario that could cause this. I don't think it's exactly what happened here but maybe something analogous happened with our base backup restore. On the primary you extend a table a bunch, including adding new segments, but crash before committing (or checkpointing). Then some of the blocks but not all may be written to disk. Assume they're all written except for the last block of the first file. So what you have is a .999G file followed by, day, 9 1G files. (Or maybe the hot backup process could just catch the files in this state if a table is rapidly growing and it doesn't take care to avoid picking up new files that appear after it starts?) smgrnblocks() stops at the first < 1GB segment and ignores the rest. This code in xlog uses it to calculate how many blocks to add but it only calls it once and then doesn't recheck where it's at as it extends the relation. As soon as it adds that one missing block the remaining files become visible. P_NEW always recalculates the position based on smgrnblocks each time (which sounds pretty inefficient but anyways....) so it will add the requested blocks to the new end. Now this isn't enough to explain things since surely the extensions records would be in the xlog in physical order. But this could have all happened after an earlier vacuum truncated the relation and we could be replaying records that predate that. So in short, if you have a 10G table and want to overwrite the last block but the first segment is one block short then xlog will add 9G to the end and write the block there. That sounds like what we've seen. I think the easy fix is to change the code in xlogutils to be more defensive and stop as soon as it finds BufferGetBlockNumber(buffer) == blkno (which is what it has in the assert already). -- greg
So here's my attempt to rewrite this logic. I ended up refactoring a bit because I found it unnecessarily confusing having the mode branches in several places. I think it's much clearer just having two separate pieces of logic for RBM_NEW and the extension cases since all they have in common is the ReadBuffer call. I have to say, it scares the hell out of me that there are no regression tests for this code. I'm certainly not comfortable committing it without a few other people reading it if I haven't even run the code once. At least I know it compiles...
Attachment
Greg Stark <stark@mit.edu> writes: > So here's my attempt to rewrite this logic. I ended up refactoring a > bit because I found it unnecessarily confusing having the mode > branches in several places. I think it's much clearer just having two > separate pieces of logic for RBM_NEW and the extension cases since all > they have in common is the ReadBuffer call. I don't like that at all. It's a lot of unnecessary churn in what is indeed hard-to-test code, and personally I don't find it clearer. Nor, if you're going to complain about the cost of smgrnblocks, does it seem like a great idea to be doing that *twice* per page rather than once. How about the attached instead? regards, tom lane diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index 4cd82df..f1918f6 100644 *** a/src/backend/access/transam/xlogutils.c --- b/src/backend/access/transam/xlogutils.c *************** XLogReadBufferExtended(RelFileNode rnode *** 338,352 **** /* we do this in recovery only - no rel-extension lock needed */ Assert(InRecovery); buffer = InvalidBuffer; ! while (blkno >= lastblock) { if (buffer != InvalidBuffer) ReleaseBuffer(buffer); buffer = ReadBufferWithoutRelcache(rnode, forknum, P_NEW, mode, NULL); - lastblock++; } ! Assert(BufferGetBlockNumber(buffer) == blkno); } if (mode == RBM_NORMAL) --- 338,358 ---- /* we do this in recovery only - no rel-extension lock needed */ Assert(InRecovery); buffer = InvalidBuffer; ! do { if (buffer != InvalidBuffer) ReleaseBuffer(buffer); buffer = ReadBufferWithoutRelcache(rnode, forknum, P_NEW, mode, NULL); } ! while (BufferGetBlockNumber(buffer) < blkno); ! /* Handle the corner case that P_NEW returns non-consecutive pages */ ! if (BufferGetBlockNumber(buffer) != blkno) ! { ! ReleaseBuffer(buffer); ! buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, ! mode, NULL); ! } } if (mode == RBM_NORMAL)
Greg Stark <stark@mit.edu> writes: > So I think I've come up with a scenario that could cause this. I don't > think it's exactly what happened here but maybe something analogous > happened with our base backup restore. I agree it seems like a good idea for XLogReadBufferExtended to defend itself against successive P_NEW calls possibly not returning consecutive pages. However, unless you had an operating-system-level crash on the master, this isn't sufficient to explain the problem. We'd need also a plausible theory about how a base backup could've left us with short segments in a large relation. > (Or maybe the hot backup > process could just catch the files in this state if a table is rapidly > growing and it doesn't take care to avoid picking up new files that > appear after it starts?) That's a possible explanation I guess, but it doesn't seem terribly probable from a timing standpoint. Also, you should be able to gauge the probability of this theory from knowledge of the application --- are the bloated relations all ones that would've been growing *very* rapidly during the base backup? regards, tom lane
On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > How about the attached instead? This does possibly allocate an extra block past the target block. I'm not sure how surprising that would be for the rest of the code. For what it's worth I've confirmed the bug in wal-e caused the initial problem. But I think it's possible to occur without that bug so I think it still needs to be addressed. -- greg
I wrote: > Greg Stark <stark@mit.edu> writes: >> (Or maybe the hot backup >> process could just catch the files in this state if a table is rapidly >> growing and it doesn't take care to avoid picking up new files that >> appear after it starts?) > That's a possible explanation I guess, but it doesn't seem terribly > probable from a timing standpoint. I did a bit of arithmetic using the cases you posted previously. In the first case, where block 3634978 got written to 7141472, you can make the numbers come out right if you assume that a page was missing at the end of segment 1 --- that leads to the conclusion that EOF exclusive of that missing page had been around 28.75 GB, which squares well with the relation's size on master. However, it's fairly hard to credit that the base backup would have collected full-size or nearly full-size images of segments 2 through 28 while not seeing segment 1 at full size. You'd have to assume that the rel grew by a factor of ~14 while the base backup was in progress --- and then didn't grow very much more afterwards. (What state exactly did you measure the primary rel sizes in? Was it long after the backup/restore, or did you rewind things somehow?) The other examples seem to fit the theory a bit better, but this one is hard to explain this way. The other big problem for this theory is that you said in http://www.postgresql.org/message-id/CAM-w4HPvJCBRVV3dXg8aj0WzkU08dHuX-XYbfDYQhNrn5bnTQg@mail.gmail.com > What's worse is we created a new standby from the same base backup and > replayed the same records and it didn't reproduce the problem. If this were the explanation, it oughta be reproducible that way. I still agree that XLogReadBufferExtended shouldn't be assuming that P_NEW will not skip pages. But I think we have another bug in here somewhere. regards, tom lane
Greg Stark <stark@mit.edu> writes: > On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> How about the attached instead? > This does possibly allocate an extra block past the target block. I'm > not sure how surprising that would be for the rest of the code. Should be fine; we could end up with an extra block after a failed extension operation in any case. > For what it's worth I've confirmed the bug in wal-e caused the initial > problem. Huh? Bug in wal-e? What bug? regards, tom lane
On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Greg Stark <stark@mit.edu> writes: >> On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> How about the attached instead? > >> This does possibly allocate an extra block past the target block. I'm >> not sure how surprising that would be for the rest of the code. > > Should be fine; we could end up with an extra block after a failed > extension operation in any case. I know it's fine on the active database, I'm not so clear whether it's compatible with the xlog records from the primary. I suppose it'll just see an Initialize Page record and happily see the nul block and initialize it. It's still a bit scary. I of course think my code is vastly clearer than the existing code but yes, I see the "unnecessary churn" argument. I think that's the fundamental problem with lacking tests, it makes the code get more and more complex as we're reluctant to simplify it out of fear. >> For what it's worth I've confirmed the bug in wal-e caused the initial >> problem. > > Huh? Bug in wal-e? What bug? WAL-E actually didn't restore a whole 1GB file due to a transient S3 problem, in fact a bunch of them. It's remarkable that Postgres kept going with that much data missing. But the arithmetic worked out on the case I checked it on, which was the last one that I just sent the xlog record for last night. In that case there was precisely one segment missing and the relation was extended by the number of segments you would expect if it filled in that missing segment and then jumped to the end of the relation. -- greg
Greg Stark <stark@mit.edu> writes: > On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Greg Stark <stark@mit.edu> writes: >>> This does possibly allocate an extra block past the target block. I'm >>> not sure how surprising that would be for the rest of the code. >> Should be fine; we could end up with an extra block after a failed >> extension operation in any case. > I know it's fine on the active database, I'm not so clear whether it's > compatible with the xlog records from the primary. I suppose it'll > just see an Initialize Page record and happily see the nul block and > initialize it. It's still a bit scary. Well, we can easily find uninitialized extra pages on the primary too, so if WAL replay were unable to cope with that, it would be a bug regardless. >> Huh? Bug in wal-e? What bug? > WAL-E actually didn't restore a whole 1GB file due to a transient S3 > problem, in fact a bunch of them. Hah. Okay, I think we can write this issue off as closed then. regards, tom lane
I wrote: > Greg Stark <stark@mit.edu> writes: >> WAL-E actually didn't restore a whole 1GB file due to a transient S3 >> problem, in fact a bunch of them. > Hah. Okay, I think we can write this issue off as closed then. Oh, wait a minute. It's not just a matter of whether we find the right block: we also have to consider whether XLogReadBufferExtended will apply the right "mode" behavior. Currently, it supposes that all pages past the initially observed EOF should be assumed to be uninitialized; but if we're working with an inconsistent database, that seems like an unsafe assumption. It might be that a page is there but we've not (yet) fixed the length of some preceding segment. If we want to not get bogus "WAL contains references to invalid pages" failures in such scenarios, it seems like we need a more invasive change than what I just committed. I think your patch didn't cover this consideration either. What I think we probably want to do is forcibly cause the target page to exist, using a P_NEW loop like what I committed, and then decide on the basis of whether it's all-zeroes whether to consider it invalid or not. This seems sane on the grounds that it's just the extension to the page level of the existing policy of creating the file whether it existed or not. It could only result in a large amount of wasted work if the passed-in target block is insane --- but since we got it out of a CRC-checked WAL record, I think it's safe to not worry too much about that. regards, tom lane
I wrote: > What I think we probably want to do is forcibly cause the target page > to exist, using a P_NEW loop like what I committed, and then decide > on the basis of whether it's all-zeroes whether to consider it invalid > or not. This seems sane on the grounds that it's just the extension > to the page level of the existing policy of creating the file whether > it existed or not. It could only result in a large amount of wasted > work if the passed-in target block is insane --- but since we got it > out of a CRC-checked WAL record, I think it's safe to not worry too > much about that. Like the attached. A possible complaint is that if the WAL sequence contains updates against large relations that are later dropped, this will waste time and disk space replaying those updates as best it can. Doesn't seem like that's a case to optimize for, however. regards, tom lane diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c index f1918f6..7a820c0 100644 *** a/src/backend/access/transam/xlogutils.c --- b/src/backend/access/transam/xlogutils.c *************** XLogReadBuffer(RelFileNode rnode, BlockN *** 277,297 **** * XLogReadBufferExtended * Read a page during XLOG replay * ! * This is functionally comparable to ReadBufferExtended. There's some ! * differences in the behavior wrt. the "mode" argument: * ! * In RBM_NORMAL mode, if the page doesn't exist, or contains all-zeroes, we ! * return InvalidBuffer. In this case the caller should silently skip the ! * update on this page. (In this situation, we expect that the page was later ! * dropped or truncated. If we don't see evidence of that later in the WAL ! * sequence, we'll complain at the end of WAL replay.) * * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the * relation is extended with all-zeroes pages up to the given block number. * ! * In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't ! * exist, and we don't check for all-zeroes. Thus, no log entry is made ! * to imply that the page should be dropped or truncated later. */ Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, --- 277,307 ---- * XLogReadBufferExtended * Read a page during XLOG replay * ! * This is functionally comparable to ReadBufferExtended, except that we ! * are willing to create the target page (and indeed the whole relation) ! * if it doesn't currently exist. This allows safe replay of WAL sequences ! * in which a relation was later dropped or truncated. * ! * The "mode" argument provides some control over this behavior. (See also ! * ReadBufferExtended's specification of what the modes do.) * * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the * relation is extended with all-zeroes pages up to the given block number. + * These modes should be used if the caller is going to initialize the page + * contents from scratch, and doesn't need it to be valid already. * ! * In RBM_NORMAL mode, we similarly create the page if needed, but if the ! * page contains all zeroes (including the case where we just created it), ! * we return InvalidBuffer. Then the caller should silently skip the update ! * on this page. This mode should be used for incremental updates where the ! * caller needs to see a valid page. (In this case, we expect that the page ! * later gets dropped or truncated. If we don't see evidence of that later in ! * the WAL sequence, we'll complain at the end of WAL replay.) ! * ! * RBM_NORMAL_NO_LOG mode is like RBM_NORMAL except that we will return an ! * all-zeroes page, and not log it as one that ought to get dropped later. ! * This mode is for when the caller wants to read a page that might validly ! * contain zeroes. */ Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, *************** XLogReadBufferExtended(RelFileNode rnode *** 299,304 **** --- 309,315 ---- { BlockNumber lastblock; Buffer buffer; + bool present; SMgrRelation smgr; Assert(blkno != P_NEW); *************** XLogReadBufferExtended(RelFileNode rnode *** 316,342 **** */ smgrcreate(smgr, forknum, true); lastblock = smgrnblocks(smgr, forknum); if (blkno < lastblock) { /* page exists in file */ buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode, NULL); } else { ! /* hm, page doesn't exist in file */ ! if (mode == RBM_NORMAL) ! { ! log_invalid_page(rnode, forknum, blkno, false); ! return InvalidBuffer; ! } ! if (mode == RBM_NORMAL_NO_LOG) ! return InvalidBuffer; ! /* OK to extend the file */ /* we do this in recovery only - no rel-extension lock needed */ Assert(InRecovery); buffer = InvalidBuffer; do { --- 327,357 ---- */ smgrcreate(smgr, forknum, true); + /* + * On the same principle, if the page doesn't already exist in the file, + * create it by extending the relation as far as needed. + * + * When we are working in a not-yet-consistent database, it's possible for + * P_NEW to behave somewhat inconsistently as a result of incomplete + * segment files. Don't assume that the returned pages are necessarily + * consecutive. When we're done with this loop, however, any segments + * before the target page's segment have been zero-filled until complete. + */ lastblock = smgrnblocks(smgr, forknum); if (blkno < lastblock) { /* page exists in file */ + present = true; buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode, NULL); } else { ! /* must extend the file */ /* we do this in recovery only - no rel-extension lock needed */ Assert(InRecovery); + present = false; buffer = InvalidBuffer; do { *************** XLogReadBufferExtended(RelFileNode rnode *** 352,357 **** --- 367,374 ---- ReleaseBuffer(buffer); buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno, mode, NULL); + /* page was not in fact created by P_NEW extension */ + present = true; } } *************** XLogReadBufferExtended(RelFileNode rnode *** 368,374 **** if (PageIsNew(page)) { ReleaseBuffer(buffer); ! log_invalid_page(rnode, forknum, blkno, true); return InvalidBuffer; } } --- 385,391 ---- if (PageIsNew(page)) { ReleaseBuffer(buffer); ! log_invalid_page(rnode, forknum, blkno, present); return InvalidBuffer; } }
On Wed, Feb 12, 2014 at 8:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Oh, wait a minute. It's not just a matter of whether we find the right > block: we also have to consider whether XLogReadBufferExtended will > apply the right "mode" behavior. Currently, it supposes that all pages > past the initially observed EOF should be assumed to be uninitialized; > but if we're working with an inconsistent database, that seems like > an unsafe assumption. It might be that a page is there but we've not > (yet) fixed the length of some preceding segment. If we want to not > get bogus "WAL contains references to invalid pages" failures in such > scenarios, it seems like we need a more invasive change than what > I just committed. I think your patch didn't cover this consideration > either. Hm. I *think* those cases would be handled anyways since the table would later be truncated. Arguably any reference after the "short" segment is a "reference to an invalid page" since it means it's a record which predates the records which caused the extension. Obviously it would still give the error in the case we had where files were missing but then probably it should give errors in that case. -- greg
Greg Stark <stark@mit.edu> writes: > On Wed, Feb 12, 2014 at 8:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Oh, wait a minute. It's not just a matter of whether we find the right >> block: we also have to consider whether XLogReadBufferExtended will >> apply the right "mode" behavior. Currently, it supposes that all pages >> past the initially observed EOF should be assumed to be uninitialized; >> but if we're working with an inconsistent database, that seems like >> an unsafe assumption. It might be that a page is there but we've not >> (yet) fixed the length of some preceding segment. If we want to not >> get bogus "WAL contains references to invalid pages" failures in such >> scenarios, it seems like we need a more invasive change than what >> I just committed. I think your patch didn't cover this consideration >> either. > Hm. I *think* those cases would be handled anyways since the table > would later be truncated. Arguably any reference after the "short" > segment is a "reference to an invalid page" since it means it's a > record which predates the records which caused the extension. Well, that would be the case if you assume perfectly sequential filesystem behavior, but I'm not sure the assumption holds if the starting condition is a base backup. We could be looking at a version of segment 1 that predates segment 2's existence, and yet see some data in segment 2 as well, because it's not a perfectly coherent snapshot. I think what you're arguing is that we should see WAL records filling the rest of segment 1 before we see any references to segment 2, but if that's the case then how did we get into the situation you reported? Or is it just that it was a broken base backup to start with? regards, tom lane
Hi all, On 02/12/2014 08:27 PM, Greg Stark wrote: > On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Greg Stark <stark@mit.edu> writes: >>> For what it's worth I've confirmed the bug in wal-e caused the initial >>> problem. >> >> Huh? Bug in wal-e? What bug? > > WAL-E actually didn't restore a whole 1GB file due to a transient S3 > problem, in fact a bunch of them. It's remarkable that Postgres kept > going with that much data missing. But the arithmetic worked out on > the case I checked it on, which was the last one that I just sent the > xlog record for last night. In that case there was precisely one > segment missing and the relation was extended by the number of > segments you would expect if it filled in that missing segment and > then jumped to the end of the relation. sorry for interrupting, but did we already notify wal-e's maintainer? Andrea ps cc:ed Daniel
<p dir="ltr"><br /><p dir="ltr">> I think what you're arguing is that we should see WAL records filling the<br /> >rest of segment 1 before we see any references to segment 2, but if that's<br /> > the case then how did we get intothe situation you reported? Or is it<br /> > just that it was a broken base backup to start with?<p dir="ltr">Thescenario I could come up with that didn't require a broken base backup was that there was an earlier truncateor vacuum. So the sequence is high offset reference, truncate, growth, crash. All possibly on a single database.<pdir="ltr">It's possible we're better off not assuming we've thought of all possible ways this can happen though.
Greg Stark <stark@mit.edu> writes: >> I think what you're arguing is that we should see WAL records filling the >> rest of segment 1 before we see any references to segment 2, but if that's >> the case then how did we get into the situation you reported? Or is it >> just that it was a broken base backup to start with? > The scenario I could come up with that didn't require a broken base backup > was that there was an earlier truncate or vacuum. So the sequence is high > offset reference, truncate, growth, crash. All possibly on a single > database. That's not really an issue, because then it would be OK to discard the high-offset update; we'd recognize that as safe when we replay the truncation. > It's possible we're better off not assuming we've thought of all possible > ways this can happen though. That's what's bothering me, too. On the other hand, if we can't think of a scenario where it'd be necessary to replay the high-offset update, then I'm disinclined to mess with the code further. regards, tom lane
On Thu, Feb 13, 2014 at 7:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The scenario I could come up with that didn't require a broken base backup >> was that there was an earlier truncate or vacuum. So the sequence is high >> offset reference, truncate, growth, crash. All possibly on a single >> database. > > That's not really an issue, because then it would be OK to discard the > high-offset update; we'd recognize that as safe when we replay the > truncation. Yeah, that's my point. >> It's possible we're better off not assuming we've thought of all possible >> ways this can happen though. > > That's what's bothering me, too. On the other hand, if we can't think of > a scenario where it'd be necessary to replay the high-offset update, then > I'm disinclined to mess with the code further. And the whole point of the undefined page error checking is to detect cases like this, so covering them up in the name of possible edge cases we haven't thought of kind of defeats the purpose. In particular I would have liked to get errors rather than soldier on when the database found these missing segments. In that vein, the other possibly open question was how we got past the "undefined pages" errors that we did see. Andres said he thought that was due to the bug where some piece of code was mistakenly using the presence of a snapshot but I'm not clear how that can cause this though. -- greg
Greg Stark <stark@mit.edu> writes: > On Thu, Feb 13, 2014 at 7:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> That's what's bothering me, too. On the other hand, if we can't think of >> a scenario where it'd be necessary to replay the high-offset update, then >> I'm disinclined to mess with the code further. > And the whole point of the undefined page error checking is to detect > cases like this, so covering them up in the name of possible edge > cases we haven't thought of kind of defeats the purpose. Yeah, good point. Let's leave it alone and see if the existing fix helps. regards, tom lane
Going over this I think this is still a potential issue:
On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote:
I am not sure that explains the issue, but I think the redo action for
truncation is not safe across crashes. A XLOG_SMGR_TRUNCATE will just
do a smgrtruncate() (and then mdtruncate) which will iterate over the
segments starting at 0 till mdnblocks()/segment_size and *truncate* but
not delete individual segment files that are not needed anymore, right?
If we crash in the midst of that a new mdtruncate() will be issued, but
it will get a shorter value back from mdnblocks().
Am I missing something?
I'm not too familiar with md.c but my reading of the code is that we truncate the files in reverse order? In which case I think the code is safe *iff* the filesystem guarantees ordered meta data writes which I tihnk ext3 does (I think in all the journal modes). Most filesystems meta data writes are synchronous so the truncates are safe for them too.
But we don't generally rely on meta data writes being ordered. I think the "correct" thing to do is to record the nblocks prior to the truncate and then have md.c expose a new function that takes that parameter and pokes around looking for any segments it might need to clean up. But that would involve lots of abstraction violations in md.c. I think using nblocks would keep the violations within md.c but that still seems like a pain.
On 2014-02-14 20:46:01 +0000, Greg Stark wrote: > Going over this I think this is still a potential issue: > > On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote: > > > > > I am not sure that explains the issue, but I think the redo action for > > truncation is not safe across crashes. A XLOG_SMGR_TRUNCATE will just > > do a smgrtruncate() (and then mdtruncate) which will iterate over the > > segments starting at 0 till mdnblocks()/segment_size and *truncate* but > > not delete individual segment files that are not needed anymore, right? > > If we crash in the midst of that a new mdtruncate() will be issued, but > > it will get a shorter value back from mdnblocks(). > > > > Am I missing something? > > > > I'm not too familiar with md.c but my reading of the code is that we > truncate the files in reverse order? That's what I had assumed as well, but it doesn't look that way: v = mdopen(reln, forknum, EXTENSION_FAIL); priorblocks = 0; /* <- initial value */ while (v != NULL) { MdfdVec *ov = v; if (priorblocks > nblocks) { /* truncate entire segment */ } else if (priorblocks + ((BlockNumber)RELSEG_SIZE) > nblocks) { /* truncate entire segment */ } else { /* nothing to do, all needed */ } priorblocks += RELSEG_SIZE; } So, according to that we start at segment 0, right? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-02-14 20:46:01 +0000, Greg Stark wrote: >> Going over this I think this is still a potential issue: >> On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote: >>> I am not sure that explains the issue, but I think the redo action for >>> truncation is not safe across crashes. A XLOG_SMGR_TRUNCATE will just >>> do a smgrtruncate() (and then mdtruncate) which will iterate over the >>> segments starting at 0 till mdnblocks()/segment_size and *truncate* but >>> not delete individual segment files that are not needed anymore, right? >>> If we crash in the midst of that a new mdtruncate() will be issued, but >>> it will get a shorter value back from mdnblocks(). >> I'm not too familiar with md.c but my reading of the code is that we >> truncate the files in reverse order? > That's what I had assumed as well, but it doesn't look that way: No, it's deleting forward. We could probably fix things so it deleted backwards; it'd be a tad tedious because the list structure isn't organized that way, but we could do it. Not sure if that's good enough though. If you don't want to assume the filesystem metadata is coherent after a crash, we might have nonzero-size segments after zero-size ones, even if the truncate calls had been issued in the right order. Another possibility is to keep opening and truncating files until we don't find the next segment in sequence, looking directly at the filesystem not at the mdfd chain. I don't think this would be appropriate in normal operation, but we could do it if InRecovery (and maybe even only if we don't think the database is consistent?) regards, tom lane
On 2014-02-14 22:30:45 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-02-14 20:46:01 +0000, Greg Stark wrote: > >> Going over this I think this is still a potential issue: > >> On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote: > >>> I am not sure that explains the issue, but I think the redo action for > >>> truncation is not safe across crashes. A XLOG_SMGR_TRUNCATE will just > >>> do a smgrtruncate() (and then mdtruncate) which will iterate over the > >>> segments starting at 0 till mdnblocks()/segment_size and *truncate* but > >>> not delete individual segment files that are not needed anymore, right? > >>> If we crash in the midst of that a new mdtruncate() will be issued, but > >>> it will get a shorter value back from mdnblocks(). > We could probably fix things so it deleted backwards; it'd be a tad > tedious because the list structure isn't organized that way, but we > could do it. We could just make the list a doubly linked one, that'd make it simple. > Not sure if that's good enough though. If you don't > want to assume the filesystem metadata is coherent after a crash, > we might have nonzero-size segments after zero-size ones, even if > the truncate calls had been issued in the right order. I don't think that can actually happen on any realistic/interesting FS. Metadata updates better be journaled, so while they might not persist because the journal wasn't flushed, they should be applied in a sane order after a crash. But nonetheless I am not sure we want to rely on that. > Another possibility is to keep opening and truncating files until > we don't find the next segment in sequence, looking directly at the > filesystem not at the mdfd chain. I don't think this would be > appropriate in normal operation, but we could do it if InRecovery > (and maybe even only if we don't think the database is consistent?) Yes, I was thinking of simply having a mdnblocks() variant that looks for the last existing file, disregarding the size. But looking around, it seems mdunlinkfork() has a similar issue, and I don't see how such a trick could be applied there :( I guess the theoretically correct thing would be to make all WAL records about truncation and unlinking contain the current size of the relation, but especially with deletions and forks that will probably turn out to be annoying to do. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Feb 15, 2014 at 11:45 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I guess the theoretically correct thing would be to make all WAL records > about truncation and unlinking contain the current size of the relation, > but especially with deletions and forks that will probably turn out to > be annoying to do. Here's another alternative. In md.c when extending a file to RELSEG_SIZE always check if the next segment is already there and truncate it if it is to avoid magically slurping in that data. That maintains the invariant that the first short segment will mark the end of the relation. If you have a short or missing segment then you'll ignore all the later segments. I think to make this work you would have to sync the newly truncated segment first before extending the current segment though. And this would cause every relation extension to do an extra filesystem lookup. Perhaps only doing this in recovery or *with assertions enabled?) would mitigate that cost. This makes a mockery of the comment in xlogutils.c that we would rather not lose data in the case of a lost inode. But I feel like the data in the later segments was already lost before the earlier segment was filled up, it hardly helps matters if it can sometimes be unlost if the earlier data happens to get written to in a particular pattern. -- greg