Thread: Recovery inconsistencies, standby much larger than primary

Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
Since the point release we've run into a number of databases that when
we restore from a base backup end up being larger than the primary
database was. Sometimes by a large factor. The data below is from
9.1.11 (both primary and standby) but we've seen the same thing on
9.2.6.

primary$ for i in  1261982 1364767 1366221 473158 ; do echo -n "$i " ;
du -shc $i* | tail -1 ; done
1261982 29G total
1364767 23G total
1366221 12G total
473158 76G total

standby$ for i in  1261982 1364767 1366221 473158 ; do echo -n "$i " ;
du -shc $i* | tail -1 ; done
1261982 55G total
1364767 28G total
1366221 17G total
473158 139G total

I've run the snaga xlogdump on the WAL records played before reaching
a consistent point (we confirmed the extra storage had already
appeared by then) and grepped for the above relfilenode but they're
quite large. I believe these dumps don't contain any sensitive data,
when I verify that I can upload one of them for inspection.

$ ls -lh [14]*
-rw-rw-r-- 1 heroku heroku 325M Jan 24 04:13 1261982
-rw-r--r-- 1 root   root   352M Jan 25 00:04 1364767
-rw-r--r-- 1 root   root   123M Jan 25 00:04 1366221
-rw-r--r-- 1 root   root   357M Jan 25 00:04 473158

The first three are btrees and the fourth is a haeap btw.

We're also seeing log entries about "wal contains reference to invalid
pages" but these errors seem only vaguely correlated. Sometimes we get
the errors but the tables don't grow noticeably and sometimes we don't
get the errors and the tables are much larger.

Much of the added space is uninitialized pages as you might expect but
I don't understand is how the database can start up without running
into the "reference to invalid pages" panic consistently. We check
both that there are no references after consistency is reached *and*
that any references before consistency are resolved by a truncate or
unlink before consistency.

The primary was never this large btw, so it's not just a case of
leftover files from drops or truncates that might have failed on the
standby.

I'm assuming this is somehow related to the mulixact or transaction
wraparound problems but I don't really understand how they could be
hitting when both the primary and standby are post-upgrade to the most
recent point release which have the fixes

-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
Hi,

On 2014-01-24 19:23:28 -0500, Greg Stark wrote:
> Since the point release we've run into a number of databases that when
> we restore from a base backup end up being larger than the primary
> database was. Sometimes by a large factor. The data below is from
> 9.1.11 (both primary and standby) but we've seen the same thing on
> 9.2.6.

What's the procedure for creating those standbys? Were they of similar
size after being cloned?

> primary$ for i in  1261982 1364767 1366221 473158 ; do echo -n "$i " ;
> du -shc $i* | tail -1 ; done
> 1261982 29G total
> 1364767 23G total
> 1366221 12G total
> 473158 76G total
> 
> standby$ for i in  1261982 1364767 1366221 473158 ; do echo -n "$i " ;
> du -shc $i* | tail -1 ; done
> 1261982 55G total
> 1364767 28G total
> 1366221 17G total
> 473158 139G total
> ...
> The first three are btrees and the fourth is a haeap btw.

Are those all of the same underlying heap relation?

> We're also seeing log entries about "wal contains reference to invalid
> pages" but these errors seem only vaguely correlated. Sometimes we get
> the errors but the tables don't grow noticeably and sometimes we don't
> get the errors and the tables are much larger.

Uhm. I am a bit confused. You see those in the standby's log? At !debug
log levels? That'd imply that the standby is dead and needed to be
recloned, no? How do you continue after that?

> Much of the added space is uninitialized pages as you might expect but
> I don't understand is how the database can start up without running
> into the "reference to invalid pages" panic consistently. We check
> both that there are no references after consistency is reached *and*
> that any references before consistency are resolved by a truncate or
> unlink before consistency.

Well, it's pretty easy to get into a situation with lot's of new
pages. Lots of concurrent inserts that all fail before logging WAL. The
next insert will extend the relation and only initialise that last
value.

It'd be interesting to look for TRUNCATE records using xlogdump. Could
you show those for starters?

> I'm assuming this is somehow related to the mulixact or transaction
> wraparound problems but I don't really understand how they could be
> hitting when both the primary and standby are post-upgrade to the most
> recent point release which have the fixes

That doesn't sound likely. For one the symptoms don't fit, for another,
those problems are mostly 9.3+.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Sun, Jan 26, 2014 at 9:45 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
> On 2014-01-24 19:23:28 -0500, Greg Stark wrote:
>> Since the point release we've run into a number of databases that when
>> we restore from a base backup end up being larger than the primary
>> database was. Sometimes by a large factor. The data below is from
>> 9.1.11 (both primary and standby) but we've seen the same thing on
>> 9.2.6.
>
> What's the procedure for creating those standbys? Were they of similar
> size after being cloned?

These are restored from base backup using WAL-E and then started in
standby mode. The logs are retrieved using archive_command (which is
WAL-E) after it retrieves lots of archived wal the database switches
to streaming.

We confirmed from size monitoring that the standby database grew
substantially before the time it reported reaching consistent state,
so I only downloaded the WAL from that range for analysis.

>> primary$ for i in  1261982 1364767 1366221 473158 ; do echo -n "$i " ;
>> du -shc $i* | tail -1 ; done
>> 1261982 29G total
>> 1364767 23G total
>> 1366221 12G total
>> 473158 76G total
>>
>> standby$ for i in  1261982 1364767 1366221 473158 ; do echo -n "$i " ;
>> du -shc $i* | tail -1 ; done
>> 1261982 55G total
>> 1364767 28G total
>> 1366221 17G total
>> 473158 139G total
>> ...
>> The first three are btrees and the fourth is a haeap btw.
>
> Are those all of the same underlying heap relation?

Are you asking whether the relfilenode was reused for a different
relation? I doubt it.

Or are you asking if the first three indexes are for the same heap
(presumably the fourth one)? I don't think so but I can check.

>> We're also seeing log entries about "wal contains reference to invalid
>> pages" but these errors seem only vaguely correlated. Sometimes we get
>> the errors but the tables don't grow noticeably and sometimes we don't
>> get the errors and the tables are much larger.
>
> Uhm. I am a bit confused. You see those in the standby's log? At !debug
> log levels? That'd imply that the standby is dead and needed to be
> recloned, no? How do you continue after that?

It's possible I'm confusing symptoms from an unrelated problem. But
the symptom we saw was that it got this error, recovery crashed, then
recovery started again and it replayed fine. I agree that doesn't jive
with the code I see in 9.3, I didn't check how long the code was this
tense though.

>> Much of the added space is uninitialized pages as you might expect but
>> I don't understand is how the database can start up without running
>> into the "reference to invalid pages" panic consistently. We check
>> both that there are no references after consistency is reached *and*
>> that any references before consistency are resolved by a truncate or
>> unlink before consistency.
>
> Well, it's pretty easy to get into a situation with lot's of new
> pages. Lots of concurrent inserts that all fail before logging WAL. The
> next insert will extend the relation and only initialise that last
> value.
>
> It'd be interesting to look for TRUNCATE records using xlogdump. Could
> you show those for starters?

There are no records matching grep -i truncate in any of those
extracts for those relfilenodes. I'm grepping the whole xlogdump now
but it'll take a while. So far no truncates anywhere.

>> I'm assuming this is somehow related to the mulixact or transaction
>> wraparound problems but I don't really understand how they could be
>> hitting when both the primary and standby are post-upgrade to the most
>> recent point release which have the fixes
>
> That doesn't sound likely. For one the symptoms don't fit, for another,
> those problems are mostly 9.3+.

These problems all started to appear after the latest point release
btw. That could just be a coincidence of course.


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
>> We're also seeing log entries about "wal contains reference to invalid
>> pages" but these errors seem only vaguely correlated. Sometimes we get
>> the errors but the tables don't grow noticeably and sometimes we don't
>> get the errors and the tables are much larger.
>
> Uhm. I am a bit confused. You see those in the standby's log? At !debug
> log levels? That'd imply that the standby is dead and needed to be
> recloned, no? How do you continue after that?


So in chatting with Heikki last night we came up with a scenario where
this check is insufficient.

If you have multiple checkpoints during the base backup then there
will be restartpoints during recovery. If the reference to the invalid
page is before the restartpont then after crashing recovery and coming
back up the recovery will go forward fine.

Fixing this check doesn't look trivial. I'm inclined to say to
suppress any restartpoints while there are references to invalid pages
in the hash. The problem with this is that it will prevent trimming
the xlog during recovery. It seems frightening that most days recovery
will take little extra space but if you happen to have a drop table or
truncate during the base backup then your recovery might require a lot
of extra space.

The alternative of spilling the hash table to disk at every
restartpoint seems kind of hokey. Then we need to worry about fsyncing
this file, cleaning it up, dealing with the file after crashes, etc.

-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-01-31 11:09:14 +0000, Greg Stark wrote:
> On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >
> >> We're also seeing log entries about "wal contains reference to invalid
> >> pages" but these errors seem only vaguely correlated. Sometimes we get
> >> the errors but the tables don't grow noticeably and sometimes we don't
> >> get the errors and the tables are much larger.
> >
> > Uhm. I am a bit confused. You see those in the standby's log? At !debug
> > log levels? That'd imply that the standby is dead and needed to be
> > recloned, no? How do you continue after that?

> So in chatting with Heikki last night we came up with a scenario where
> this check is insufficient.

But that seems unrelated to the issue at hand, right?

> If you have multiple checkpoints during the base backup then there
> will be restartpoints during recovery. If the reference to the invalid
> page is before the restartpont then after crashing recovery and coming
> back up the recovery will go forward fine.

We don't perform restartpoints if there are invalid pages
registered. Check the XLogHaveInvalidPages() call in xlog.c.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-01-31 11:09:14 +0000, Greg Stark wrote:
> On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >
> >> We're also seeing log entries about "wal contains reference to invalid
> >> pages" but these errors seem only vaguely correlated. Sometimes we get
> >> the errors but the tables don't grow noticeably and sometimes we don't
> >> get the errors and the tables are much larger.
> >
> > Uhm. I am a bit confused. You see those in the standby's log? At !debug
> > log levels? That'd imply that the standby is dead and needed to be
> > recloned, no? How do you continue after that?

> So in chatting with Heikki last night we came up with a scenario where
> this check is insufficient.

The slightly more likely explanation for transient errors is that you
hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had
only taken effect if HS has already assembled a snapshot, which can make
such an error vanish after a restart...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Fri, Jan 31, 2014 at 11:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> The slightly more likely explanation for transient errors is that you
> hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had
> only taken effect if HS has already assembled a snapshot, which can make
> such an error vanish after a restart...

Which one, there seem to be several....

So this seems like it's more likely to be a symptom of whatever is
causing the table to grow than a cause? That is, there's some bug
causing the standby to extend the btree dramatically resulting in lots
of uninitialized pages and touching those pages triggers this bug. But
this doesn't explain why the btree is being extended I don't think.


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-01-31 11:46:09 +0000, Greg Stark wrote:
> On Fri, Jan 31, 2014 at 11:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > The slightly more likely explanation for transient errors is that you
> > hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had
> > only taken effect if HS has already assembled a snapshot, which can make
> > such an error vanish after a restart...
> 
> Which one, there seem to be several....
> 
> So this seems like it's more likely to be a symptom of whatever is
> causing the table to grow than a cause? That is, there's some bug
> causing the standby to extend the btree dramatically resulting in lots
> of uninitialized pages and touching those pages triggers this bug. But
> this doesn't explain why the btree is being extended I don't think.

I don't think anything we've talked about so far is likely to explain
the issue. I don't have time atm to look closer, but what I'd do is try
to look if there are any pages with valid LSNs on the standby in the
bloated area... That then might give you a hint where to look.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
1261982.53 is entirely nuls. I think that's true for most if not all
of the intervening files, still investigating.

The 54th segment is nul up to offset 1f0c0000 after which it has valid
looking blocks:

# hexdump 1261982.54 | head -100
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
1f0c0000 0ea1 0000 8988 0063 0006 0000 04d8 0cf0

However when I grep xlogdump for any records mentioning this block I
get nothing.

In fact the largest block I find in the xlog is 3646630:

# grep  'tid 3646631/' 1261982 | wc -l
0
# grep  'tid 3646630/' 1261982 | wc -l
177

Looking at the block above it looks like the LSN is 0000EA100638988
which I find in the logs but it's a btree insert on a different btree:

[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
blk:3634978 hole_off/len:1240/2072
[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid
2746914/219
[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] bkpblock[1]: s/d/r:1663/16385/1364767
blk:2746914 hole_off/len:1180/2372
[cur:EA1/63A0A8, xid:1418089147, rmid:1(Transaction),
len/tot_len:32/64, info:0, prev:EA1/638988] d/s:16385/1663 commit at
2014-01-21 05:41:11 UTC



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-01-31 14:39:47 +0000, Greg Stark wrote:
> 1261982.53 is entirely nuls. I think that's true for most if not all
> of the intervening files, still investigating.
> 
> The 54th segment is nul up to offset 1f0c0000 after which it has valid
> looking blocks:

It'd be interesting to dump the page header for that using pageinspect.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Fri, Jan 31, 2014 at 2:39 PM, Greg Stark <stark@mit.edu> wrote:
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> blk:3634978 hole_off/len:1240/2072
> [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
> info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid
> 2746914/219

Actually wait, the immediate previous record is indeed on the right
filenode. Is the LSN printed in xlogdump the LSN that would be in the
pd_lsn or is the pd_lsn going to be from the following record?


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-01-31 14:59:21 +0000, Greg Stark wrote:
> On Fri, Jan 31, 2014 at 2:39 PM, Greg Stark <stark@mit.edu> wrote:
> > [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> > blk:3634978 hole_off/len:1240/2072
> > [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
> > info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid
> > 2746914/219
> 
> Actually wait, the immediate previous record is indeed on the right
> filenode. Is the LSN printed in xlogdump the LSN that would be in the
> pd_lsn or is the pd_lsn going to be from the following record?

It points to the end of the record (i.e. the beginning of the next). It
needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush
enough.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Fri, Jan 31, 2014 at 3:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:

> It points to the end of the record (i.e. the beginning of the next). It
> needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush
> enough.

Ah, in which case the relevant record is:
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
3634978/282
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
blk:3634978 hole_off/len:1240/2072


It looks like get_raw_page() refuses to read past the end of relpages.
I could make a clone of this database to allow experimenting with
tweaking relpages but it may or may not reproduce the problem...

=# select pg_relation_size('data_pkey') / 1024 / 1024 / 1024;?column?
----------      23
(1 row)

=# select get_raw_page('data_pkey', 'main', 11073632) ;
ERROR:  block number 11073632 is out of range for relation "data_pkey"

d9de7pcqls4ib6=# select relpages from pg_class where relname = 'data_pkey';relpages
---------- 2889286


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-01-31 15:15:24 +0000, Greg Stark wrote:
> On Fri, Jan 31, 2014 at 3:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> 
> > It points to the end of the record (i.e. the beginning of the next). It
> > needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush
> > enough.
> 
> Ah, in which case the relevant record is:
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
> 3634978/282
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> blk:3634978 hole_off/len:1240/2072
> 
> 
> It looks like get_raw_page() refuses to read past the end of relpages.
> I could make a clone of this database to allow experimenting with
> tweaking relpages but it may or may not reproduce the problem...

No, it uses smgrnblocks() to get the size.

> =# select get_raw_page('data_pkey', 'main', 11073632) ;
> ERROR:  block number 11073632 is out of range for relation "data_pkey"

Isn't the page 3634978?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> =# select get_raw_page('data_pkey', 'main', 11073632) ;
>> ERROR:  block number 11073632 is out of range for relation "data_pkey"
>
> Isn't the page 3634978?

The page in the record is.

But the page on disk is in the 54th segment at offset 1F0C0000

So unless my arithmetic is wrong:

bc -l
ibase=16
400 * 400 * 400 / 2000 * 54 + 1F0C0000 / 2000
11073632


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-01-31 15:21:35 +0000, Greg Stark wrote:
> On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> =# select get_raw_page('data_pkey', 'main', 11073632) ;
> >> ERROR:  block number 11073632 is out of range for relation "data_pkey"
> >
> > Isn't the page 3634978?
> 
> The page in the record is.

It'd be interesting to look at the referenced page using bt_page_items().

> But the page on disk is in the 54th segment at offset 1F0C0000

It's interesting that the smgr gets this wrong then (as also evidenced
by the fact that relation_size does as well). Could you please do a ls
-l path/to/relfilenode*?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> It's interesting that the smgr gets this wrong then (as also evidenced
> by the fact that relation_size does as well). Could you please do a ls
> -l path/to/relfilenode*?

IIRC, smgrnblocks will stop as soon as it finds a segment that is not
1GB in size.  Could you check the lengths of all segments for that
relation?
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-01-31 10:33:16 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > It's interesting that the smgr gets this wrong then (as also evidenced
> > by the fact that relation_size does as well). Could you please do a ls
> > -l path/to/relfilenode*?
> 
> IIRC, smgrnblocks will stop as soon as it finds a segment that is not
> 1GB in size.  Could you check the lengths of all segments for that
> relation?

Yea, that's what I am wondering about. I wanted the full list because
there could be an entire file missing and it's interesting to see at
which time they were last touched relative to each other...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Isn't the page 3634978?

> The page in the record is.

> But the page on disk is in the 54th segment at offset 1F0C0000

> So unless my arithmetic is wrong:

> bc -l
> ibase=16
> 400 * 400 * 400 / 2000 * 54 + 1F0C0000 / 2000
> 11073632

At least two of us are confused.  I get

# select ((2^30) * 54.0 + 'x1F0C0000'::bit(32)::int) / 8192;?column? 
---------- 7141472
(1 row)
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
Sorry guys. I transposed two numbers when looking up the relation.
"data_pk" wasn't the right index.

=# select (page_header(get_raw_page('index_data_id', 'main', 3020854))).* ;    lsn      | tli | flags | lower | upper |
special| pagesize |
 
version | prune_xid
--------------+-----+-------+-------+-------+---------+----------+---------+-----------CF0/2DD67BB8 |   5 |     0 |
792|  5104 |    8176 |     8192 | 4 |         0
 
(1 row)



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> Sorry guys. I transposed two numbers when looking up the relation.
> "data_pk" wasn't the right index.

> =# select (page_header(get_raw_page('index_data_id', 'main', 3020854))).* ;
>      lsn      | tli | flags | lower | upper | special | pagesize |
> version | prune_xid
> --------------+-----+-------+-------+-------+---------+----------+---------+-----------
>  CF0/2DD67BB8 |   5 |     0 |   792 |  5104 |    8176 |     8192 |
>   4 |         0
> (1 row)

Clearly not the right page --- the LSN isn't what we saw in the hexdump.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Fri, Jan 31, 2014 at 3:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> 400 * 400 * 400 / 2000 * 54 + 1F0C0000 / 2000
>> 11073632

Ooops, it's reading 54 in hex there.
> # select ((2^30) * 54.0 + 'x1F0C0000'::bit(32)::int) / 8192;
>  ?column?
> ----------
>   7141472

ibase=16
400 * 400 * 400 / 2000 * 36 + 1F0C0000 / 2000
7141472.00000000000000000000

So now that I've got the arithmetic right:

=# select (page_header(get_raw_page('index_data_id', 'main', 7141472))).* ;   lsn     | tli | flags | lower | upper |
special| pagesize |
 
version | prune_xid
------------+-----+-------+-------+-------+---------+----------+---------+-----------EA1/638988 |   6 |     0 |  1240 |
3312 |    8176 |     8192 |
 
4 |         0


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-01-31 10:33:16 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > It's interesting that the smgr gets this wrong then (as also evidenced
> > by the fact that relation_size does as well). Could you please do a ls
> > -l path/to/relfilenode*?
> 
> IIRC, smgrnblocks will stop as soon as it finds a segment that is not
> 1GB in size.  Could you check the lengths of all segments for that
> relation?

I am not sure that explains the issue, but I think the redo action for
truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
do a smgrtruncate() (and then mdtruncate) which will iterate over the
segments starting at 0 till mdnblocks()/segment_size and *truncate* but
not delete individual segment files that are not needed anymore, right?
If we crash in the midst of that a new mdtruncate() will be issued, but
it will get a shorter value back from mdnblocks().

Am I missing something?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
So just to summarize, this xlog record:

[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
3634978/282
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
blk:3634978 hole_off/len:1240/2072

Appears to have been written to this location:

d9de7pcqls4ib6=# select
(page_header(get_raw_page('index_event_data_on_event_occurrence_id',
'main', 7141472))).* ;   lsn     | tli | flags | lower | upper | special | pagesize |
version | prune_xid
------------+-----+-------+-------+-------+---------+----------+---------+-----------EA1/638988 |   6 |     0 |  1240 |
3312 |    8176 |     8192 |
 
4 |         0
(1 row)

The correct location appears to have been written to by later records
(a split where it's the leftsib to be specific) so any forensic
evidence is lost:

d9de7pcqls4ib6=# select
(page_header(get_raw_page('index_event_data_on_event_occurrence_id',
'main', 3634978))).* ;   lsn     | tli | flags | lower | upper | special | pagesize |
version | prune_xid
------------+-----+-------+-------+-------+---------+----------+---------+-----------EA1/A90CF8 |   6 |     0 |   792 |
5104 |    8176 |     8192 |
 
4 |         0
(1 row)

But the record prior to that lsn is a split where the leftsib matches
that record:

[cur:EA1/A8EDE0, xid:1418089314, rmid:11(Btree),
len/tot_len:3502/7938, info:68, prev:EA1/A8D7A0] split_r:
s/d/r:1663/16385/1261982 leftsib 3634978
[cur:EA1/A8EDE0, xid:1418089314, rmid:11(Btree),
len/tot_len:3502/7938, info:68, prev:EA1/A8D7A0] bkpblock[2]:
s/d/r:1663/16385/1261982 blk:1881360 hole_off/len:892/3812

Interestingly there are a bunch of other records that also write to
that block but there are no other full page writes. That seems to
imply that the index is currently missing the leaf inserted by the
EA1/637140 record.



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> So just to summarize, this xlog record:
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
> 3634978/282
> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
> info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
> blk:3634978 hole_off/len:1240/2072

> Appears to have been written to [ block 7141472 ]

I've been staring at the code for a bit trying to guess how that could
have happened.  Since the WAL record has a backup block, btree_xlog_insert
would have passed control to RestoreBackupBlock, which would call
XLogReadBufferExtended with mode RBM_ZERO, so there would be no complaint
about writing past the end of the relation.  Now, you can imagine some
very low-level error causing a write to go to the wrong page due to a seek
problem or some such, but it's hard to credit that that would've resulted
in creation of all the intervening segment files.  Some level of our code
had to have thought it was being told to extend the relation.

However, on closer inspection I was a bit surprised to realize that there
are two possible candidates for doing that!  XLogReadBufferExtended will
extend the relation, a block at a time, if told to write a page past
the current nominal EOF.  And in md.c, _mdfd_getseg will *also* extend
the relation if we're InRecovery, even though it normally would not do
so when called from mdwrite().

Given the behavior in XLogReadBufferExtended, I rather think that the
InRecovery special case in _mdfd_getseg is dead code and should be
removed.  But for the purpose at hand, it's more interesting to try to
confirm which of these code levels did the extension.  I notice that
_mdfd_getseg only bothers to write the last physical page of each segment,
whereas XLogReadBufferExtended knows nothing of segments and will
ploddingly write every page.  So on a filesystem that supports "holes"
in files, I'd expect that the added segments would be fully allocated
if XLogReadBufferExtended did the deed, but they'd be quite small if
_mdfd_getseg did so.  The du results you started with suggest that the
former is the case, but could you verify that the filesystem this is
on supports holes and that du will report only the actually allocated
space when there's a hole?

Assuming that the extension was done in XLogReadBufferExtended, we are
forced to the conclusion that XLogReadBufferExtended was passed a bad
block number (viz 7141472); and it's pretty hard to see how that could
happen.  RestoreBackupBlock is just passing the value it got out of the
WAL record.  I thought about the idea that it was wrong about exactly
where the BkpBlock struct was in the record, but that would presumably
lead to garbage relnode and fork numbers not just a bad block number.

So I'm still baffled ...
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
<p dir="ltr">One thing I keep coming back to is a bad ran chip setting a bit in the block number. But I just can't seem
toget it to add up. The difference is not a power of two, it had happened on two different machines, and we don't see
otherweirdness on the machine. It seems like a strange coincidence it would happen to the same variable twice and not
toother variables.<p dir="ltr">Unless there's some unrelated code writing through a wild pointer, possibly to a stack
allocatedobject that just happens to often be that variable?<p dir="ltr">-- <br /> greg<div class="gmail_quote">On 31
Jan2014 20:21, "Tom Lane" <<a href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>> wrote:<br type="attribution"
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Greg Stark
<<ahref="mailto:stark@mit.edu">stark@mit.edu</a>> writes:<br /> > So just to summarize, this xlog record:<br
/>> [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,<br /> > info:8, prev:EA1/635290]
insert_leaf:s/d/r:1663/16385/1261982 tid<br /> > 3634978/282<br /> > [cur:EA1/637140, xid:1418089147,
rmid:11(Btree),len/tot_len:18/6194,<br /> > info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982<br />
>blk:3634978 hole_off/len:1240/2072<br /><br /> > Appears to have been written to [ block 7141472 ]<br /><br />
I'vebeen staring at the code for a bit trying to guess how that could<br /> have happened.  Since the WAL record has a
backupblock, btree_xlog_insert<br /> would have passed control to RestoreBackupBlock, which would call<br />
XLogReadBufferExtendedwith mode RBM_ZERO, so there would be no complaint<br /> about writing past the end of the
relation. Now, you can imagine some<br /> very low-level error causing a write to go to the wrong page due to a seek<br
/>problem or some such, but it's hard to credit that that would've resulted<br /> in creation of all the intervening
segmentfiles.  Some level of our code<br /> had to have thought it was being told to extend the relation.<br /><br />
However,on closer inspection I was a bit surprised to realize that there<br /> are two possible candidates for doing
that! XLogReadBufferExtended will<br /> extend the relation, a block at a time, if told to write a page past<br /> the
currentnominal EOF.  And in md.c, _mdfd_getseg will *also* extend<br /> the relation if we're InRecovery, even though
itnormally would not do<br /> so when called from mdwrite().<br /><br /> Given the behavior in XLogReadBufferExtended,
Irather think that the<br /> InRecovery special case in _mdfd_getseg is dead code and should be<br /> removed.  But for
thepurpose at hand, it's more interesting to try to<br /> confirm which of these code levels did the extension.  I
noticethat<br /> _mdfd_getseg only bothers to write the last physical page of each segment,<br /> whereas
XLogReadBufferExtendedknows nothing of segments and will<br /> ploddingly write every page.  So on a filesystem that
supports"holes"<br /> in files, I'd expect that the added segments would be fully allocated<br /> if
XLogReadBufferExtendeddid the deed, but they'd be quite small if<br /> _mdfd_getseg did so.  The du results you started
withsuggest that the<br /> former is the case, but could you verify that the filesystem this is<br /> on supports holes
andthat du will report only the actually allocated<br /> space when there's a hole?<br /><br /> Assuming that the
extensionwas done in XLogReadBufferExtended, we are<br /> forced to the conclusion that XLogReadBufferExtended was
passeda bad<br /> block number (viz 7141472); and it's pretty hard to see how that could<br /> happen.
 RestoreBackupBlockis just passing the value it got out of the<br /> WAL record.  I thought about the idea that it was
wrongabout exactly<br /> where the BkpBlock struct was in the record, but that would presumably<br /> lead to garbage
relnodeand fork numbers not just a bad block number.<br /><br /> So I'm still baffled ...<br /><br />                  
     regards, tom lane<br /></blockquote></div> 

Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> One thing I keep coming back to is a bad ran chip setting a bit in the
> block number. But I just can't seem to get it to add up. The difference is
> not a power of two, it had happened on two different machines, and we don't
> see other weirdness on the machine. It seems like a strange coincidence it
> would happen to the same variable twice and not to other variables.

I also looked at the bit patterns for the two block numbers, and couldn't
detect any relationship.

> Unless there's some unrelated code writing through a wild pointer, possibly
> to a stack allocated object that just happens to often be that variable?

Yeah, I'd been wondering if the WAL record somehow got corrupted while
in memory (presumably after being CRC-checked).  It's a bit hard to see
how though.

Are all the bloated-on-the-slave relations indexes?  I think the most
fruitful thing to do at this point is to try to isolate the bloating
events for the other affected rels as you've done for this one.
Maybe we'll see a pattern.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Josh Berkus
Date:
On 01/31/2014 01:11 PM, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
>> One thing I keep coming back to is a bad ran chip setting a bit in the
>> block number. But I just can't seem to get it to add up. The difference is
>> not a power of two, it had happened on two different machines, and we don't
>> see other weirdness on the machine. It seems like a strange coincidence it
>> would happen to the same variable twice and not to other variables.
> 
> I also looked at the bit patterns for the two block numbers, and couldn't
> detect any relationship.
> 
>> Unless there's some unrelated code writing through a wild pointer, possibly
>> to a stack allocated object that just happens to often be that variable?
> 
> Yeah, I'd been wondering if the WAL record somehow got corrupted while
> in memory (presumably after being CRC-checked).  It's a bit hard to see
> how though.
> 
> Are all the bloated-on-the-slave relations indexes?  I think the most
> fruitful thing to do at this point is to try to isolate the bloating
> events for the other affected rels as you've done for this one.
> Maybe we'll see a pattern.

FWIW, we've periodically seen reports from our clients of replica
databases being slightly larger than the master.  Nothing reproducable
or as severe as Greg's issue, or we'd have reported it.  But this could
be a more widespread issue, just that it affects most users in the +5%
ballpark, so they don't notice.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Fri, Jan 31, 2014 at 10:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, I'd been wondering if the WAL record somehow got corrupted while
> in memory (presumably after being CRC-checked).  It's a bit hard to see
> how though.

One thing I mentioned early on but bears repeating is that this
instance is 9.1.11.

Also something that occurred to me at 3am -- the "reference to invalid
pages" recovery errors that replayed correctly after the panic might
also explain why the slave seems to operate correctly. It's possible
after the panic it replayed those same records correctly.

> Are all the bloated-on-the-slave relations indexes?  I think the most
> fruitful thing to do at this point is to try to isolate the bloating
> events for the other affected rels as you've done for this one.
> Maybe we'll see a pattern.

I'll poke at those others tomorrow/today. I can also try to bring up a
new standby from the same base backup but it'll take time. It's a
large database. Also the fear I have above is that if I set a recovery
target I might make it miss the bug.


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> FWIW, we've periodically seen reports from our clients of replica
> databases being slightly larger than the master.  Nothing reproducable
> or as severe as Greg's issue, or we'd have reported it.  But this could
> be a more widespread issue, just that it affects most users in the +5%
> ballpark, so they don't notice.

Well, the "bloat" aspect of it is really the least of our problems.
AFAICT what is happening is flat-out index corruption, that is, an
intended update isn't applied where it should be but to some
seemingly-random other page.  If this is common then it seems like we
ought to be hearing a lot of reports of query misbehavior on slaves.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Fri, Jan 31, 2014 at 8:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  So on a filesystem that supports "holes"
> in files, I'd expect that the added segments would be fully allocated
> if XLogReadBufferExtended did the deed, but they'd be quite small if
> _mdfd_getseg did so.  The du results you started with suggest that the
> former is the case, but could you verify that the filesystem this is
> on supports holes and that du will report only the actually allocated
> space when there's a hole?

Yup, it's xfs:

# dd if=/dev/zero seek=1M count=1 bs=1kB of=hole
1+0 records in
1+0 records out
1000 bytes (1.0 kB) copied, 5.7286e-05 s, 17.5 MB/s

# ls -ls hole
4 -rw-r--r-- 1 root root 1048577000 Feb  1 09:35 hole

# ls -ls 1261982.5*
1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 14 12:51 1261982.5
1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:05 1261982.50
1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:07 1261982.51
1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:09 1261982.52
1048576 -rw------- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:10 1261982.53508680 -rw------- 1 ua8157t9mbut8r
postgres 520888320 Jan 23 02:12 1261982.54
 



-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
The plot thickens...

Looking at the next relation I see a similar symptom of a single valid
block at the end of several segments of nuls. This relation is also a
btree on the same table and has a header in the near vicinity of the
xlog:

d9de7pcqls4ib6=# select
(page_header(get_raw_page('event_data_event_id_person_id', 'main',
3631161))).* ;   lsn     | tli | flags | lower | upper | special | pagesize |
version | prune_xid
------------+-----+-------+-------+-------+---------+----------+---------+-----------EA1/63A0A8 |   6 |     0 |  1180 |
3552 |    8176 |     8192 |
 
4 |         0
(1 row)

And indeed it's the very next xlog record:

[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid
2746914/219
[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] bkpblock[1]: s/d/r:1663/16385/1364767
blk:2746914 hole_off/len:1180/2372
[cur:EA1/63A0A8, xid:1418089147, rmid:1(Transaction),
len/tot_len:32/64, info:0, prev:EA1/638988] d/s:16385/1663 commit at
2014-01-21 05:41:11 UTC



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
I've poked at this a bit more. There are at least 10 relations where
the last block doesn't match the block mentioned in the xlog record
that its LSN indicates. At least it looks like from the info xlogdump
prints.

Including two blocks where the "correct" block has the same LSN which
maybe means the record was correctlly replayed the second time. Or
perhaps it just means I'm underestimating the complexity of btree
splits.

[cur:EAD/511424E0, xid:1423308342, rmid:11(Btree),
len/tot_len:3750/12786, info:77, prev:EAD/51142490] split_r:
s/d/r:1663/16385/1521566 leftsib 1697585
[cur:EAD/511424E0, xid:1423308342, rmid:11(Btree),
len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[1]:
s/d/r:1663/16385/1521566 blk:1697585 hole_off/len:576/4288
[cur:EAD/511424E0, xid:1423308342, rmid:11(Btree),
len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[2]:
s/d/r:1663/16385/1521566 blk:786380 hole_off/len:740/3140
relfilenode | blockno  |     lsn      | tli | flags | lower | upper |
special | pagesize | version | prune_xid
-------------+----------+--------------+-----+-------+-------+-------+---------+----------+---------+------------
1521566|  1697585 | EAD/511456E8 |   6 |     0 |   576 |  4864 |  8176 |     8192 |       4 |          0    1521566 |
1704143| EAD/511456E8 |   6 |     0 |   644 |  4456 |  8176 |     8192 |       4 |          0
 

[cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree),
len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] split_r:
s/d/r:1663/16385/4995658 leftsib 139569
[cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree),
len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] bkpblock[2]:
s/d/r:1663/16385/4995658 blk:18152 hole_off/len:28/8028relfilenode | blockno  |     lsn      | tli | flags | lower |
upper|
 
special | pagesize | version | prune_xid
-------------+----------+--------------+-----+-------+-------+-------+---------+----------+---------+------------
4995658|   139569 | EAD/520F2058 |   6 |     0 |   152 |  4336 |  8176 |     8192 |       4 |          0    4995658 |
139584| EAD/520F2058 |   6 |     0 |   164 |  3976 |  8176 |     8192 |       4 |          0
 



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
Hm, I'm not entirely convinced those are erroneous replays to wrong
blocks. They don't look right but there are no blocks of NULs
preceding them. So if they're wrong then they only extended the
relations by a single block.

The relfilenodes that have nul blocks before the last block are:
relfilenode | blockno  | segnum |  offset  | lsn | tli | flags |
lower | upper | special | pagesize | version | prune_xid

-------------+----------+--------+----------+-----+-----+-------+-------+-------+---------+----------+---------+-----------
  1261982 |  7141472 |     54 |  7141472 | 0/0 |   0 |     0 |
 
0 |     0 |       0 |        0 |       0 |         0     473158 | 18090059 |    138 | 18090059 | 0/0 |   0 |     0 |
0 |     0 |       0 |        0 |       0 |         0    1366221 |  2208159 |     16 |  2208159 | 0/0 |   0 |     0 |
0 |     0 |       0 |        0 |       0 |         0    1364767 |  3631161 |     27 |  3631161 | 0/0 |   0 |     0 |
0 |     0 |       0 |        0 |       0 |         0    1519286 |   311808 |      2 |   311808 | 0/0 |   0 |     0 |
0 |     0 |       0 |        0 |       0 |         0



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> The relfilenodes that have nul blocks before the last block are:

Can we see the associated WAL records (ie, the ones matching the LSNs
in the last blocks of these files)?
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> The relfilenodes that have nul blocks before the last block are:
>
> Can we see the associated WAL records (ie, the ones matching the LSNs
> in the last blocks of these files)?

Sorry, I've lost track of what information I already shared or didn't,
i've been staring at these records all day. It would be so much easier
if xlogdump was a fdw or set returning function so I could do joins
with pageinspect data:

This is the information from the final block headers:
relfilenode | blockno  | segnum |  offset  |    lsn     | tli | flags
| lower | upper | special | pagesize | version | prune_xid

-------------+----------+--------+----------+------------+-----+-------+-------+-------+---------+----------+---------+------------
   473158 | 18090059 |    138 | 18090059 | EA1/625E28 |   6 |     0
 
|   144 |   896 |    8192 |     8192 |       4 | 1401029863    1366221 |  2208159 |     16 |  2208159 | EA1/62DDF0 |
6|     0
 
|  1180 |  3552 |    8176 |     8192 |       4 |          0    1261982 |  7141472 |     54 |  7141472 | EA1/638988 |
6|     0
 
|  1240 |  3312 |    8176 |     8192 |       4 |          0    1364767 |  3631161 |     27 |  3631161 | EA1/63A0A8 |
6|     0
 
|  1180 |  3552 |    8176 |     8192 |       4 |          0    1519286 |   311808 |      2 |   311808 | EA1/708B08 |
6|     1
 
|   312 |  3040 |    8192 |     8192 |       4 |          0

These are all the records from the tx that corresponds to the lsn in
the first relfilenode:

[cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524,
info:72, prev:EA1/624080] hot_update: s/d/r:1663/16385/473158 block
9386399 off 29 to block 9386399 off 30
[cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524,
info:72, prev:EA1/624080] bkpblock[1]: s/d/r:1663/16385/473158
blk:9386399 hole_off/len:144/752
[cur:EA1/625E28, xid:1418089146, rmid:1(Transaction),
len/tot_len:32/64, info:0, prev:EA1/6240C0] d/s:16385/1663 commit at
2014-01-21 05:41:11 UTC

The middle three blocks have LSNs that are all part of the same tx,
here are all the records for this tx:

[cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441,
info:8, prev:EA1/625E68] insert: s/d/r:1663/16385/473158
blk/off:9386398/33 header: none
[cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441,
info:8, prev:EA1/625E68] bkpblock[1]: s/d/r:1663/16385/473158
blk:9386398 hole_off/len:156/1828
[cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214,
info:8, prev:EA1/625F28] insert_leaf: s/d/r:1663/16385/473176 tid
1319804/405
[cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214,
info:8, prev:EA1/625F28] bkpblock[1]: s/d/r:1663/16385/473176
blk:1319804 hole_off/len:1644/52
[cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494,
info:8, prev:EA1/627868] insert_leaf: s/d/r:1663/16385/473182 tid
1186167/147
[cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494,
info:8, prev:EA1/627868] bkpblock[1]: s/d/r:1663/16385/473182
blk:1186167 hole_off/len:1300/1772
[cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314,
info:8, prev:EA1/629898] insert_leaf: s/d/r:1663/16385/1270734 tid
1294137/2
[cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314,
info:8, prev:EA1/629898] bkpblock[1]: s/d/r:1663/16385/1270734
blk:1294137 hole_off/len:1064/2952
[cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/62B210] insert_leaf: s/d/r:1663/16385/1366221 tid
1364573/66
[cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/62B210] bkpblock[1]: s/d/r:1663/16385/1366221
blk:1364573 hole_off/len:1180/2372
[cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814,
info:8, prev:EA1/62C6E8] insert_leaf: s/d/r:1663/16385/1404440 tid
1195953/51
[cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814,
info:8, prev:EA1/62C6E8] bkpblock[1]: s/d/r:1663/16385/1404440
blk:1195953 hole_off/len:964/3452
[cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862,
info:8, prev:EA1/62DDF0] insert_leaf: s/d/r:1663/16385/1410405 tid
1894183/39
[cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862,
info:8, prev:EA1/62DDF0] bkpblock[1]: s/d/r:1663/16385/1410405
blk:1894183 hole_off/len:988/1404
[cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254,
info:8, prev:EA1/62F0D8] insert_leaf: s/d/r:1663/16385/1521566 tid
1691110/132
[cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254,
info:8, prev:EA1/62F0D8] bkpblock[1]: s/d/r:1663/16385/1521566
blk:1691110 hole_off/len:1044/1012
[cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174,
info:8, prev:EA1/630BC0] insert_leaf: s/d/r:1663/16385/5285587 tid
386419/102
[cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174,
info:8, prev:EA1/630BC0] bkpblock[1]: s/d/r:1663/16385/5285587
blk:386419 hole_off/len:1036/3092
[cur:EA1/633C68, xid:1418089147, rmid:15(Sequence),
len/tot_len:158/190, info:0, prev:EA1/632830] seq
[cur:EA1/633D28, xid:1418089147, rmid:10(Heap), len/tot_len:188/220,
info:128, prev:EA1/633C68] insert(init): s/d/r:1663/16385/16523
blk/off:12933288/1 header: t_infomask2 11 t_infomask 2051 t_hoff 32
[cur:EA1/633E08, xid:1418089147, rmid:11(Btree), len/tot_len:18/5234,
info:8, prev:EA1/633D28] insert_leaf: s/d/r:1663/16385/1261892 tid
3008033/256
[cur:EA1/633E08, xid:1418089147, rmid:11(Btree), len/tot_len:18/5234,
info:8, prev:EA1/633D28] bkpblock[1]: s/d/r:1663/16385/1261892
blk:3008033 hole_off/len:1048/3032
[cur:EA1/635290, xid:1418089147, rmid:11(Btree), len/tot_len:18/7834,
info:8, prev:EA1/633E08] insert_leaf: s/d/r:1663/16385/184672 tid
1704550/5
[cur:EA1/635290, xid:1418089147, rmid:11(Btree), len/tot_len:18/7834,
info:8, prev:EA1/633E08] bkpblock[1]: s/d/r:1663/16385/184672
blk:1704550 hole_off/len:1568/432
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
3634978/282
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
blk:3634978 hole_off/len:1240/2072
[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid
2746914/219
[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] bkpblock[1]: s/d/r:1663/16385/1364767
blk:2746914 hole_off/len:1180/2372
[cur:EA1/63A0A8, xid:1418089147, rmid:1(Transaction),
len/tot_len:32/64, info:0, prev:EA1/638988] d/s:16385/1663 commit at
2014-01-21 05:41:11 UTC

The last block does a fair amount of traffic, these are the first four
records the first two are the only ones that touch this relfilenode:

[cur:EA1/707548, xid:1418089185, rmid:10(Heap), len/tot_len:21/5541,
info:8, prev:EA1/707488] insert: s/d/r:1663/16385/1519286
blk/off:281032/13 header: none
[cur:EA1/707548, xid:1418089185, rmid:10(Heap), len/tot_len:21/5541,
info:8, prev:EA1/707488] bkpblock[1]: s/d/r:1663/16385/1519286
blk:281032 hole_off/len:312/2728
[cur:EA1/708B08, xid:1418089185, rmid:11(Btree), len/tot_len:18/6814,
info:8, prev:EA1/707548] insert_leaf: s/d/r:1663/16385/1519292 tid
110564/335
[cur:EA1/708B08, xid:1418089185, rmid:11(Btree), len/tot_len:18/6814,
info:8, prev:EA1/707548] bkpblock[1]: s/d/r:1663/16385/1519292
blk:110564 hole_off/len:1364/1452


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Can we see the associated WAL records (ie, the ones matching the LSNs
>> in the last blocks of these files)?

> Sorry, I've lost track of what information I already shared or didn't,

Hm.  So one of these is a heap update, not an index update, which lets
out the theory that it's something specific to indexes.  But they are
all full-page-image updates, so the WAL replay code path for full-page
images still seems to be the suspect.

What version were you running before 9.1.11 exactly?  I took a look
through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any
changes that seemed even vaguely related to this.  There are some
changes in known-transaction tracking, but it's hard to see a connection
there.  Most of the other diffs are in code that wouldn't execute during
WAL replay at all.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Mon, Feb 3, 2014 at 12:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What version were you running before 9.1.11 exactly?  I took a look
> through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any
> changes that seemed even vaguely related to this.  There are some
> changes in known-transaction tracking, but it's hard to see a connection
> there.  Most of the other diffs are in code that wouldn't execute during
> WAL replay at all.


Both the primary and the standby were 9.1.11 from the get-go. The
database the primary was forked off of was 9.1.10 but as far as I can
tell the primary in the current pair has no problems.

What's worse is we created a new standby from the same base backup and
replayed the same records and it didn't reproduce the problem. This
means either it's a hardware problem -- but we've seen it on multiple
standbys on this database and at least one other database which is in
a different data centre -- or it's a race condition --but that's hard
to credit in the recovery code which is basically single-threaded.

And these records are from before the standby reaches a consistency so
it's hard to see how a connection from a hot standby client could
cause any kind of race condition. The only other thread that could
conceivably cause a heisenbug is the bgwriter. It's hard to imagine
how a race condition in there could be so easy to hit that it would
happen four times on one restore but otherwise go mostly unnoticed.
-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> Both the primary and the standby were 9.1.11 from the get-go. The
> database the primary was forked off of was 9.1.10 but as far as I can
> tell the primary in the current pair has no problems.

> What's worse is we created a new standby from the same base backup and
> replayed the same records and it didn't reproduce the problem. This
> means either it's a hardware problem -- but we've seen it on multiple
> standbys on this database and at least one other database which is in
> a different data centre -- or it's a race condition --but that's hard
> to credit in the recovery code which is basically single-threaded.

> And these records are from before the standby reaches a consistency so
> it's hard to see how a connection from a hot standby client could
> cause any kind of race condition. The only other thread that could
> conceivably cause a heisenbug is the bgwriter. It's hard to imagine
> how a race condition in there could be so easy to hit that it would
> happen four times on one restore but otherwise go mostly unnoticed.

I had noticed that the WAL records that were mis-replayed seemed to
be bunched pretty close together (two of them even adjacent).  Could
you confirm that?  If so, it seems like we're looking for some condition
that makes mis-replay fairly probable for a period of time, but in
itself might be quite improbable.  Not that that helps much at
nailing it down.

You might well be on to something with the bgwriter idea, considering
that none of the WAL replay code was originally written with any
concurrent execution in mind.  We might've missed some place where
additional locking is needed.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Thu, Feb 6, 2014 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I had noticed that the WAL records that were mis-replayed seemed to
> be bunched pretty close together (two of them even adjacent).  Could
> you confirm that?  If so, it seems like we're looking for some condition
> that makes mis-replay fairly probable for a period of time, but in
> itself might be quite improbable.  Not that that helps much at
> nailing it down.

Well one thing that argues for is hardware problems. It could be that
the right memory mapping just happened to line up with that variable
on the stack for that short time and then it was mapped to something
else entirely. Or the machine was overheating for a short time and
then the temperature became more reasonable. Or the person with the
x-ray source walked by in that short time window.

That doesn't explain the other instance or the other copies of this
database. I think the most productive thing I can do is switch my
attention to the other database to see if it really looks like the
same problem.

> You might well be on to something with the bgwriter idea, considering
> that none of the WAL replay code was originally written with any
> concurrent execution in mind.  We might've missed some place where
> additional locking is needed.

Except that the bgwriter has been in there for a few years already.
Unless there's been some other change, possibly involving copying some
code that was safe in some context but not where it was copied to.

The problem with the bgwriter being at fault is that from what I can
see the bgwriter will never extend a file. That means the xlog
recovery code must have done it. That means even if the bgwriter came
along and looked at the buffer we just read in it would already be too
late to cause mischief. The xlog code extends the file *first* then
reads in the backup block into a buffer. I can't see how it could
corrupt the stack or the wal recovery buffer in the window between
reading in the wal buffer and deciding to extend the relation.


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-02-06 23:41:19 +0100, Greg Stark wrote:
> The problem with the bgwriter being at fault is that from what I can
> see the bgwriter will never extend a file. That means the xlog
> recovery code must have done it. That means even if the bgwriter came
> along and looked at the buffer we just read in it would already be too
> late to cause mischief. The xlog code extends the file *first* then
> reads in the backup block into a buffer. I can't see how it could
> corrupt the stack or the wal recovery buffer in the window between
> reading in the wal buffer and deciding to extend the relation.

That's not necessarily true. If e.g. the buffer mapping would change
racily, the result write from the bgwriter could very well end up
increasing the file size, leaving a hole inbetween its write and the
original size.
Are there any other relations that are as big as the corrupted relations
got extended to?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> That's not necessarily true. If e.g. the buffer mapping would change
> racily, the result write from the bgwriter could very well end up
> increasing the file size, leaving a hole inbetween its write and the
> original size.

a) the segment isn't sparse and b) there were whole segments full of
nuls between the end of the tables and the final blocks.

So the file was definitely extended by Postgres, not the OS and the
bgwriter passes EXTENSION_FAIL which means it wouldn't create those
intervening segments.

> Are there any other relations that are as big as the corrupted relations
> got extended to?

I was wondering the same thing. But no, the extended file is 39GB
larger than the next largest relation.

Also, the final block in the first relation is clearly a version of
the same block from the correct location. Look at the ctids and the
xids on the rows for lp 3 in the attached file for example. That
second copy is from the location in the xlog record.


--
greg

Attachment

Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> That's not necessarily true. If e.g. the buffer mapping would change
>> racily, the result write from the bgwriter could very well end up
>> increasing the file size, leaving a hole inbetween its write and the
>> original size.

> a) the segment isn't sparse and b) there were whole segments full of
> nuls between the end of the tables and the final blocks.

> So the file was definitely extended by Postgres, not the OS and the
> bgwriter passes EXTENSION_FAIL which means it wouldn't create those
> intervening segments.

But ... when InRecovery, md.c will create such segments too.  We had
dismissed that on the grounds that the files would be sparse because
of the way md.c creates them.  However, it is real damn hard to see
how the loop in XLogReadBufferExtended could've accessed a bogus block,
other than hardware misfeasance which I don't believe any more than
you do.  The blkno that's passed to that function came directly out
of a WAL record that's in the private memory of the startup process
and recently passed a CRC check.  You'd have to believe some sort
of asynchronous memory clobber inside the startup process.

On the other hand, if _mdfd_getseg did the deed, there's a whole lot
more space for something funny to have happened, because now we're
talking about a buffer being written in preparation for eviction
from shared buffers, long after WAL replay filled it.

So I'm wondering if there's something wrong with our deduction from
file non-sparseness.  In this connection, google quickly found me
a report of XFS "losing" the sparse state of a file across multiple
writes:
http://oss.sgi.com/archives/xfs/2011-06/msg00225.html
I wonder whether that bug or a similar one exists in your production
kernel.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-02-06 18:42:05 -0500, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
> > On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> That's not necessarily true. If e.g. the buffer mapping would change
> >> racily, the result write from the bgwriter could very well end up
> >> increasing the file size, leaving a hole inbetween its write and the
> >> original size.
> 
> > a) the segment isn't sparse and b) there were whole segments full of
> > nuls between the end of the tables and the final blocks.
> 
> > So the file was definitely extended by Postgres, not the OS and the
> > bgwriter passes EXTENSION_FAIL which means it wouldn't create those
> > intervening segments.
> 
> But ... when InRecovery, md.c will create such segments too.  We had
> dismissed that on the grounds that the files would be sparse because
> of the way md.c creates them.  However, it is real damn hard to see
> how the loop in XLogReadBufferExtended could've accessed a bogus block,
> other than hardware misfeasance which I don't believe any more than
> you do.  The blkno that's passed to that function came directly out
> of a WAL record that's in the private memory of the startup process
> and recently passed a CRC check.  You'd have to believe some sort
> of asynchronous memory clobber inside the startup process.

That reminds me, not that I directly see how it could be responsible,
there's still 20131029011623.GJ20248@awork2.anarazel.de ff. around. I
don't think we came to a agreement in that thread how to fix the
problem.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> That reminds me, not that I directly see how it could be responsible,
> there's still 20131029011623.GJ20248@awork2.anarazel.de ff. around. I
> don't think we came to a agreement in that thread how to fix the
> problem.

Hm, yeah.  I'm not sure I believe Heikki's argument that we need to avoid
closing the smgr relation.  If that stuff is being used in any
performance-critical paths then we've got trouble already.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-02-06 20:06:03 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > That reminds me, not that I directly see how it could be responsible,
> > there's still 20131029011623.GJ20248@awork2.anarazel.de ff. around. I
> > don't think we came to a agreement in that thread how to fix the
> > problem.
> 
> Hm, yeah.  I'm not sure I believe Heikki's argument that we need to avoid
> closing the smgr relation.  If that stuff is being used in any
> performance-critical paths then we've got trouble already.

Me neither. And I still am hesitant to start doing an unconditional
smgropen(rnode, InvalidBackendId), but maybe that's also misplaced. My
gut feeling would either to go with the very simple closing the smgr
(which was the case before the current form of the fake relcache
infrastructure!) or add support of unowning the smgr rel (as in
20131105193632.GD14819@awork2.anarazel.de).

After being slightly more awake it's even harder to see how it could be
the cause for this thread's problem. True, it's a rogue write into
palloc()ed memory that's used by somebody else, but afaics it will only
ever write a NULL. While not impossible it seems a bit of a stretch how
that could cause the symptoms.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Thu, Feb 6, 2014 at 11:41 PM, Greg Stark <stark@mit.edu> wrote:
>
> That doesn't explain the other instance or the other copies of this
> database. I think the most productive thing I can do is switch my
> attention to the other database to see if it really looks like the
> same problem.

So here's an instance in the other database, this one is on a
different continent from the first one so it's definitely a different
physical machine. I've had to copy the blocks over to another machine
because the database is down and still in standby mode anyways. I
don't have the xlog file yet.

Bad block's page header -- this is in the 56'th relation segment:

=# select
(page_header(E'\\x20050000583b05aa050000002800180500200420000000001098e00f2090e00f088d24061885e00f0000000000000000')).*;
  lsn      | tli | flags | lower | upper | special | pagesize |
 
version | prune_xid
--------------+-----+-------+-------+-------+---------+----------+---------+-----------520/AA053B58 |   5 |     0 |
40|  1304 |    8192 |     8192 | 4 |         0
 
(1 row)

=# select
(heap_page_items(E'\\x20050000583b05aa050000002800180500200420000000001098e00f2090e00f088d24061885e00f000....')).*;lp|
lp_off| lp_flags | lp_len | t_xmin  | t_xmax | t_field3 |
 
t_ctid    | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid

----+--------+----------+--------+---------+--------+----------+-------------+-------------+------------+--------+--------+-------
1|   6160 |        1 |   2032 | 7635393 |      0 |        0 |
 
(4773121,1) |           3 |       2306 |     24 |        | 2 |   4128 |        1 |   2032 | 7635393 |      0 |        0
|
(4773121,2) |           3 |       2306 |     24 |        | 3 |   3336 |        1 |    786 | 7635393 |      0 |        0
|
(4773121,3) |           3 |       2306 |     24 |        | 4 |   1304 |        1 |   2032 | 7635428 |      0 |        0
|
(4773121,4) |           3 |       2306 |     24 |        |
(4 rows)

Looking at the block at offset 4773121 (which is in the 36th segment):

=# select
(heap_page_items(E'\\x20050000a00a0bad050000002c00a00200200420000000001098e00f2090e00f088d24061885e00fa082ec040000000.....')).*;lp
|lp_off | lp_flags | lp_len | t_xmin  | t_xmax | t_field3 |
 
t_ctid    | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid

----+--------+----------+--------+---------+--------+----------+-------------+-------------+------------+--------+--------+-------
1|   6160 |        1 |   2032 | 7635393 |      0 |        0 |
 
(4773121,1) |           3 |       2306 |     24 |        | 2 |   4128 |        1 |   2032 | 7635393 |      0 |        0
|
(4773121,2) |           3 |       2306 |     24 |        | 3 |   3336 |        1 |    786 | 7635393 |      0 |        0
|
(4773121,3) |           3 |       2306 |     24 |        | 4 |   1304 |        1 |   2032 | 7635428 |      0 |        0
|
(4773121,4) |           3 |       2306 |     24 |        | 5 |    672 |        1 |    630 | 7635580 |      0 |        0
|
(4773121,5) |           3 |       2306 |     24 |        |
(5 rows)

d9de7pcqls4ib6=# select

(page_header(E'\\x20050000a00a0bad050000002c00a00200200420000000001098e00f2090e00f088d24061885e00fa082ec040000000000000000....')).*;
  lsn      | tli | flags | lower | upper | special | pagesize |
 
version | prune_xid
--------------+-----+-------+-------+-------+---------+----------+---------+-----------520/AD0B0AA0 |   5 |     0 |
44|   672 |    8192 |     8192 | 4 |         0
 
(1 row)



-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Sun, Feb 9, 2014 at 2:54 PM, Greg Stark <stark@mit.edu> wrote:

> Bad block's page header -- this is in the 56'th relation segment:
>
> =# select
(page_header(E'\\x20050000583b05aa050000002800180500200420000000001098e00f2090e00f088d24061885e00f0000000000000000')).*;
>      lsn      | tli | flags | lower | upper | special | pagesize |
> version | prune_xid
> --------------+-----+-------+-------+-------+---------+----------+---------+-----------
>  520/AA053B58 |   5 |     0 |    40 |  1304 |    8192 |     8192 |
>   4 |         0


> Looking at the block at offset 4773121 (which is in the 36th segment):
...
> d9de7pcqls4ib6=# select
>
(page_header(E'\\x20050000a00a0bad050000002c00a00200200420000000001098e00f2090e00f088d24061885e00fa082ec040000000000000000....')).*;
>      lsn      | tli | flags | lower | upper | special | pagesize |
> version | prune_xid
> --------------+-----+-------+-------+-------+---------+----------+---------+-----------
>  520/AD0B0AA0 |   5 |     0 |    44 |   672 |    8192 |     8192 |
>   4 |         0
> (1 row)

And I finally tracked down the xlog record for this stray write:

[cur:520/AA051FE0, xid:7635428, rmid:10(Heap), len/tot_len:21/7005,
info:8, prev:520/AA051F20] insert: s/d/r:1663/16385/16650
blk/off:4773121/4 header: none
[cur:520/AA051FE0, xid:7635428, rmid:10(Heap), len/tot_len:21/7005,
info:8, prev:520/AA051F20] bkpblock[1]: s/d/r:1663/16385/16650
blk:4773121 hole_off/len:40/1264



-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
So I think I've come up with a scenario that could cause this. I don't
think it's exactly what happened here but maybe something analogous
happened with our base backup restore.

On the primary you extend a table a bunch, including adding new
segments, but crash before committing (or checkpointing). Then some of
the blocks but not all may be written to disk. Assume they're all
written except for the last block of the first file. So what you have
is a .999G file followed by, day, 9 1G files. (Or maybe the hot backup
process could just catch the files in this state if a table is rapidly
growing and it doesn't take care to avoid picking up new files that
appear after it starts?)

smgrnblocks() stops at the first < 1GB segment and ignores the rest.
This code in xlog uses it to calculate how many blocks to add but it
only calls it once and then doesn't recheck where it's at as it
extends the relation. As soon as it adds that one missing block the
remaining files become visible. P_NEW always recalculates the position
based on smgrnblocks each time (which sounds pretty  inefficient but
anyways....) so it will add the requested blocks to the new end.

Now this isn't enough to explain things since surely the extensions
records would be in the xlog in physical order. But this could have
all happened after an earlier vacuum truncated the relation and we
could be replaying records that predate that.

So in short, if you have a 10G table and want to overwrite the last
block but the first segment is one block short then xlog will add 9G
to the end and write the block there. That sounds like what we've
seen.

I think the easy fix is to change the code in xlogutils to be more
defensive and stop as soon as it finds BufferGetBlockNumber(buffer) ==
blkno (which is what it has in the assert already).
-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
So here's my attempt to rewrite this logic. I ended up refactoring a
bit because I found it unnecessarily confusing having the mode
branches in several places. I think it's much clearer just having two
separate pieces of logic for RBM_NEW and the extension cases since all
they have in common is the ReadBuffer call.

I have to say, it scares the hell out of me that there are no
regression tests for this code. I'm certainly not comfortable
committing it without a few other people reading it if I haven't even
run the code once. At least I know it compiles...

Attachment

Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> So here's my attempt to rewrite this logic. I ended up refactoring a
> bit because I found it unnecessarily confusing having the mode
> branches in several places. I think it's much clearer just having two
> separate pieces of logic for RBM_NEW and the extension cases since all
> they have in common is the ReadBuffer call.

I don't like that at all.  It's a lot of unnecessary churn in what is
indeed hard-to-test code, and personally I don't find it clearer.
Nor, if you're going to complain about the cost of smgrnblocks, does
it seem like a great idea to be doing that *twice* per page rather
than once.

How about the attached instead?

            regards, tom lane

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4cd82df..f1918f6 100644
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
*************** XLogReadBufferExtended(RelFileNode rnode
*** 338,352 ****
          /* we do this in recovery only - no rel-extension lock needed */
          Assert(InRecovery);
          buffer = InvalidBuffer;
!         while (blkno >= lastblock)
          {
              if (buffer != InvalidBuffer)
                  ReleaseBuffer(buffer);
              buffer = ReadBufferWithoutRelcache(rnode, forknum,
                                                 P_NEW, mode, NULL);
-             lastblock++;
          }
!         Assert(BufferGetBlockNumber(buffer) == blkno);
      }

      if (mode == RBM_NORMAL)
--- 338,358 ----
          /* we do this in recovery only - no rel-extension lock needed */
          Assert(InRecovery);
          buffer = InvalidBuffer;
!         do
          {
              if (buffer != InvalidBuffer)
                  ReleaseBuffer(buffer);
              buffer = ReadBufferWithoutRelcache(rnode, forknum,
                                                 P_NEW, mode, NULL);
          }
!         while (BufferGetBlockNumber(buffer) < blkno);
!         /* Handle the corner case that P_NEW returns non-consecutive pages */
!         if (BufferGetBlockNumber(buffer) != blkno)
!         {
!             ReleaseBuffer(buffer);
!             buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
!                                                mode, NULL);
!         }
      }

      if (mode == RBM_NORMAL)

Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> So I think I've come up with a scenario that could cause this. I don't
> think it's exactly what happened here but maybe something analogous
> happened with our base backup restore.

I agree it seems like a good idea for XLogReadBufferExtended to defend
itself against successive P_NEW calls possibly not returning consecutive
pages.  However, unless you had an operating-system-level crash on the
master, this isn't sufficient to explain the problem.  We'd need also a
plausible theory about how a base backup could've left us with short
segments in a large relation.

> (Or maybe the hot backup
> process could just catch the files in this state if a table is rapidly
> growing and it doesn't take care to avoid picking up new files that
> appear after it starts?)

That's a possible explanation I guess, but it doesn't seem terribly
probable from a timing standpoint.  Also, you should be able to gauge
the probability of this theory from knowledge of the application ---
are the bloated relations all ones that would've been growing *very*
rapidly during the base backup?
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> How about the attached instead?

This does possibly allocate an extra block past the target block. I'm
not sure how surprising that would be for the rest of the code.

For what it's worth I've confirmed the bug in wal-e caused the initial
problem. But I think it's possible to occur without that bug so I
think it still needs to be addressed.


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
I wrote:
> Greg Stark <stark@mit.edu> writes:
>> (Or maybe the hot backup
>> process could just catch the files in this state if a table is rapidly
>> growing and it doesn't take care to avoid picking up new files that
>> appear after it starts?)

> That's a possible explanation I guess, but it doesn't seem terribly
> probable from a timing standpoint.

I did a bit of arithmetic using the cases you posted previously.
In the first case, where block 3634978 got written to 7141472,
you can make the numbers come out right if you assume that a page
was missing at the end of segment 1 --- that leads to the conclusion
that EOF exclusive of that missing page had been around 28.75 GB,
which squares well with the relation's size on master.  However, it's
fairly hard to credit that the base backup would have collected
full-size or nearly full-size images of segments 2 through 28 while
not seeing segment 1 at full size.  You'd have to assume that the
rel grew by a factor of ~14 while the base backup was in progress
--- and then didn't grow very much more afterwards.  (What state
exactly did you measure the primary rel sizes in?  Was it long
after the backup/restore, or did you rewind things somehow?)

The other examples seem to fit the theory a bit better, but this
one is hard to explain this way.

The other big problem for this theory is that you said in
http://www.postgresql.org/message-id/CAM-w4HPvJCBRVV3dXg8aj0WzkU08dHuX-XYbfDYQhNrn5bnTQg@mail.gmail.com

> What's worse is we created a new standby from the same base backup and
> replayed the same records and it didn't reproduce the problem.

If this were the explanation, it oughta be reproducible that way.

I still agree that XLogReadBufferExtended shouldn't be assuming that P_NEW
will not skip pages.  But I think we have another bug in here somewhere.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> How about the attached instead?

> This does possibly allocate an extra block past the target block. I'm
> not sure how surprising that would be for the rest of the code.

Should be fine; we could end up with an extra block after a failed
extension operation in any case.

> For what it's worth I've confirmed the bug in wal-e caused the initial
> problem.

Huh?  Bug in wal-e?  What bug?
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> How about the attached instead?
>
>> This does possibly allocate an extra block past the target block. I'm
>> not sure how surprising that would be for the rest of the code.
>
> Should be fine; we could end up with an extra block after a failed
> extension operation in any case.

I know it's fine on the active database, I'm not so clear whether it's
compatible with the xlog records from the primary. I suppose it'll
just see an Initialize Page record and happily see the nul block and
initialize it. It's still a bit scary.

I of course think my code is vastly clearer than the existing code but
yes, I see the "unnecessary churn" argument. I think that's the
fundamental problem with lacking tests, it makes the code get more and
more complex as we're reluctant to simplify it out of fear.

>> For what it's worth I've confirmed the bug in wal-e caused the initial
>> problem.
>
> Huh?  Bug in wal-e?  What bug?

WAL-E actually didn't restore a whole 1GB file due to a transient S3
problem, in fact a bunch of them. It's remarkable that Postgres kept
going with that much data missing. But the arithmetic worked out on
the case I checked it on, which was the last one that I just sent the
xlog record for last night. In that case there was precisely one
segment missing and the relation was extended by the number of
segments you would expect if it filled in that missing segment and
then jumped to the end of the relation.


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Greg Stark <stark@mit.edu> writes:
>>> This does possibly allocate an extra block past the target block. I'm
>>> not sure how surprising that would be for the rest of the code.

>> Should be fine; we could end up with an extra block after a failed
>> extension operation in any case.

> I know it's fine on the active database, I'm not so clear whether it's
> compatible with the xlog records from the primary. I suppose it'll
> just see an Initialize Page record and happily see the nul block and
> initialize it. It's still a bit scary.

Well, we can easily find uninitialized extra pages on the primary too,
so if WAL replay were unable to cope with that, it would be a bug
regardless.

>> Huh?  Bug in wal-e?  What bug?

> WAL-E actually didn't restore a whole 1GB file due to a transient S3
> problem, in fact a bunch of them.

Hah.  Okay, I think we can write this issue off as closed then.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
I wrote:
> Greg Stark <stark@mit.edu> writes:
>> WAL-E actually didn't restore a whole 1GB file due to a transient S3
>> problem, in fact a bunch of them.

> Hah.  Okay, I think we can write this issue off as closed then.

Oh, wait a minute.  It's not just a matter of whether we find the right
block: we also have to consider whether XLogReadBufferExtended will
apply the right "mode" behavior.  Currently, it supposes that all pages
past the initially observed EOF should be assumed to be uninitialized;
but if we're working with an inconsistent database, that seems like
an unsafe assumption.  It might be that a page is there but we've not
(yet) fixed the length of some preceding segment.  If we want to not
get bogus "WAL contains references to invalid pages" failures in such
scenarios, it seems like we need a more invasive change than what
I just committed.  I think your patch didn't cover this consideration
either.

What I think we probably want to do is forcibly cause the target page
to exist, using a P_NEW loop like what I committed, and then decide
on the basis of whether it's all-zeroes whether to consider it invalid
or not.  This seems sane on the grounds that it's just the extension
to the page level of the existing policy of creating the file whether
it existed or not.  It could only result in a large amount of wasted
work if the passed-in target block is insane --- but since we got it
out of a CRC-checked WAL record, I think it's safe to not worry too
much about that.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
I wrote:
> What I think we probably want to do is forcibly cause the target page
> to exist, using a P_NEW loop like what I committed, and then decide
> on the basis of whether it's all-zeroes whether to consider it invalid
> or not.  This seems sane on the grounds that it's just the extension
> to the page level of the existing policy of creating the file whether
> it existed or not.  It could only result in a large amount of wasted
> work if the passed-in target block is insane --- but since we got it
> out of a CRC-checked WAL record, I think it's safe to not worry too
> much about that.

Like the attached.  A possible complaint is that if the WAL sequence
contains updates against large relations that are later dropped,
this will waste time and disk space replaying those updates as best
it can.  Doesn't seem like that's a case to optimize for, however.

            regards, tom lane

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index f1918f6..7a820c0 100644
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
*************** XLogReadBuffer(RelFileNode rnode, BlockN
*** 277,297 ****
   * XLogReadBufferExtended
   *        Read a page during XLOG replay
   *
!  * This is functionally comparable to ReadBufferExtended. There's some
!  * differences in the behavior wrt. the "mode" argument:
   *
!  * In RBM_NORMAL mode, if the page doesn't exist, or contains all-zeroes, we
!  * return InvalidBuffer. In this case the caller should silently skip the
!  * update on this page. (In this situation, we expect that the page was later
!  * dropped or truncated. If we don't see evidence of that later in the WAL
!  * sequence, we'll complain at the end of WAL replay.)
   *
   * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
   * relation is extended with all-zeroes pages up to the given block number.
   *
!  * In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't
!  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
!  * to imply that the page should be dropped or truncated later.
   */
  Buffer
  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
--- 277,307 ----
   * XLogReadBufferExtended
   *        Read a page during XLOG replay
   *
!  * This is functionally comparable to ReadBufferExtended, except that we
!  * are willing to create the target page (and indeed the whole relation)
!  * if it doesn't currently exist.  This allows safe replay of WAL sequences
!  * in which a relation was later dropped or truncated.
   *
!  * The "mode" argument provides some control over this behavior.  (See also
!  * ReadBufferExtended's specification of what the modes do.)
   *
   * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
   * relation is extended with all-zeroes pages up to the given block number.
+  * These modes should be used if the caller is going to initialize the page
+  * contents from scratch, and doesn't need it to be valid already.
   *
!  * In RBM_NORMAL mode, we similarly create the page if needed, but if the
!  * page contains all zeroes (including the case where we just created it),
!  * we return InvalidBuffer.  Then the caller should silently skip the update
!  * on this page.  This mode should be used for incremental updates where the
!  * caller needs to see a valid page.  (In this case, we expect that the page
!  * later gets dropped or truncated. If we don't see evidence of that later in
!  * the WAL sequence, we'll complain at the end of WAL replay.)
!  *
!  * RBM_NORMAL_NO_LOG mode is like RBM_NORMAL except that we will return an
!  * all-zeroes page, and not log it as one that ought to get dropped later.
!  * This mode is for when the caller wants to read a page that might validly
!  * contain zeroes.
   */
  Buffer
  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
*************** XLogReadBufferExtended(RelFileNode rnode
*** 299,304 ****
--- 309,315 ----
  {
      BlockNumber lastblock;
      Buffer        buffer;
+     bool        present;
      SMgrRelation smgr;

      Assert(blkno != P_NEW);
*************** XLogReadBufferExtended(RelFileNode rnode
*** 316,342 ****
       */
      smgrcreate(smgr, forknum, true);

      lastblock = smgrnblocks(smgr, forknum);

      if (blkno < lastblock)
      {
          /* page exists in file */
          buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
                                             mode, NULL);
      }
      else
      {
!         /* hm, page doesn't exist in file */
!         if (mode == RBM_NORMAL)
!         {
!             log_invalid_page(rnode, forknum, blkno, false);
!             return InvalidBuffer;
!         }
!         if (mode == RBM_NORMAL_NO_LOG)
!             return InvalidBuffer;
!         /* OK to extend the file */
          /* we do this in recovery only - no rel-extension lock needed */
          Assert(InRecovery);
          buffer = InvalidBuffer;
          do
          {
--- 327,357 ----
       */
      smgrcreate(smgr, forknum, true);

+     /*
+      * On the same principle, if the page doesn't already exist in the file,
+      * create it by extending the relation as far as needed.
+      *
+      * When we are working in a not-yet-consistent database, it's possible for
+      * P_NEW to behave somewhat inconsistently as a result of incomplete
+      * segment files.  Don't assume that the returned pages are necessarily
+      * consecutive.  When we're done with this loop, however, any segments
+      * before the target page's segment have been zero-filled until complete.
+      */
      lastblock = smgrnblocks(smgr, forknum);

      if (blkno < lastblock)
      {
          /* page exists in file */
+         present = true;
          buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
                                             mode, NULL);
      }
      else
      {
!         /* must extend the file */
          /* we do this in recovery only - no rel-extension lock needed */
          Assert(InRecovery);
+         present = false;
          buffer = InvalidBuffer;
          do
          {
*************** XLogReadBufferExtended(RelFileNode rnode
*** 352,357 ****
--- 367,374 ----
              ReleaseBuffer(buffer);
              buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
                                                 mode, NULL);
+             /* page was not in fact created by P_NEW extension */
+             present = true;
          }
      }

*************** XLogReadBufferExtended(RelFileNode rnode
*** 368,374 ****
          if (PageIsNew(page))
          {
              ReleaseBuffer(buffer);
!             log_invalid_page(rnode, forknum, blkno, true);
              return InvalidBuffer;
          }
      }
--- 385,391 ----
          if (PageIsNew(page))
          {
              ReleaseBuffer(buffer);
!             log_invalid_page(rnode, forknum, blkno, present);
              return InvalidBuffer;
          }
      }

Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Wed, Feb 12, 2014 at 8:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Oh, wait a minute.  It's not just a matter of whether we find the right
> block: we also have to consider whether XLogReadBufferExtended will
> apply the right "mode" behavior.  Currently, it supposes that all pages
> past the initially observed EOF should be assumed to be uninitialized;
> but if we're working with an inconsistent database, that seems like
> an unsafe assumption.  It might be that a page is there but we've not
> (yet) fixed the length of some preceding segment.  If we want to not
> get bogus "WAL contains references to invalid pages" failures in such
> scenarios, it seems like we need a more invasive change than what
> I just committed.  I think your patch didn't cover this consideration
> either.

Hm. I *think* those cases would be handled anyways since the table
would later be truncated. Arguably any reference after the "short"
segment is a "reference to an invalid page" since it means it's a
record which predates the records which caused the extension.
Obviously it would still give the error in the case we had where files
were missing but then probably it should give errors in that case.

-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Wed, Feb 12, 2014 at 8:28 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Oh, wait a minute.  It's not just a matter of whether we find the right
>> block: we also have to consider whether XLogReadBufferExtended will
>> apply the right "mode" behavior.  Currently, it supposes that all pages
>> past the initially observed EOF should be assumed to be uninitialized;
>> but if we're working with an inconsistent database, that seems like
>> an unsafe assumption.  It might be that a page is there but we've not
>> (yet) fixed the length of some preceding segment.  If we want to not
>> get bogus "WAL contains references to invalid pages" failures in such
>> scenarios, it seems like we need a more invasive change than what
>> I just committed.  I think your patch didn't cover this consideration
>> either.

> Hm. I *think* those cases would be handled anyways since the table
> would later be truncated. Arguably any reference after the "short"
> segment is a "reference to an invalid page" since it means it's a
> record which predates the records which caused the extension.

Well, that would be the case if you assume perfectly sequential filesystem
behavior, but I'm not sure the assumption holds if the starting condition
is a base backup.  We could be looking at a version of segment 1 that
predates segment 2's existence, and yet see some data in segment 2 as
well, because it's not a perfectly coherent snapshot.

I think what you're arguing is that we should see WAL records filling the
rest of segment 1 before we see any references to segment 2, but if that's
the case then how did we get into the situation you reported?  Or is it
just that it was a broken base backup to start with?
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Andrea Suisani
Date:
Hi all,

On 02/12/2014 08:27 PM, Greg Stark wrote:
> On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Greg Stark <stark@mit.edu> writes:
>>> For what it's worth I've confirmed the bug in wal-e caused the initial
>>> problem.
>>
>> Huh?  Bug in wal-e?  What bug?
>
> WAL-E actually didn't restore a whole 1GB file due to a transient S3
> problem, in fact a bunch of them. It's remarkable that Postgres kept
> going with that much data missing. But the arithmetic worked out on
> the case I checked it on, which was the last one that I just sent the
> xlog record for last night. In that case there was precisely one
> segment missing and the relation was extended by the number of
> segments you would expect if it filled in that missing segment and
> then jumped to the end of the relation.

sorry for interrupting, but did we already notify wal-e's maintainer?

Andrea

ps cc:ed Daniel






Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
<p dir="ltr"><br /><p dir="ltr">> I think what you're arguing is that we should see WAL records filling the<br />
>rest of segment 1 before we see any references to segment 2, but if that's<br /> > the case then how did we get
intothe situation you reported?  Or is it<br /> > just that it was a broken base backup to start with?<p
dir="ltr">Thescenario I could come up with that didn't require a broken base backup was that there was an earlier
truncateor vacuum. So the sequence is high offset reference, truncate, growth, crash. All possibly on a single
database.<pdir="ltr">It's possible we're better off not assuming we've thought of all possible ways this can happen
though.

Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
>> I think what you're arguing is that we should see WAL records filling the
>> rest of segment 1 before we see any references to segment 2, but if that's
>> the case then how did we get into the situation you reported?  Or is it
>> just that it was a broken base backup to start with?

> The scenario I could come up with that didn't require a broken base backup
> was that there was an earlier truncate or vacuum. So the sequence is high
> offset reference, truncate, growth, crash. All possibly on a single
> database.

That's not really an issue, because then it would be OK to discard the
high-offset update; we'd recognize that as safe when we replay the
truncation.

> It's possible we're better off not assuming we've thought of all possible
> ways this can happen though.

That's what's bothering me, too.  On the other hand, if we can't think of
a scenario where it'd be necessary to replay the high-offset update, then
I'm disinclined to mess with the code further.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Thu, Feb 13, 2014 at 7:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The scenario I could come up with that didn't require a broken base backup
>> was that there was an earlier truncate or vacuum. So the sequence is high
>> offset reference, truncate, growth, crash. All possibly on a single
>> database.
>
> That's not really an issue, because then it would be OK to discard the
> high-offset update; we'd recognize that as safe when we replay the
> truncation.

Yeah, that's my point.

>> It's possible we're better off not assuming we've thought of all possible
>> ways this can happen though.
>
> That's what's bothering me, too.  On the other hand, if we can't think of
> a scenario where it'd be necessary to replay the high-offset update, then
> I'm disinclined to mess with the code further.

And the whole point of the undefined page error checking is to detect
cases like this, so covering them up in the name of possible edge
cases we haven't thought of kind of defeats the purpose. In particular
I would have liked to get errors rather than soldier on when the
database found these missing segments.

In that vein, the other possibly open question was how we got past the
"undefined pages" errors that we did see. Andres said he thought that
was due to the bug where some piece of code was mistakenly using the
presence of a snapshot but I'm not clear how that can cause this
though.


-- 
greg



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Greg Stark <stark@mit.edu> writes:
> On Thu, Feb 13, 2014 at 7:52 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> That's what's bothering me, too.  On the other hand, if we can't think of
>> a scenario where it'd be necessary to replay the high-offset update, then
>> I'm disinclined to mess with the code further.

> And the whole point of the undefined page error checking is to detect
> cases like this, so covering them up in the name of possible edge
> cases we haven't thought of kind of defeats the purpose.

Yeah, good point.  Let's leave it alone and see if the existing fix
helps.
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
Going over this I think this is still a potential issue:

On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote:

I am not sure that explains the issue, but I think the redo action for
truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
do a smgrtruncate() (and then mdtruncate) which will iterate over the
segments starting at 0 till mdnblocks()/segment_size and *truncate* but
not delete individual segment files that are not needed anymore, right?
If we crash in the midst of that a new mdtruncate() will be issued, but
it will get a shorter value back from mdnblocks().

Am I missing something?

I'm not too familiar with md.c but my reading of the code is that we truncate the files in reverse order? In which case I think the code is safe *iff* the filesystem guarantees ordered meta data writes which I tihnk ext3 does (I think in all the journal modes). Most filesystems meta data writes are synchronous so the truncates are safe for them too.

But we don't generally rely on meta data writes being ordered. I think the "correct" thing to do is to record the nblocks prior to the truncate and then have md.c expose a new function that takes that parameter and pokes around looking for any segments it might need to clean up. But that would involve lots of abstraction violations in md.c. I think using nblocks would keep the violations within md.c but that still seems like a pain.

Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-02-14 20:46:01 +0000, Greg Stark wrote:
> Going over this I think this is still a potential issue:
> 
> On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote:
> 
> >
> > I am not sure that explains the issue, but I think the redo action for
> > truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
> > do a smgrtruncate() (and then mdtruncate) which will iterate over the
> > segments starting at 0 till mdnblocks()/segment_size and *truncate* but
> > not delete individual segment files that are not needed anymore, right?
> > If we crash in the midst of that a new mdtruncate() will be issued, but
> > it will get a shorter value back from mdnblocks().
> >
> > Am I missing something?
> >
> 
> I'm not too familiar with md.c but my reading of the code is that we
> truncate the files in reverse order?

That's what I had assumed as well, but it doesn't look that way:
   v = mdopen(reln, forknum, EXTENSION_FAIL);
   priorblocks = 0; /* <- initial value */   while (v != NULL)   {       MdfdVec    *ov = v;
       if (priorblocks > nblocks)       {           /* truncate entire segment */       }       else if (priorblocks +
((BlockNumber)RELSEG_SIZE) > nblocks)       {           /* truncate entire segment */       }       else       {
  /* nothing to do, all needed */       }       priorblocks += RELSEG_SIZE;   }
 

So, according to that we start at segment 0, right?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-02-14 20:46:01 +0000, Greg Stark wrote:
>> Going over this I think this is still a potential issue:
>> On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote:
>>> I am not sure that explains the issue, but I think the redo action for
>>> truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
>>> do a smgrtruncate() (and then mdtruncate) which will iterate over the
>>> segments starting at 0 till mdnblocks()/segment_size and *truncate* but
>>> not delete individual segment files that are not needed anymore, right?
>>> If we crash in the midst of that a new mdtruncate() will be issued, but
>>> it will get a shorter value back from mdnblocks().

>> I'm not too familiar with md.c but my reading of the code is that we
>> truncate the files in reverse order?

> That's what I had assumed as well, but it doesn't look that way:

No, it's deleting forward.

We could probably fix things so it deleted backwards; it'd be a tad
tedious because the list structure isn't organized that way, but we
could do it.  Not sure if that's good enough though.  If you don't
want to assume the filesystem metadata is coherent after a crash,
we might have nonzero-size segments after zero-size ones, even if
the truncate calls had been issued in the right order.

Another possibility is to keep opening and truncating files until
we don't find the next segment in sequence, looking directly at the
filesystem not at the mdfd chain.  I don't think this would be
appropriate in normal operation, but we could do it if InRecovery
(and maybe even only if we don't think the database is consistent?)
        regards, tom lane



Re: Recovery inconsistencies, standby much larger than primary

From
Andres Freund
Date:
On 2014-02-14 22:30:45 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-02-14 20:46:01 +0000, Greg Stark wrote:
> >> Going over this I think this is still a potential issue:
> >> On 31 Jan 2014 15:56, "Andres Freund" <andres@2ndquadrant.com> wrote:
> >>> I am not sure that explains the issue, but I think the redo action for
> >>> truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
> >>> do a smgrtruncate() (and then mdtruncate) which will iterate over the
> >>> segments starting at 0 till mdnblocks()/segment_size and *truncate* but
> >>> not delete individual segment files that are not needed anymore, right?
> >>> If we crash in the midst of that a new mdtruncate() will be issued, but
> >>> it will get a shorter value back from mdnblocks().

> We could probably fix things so it deleted backwards; it'd be a tad
> tedious because the list structure isn't organized that way, but we
> could do it.

We could just make the list a doubly linked one, that'd make it simple.

> Not sure if that's good enough though.  If you don't
> want to assume the filesystem metadata is coherent after a crash,
> we might have nonzero-size segments after zero-size ones, even if
> the truncate calls had been issued in the right order.

I don't think that can actually happen on any realistic/interesting
FS. Metadata updates better be journaled, so while they might not
persist because the journal wasn't flushed, they should be applied in a
sane order after a crash.
But nonetheless I am not sure we want to rely on that.

> Another possibility is to keep opening and truncating files until
> we don't find the next segment in sequence, looking directly at the
> filesystem not at the mdfd chain.  I don't think this would be
> appropriate in normal operation, but we could do it if InRecovery
> (and maybe even only if we don't think the database is consistent?)

Yes, I was thinking of simply having a mdnblocks() variant that looks
for the last existing file, disregarding the size. But looking around,
it seems mdunlinkfork() has a similar issue, and I don't see how such a
trick could be applied there :(

I guess the theoretically correct thing would be to make all WAL records
about truncation and unlinking contain the current size of the relation,
but especially with deletions and forks that will probably turn out to
be annoying to do.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Recovery inconsistencies, standby much larger than primary

From
Greg Stark
Date:
On Sat, Feb 15, 2014 at 11:45 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I guess the theoretically correct thing would be to make all WAL records
> about truncation and unlinking contain the current size of the relation,
> but especially with deletions and forks that will probably turn out to
> be annoying to do.

Here's another alternative.

In md.c when extending a file to RELSEG_SIZE always check if the next
segment is already there and truncate it if it is to avoid magically
slurping in that data. That maintains the invariant that the first
short segment will mark the end of the relation. If you have a short
or missing segment then you'll ignore all the later segments.

I think to make this work you would have to sync the newly truncated
segment first before extending the current segment though. And this
would cause every relation extension to do an extra filesystem lookup.
Perhaps only doing this in recovery or *with assertions enabled?)
would mitigate that cost.

This makes a mockery of the comment in xlogutils.c that we would
rather not lose data in the case of a lost inode. But I feel like the
data in the later segments was already lost before the earlier segment
was filled up, it hardly helps matters if it can sometimes be unlost
if the earlier data happens to get written to in a particular pattern.


-- 
greg