Thread: WAL consistency check facility

WAL consistency check facility

From
Kuntal Ghosh
Date:
Hi,

I've attached a patch to check if the current page is equal with the
FPW after applying WAL on it. This is how the patch works:

1. When a WAL record is inserted, a FPW is done for that operation.
But, a flag is kept to  indicate whether that page needs to be
restored.
2. During recovery, when a redo operation is done, we do a comparison
with the FPW contained in the WAL record with the current page in the
buffer. For this purpose, I've used Michael's patch with minor changes
to check whether two pages are actually equal or not.
3. I've also added a guc variable (wal_consistency_mask) to indicate
the operations (HEAP,BTREE,HASH,GIN etc) for which this feature
(always FPW and consistency check) is to be enabled.

How to use the patch:
1. Apply the patch.
2. In postgresql.conf file, set wal_consistency_mask variable
accordingly. For debug messages, set log_min_messages = debug1.

Michael's patch:
https://www.postgresql.org/message-id/CAB7nPqR4vxdKijP%2BDu82vOcOnGMvutq-gfqiU2dsH4bsM77hYg%40mail.gmail.com

Reference thread:

https://www.postgresql.org/message-id/flat/CAB7nPqR4vxdKijP%2BDu82vOcOnGMvutq-gfqiU2dsH4bsM77hYg%40mail.gmail.com#CAB7nPqR4vxdKijP+Du82vOcOnGMvutq-gfqiU2dsH4bsM77hYg@mail.gmail.com

Please let me know your thoughts on this.
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WAL consistency check facility

From
Michael Paquier
Date:
On Mon, Aug 22, 2016 at 9:44 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> Please let me know your thoughts on this.

Since custom AMs have been introduced, I have kept that in a corner of
my mind and thought about it a bit. And while the goal of this patch
is clearly worth it, I don't think that the page masking interface is
clear at all. For example, your patch completely ignores
contrib/bloom, and we surely want to do something about it. The idea
would be to add a page masking routine in IndexAmRoutine and heap to
be able to perform page masking operations directly with that. This
would allow as well one to be able to perform page masking for bloom
or any custom access method, and this will allow this sanity check to
be more generic as well.

Another pin-point is: given a certain page, how do we identify of
which type it is? One possibility would be again to extend the AM
handler with some kind of is_self function with a prototype like that:
bool handler->is_self(Page);
If the page is of the type of the handler, this returns true, and
false otherwise. Still here performance would suck.

At the end, what we want is a clean interface, and more thoughts into it.
-- 
Michael



Re: WAL consistency check facility

From
Robert Haas
Date:
On Mon, Aug 22, 2016 at 9:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Another pin-point is: given a certain page, how do we identify of
> which type it is? One possibility would be again to extend the AM
> handler with some kind of is_self function with a prototype like that:
> bool handler->is_self(Page);
> If the page is of the type of the handler, this returns true, and
> false otherwise. Still here performance would suck.
>
> At the end, what we want is a clean interface, and more thoughts into it.

I think that it makes sense to filter based on the resource manager
ID, but I can't see how we could reasonably filter based on the AM
name.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Simon Riggs
Date:
On 22 August 2016 at 13:44, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

> Please let me know your thoughts on this.

Do the regression tests pass with this option enabled?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Mon, Aug 22, 2016 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Aug 22, 2016 at 9:25 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Another pin-point is: given a certain page, how do we identify of
>> which type it is? One possibility would be again to extend the AM
>> handler with some kind of is_self function with a prototype like that:
>> bool handler->is_self(Page);
>> If the page is of the type of the handler, this returns true, and
>> false otherwise. Still here performance would suck.
>>
>> At the end, what we want is a clean interface, and more thoughts into it.
>
> I think that it makes sense to filter based on the resource manager
> ID
>

+1.

I think the patch currently addresses only a subset of resource
manager id's (mainly Heap and Index resource manager id's).  Do we
want to handle the complete resource manager list as defined in
rmgrlist.h?

Another thing that needs some thoughts is the UI of this patch,
currently it is using integer mask which might not be best way, but
again as it is intended to be mainly used for tests, it might be okay.

Do we want to enable some tests in the regression suite by using this option?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
Yes, I've verified the outputs and log contents after running gmake
installcheck and gmake installcheck-world. The status of the test was
marked as pass for all the testcases.


On Mon, Aug 22, 2016 at 9:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 22 August 2016 at 13:44, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
>> Please let me know your thoughts on this.
>
> Do the regression tests pass with this option enabled?
>
> --
> Simon Riggs                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Thanks & Regards,
Kuntal Ghosh



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Tue, Aug 23, 2016 at 1:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Aug 22, 2016 at 9:16 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Aug 22, 2016 at 9:25 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Another pin-point is: given a certain page, how do we identify of
>>> which type it is? One possibility would be again to extend the AM
>>> handler with some kind of is_self function with a prototype like that:
>>> bool handler->is_self(Page);
>>> If the page is of the type of the handler, this returns true, and
>>> false otherwise. Still here performance would suck.
>>>
>>> At the end, what we want is a clean interface, and more thoughts into it.
>>
>> I think that it makes sense to filter based on the resource manager
>> ID
>>
>
> +1.

Yes actually that's better. That's simple enough and removes any need
to looking at pd_special.

> I think the patch currently addresses only a subset of resource
> manager id's (mainly Heap and Index resource manager id's).  Do we
> want to handle the complete resource manager list as defined in
> rmgrlist.h?

Not all of them generate FPWs. I don't think it matters much.

> Another thing that needs some thoughts is the UI of this patch,
> currently it is using integer mask which might not be best way, but
> again as it is intended to be mainly used for tests, it might be okay.

What we'd want to have is a way to visualize easily differences of
pages. Any other ideas than MASK_MARKER would be welcome of course.

> Do we want to enable some tests in the regression suite by using this option?

We could get out a recovery test that sets up a standby/master and
runs the tests of src/test/regress with pg_regress with this parameter
enabled.

+ * bufmask.c
+ *      Routines for buffer masking, used to ensure that buffers used for
+ *      comparison across nodes are in a consistent state.
+ *
+ * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
Copyright notices need to be updated. (It's already been 2 years!!)

Also, what's the use case of allowing only a certain set of rmgrs to
be checked. Wouldn't a simple on/off switch be simpler? As presented,
wal_consistency_mask is also going to be very quite confusing for
users. You should not need to apply some maths to set up this
parameter, a list of rmgr names may be more adapted if this level of
tuning is needed, still it seems to me that we don't need this much.
-- 
Michael



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Tue, Aug 23, 2016 at 10:57 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
> Also, what's the use case of allowing only a certain set of rmgrs to
> be checked. Wouldn't a simple on/off switch be simpler?
>

I think there should be a way to test WAL for one particular resource
manager.  For example, if someone develops a new index or some other
heap storage, only that particular module can be verified.  Generating
WAL for all the resource managers together can also serve the purpose,
but it will be slightly difficult to verify it.

> As presented,
> wal_consistency_mask is also going to be very quite confusing for
> users. You should not need to apply some maths to set up this
> parameter, a list of rmgr names may be more adapted if this level of
> tuning is needed,
>

Yeah, that can be better.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Simon Riggs
Date:
On 22 August 2016 at 16:56, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 22 August 2016 at 13:44, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
>> Please let me know your thoughts on this.
>
> Do the regression tests pass with this option enabled?

Hi,

I'd like to be a reviewer on this. Please can you add this onto the CF
app so we can track the review?

Please supply details of the testing and test coverage.

Thanks

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
Hi,

I've added the feature in CP app. Following are the testing details:

1. In master, I've enabled following configurations:

* wal_level = replica
* max_wal_senders = 3
* wal_keep_segments = 4000
* hot_standby = on
* wal_consistency_mask = 511  /* Enable consistency check mask bit*/

2. In slave, I've enabled following configurations:

* standby_mode = on
* wal_consistency_mask = 511 /* Enable consistency check mask bit*/

3. Then, I performed gmake installcheck in master. I didn't get any
warning regarding WAL inconsistency in slave.

I've made following changes to the attached patch:

1. For BRIN pages, I've masked the unused space, PD_PAGE_FULL and
PD_HAS_FREE_LINES flags.
2. For Btree pages, I've masked BTP_HALF_DEAD, BTP_SPLIT_END,
BTP_HAS_GARBAGE and BTP_INCOMPLETE_SPLIT flags.
3. For GIN_DELETED page, I've masked the entire page since the page is
always initialized during recovery.
4. For Speculative Heap tuple insert operation, there was
inconsistency in t_ctid value. So, I've modified the t_ctid value (in
backup page) to current block number and offset number. Need
suggestions!!


What needs to be done:
1. Add support for other Resource Managers.
2. Modify masking techniques for existing Resource Managers (if required).
3. Modify the GUC parameter which will accept a list of rmgr names.
4. Modify the technique for identifying rmgr names for which the
feature should be enabled.
5. Generalize the page type identification technique.


On Wed, Aug 24, 2016 at 2:14 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 22 August 2016 at 16:56, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 22 August 2016 at 13:44, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>
>>> Please let me know your thoughts on this.
>>
>> Do the regression tests pass with this option enabled?
>
> Hi,
>
> I'd like to be a reviewer on this. Please can you add this onto the CF
> app so we can track the review?
>
> Please supply details of the testing and test coverage.
>
> Thanks
>
> --
> Simon Riggs                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Thanks & Regards,
Kuntal Ghosh

Attachment

Re: WAL consistency check facility

From
Alvaro Herrera
Date:
Kuntal Ghosh wrote:

> 4. For Speculative Heap tuple insert operation, there was
> inconsistency in t_ctid value. So, I've modified the t_ctid value (in
> backup page) to current block number and offset number. Need
> suggestions!!

In speculative insertions, t_ctid is used to store the speculative
token.  I think you should just mask that field out in that case (which
you can recognize because ip_posid is set to magic value 0xfffe).

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
Thanks a lot.

I just want to mention the situation where I was getting the
speculative token related inconsistency.

ItemPointer in backup page from master:
LOG:  ItemPointer BlockNumber: 1 OffsetNumber:65534 Speculative: true
CONTEXT:  xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

ItemPointer in current page from slave after redo:
LOG:  ItemPointer BlockNumber: 0 OffsetNumber:1 Speculative: false
CONTEXT:  xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

As the block numbers are different, I was getting the following warning:
WARNING:  Inconsistent page (at byte 8166) found for record
0/127F4A48, rel 1663/16384/16946, forknum 0, blkno 0, Backup Page
Header : (pd_lower: 28 pd_upper: 8152 pd_special: 8192) Current Page
Header: (pd_lower: 28 pd_upper: 8152 pd_special: 8192)
CONTEXT:  xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1

In heap_xlog_insert, t_ctid is always set to blkno and xlrec->offnum.
I think this is why I was getting the above warning.


On Thu, Aug 25, 2016 at 10:33 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Kuntal Ghosh wrote:
>
>> 4. For Speculative Heap tuple insert operation, there was
>> inconsistency in t_ctid value. So, I've modified the t_ctid value (in
>> backup page) to current block number and offset number. Need
>> suggestions!!
>
> In speculative insertions, t_ctid is used to store the speculative
> token.  I think you should just mask that field out in that case (which
> you can recognize because ip_posid is set to magic value 0xfffe).
>
> --
> Álvaro Herrera                http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Thanks & Regards,
Kuntal Ghosh



Re: WAL consistency check facility

From
Alvaro Herrera
Date:
Kuntal Ghosh wrote:
> Thanks a lot.
> 
> I just want to mention the situation where I was getting the
> speculative token related inconsistency.
> 
> ItemPointer in backup page from master:
> LOG:  ItemPointer BlockNumber: 1 OffsetNumber:65534 Speculative: true
> CONTEXT:  xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1
> 
> ItemPointer in current page from slave after redo:
> LOG:  ItemPointer BlockNumber: 0 OffsetNumber:1 Speculative: false
> CONTEXT:  xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1
> 
> As the block numbers are different, I was getting the following warning:
> WARNING:  Inconsistent page (at byte 8166) found for record
> 0/127F4A48, rel 1663/16384/16946, forknum 0, blkno 0, Backup Page
> Header : (pd_lower: 28 pd_upper: 8152 pd_special: 8192) Current Page
> Header: (pd_lower: 28 pd_upper: 8152 pd_special: 8192)
> CONTEXT:  xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1
> 
> In heap_xlog_insert, t_ctid is always set to blkno and xlrec->offnum.
> I think this is why I was getting the above warning.

Umm, really?  Then perhaps this *is* a bug.  Peter?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Simon Riggs
Date:
Hi Kuntal,

Thanks for the patch.

Current patch has no docs, no tests and no explanatory comments, so
makes review quite hard.

The good news is you might discover a few bugs with it, so its worth
pursuing actively in this CF, though its not near to being
committable.

I think you should add this as part of the default testing for both
check and installcheck. I can't imagine why we'd have it and not use
it during testing.


On 25 August 2016 at 18:41, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/

What does this mean? (No docs)


> What needs to be done:
> 1. Add support for other Resource Managers.

We probably need to have a discussion as to why you think this should
be Rmgr dependent?
Code comments would help there.

If it does, then you should probably do this by extending RmgrTable
with an rm_check, so you can call it like this...

RmgrTable[record->xl_rmid].rm_check

I'm interested in how we handle the new generic WAL format for blocks.
Surely if we can handle that then we won't need an Rmgr dependency?
I'm sure you have reasons, they just need to be explained long hand -
don't assume anything.


> 5. Generalize the page type identification technique.

Why not do this first?

There are some coding guideline stuff to check as well.

Thanks

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Fri, Aug 26, 2016 at 9:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> I think you should add this as part of the default testing for both
> check and installcheck. I can't imagine why we'd have it and not use
> it during testing.
>

The actual consistency checks are done during redo (replay), so not
sure whats in you mind for enabling it with check or installcheck.  I
think we can run few recovery/replay tests with this framework.  Also
running the tests under this framework could be time-consuming as it
logs the entire page for each WAL record we write and then during
replay reads the same.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Simon Riggs
Date:
On 27 August 2016 at 07:36, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Aug 26, 2016 at 9:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>
>> I think you should add this as part of the default testing for both
>> check and installcheck. I can't imagine why we'd have it and not use
>> it during testing.
>>
>
> The actual consistency checks are done during redo (replay), so not
> sure whats in you mind for enabling it with check or installcheck.  I
> think we can run few recovery/replay tests with this framework.  Also
> running the tests under this framework could be time-consuming as it
> logs the entire page for each WAL record we write and then during
> replay reads the same.

I'd like to see an automated test added so we can be certain we don't
add things that break recovery. Don't mind much where or how.

The main use is to maintain that certainty while in production.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
Hello Simon,

I'm really sorry for the inconveniences. Next time, I'll attach the
patch with proper documentation, test and comments.

> I think you should add this as part of the default testing for both
> check and installcheck. I can't imagine why we'd have it and not use
> it during testing.

Since, this is redo(replay) feature, we can surely add this in
installcheck. But, as Amit mentioned, it could be time-consuming.

>> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>
> What does this mean? (No docs)

I was using this parameter as a masking integer to indicate the
operations(rmgr list) for which we need this feature to be enabled.
Since, this could be confusing, I've changed it accordingly so that it
accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

>> 1. Add support for other Resource Managers.
>
> We probably need to have a discussion as to why you think this should
> be Rmgr dependent?
> Code comments would help there.
>
> If it does, then you should probably do this by extending RmgrTable
> with an rm_check, so you can call it like this...
>
> RmgrTable[record->xl_rmid].rm_check

+1.
I'm modifying it accordingly. I'm calling this function after
RmgrTable[record->xl_rmid].rm_redo.

>> 5. Generalize the page type identification technique.
>
> Why not do this first?
>

At present, I'm using special page size and page ID to identify page
type. But, I've noticed some cases where the entire page is
initialized to zero (Ex: hash_xlog_squeeze_page). RmgrID and info bit
can help us to identify those pages.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Sat, Aug 27, 2016 at 6:16 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 27 August 2016 at 07:36, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, Aug 26, 2016 at 9:26 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>>
>>> I think you should add this as part of the default testing for both
>>> check and installcheck. I can't imagine why we'd have it and not use
>>> it during testing.
>>>
>>
>> The actual consistency checks are done during redo (replay), so not
>> sure whats in you mind for enabling it with check or installcheck.  I
>> think we can run few recovery/replay tests with this framework.  Also
>> running the tests under this framework could be time-consuming as it
>> logs the entire page for each WAL record we write and then during
>> replay reads the same.
>
> I'd like to see an automated test added so we can be certain we don't
> add things that break recovery. Don't mind much where or how.
>
> The main use is to maintain that certainty while in production.

For developers, having extra checks with the new routines in WAL_DEBUG
could also be useful for a code path producing WAL. Let's not forget
that as well.
-- 
Michael



Re: WAL consistency check facility

From
Peter Geoghegan
Date:
On Fri, Aug 26, 2016 at 7:24 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> As the block numbers are different, I was getting the following warning:
>> WARNING:  Inconsistent page (at byte 8166) found for record
>> 0/127F4A48, rel 1663/16384/16946, forknum 0, blkno 0, Backup Page
>> Header : (pd_lower: 28 pd_upper: 8152 pd_special: 8192) Current Page
>> Header: (pd_lower: 28 pd_upper: 8152 pd_special: 8192)
>> CONTEXT:  xlog redo at 0/127F4A48 for Heap/INSERT+INIT: off 1
>>
>> In heap_xlog_insert, t_ctid is always set to blkno and xlrec->offnum.
>> I think this is why I was getting the above warning.
>
> Umm, really?  Then perhaps this *is* a bug.  Peter?

It's a matter of perspective, but probably not. The facts are:

* heap_insert() treats speculative insertions differently. In
particular, it does not set ctid in the caller-passed heap tuple
itself, leaving the ctid field to contain a speculative token value --
a per-backend monotonically increasing identifier. This identifier
represents some particular speculative insertion attempt within a
backend.

* On the redo side, heap_xlog_insert() was only changed mechanically
when upsert went in. So, it doesn't actually care about the stuff that
heap_insert() was made to care about to support speculative insertion.

* Furthermore, heap_insert() does *not* WAL log ctid under any
circumstances (that didn't change, either). Traditionally, the ctid
field was completely redundant anyway (since, of course, we're only
dealing with newly inserted tuples in heap_insert()). With speculative
insertions, there is a token within ctid, whose value represents
actual information that cannot be reconstructed after the fact (the
speculative token value). Even still, that isn't WAL-logged (see
comments above xl_heap_header struct). That's okay, because the
speculative insertion token value is only needed due to obscure issues
with "unprincipled deadlocks". The speculative token value itself is
only of interest to other, conflicting inserters, and only for the
instant in time immediately following physical insertion. The token
doesn't matter in the slightest to crash recovery, nor to Hot Standby
replicas.

While this design had some really nice properties (ask me if you are
unclear on this), it does break tools like the proposed WAL-checker
tool. I would compare this speculative token situation to the
situation with hint bits (when checksums are disabled, and
wal_log_hints = off).

I actually have a lot of sympathy for the idea that, in general, cases
like this should be avoided. But, would it really be worth it to
create a useless special case for speculative insertion (to WAL-log
the virtually useless speculative insertion token value)? I'm certain
that the answer must be "no": This tool ought to deal with speculative
insertion as a special case, and not vice-versa.

-- 
Peter Geoghegan



Re: WAL consistency check facility

From
Peter Geoghegan
Date:
On Thu, Aug 25, 2016 at 9:41 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> 2. For Btree pages, I've masked BTP_HALF_DEAD, BTP_SPLIT_END,
> BTP_HAS_GARBAGE and BTP_INCOMPLETE_SPLIT flags.

Why? I think that you should only perform this kind of masking where
it's clearly strictly necessary.

It is true that nbtree can allow cases where LP_DEAD is set with only
a share lock (by read-only queries), so I can see why BTP_HAS_GARBAGE
might need to be excluded; this is comparable to the heapam's use of
hint bits. However, it is unclear why you need to mask the remaining
btpo_flags that you list, because the other flags have clear-cut roles
in various atomic operations that we WAL-log.

-- 
Peter Geoghegan



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Sun, Aug 28, 2016 at 6:26 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Aug 25, 2016 at 9:41 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> 2. For Btree pages, I've masked BTP_HALF_DEAD, BTP_SPLIT_END,
>> BTP_HAS_GARBAGE and BTP_INCOMPLETE_SPLIT flags.
>
> Why? I think that you should only perform this kind of masking where
> it's clearly strictly necessary.
>
> It is true that nbtree can allow cases where LP_DEAD is set with only
> a share lock (by read-only queries), so I can see why BTP_HAS_GARBAGE
> might need to be excluded; this is comparable to the heapam's use of
> hint bits. However, it is unclear why you need to mask the remaining
> btpo_flags that you list, because the other flags have clear-cut roles
> in various atomic operations that we WAL-log.
>

Right, I think there is no need to mask all the flags.  However apart
from BTP_HAS_GARBAGE, it seems we should mask BTP_SPLIT_END as that is
just used to save some processing for vaccum and won't be set after
crash recovery or on standby after WAL replay.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Peter Geoghegan
Date:
On Sat, Aug 27, 2016 at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> Right, I think there is no need to mask all the flags.  However apart
> from BTP_HAS_GARBAGE, it seems we should mask BTP_SPLIT_END as that is
> just used to save some processing for vaccum and won't be set after
> crash recovery or on standby after WAL replay.

Right you are -- while BTP_INCOMPLETE_SPLIT is set during recovery,
BTP_SPLIT_END is not. Still, most of the btpo_flags flags that are
masked in the patch shouldn't be.


-- 
Peter Geoghegan



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
Thank you. I've updated it accordingly.

On Sun, Aug 28, 2016 at 11:20 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Sat, Aug 27, 2016 at 9:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Right, I think there is no need to mask all the flags.  However apart
>> from BTP_HAS_GARBAGE, it seems we should mask BTP_SPLIT_END as that is
>> just used to save some processing for vaccum and won't be set after
>> crash recovery or on standby after WAL replay.
>
> Right you are -- while BTP_INCOMPLETE_SPLIT is set during recovery,
> BTP_SPLIT_END is not. Still, most of the btpo_flags flags that are
> masked in the patch shouldn't be.
>
>
> --
> Peter Geoghegan



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Simon Riggs
Date:
On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

>>> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>>
>> What does this mean? (No docs)
>
> I was using this parameter as a masking integer to indicate the
> operations(rmgr list) for which we need this feature to be enabled.
> Since, this could be confusing, I've changed it accordingly so that it
> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

Why would we want that?


>>> 1. Add support for other Resource Managers.
>>
>> We probably need to have a discussion as to why you think this should
>> be Rmgr dependent?
>> Code comments would help there.
>>
>> If it does, then you should probably do this by extending RmgrTable
>> with an rm_check, so you can call it like this...
>>
>> RmgrTable[record->xl_rmid].rm_check
>
> +1.
> I'm modifying it accordingly. I'm calling this function after
> RmgrTable[record->xl_rmid].rm_redo.
>
>>> 5. Generalize the page type identification technique.
>>
>> Why not do this first?
>>
>
> At present, I'm using special page size and page ID to identify page
> type. But, I've noticed some cases where the entire page is
> initialized to zero (Ex: hash_xlog_squeeze_page). RmgrID and info bit
> can help us to identify those pages.

I'd prefer a solution that was not dependent upon RmgrID at all.

If there are various special cases that we need to cater for, ISTM
they would be flaws in the existing WAL implementation rather than
anything we would want to perpetuate. I hope we'll spend time fixing
them rather than add loads of weird code to work around the
imperfections.

Underdocumented special case code is going to be unbelievably
difficult to get right in the long term.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Aug 31, 2016 at 10:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
>>>> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>>>
>>> What does this mean? (No docs)
>>
>> I was using this parameter as a masking integer to indicate the
>> operations(rmgr list) for which we need this feature to be enabled.
>> Since, this could be confusing, I've changed it accordingly so that it
>> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>
> Why would we want that?

I am still in for just an on/off switch instead of this complication.
An all-or-nothing feature is what we are looking at here. Still a list
is an improvement compared to a bitmap.

>>>> 1. Add support for other Resource Managers.
>>>
>>> We probably need to have a discussion as to why you think this should
>>> be Rmgr dependent?
>>> Code comments would help there.
>>>
>>> If it does, then you should probably do this by extending RmgrTable
>>> with an rm_check, so you can call it like this...
>>>
>>> RmgrTable[record->xl_rmid].rm_check
>>
>> +1.
>> I'm modifying it accordingly. I'm calling this function after
>> RmgrTable[record->xl_rmid].rm_redo.
>>
>>>> 5. Generalize the page type identification technique.
>>>
>>> Why not do this first?
>>>
>>
>> At present, I'm using special page size and page ID to identify page
>> type. But, I've noticed some cases where the entire page is
>> initialized to zero (Ex: hash_xlog_squeeze_page). RmgrID and info bit
>> can help us to identify those pages.
>
> I'd prefer a solution that was not dependent upon RmgrID at all.

So you'd rather identify the page types by looking at pd_special? That
seems worse to me but..
-- 
Michael



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
>>>> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>>>
>>> What does this mean? (No docs)
>>
>> I was using this parameter as a masking integer to indicate the
>> operations(rmgr list) for which we need this feature to be enabled.
>> Since, this could be confusing, I've changed it accordingly so that it
>> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>
> Why would we want that?
>

It would be easier to test and develop the various modules separately.
As an example, if we develop a new AM which needs WAL facility or
adding WAL capability to an existing system (say Hash Index), we can
just test that module, rather than whole system.  I think it can help
us in narrowing down the problem, if we have facility to enable it at
RMGR ID level.  Having said that, I think this must have the facility
to enable it for all the RMGR ID's (say ALL) and probably that should
be default.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Sep 1, 2016 at 11:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>
>>>>> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>>>>
>>>> What does this mean? (No docs)
>>>
>>> I was using this parameter as a masking integer to indicate the
>>> operations(rmgr list) for which we need this feature to be enabled.
>>> Since, this could be confusing, I've changed it accordingly so that it
>>> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>>
>> Why would we want that?
>
> It would be easier to test and develop the various modules separately.
> As an example, if we develop a new AM which needs WAL facility or
> adding WAL capability to an existing system (say Hash Index), we can
> just test that module, rather than whole system.  I think it can help
> us in narrowing down the problem, if we have facility to enable it at
> RMGR ID level.  Having said that, I think this must have the facility
> to enable it for all the RMGR ID's (say ALL) and probably that should
> be default.

As far as I am understanding things, we are aiming at something that
could be used on production systems. And, honestly, any people
enabling it would just do it for all RMGRs because that's a
no-brainer. If we are designing something for testing purposes
instead, something is wrong with this patch then.

Doing filtering at RMGR level for testing and development purposes
will be done by somebody who has the skills to filter out which
records he should look at. Or he'll bump into an existing bump. So I'd
rather keep this thing simple.
-- 
Michael



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Thu, Sep 1, 2016 at 8:30 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Sep 1, 2016 at 11:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>>> On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>>
>>>>>> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>>>>>
>>>>> What does this mean? (No docs)
>>>>
>>>> I was using this parameter as a masking integer to indicate the
>>>> operations(rmgr list) for which we need this feature to be enabled.
>>>> Since, this could be confusing, I've changed it accordingly so that it
>>>> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>>>
>>> Why would we want that?
>>
>> It would be easier to test and develop the various modules separately.
>> As an example, if we develop a new AM which needs WAL facility or
>> adding WAL capability to an existing system (say Hash Index), we can
>> just test that module, rather than whole system.  I think it can help
>> us in narrowing down the problem, if we have facility to enable it at
>> RMGR ID level.  Having said that, I think this must have the facility
>> to enable it for all the RMGR ID's (say ALL) and probably that should
>> be default.
>
> As far as I am understanding things, we are aiming at something that
> could be used on production systems.
>

I don't think you can enable it by default in production systems.
Enabling it will lead to significant performance drop as it writes the
whole page after each record for most type of RMGR ID's.

> And, honestly, any people
> enabling it would just do it for all RMGRs because that's a
> no-brainer.

Agreed, but remember enabling it for all is not free.

> If we are designing something for testing purposes
> instead, something is wrong with this patch then.
>

What is wrong?

> Doing filtering at RMGR level for testing and development purposes
> will be done by somebody who has the skills to filter out which
> records he should look at.
>

Right, but in that way, if you see many of our guc parameters needs a
good level of understanding to set the correct  values for them.  For
example, do you think it is easy for user to set value for
"replacement_sort_tuples" without reading the description or
understanding the meaning of same.  This example might not be the best
example, but I think there are other parameters which do require some
deeper understanding of system.  The main thing is default values for
such parameters should be chosen carefully such that it represents
most common usage.

> Or he'll bump into an existing bump. So I'd
> rather keep this thing simple.
>

It seems to me that having an option of 'ALL' would make it easier for
users to set it.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Thu, Sep 1, 2016 at 8:02 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 27 August 2016 at 12:09, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>
>>>>> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>>>>
>>>> What does this mean? (No docs)
>>>
>>> I was using this parameter as a masking integer to indicate the
>>> operations(rmgr list) for which we need this feature to be enabled.
>>> Since, this could be confusing, I've changed it accordingly so that it
>>> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>>
>> Why would we want that?
>>
>
> It would be easier to test and develop the various modules separately.
> As an example, if we develop a new AM which needs WAL facility or
> adding WAL capability to an existing system (say Hash Index), we can
> just test that module, rather than whole system.  I think it can help
> us in narrowing down the problem, if we have facility to enable it at
> RMGR ID level.  Having said that, I think this must have the facility
> to enable it for all the RMGR ID's (say ALL) and probably that should
> be default.
>

oops, I think having an option of specifying 'ALL' is good, but that
shouldn't be default, because it could have serious performance
implications.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Peter Geoghegan
Date:
On Wed, Aug 31, 2016 at 8:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> As far as I am understanding things, we are aiming at something that
>> could be used on production systems.
>>
>
> I don't think you can enable it by default in production systems.
> Enabling it will lead to significant performance drop as it writes the
> whole page after each record for most type of RMGR ID's.
>
>> And, honestly, any people
>> enabling it would just do it for all RMGRs because that's a
>> no-brainer.
>
> Agreed, but remember enabling it for all is not free.

I have sympathy for the idea that this should be as low overhead as
possible, even if that means adding complexity to the interface --
within reason. I would like to hear a practical example of where this
RMGR id interface could be put to good use, when starting with little
initial information about a problem. And, ideally, we'd also have some
indication of how big a difference that would make, it terms of
measurable performance impact.

-- 
Peter Geoghegan



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Thu, Sep 1, 2016 at 9:43 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Aug 31, 2016 at 8:26 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>> As far as I am understanding things, we are aiming at something that
>>> could be used on production systems.
>>>
>>
>> I don't think you can enable it by default in production systems.
>> Enabling it will lead to significant performance drop as it writes the
>> whole page after each record for most type of RMGR ID's.
>>
>>> And, honestly, any people
>>> enabling it would just do it for all RMGRs because that's a
>>> no-brainer.
>>
>> Agreed, but remember enabling it for all is not free.
>
> I have sympathy for the idea that this should be as low overhead as
> possible, even if that means adding complexity to the interface --
> within reason. I would like to hear a practical example of where this
> RMGR id interface could be put to good use, when starting with little
> initial information about a problem.
>

One example that comes to mind is for the cases where the problem
reproduces only under high concurrency or some stress test. Now assume
the problem is with index, enabling it for all rmgr's could reduce the
probability of problem due to it's performance impact.  The second
advantage which I have already listed is it helps in future
development like the one I am doing now for hash indexes (making them
logged).

> And, ideally, we'd also have some
> indication of how big a difference that would make, it terms of
> measurable performance impact.
>

Yes, that's a valid point. I think we can do some tests to see the difference.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Robert Haas
Date:
On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I'd prefer a solution that was not dependent upon RmgrID at all.
>
> If there are various special cases that we need to cater for, ISTM
> they would be flaws in the existing WAL implementation rather than
> anything we would want to perpetuate. I hope we'll spend time fixing
> them rather than add loads of weird code to work around the
> imperfections.
>
> Underdocumented special case code is going to be unbelievably
> difficult to get right in the long term.

It seems to me that you may be conflating the issue of which changes
should be masked out as hints (which is, indeed, special case code,
whether underdocumented or not) with the issue of which rmgrs the user
may want to verify (which is just a case of matching the rmgr ID in
the WAL record against a list provided by the user, and is not special
case code at all).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Simon Riggs
Date:
On 1 September 2016 at 11:16, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> I'd prefer a solution that was not dependent upon RmgrID at all.
>>
>> If there are various special cases that we need to cater for, ISTM
>> they would be flaws in the existing WAL implementation rather than
>> anything we would want to perpetuate. I hope we'll spend time fixing
>> them rather than add loads of weird code to work around the
>> imperfections.
>>
>> Underdocumented special case code is going to be unbelievably
>> difficult to get right in the long term.
>
> It seems to me that you may be conflating the issue of which changes
> should be masked out as hints (which is, indeed, special case code,
> whether underdocumented or not) with the issue of which rmgrs the user
> may want to verify (which is just a case of matching the rmgr ID in
> the WAL record against a list provided by the user, and is not special
> case code at all).

Yep, it seems entirely likely that I am misunderstanding what is
happening here. I'd like to see an analysis/discussion before we write
code. As you might expect, I'm excited by this feature and the
discoveries it appears likely to bring.

We've got wal_log_hints and that causes lots of extra traffic. I'm
happy with assuming that is switched on in this case also. (Perhaps we
might have a wal_log_level with various levels of logging.)

So in my current understanding, a hinted change has by definition no
WAL record, so we just ship a FPW. A non-hint change has a WAL record
and it is my (possibly naive) hope that all changes to a page are
reflected in the WAL record/replay, so we can just make a simple
comparison without caring what is the rmgr of the WAL record.

If we can start by discussing which special cases we know about that
require extra code, that will help. We can then decide whether to fix
the WAL record/replay or fix the comparison logic, possibly on a case
by case basis. My current preference is to generate lots of small
fixes to existing WAL code and then have a very, very simple patch for
this actual feature, but am willing to discuss alternate approaches.

IMV this would be a feature certain users would want turned on all the
time for everything. So I'm not bothered much about making this
feature settable by rmgr. I might question why this particular feature
would be settable by rmgr, when features like wal_log_hints and
wal_compression are not, but such discussion is a minor point in
comparison to discussing the main feature.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Robert Haas
Date:
On Thu, Sep 1, 2016 at 4:12 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> So in my current understanding, a hinted change has by definition no
> WAL record, so we just ship a FPW.

Hmm.  An FPW would have to be contained in a WAL record, so it can't
be right to say that we ship an FPW for lack of a WAL record.  I think
what we ship is nothing at all when wal_log_hints is disabled, and
when wal_log_hints is enabled we log an FDW once per checkpoint.

> A non-hint change has a WAL record
> and it is my (possibly naive) hope that all changes to a page are
> reflected in the WAL record/replay,

I hope for this, too.

> so we can just make a simple
> comparison without caring what is the rmgr of the WAL record.

Sure, that is 100% possible, and likely a good idea as far as the
behavior on the standby is concerned.  What's not so clear is whether
a simple on/off switch is a wise plan on the master.

The purpose of this code, as I understand it, is to check for
discrepancies between "do" and "redo"; that is, to verify that the
changes made to the buffer at the time the WAL record is generated
produce the same result as replay of that WAL record on the standby.
To accomplish this purpose, a post-image of the affected buffers is
included in each and every WAL record.  On replay, that post-image can
be compared with the result of replay.  If they differ, PostgreSQL has
a bug.  I would not expect many users to run this in production,
because it will presumably be wicked expensive.  If I recall
correctly, doing full page writes once per buffer per checkpoint, the
space taken up by FPWs is >75% of WAL volume.  Doing it for every
record will be exponentially more expensive.  The primary audience of
this feature is PostgreSQL developers, who might want to use it to try
to verify that, for example, Amit's patch to add write-ahead logging
for hash indexes does not have bugs.[1]

Indeed, it had occurred to me that we might not even want to compile
this code into the server unless WAL_DEBUG is defined; after all, how
does it help a regular user to detect that the server has a bug?  Bug
or no bug, that's the code they've got.  But on further reflection, it
seems like it could be useful: if we suspect a bug in the redo code
but we can't reproduce it here, we could ask the customer to turn this
option on to see whether it produces logging indicating the nature of
the problem.  However, because of the likely expensive of enabling the
feature, it seems like it would be quite desirable to limit the
expense of generating many extra FPWs to the affected rmgr.  For
example, if a user has a table with a btree index and a gin index, and
we suspect a bug in GIN, it would be nice for the user to be able to
enable the feature *only for GIN* rather than paying the cost of
enabling it for btree and heap as well.[2]

Similarly, when we imagine a developer using this feature to test for
bugs, it may at times be useful to enable it across-the-board to look
for bugs in any aspect of the write-ahead logging system. However, at
other times, when the goal is to find bugs in a particular AM, it
might be useful to enable it only for the corresponding rmgr.  It is
altogether likely that this feature will slow the system down quite a
lot.  If enabling this feature for hash indexes also means enabling it
for the heap, the incremental performance hit might be sufficient to
mask concurrency-related bugs in the hash index code that would
otherwise have been found.  So, I think having at least some basic
filtering is probably a pretty smart idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] It probably has bugs.
[2] One could of course add filtering considerably more complex than
per-rmgr - e.g. enabling it for only one particular relfilenode on a
busy production system might be rather desirable.  But I'm not sure we
really want to go there.  It adds a fair amount of complexity to a
feature that many people are obviously hoping will be quite simple to
use.



Re: WAL consistency check facility

From
Simon Riggs
Date:
On 1 September 2016 at 17:23, Robert Haas <robertmhaas@gmail.com> wrote:

> The primary audience of this feature is PostgreSQL developers

I have spoken to users who are waiting for this feature to run in
production, which is why I suggested it.

Some people care more about correctness than they do about loss of performance.

Obviously, this would be expensive and those with a super high
performance requirement may not be able to take advantage of this. I'm
sure many people will turn it off once if they hit a performance
issue, but running it in production for the first few months will give
people a very safe feeling.

I think the primary use for an rmgr filter might well be PostgreSQL developers.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Peter Geoghegan
Date:
On Thu, Sep 1, 2016 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Indeed, it had occurred to me that we might not even want to compile
> this code into the server unless WAL_DEBUG is defined; after all, how
> does it help a regular user to detect that the server has a bug?  Bug
> or no bug, that's the code they've got.  But on further reflection, it
> seems like it could be useful: if we suspect a bug in the redo code
> but we can't reproduce it here, we could ask the customer to turn this
> option on to see whether it produces logging indicating the nature of
> the problem.  However, because of the likely expensive of enabling the
> feature, it seems like it would be quite desirable to limit the
> expense of generating many extra FPWs to the affected rmgr.  For
> example, if a user has a table with a btree index and a gin index, and
> we suspect a bug in GIN, it would be nice for the user to be able to
> enable the feature *only for GIN* rather than paying the cost of
> enabling it for btree and heap as well.[2]

Yes, that would be rather a large advantage.

I think that there really is no hard distinction between users and
hackers. Some people will want to run this in production, and it would
be a lot better if performance was at least not atrocious. If amcheck
couldn't do the majority of its verification with only an
AccessShareLock, then users probably just couldn't use it. Heroku
wouldn't have been able to use it on all production databases. It
wouldn't have mattered that the verification was no less effective,
since the bugs it found would simply never have been observed in
practice.

-- 
Peter Geoghegan



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
Hello,

As per the earlier discussions, I've attached the updated patch for
WAL consistency check feature. This is how the patch works:

- If WAL consistency check is enabled for a rmgrID, we always include
the backup image in the WAL record.
- I've extended the RmgrTable with a new function pointer
rm_checkConsistency, which is called after rm_redo. (only when WAL
consistency check is enabled for this rmgrID)
- In each rm_checkConsistency, both backup pages and buffer pages are
masked accordingly before any comparison.
- In postgresql.conf, a new guc variable named 'wal_consistency' is
added. Default value of this variable is 'None'. Valid values are
combinations of Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist,
BRIN, Generic and XLOG. It can also be set to 'All' to enable all the
values.
- In recovery tests (src/test/recovery/t), I've added wal_consistency
parameter in the existing scripts. This feature doesn't change the
expected output. If there is any inconsistency, it can be verified in
corresponding log file.

Results
------------------------

I've tested with installcheck and installcheck-world in master-standby
set-up. Followings are the configuration parameters.

Master:
wal_level = replica
max_wal_senders = 3
wal_keep_segments = 4000
hot_standby = on
wal_consistency = 'All'

Standby:
wal_consistency = 'All'

I got two types of inconsistencies as following:

1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup
page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after
redo, only BTP_DELETED flag is set in buffer page. I assume that we
should clear all btpo_flags before setting BTP_DELETED in
_bt_unlink_halfdead_page().

2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in
REVMAP page. This happens only for two cases. I'm not sure what the
reason can be.

I haven't done sufficient tests yet to measure the overhead of this
modification. I'll do that next.


Thanks to Amit Kapila, Dilip Kumar and Robert Haas for their off-line
suggestions.

Thoughts?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

On Thu, Sep 1, 2016 at 11:34 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 1, 2016 at 9:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Indeed, it had occurred to me that we might not even want to compile
>> this code into the server unless WAL_DEBUG is defined; after all, how
>> does it help a regular user to detect that the server has a bug?  Bug
>> or no bug, that's the code they've got.  But on further reflection, it
>> seems like it could be useful: if we suspect a bug in the redo code
>> but we can't reproduce it here, we could ask the customer to turn this
>> option on to see whether it produces logging indicating the nature of
>> the problem.  However, because of the likely expensive of enabling the
>> feature, it seems like it would be quite desirable to limit the
>> expense of generating many extra FPWs to the affected rmgr.  For
>> example, if a user has a table with a btree index and a gin index, and
>> we suspect a bug in GIN, it would be nice for the user to be able to
>> enable the feature *only for GIN* rather than paying the cost of
>> enabling it for btree and heap as well.[2]
>
> Yes, that would be rather a large advantage.
>
> I think that there really is no hard distinction between users and
> hackers. Some people will want to run this in production, and it would
> be a lot better if performance was at least not atrocious. If amcheck
> couldn't do the majority of its verification with only an
> AccessShareLock, then users probably just couldn't use it. Heroku
> wouldn't have been able to use it on all production databases. It
> wouldn't have mattered that the verification was no less effective,
> since the bugs it found would simply never have been observed in
> practice.
>
> --
> Peter Geoghegan



--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Wed, Sep 7, 2016 at 3:52 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> Hello,
>
> As per the earlier discussions, I've attached the updated patch for
> WAL consistency check feature. This is how the patch works:
>

The earlier patch (wal_consistency_v6.patch) was based on the commit
id 67e1e2aaff.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Wed, Sep 7, 2016 at 3:52 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
> I got two types of inconsistencies as following:
>
> 1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup
> page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after
> redo, only BTP_DELETED flag is set in buffer page.
>

I see that inconsistency in code as well.  I think this is harmless,
because after the page is marked as deleted, it is not used for any
purpose other than to recycle it for re-use.  After re-using it, the
caller always suppose to initialize the flags based on it's usage and
I see that is happening in the code unless I am missing something.

> I assume that we
> should clear all btpo_flags before setting BTP_DELETED in
> _bt_unlink_halfdead_page().
>

Yeah, we can do that for consistency.  If we see any problem in doing
so, then I think we can log the flags and set them during replay.


Note - Please post your replies inline rather than top posting them.
It breaks the discussion link, if you top post it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Sep 7, 2016 at 7:22 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> Hello,

Could you avoid top-posting please? More reference here:
http://www.idallen.com/topposting.html

> - If WAL consistency check is enabled for a rmgrID, we always include
> the backup image in the WAL record.

What happens if wal_consistency has different settings on a standby
and its master? If for example it is set to 'all' on the standby, and
'none' on the master, or vice-versa, how do things react? An update of
this parameter should be WAL-logged, no?

> - I've extended the RmgrTable with a new function pointer
> rm_checkConsistency, which is called after rm_redo. (only when WAL
> consistency check is enabled for this rmgrID)
> - In each rm_checkConsistency, both backup pages and buffer pages are
> masked accordingly before any comparison.

This leads to heavy code duplication...

> - In postgresql.conf, a new guc variable named 'wal_consistency' is
> added. Default value of this variable is 'None'. Valid values are
> combinations of Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist,
> BRIN, Generic and XLOG. It can also be set to 'All' to enable all the
> values.

Lower-case is the usual policy for parameter values for GUC parameters.

> - In recovery tests (src/test/recovery/t), I've added wal_consistency
> parameter in the existing scripts. This feature doesn't change the
> expected output. If there is any inconsistency, it can be verified in
> corresponding log file.

I am afraid that just generating a WARNING message is going to be
useless for the buildfarm. If we want to detect errors, we could for
example have an additional GUC to trigger an ERROR or a FATAL, taking
down the cluster, and allowing things to show in red on a platform.

> Results
> ------------------------
>
> I've tested with installcheck and installcheck-world in master-standby
> set-up. Followings are the configuration parameters.

So you tested as well the recovery tests, right?

> I got two types of inconsistencies as following:
>
> 1. For Btree/UNLINK_PAGE_META, btpo_flags are different. In backup
> page, BTP_DELETED and BTP_LEAF both the flags are set, whereas after
> redo, only BTP_DELETED flag is set in buffer page. I assume that we
> should clear all btpo_flags before setting BTP_DELETED in
> _bt_unlink_halfdead_page().

The page is deleted, it does not matter, so you could just mask all
the flags for a deleted page...
[...]
+   /*
+    * Mask everything on a DELETED page.
+    */
+   if (((BTPageOpaque) PageGetSpecialPointer(page_norm))->btpo_flags
& BTP_DELETED)
And that's what is happening.

> 2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in
> REVMAP page. This happens only for two cases. I'm not sure what the
> reason can be.

Hm? This smells like a block reference bug. What are the cases you are
referring to?

> I haven't done sufficient tests yet to measure the overhead of this
> modification. I'll do that next.

I did a first pass on your patch, and I think that things could be
really reduced. There is much code duplication, but see below for the
details..
#include "access/xlogutils.h"
-
+#include "storage/bufmask.h"
I know that I am a noisy one on the matter, but please double-check
for such useless noise in your patch. And there is not only one.

+   newwalconsistency = (bool *) guc_malloc(ERROR,(RM_MAX_ID + 1)*sizeof(bool));
This spacing is not project-style. You may want to go through that:
https://www.postgresql.org/docs/devel/static/source.html

+$node_master->append_conf(
+   'postgresql.conf', qq(
+wal_consistency = 'All'
+));
Instead of duplicating that 7 times, you could just do it once in the
init() method of PostgresNode.pm. This really has meaning if enabled
by default.

+           /*
+            * Followings are the rmids which can have backup blocks.
+            * We'll enable this feature only for these rmids.
+            */
+           newwalconsistency[RM_HEAP2_ID] = true;
+           newwalconsistency[RM_HEAP_ID] = true;
+           newwalconsistency[RM_BTREE_ID] = true;
+           newwalconsistency[RM_HASH_ID] = true;
+           newwalconsistency[RM_GIN_ID] = true;
+           newwalconsistency[RM_GIST_ID] = true;
+           newwalconsistency[RM_SEQ_ID] = true;
+           newwalconsistency[RM_SPGIST_ID] = true;
+           newwalconsistency[RM_BRIN_ID] = true;
+           newwalconsistency[RM_GENERIC_ID] = true;
+           newwalconsistency[RM_XLOG_ID] = true;
Here you can just use MemSet with RM_MAX_ID and simplify this code maintenance.

+           for(i = 0; i < RM_MAX_ID + 1 ; i++)
+               newwalconsistency[i] = false;
+           break;
Same here you can just use MemSet.

+       else if (pg_strcasecmp(tok, "NONE") == 0)
[...]
+       else if (pg_strcasecmp(tok, "ALL") == 0)
It seems to me that using NONE or ALL with any other keywords should
not be allowed.

+       if (pg_strcasecmp(tok, "Heap2") == 0)
+       {
+           newwalconsistency[RM_HEAP2_ID] = true;
+       }
Thinking more about it, I guess that we had better change the
definition list of rmgrs in rmgr.h and get something closer to
RmgrDescData that pg_xlogdump has to avoid all this stanza by
completing it with the name of the rmgr. The only special cases that
this code path would need to take care of would be then 'none' and
'all'. You could do this refactoring on top of the main patch to
simplify it as it is rather big (1.7k lines).

+       if (inconsistent_loc < BLCKSZ)
+           elog(WARNING,
+                "Inconsistent page (at byte %u) found, rel %u/%u/%u, "
+                "forknum %u, blkno %u", inconsistent_loc,
+                rnode.spcNode, rnode.dbNode, rnode.relNode,
+                forknum, blkno);
+       else
+           elog(DEBUG1,
+                "Consistent page found, rel %u/%u/%u, "
+                "forknum %u, blkno %u",
+                rnode.spcNode, rnode.dbNode, rnode.relNode,
+                forknum, blkno);
This is going to be very chatty. Perhaps the elog level should be raised?

-#define SEQ_MAGIC    0x1717
-
-typedef struct sequence_magic
-{
-   uint32      magic;
-} sequence_magic;
You do not need this refactoring anymore.

+   void        (*rm_checkConsistency) (XLogReaderState *record);
All your _checkConsistency functions share the same pattern, in short
they all use a for loop for each block, call each time
XLogReadBufferExtended, etc. And this leads to a *lot* of duplication.
You would get a reduction by a couple of hundreds of lines by having a
smarter refactoring. And to be honest, if I look at your patch what I
think is the correct way of doing things is to add to the rmgr not
this check consistency function, but just a pointer to the masking
function.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
Hello Michael,

Thanks for your detailed review.

>> - If WAL consistency check is enabled for a rmgrID, we always include
>> the backup image in the WAL record.
>
> What happens if wal_consistency has different settings on a standby
> and its master? If for example it is set to 'all' on the standby, and
> 'none' on the master, or vice-versa, how do things react? An update of
> this parameter should be WAL-logged, no?
>
It is possible to set wal_consistency to 'All' in master and any other
values in standby. But, the scenario you mentioned will cause error in
standby since it may not get the required backup image for wal
consistency check. I think that user should be responsible to set
this value correctly. We can improve the error message to make the
user aware of the situation.

>> - I've extended the RmgrTable with a new function pointer
>> rm_checkConsistency, which is called after rm_redo. (only when WAL
>> consistency check is enabled for this rmgrID)
>> - In each rm_checkConsistency, both backup pages and buffer pages are
>> masked accordingly before any comparison.
>
> This leads to heavy code duplication...
>
> +   void        (*rm_checkConsistency) (XLogReaderState *record);
> All your _checkConsistency functions share the same pattern, in short
> they all use a for loop for each block, call each time
> XLogReadBufferExtended, etc. And this leads to a *lot* of duplication.
> You would get a reduction by a couple of hundreds of lines by having a
> smarter refactoring. And to be honest, if I look at your patch what I
> think is the correct way of doing things is to add to the rmgr not
> this check consistency function, but just a pointer to the masking
> function.
Pointer to the masking function will certainly reduce a lot of redundant
code. I'll modify it accordingly.

>> - In recovery tests (src/test/recovery/t), I've added wal_consistency
>> parameter in the existing scripts. This feature doesn't change the
>> expected output. If there is any inconsistency, it can be verified in
>> corresponding log file.
>
> I am afraid that just generating a WARNING message is going to be
> useless for the buildfarm. If we want to detect errors, we could for
> example have an additional GUC to trigger an ERROR or a FATAL, taking
> down the cluster, and allowing things to show in red on a platform.
>
Yes, we can include an additional GUC to trigger an ERROR for any inconsistency.

>> Results
>> ------------------------
>>
>> I've tested with installcheck and installcheck-world in master-standby
>> set-up. Followings are the configuration parameters.
>
> So you tested as well the recovery tests, right?
>
Yes, I've done the recovery tests after enabling tap-test.

>
> +           /*
> +            * Followings are the rmids which can have backup blocks.
> +            * We'll enable this feature only for these rmids.
> +            */
> +           newwalconsistency[RM_HEAP2_ID] = true;
> +           newwalconsistency[RM_HEAP_ID] = true;
> +           newwalconsistency[RM_BTREE_ID] = true;
> +           newwalconsistency[RM_HASH_ID] = true;
> +           newwalconsistency[RM_GIN_ID] = true;
> +           newwalconsistency[RM_GIST_ID] = true;
> +           newwalconsistency[RM_SEQ_ID] = true;
> +           newwalconsistency[RM_SPGIST_ID] = true;
> +           newwalconsistency[RM_BRIN_ID] = true;
> +           newwalconsistency[RM_GENERIC_ID] = true;
> +           newwalconsistency[RM_XLOG_ID] = true;
> Here you can just use MemSet with RM_MAX_ID and simplify this code maintenance.
>
Not all rmids can have backup blocks. So, for wal_consistency = 'all',
I've enabled only those rmids which can have backup blocks.

>
> +       if (pg_strcasecmp(tok, "Heap2") == 0)
> +       {
> +           newwalconsistency[RM_HEAP2_ID] = true;
> +       }
> Thinking more about it, I guess that we had better change the
> definition list of rmgrs in rmgr.h and get something closer to
> RmgrDescData that pg_xlogdump has to avoid all this stanza by
> completing it with the name of the rmgr. The only special cases that
> this code path would need to take care of would be then 'none' and
> 'all'. You could do this refactoring on top of the main patch to
> simplify it as it is rather big (1.7k lines).
>
I'm not sure about this point. wal_consistency doesn't support  all
the rmids. We should have some way to check this.

I'll update rest of the things as mentioned by you accordingly.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Fri, Sep 9, 2016 at 4:01 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>> - If WAL consistency check is enabled for a rmgrID, we always include
>>> the backup image in the WAL record.
>>
>> What happens if wal_consistency has different settings on a standby
>> and its master? If for example it is set to 'all' on the standby, and
>> 'none' on the master, or vice-versa, how do things react? An update of
>> this parameter should be WAL-logged, no?
>>
> It is possible to set wal_consistency to 'All' in master and any other
> values in standby. But, the scenario you mentioned will cause error in
> standby since it may not get the required backup image for wal
> consistency check. I think that user should be responsible to set
> this value correctly. We can improve the error message to make the
> user aware of the situation.

Let's be careful here. You should as well consider things from the
angle that some parameter updates are WAL-logged as well, like
wal_level with the WAL record XLOG_PARAMETER_CHANGE.

>>> - In recovery tests (src/test/recovery/t), I've added wal_consistency
>>> parameter in the existing scripts. This feature doesn't change the
>>> expected output. If there is any inconsistency, it can be verified in
>>> corresponding log file.
>>
>> I am afraid that just generating a WARNING message is going to be
>> useless for the buildfarm. If we want to detect errors, we could for
>> example have an additional GUC to trigger an ERROR or a FATAL, taking
>> down the cluster, and allowing things to show in red on a platform.
>>
> Yes, we can include an additional GUC to trigger an ERROR for any inconsistency.

I'd like to hear extra opinions about that, but IMO just having an
ERROR would be fine for the first implementation. Once you've bumped
into an ERROR, you are likely going to fix it first.

>> +           /*
>> +            * Followings are the rmids which can have backup blocks.
>> +            * We'll enable this feature only for these rmids.
>> +            */
>> +           newwalconsistency[RM_HEAP2_ID] = true;
>> +           newwalconsistency[RM_HEAP_ID] = true;
>> +           newwalconsistency[RM_BTREE_ID] = true;
>> +           newwalconsistency[RM_HASH_ID] = true;
>> +           newwalconsistency[RM_GIN_ID] = true;
>> +           newwalconsistency[RM_GIST_ID] = true;
>> +           newwalconsistency[RM_SEQ_ID] = true;
>> +           newwalconsistency[RM_SPGIST_ID] = true;
>> +           newwalconsistency[RM_BRIN_ID] = true;
>> +           newwalconsistency[RM_GENERIC_ID] = true;
>> +           newwalconsistency[RM_XLOG_ID] = true;
>> Here you can just use MemSet with RM_MAX_ID and simplify this code maintenance.
>>
> Not all rmids can have backup blocks. So, for wal_consistency = 'all',
> I've enabled only those rmids which can have backup blocks.

Even if some rmgrs do not support FPWs, I don't think that it is safe
to assume that the existing ones would never support it. Imagine for
example that feature X is implemented. Feature X adds rmgs Y, but rmgr
Y does not use FPWs. At a later point a new feature is added, which
makes rmgr Y using FPWs. We'd increase the number of places to update
with your patch, increasing the likelyness to introduce bugs. It would
be better to use a safe implementation from the maintenance point of
view to be honest (maintenance load of masking functions is somewhat
leveraged by the fact that on-disk format is kept compatible).

>>
>> +       if (pg_strcasecmp(tok, "Heap2") == 0)
>> +       {
>> +           newwalconsistency[RM_HEAP2_ID] = true;
>> +       }
>> Thinking more about it, I guess that we had better change the
>> definition list of rmgrs in rmgr.h and get something closer to
>> RmgrDescData that pg_xlogdump has to avoid all this stanza by
>> completing it with the name of the rmgr. The only special cases that
>> this code path would need to take care of would be then 'none' and
>> 'all'. You could do this refactoring on top of the main patch to
>> simplify it as it is rather big (1.7k lines).
>>
> I'm not sure about this point. wal_consistency doesn't support  all
> the rmids. We should have some way to check this.

I'd rather see this code done in such a way that all the rmgrs can be
handled, this approach being particularly attractive for the fact that
there is no need to change it if new rmgrs are added in the future.
(This was a reason as well why I still think that a simple on/off
switch would be plain enough, users have mostly control of the SQLs
triggering WAL. And if you run tests, you'll likely have the mind to
turn autovacuum to off to avoid it to generate FPWs and pollute the
logs at least at the second run of your tests).

And if you move forward with the approach of making this parameter a
list, I think that it would be better to add a section in the WAL
documentation about resource managers, like what they are, and list
them in this section of the docs. Then your parameter could link to
this documentation part and users would be able to see what kind of
values can be set. This leverages the need to update multiple portions
of the docs if rmgrs are added or removed in the future, as well as it
minimizes the maintenance of this code.
-- 
Michael



Re: WAL consistency check facility

From
Robert Haas
Date:
On Sat, Sep 10, 2016 at 3:19 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Sep 9, 2016 at 4:01 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>>> - If WAL consistency check is enabled for a rmgrID, we always include
>>>> the backup image in the WAL record.
>>>
>>> What happens if wal_consistency has different settings on a standby
>>> and its master? If for example it is set to 'all' on the standby, and
>>> 'none' on the master, or vice-versa, how do things react? An update of
>>> this parameter should be WAL-logged, no?
>>>
>> It is possible to set wal_consistency to 'All' in master and any other
>> values in standby. But, the scenario you mentioned will cause error in
>> standby since it may not get the required backup image for wal
>> consistency check. I think that user should be responsible to set
>> this value correctly. We can improve the error message to make the
>> user aware of the situation.
>
> Let's be careful here. You should as well consider things from the
> angle that some parameter updates are WAL-logged as well, like
> wal_level with the WAL record XLOG_PARAMETER_CHANGE.

It seems entirely unnecessary for the master and the standby to agree
here.  I think what we need is two GUCs.  One of them, which affects
only the master, controls whether the validation information is
including in the WAL, and the other, which affects only the standby,
affects whether validation is performed when the necessary information
is present.  Or maybe skip the second one and just decree that
standbys will always validate if the necessary information is present.
Using the same GUC on both the master and the standby but making it
mean different things in each of those places (whether to log the
validation info in one case, whether to perform validation in the
other case) is another option that also avoids needing to enforce that
the setting is the same in both places, but probably an inferior one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Sat, Sep 10, 2016 at 12:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Sep 9, 2016 at 4:01 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>
>>>> - In recovery tests (src/test/recovery/t), I've added wal_consistency
>>>> parameter in the existing scripts. This feature doesn't change the
>>>> expected output. If there is any inconsistency, it can be verified in
>>>> corresponding log file.
>>>
>>> I am afraid that just generating a WARNING message is going to be
>>> useless for the buildfarm. If we want to detect errors, we could for
>>> example have an additional GUC to trigger an ERROR or a FATAL, taking
>>> down the cluster, and allowing things to show in red on a platform.
>>>
>> Yes, we can include an additional GUC to trigger an ERROR for any inconsistency.
>
> I'd like to hear extra opinions about that, but IMO just having an
> ERROR would be fine for the first implementation. Once you've bumped
> into an ERROR, you are likely going to fix it first.
>

+1 for just an ERROR to detect the inconsistency.  I think adding
additional GUC just to raise error level doesn't seem to be advisable.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Amit Kapila
Date:
On Sat, Sep 10, 2016 at 8:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Sep 10, 2016 at 3:19 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Sep 9, 2016 at 4:01 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>>>>> - If WAL consistency check is enabled for a rmgrID, we always include
>>>>> the backup image in the WAL record.
>>>>
>>>> What happens if wal_consistency has different settings on a standby
>>>> and its master? If for example it is set to 'all' on the standby, and
>>>> 'none' on the master, or vice-versa, how do things react? An update of
>>>> this parameter should be WAL-logged, no?
>>>>
>>> It is possible to set wal_consistency to 'All' in master and any other
>>> values in standby. But, the scenario you mentioned will cause error in
>>> standby since it may not get the required backup image for wal
>>> consistency check. I think that user should be responsible to set
>>> this value correctly. We can improve the error message to make the
>>> user aware of the situation.
>>
>> Let's be careful here. You should as well consider things from the
>> angle that some parameter updates are WAL-logged as well, like
>> wal_level with the WAL record XLOG_PARAMETER_CHANGE.
>
> It seems entirely unnecessary for the master and the standby to agree
> here.  I think what we need is two GUCs.  One of them, which affects
> only the master, controls whether the validation information is
> including in the WAL, and the other, which affects only the standby,
> affects whether validation is performed when the necessary information
> is present.
>

I think from the clarity perspective, this option sounds good, but I
am slightly afraid that it might be inconvenient for users to set the
different values for these two parameters.

>  Or maybe skip the second one and just decree that
> standbys will always validate if the necessary information is present.

Sounds like a better alternative and probably easier to configure for users.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
Hello,

Based on the previous discussions, I've modified the existing patch.

>+   void        (*rm_checkConsistency) (XLogReaderState *record);
>All your _checkConsistency functions share the same pattern, in short
>they all use a for loop for each block, call each time
>XLogReadBufferExtended, etc. And this leads to a *lot* of duplication.
>You would get a reduction by a couple of hundreds of lines by having a
>smarter refactoring. And to be honest, if I look at your patch what I
>think is the correct way of doing things is to add to the rmgr not
>this check consistency function, but just a pointer to the masking
>function.
+1. In rmgrlist, I've added a pointer to the masking function for each rmid.
A common function named checkConsistency calls these masking functions
based on their rmid and does comparison for each block.

>> - If WAL consistency check is enabled for a rmgrID, we always include
>> the backup image in the WAL record.
>
>What happens if wal_consistency has different settings on a standby
>and its master? If for example it is set to 'all' on the standby, and
>'none' on the master, or vice-versa, how do things react? An update of
>this parameter should be WAL-logged, no?
If wal_consistency is enabled for a rmid, standby will always check whether
backup image exists or not i.e. BKPBLOCK_HAS_IMAGE is set or not.
(I guess Amit and Robert also suggested the same in the thread)
Basically, BKPBLOCK_HAS_IMAGE is set if a block contains image and
BKPIMAGE_IS_REQUIRED_FOR_REDO (I've added this one) is set if that backup
image is required during redo. When we decode a wal record, has_image
flag of DecodedBkpBlock is set to BKPIMAGE_IS_REQUIRED_FOR_REDO.

>+       if (pg_strcasecmp(tok, "Heap2") == 0)
>+       {
>+           newwalconsistency[RM_HEAP2_ID] = true;
>+       }
>Thinking more about it, I guess that we had better change the
>definition list of rmgrs in rmgr.h and get something closer to
>RmgrDescData that pg_xlogdump has to avoid all this stanza by
>completing it with the name of the rmgr. The only special cases that
>this code path would need to take care of would be then 'none' and
>'all'. You could do this refactoring on top of the main patch to
>simplify it as it is rather big (1.7k lines).
I've modified it exactly like pg_xlogdump does. Additionally, it checks
whether masking function is defined for the rmid or not. Hence, in future,
if we want to include any other rmid for wal consistency check, we just need
to define its masking function.

>> - In recovery tests (src/test/recovery/t), I've added wal_consistency
>> parameter in the existing scripts. This feature doesn't change the
>> expected output. If there is any inconsistency, it can be verified in
>> corresponding log file.
>
>I am afraid that just generating a WARNING message is going to be
>useless for the buildfarm. If we want to detect errors, we could for
>example have an additional GUC to trigger an ERROR or a FATAL, taking
>down the cluster, and allowing things to show in red on a platform.
For now, I've kept this as a WARNING message to detect all inconsistencies
at once. Once, the patch is finalized, I'll modify it as an ERROR message.

Thoughts?


--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WAL consistency check facility

From
Michael Paquier
Date:
On Mon, Sep 12, 2016 at 5:06 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
>>+   void        (*rm_checkConsistency) (XLogReaderState *record);
>>All your _checkConsistency functions share the same pattern, in short
>>they all use a for loop for each block, call each time
>>XLogReadBufferExtended, etc. And this leads to a *lot* of duplication.
>>You would get a reduction by a couple of hundreds of lines by having a
>>smarter refactoring. And to be honest, if I look at your patch what I
>>think is the correct way of doing things is to add to the rmgr not
>>this check consistency function, but just a pointer to the masking
>>function.
>
> +1. In rmgrlist, I've added a pointer to the masking function for each rmid.
> A common function named checkConsistency calls these masking functions
> based on their rmid and does comparison for each block.

The patch size is down from 79kB to 38kB. That gets better :)

>>> - If WAL consistency check is enabled for a rmgrID, we always include
>>> the backup image in the WAL record.
>>
>>What happens if wal_consistency has different settings on a standby
>>and its master? If for example it is set to 'all' on the standby, and
>>'none' on the master, or vice-versa, how do things react? An update of
>>this parameter should be WAL-logged, no?
>
> If wal_consistency is enabled for a rmid, standby will always check whether
> backup image exists or not i.e. BKPBLOCK_HAS_IMAGE is set or not.
> (I guess Amit and Robert also suggested the same in the thread)
> Basically, BKPBLOCK_HAS_IMAGE is set if a block contains image and
> BKPIMAGE_IS_REQUIRED_FOR_REDO (I've added this one) is set if that backup
> image is required during redo. When we decode a wal record, has_image
> flag of DecodedBkpBlock is set to BKPIMAGE_IS_REQUIRED_FOR_REDO.

Ah I see. But do we actually store the status in the record itself,
then at replay we don't care of the value of wal_consistency at
replay. That's the same concept used by wal_compression. So shouldn't
you have more specific checks related to that in checkConsistency? You
actually don't need to check for anything in xlogreader.c, just check
for the consistency if there is a need to do so, or do nothing.

> For now, I've kept this as a WARNING message to detect all inconsistencies
> at once. Once, the patch is finalized, I'll modify it as an ERROR message.

Or say FATAL. This way the server is taken down.

> Thoughts?

A couple of extra thoughts:
1) The routines of each rmgr are located in a dedicated file, for
example GIN stuff is in ginxlog.c, etc. It seems to me that it would
be better to move each masking function where it should be instead
being centralized. A couple of routines need to be centralized, so I'd
suggest putting them in a new file, like xlogmask.c, xlog because now
this is part of WAL replay completely, including the lsn, the hint
bint and the other common routines.

2) Regarding page comparison:
+/*
+ * Compare the contents of two pages.
+ * If the two pages are exactly same, it returns BLCKSZ. Otherwise,
+ * it returns the location where the first mismatch has occurred.
+ */
+int
+comparePages(char *page1, char *page2)
We could just use memcpy() here. compareImages was useful to get a
clear image of what the inconsistencies were, but you don't do that
anymore.

3)
+static void checkConsistency(RmgrId rmid, XLogReaderState *record);
The RMGR if is part of the record decoded, so you could just remove
RmgrId from the list of arguments and simplify this interface.

4) If this patch still goes with the possibility to set up a list of
RMGRs, documentation is needed for that. I'd suggest writing first a
patch to explain what are RMGRs for WAL, then apply the WAL
consistency facility on top of it and link wal_consistency to it.

5)
+           has_image = record->blocks[block_id].has_image;
+           record->blocks[block_id].has_image = true;
+           if (!RestoreBlockImage(record, block_id, old_page))
+               elog(ERROR, "failed to restore block image");
+           record->blocks[block_id].has_image = has_image;
Er, what? And BKPIMAGE_IS_REQUIRED_FOR_REDO?

6)
+           /*
+            * Remember that, if WAL consistency check is enabled for
the current rmid,
+            * we always include backup image with the WAL record.
But, during redo we
+            * restore the backup block only if needs_backup is set.
+            */
+           if (needs_backup)
+               bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
This should use wal_consistency[rmid]?

7) This patch has zero documentation. Please add some. Any human being
on this list other than those who worked on the first versions
(Heikki, Simon and I?) is going to have a hard time to review this
patch in details moving on if there is no reference to tell what this
feature does for the user...

This patch is going to the good direction, but I don't think it's far
from being ready for commit yet. So I am going to mark it as returned
with feedback if there are no objections.
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Sun, Sep 11, 2016 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> It seems entirely unnecessary for the master and the standby to agree
> here.  I think what we need is two GUCs.  One of them, which affects
> only the master, controls whether the validation information is
> including in the WAL, and the other, which affects only the standby,
> affects whether validation is performed when the necessary information
> is present.  Or maybe skip the second one and just decree that
> standbys will always validate if the necessary information is present.
> Using the same GUC on both the master and the standby but making it
> mean different things in each of those places (whether to log the
> validation info in one case, whether to perform validation in the
> other case) is another option that also avoids needing to enforce that
> the setting is the same in both places, but probably an inferior one.

Thinking more about that, there is no actual need to do anything
complicated here. We could just track at the record level if a
consistency check is needs to be done at replay and do it. If nothing
is set, just do nothing. That would allow us to promote this parameter
to SIGHUP. wal_compression does something similar.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
>>>> - If WAL consistency check is enabled for a rmgrID, we always include
>>>> the backup image in the WAL record.
>>>
>>>What happens if wal_consistency has different settings on a standby
>>>and its master? If for example it is set to 'all' on the standby, and
>>>'none' on the master, or vice-versa, how do things react? An update of
>>>this parameter should be WAL-logged, no?
>>
>> If wal_consistency is enabled for a rmid, standby will always check whether
>> backup image exists or not i.e. BKPBLOCK_HAS_IMAGE is set or not.
>> (I guess Amit and Robert also suggested the same in the thread)
>> Basically, BKPBLOCK_HAS_IMAGE is set if a block contains image and
>> BKPIMAGE_IS_REQUIRED_FOR_REDO (I've added this one) is set if that backup
>> image is required during redo. When we decode a wal record, has_image
>> flag of DecodedBkpBlock is set to BKPIMAGE_IS_REQUIRED_FOR_REDO.
>
>Ah I see. But do we actually store the status in the record itself,
>then at replay we don't care of the value of wal_consistency at
>replay. That's the same concept used by wal_compression. So shouldn't
>you have more specific checks related to that in checkConsistency? You
>actually don't need to check for anything in xlogreader.c, just check
>for the consistency if there is a need to do so, or do nothing.
I'm sorry, but I don't quite follow you here. If a wal record contains
an image, has_image should be set since it helps decoding the
record. But, during redo if XLogRecHasBlockImage() returns true, i.e.,
has_image is set, then it always restore the block. But, in our case,
a record can have a backup image which should not be restored. So, we need
to decide two things:
1. Does a record contain backup image? (required for decoding the record)
2. If it has an image, should we restore it during redo?
I think we sould decide these in DecodeXLogRecord() only. BKPBLOCK_HAS_IMAGE
answers the first question whereas BKPIMAGE_IS_REQUIRED_FOR_REDO
answers the second one. In DecodeXLogRecord(), we check that
BKPBLOCK_HAS_IMAGE should be set if wal_consistency is enabled for
this record. The flag has_image is set to
BKPIMAGE_IS_REQUIRED_FOR_REDO which is later used to decide whether we
want to restore a block or not.

> For now, I've kept this as a WARNING message to detect all inconsistencies
> at once. Once, the patch is finalized, I'll modify it as an ERROR message.

Or say FATAL. This way the server is taken down.

> Thoughts?
+1. I'll do that.

>A couple of extra thoughts:
>1) The routines of each rmgr are located in a dedicated file, for
>example GIN stuff is in ginxlog.c, etc. It seems to me that it would
>be better to move each masking function where it should be instead
>being centralized. A couple of routines need to be centralized, so I'd
>suggest putting them in a new file, like xlogmask.c, xlog because now
>this is part of WAL replay completely, including the lsn, the hint
>bint and the other common routines.
Sounds good. But, I think that the file name for common masking routines
should be as bufmask.c since we are masking the buffers only.

>2) Regarding page comparison:
>+/*
>+ * Compare the contents of two pages.
>+ * If the two pages are exactly same, it returns BLCKSZ. Otherwise,
>+ * it returns the location where the first mismatch has occurred.
>+ */
>+int
>+comparePages(char *page1, char *page2)
>We could just use memcpy() here. compareImages was useful to get a
>clear image of what the inconsistencies were, but you don't do that
>anymore.
memcmp(), right?

>5)
>+           has_image = record->blocks[block_id].has_image;
>+           record->blocks[block_id].has_image = true;
>+           if (!RestoreBlockImage(record, block_id, old_page))
>+               elog(ERROR, "failed to restore block image");
>+           record->blocks[block_id].has_image = has_image;
>Er, what? And BKPIMAGE_IS_REQUIRED_FOR_REDO?
Sorry, I completely missed this.

>6)
>+           /*
>+            * Remember that, if WAL consistency check is enabled for
>the current rmid,
>+            * we always include backup image with the WAL record.
>But, during redo we
>+            * restore the backup block only if needs_backup is set.
>+            */
>+           if (needs_backup)
>+               bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
>This should use wal_consistency[rmid]?
needs_backup is set when XLogRecordAssemble decides that backup image
should be included in the record for redo purpose. This image will be
restored during
redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included
image should be restored during redo(or has_image should be set or not).

>7) This patch has zero documentation. Please add some. Any human being
>on this list other than those who worked on the first versions
>(Heikki, Simon and I?) is going to have a hard time to review this
>patch in details moving on if there is no reference to tell what this
>feature does for the user...
>
>This patch is going to the good direction, but I don't think it's far
>from being ready for commit yet. So I am going to mark it as returned
>with feedback if there are no objections
I think only major change that this patch needs a proper and detailed
documentation. Other than that there are very minor changes which can
be done quickly. Right?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Tue, Sep 13, 2016 at 6:07 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
>> For now, I've kept this as a WARNING message to detect all inconsistencies
>> at once. Once, the patch is finalized, I'll modify it as an ERROR message.
>
> Or say FATAL. This way the server is taken down.

What I'd really like to see here is a way to quickly identify in the
buildfarm the moment an inconsistent WAL has been introduced by a new
feature, some bug fix, or perhaps a deficiency in the masking
routines. We could definitely tune that later on, by controlling with
a GUC if this generates a WARNING instead of a FATAL, the former being
more useful for production environments, and the latter for tests. It
would be good to think as well about a set of tests, one rough thing
would be to modify an on-disk page for a table, and work on that to
force an inconsistency to be triggered..

>>A couple of extra thoughts:
>>1) The routines of each rmgr are located in a dedicated file, for
>>example GIN stuff is in ginxlog.c, etc. It seems to me that it would
>>be better to move each masking function where it should be instead
>>being centralized. A couple of routines need to be centralized, so I'd
>>suggest putting them in a new file, like xlogmask.c, xlog because now
>>this is part of WAL replay completely, including the lsn, the hint
>>bint and the other common routines.
> Sounds good. But, I think that the file name for common masking routines
> should be as bufmask.c since we are masking the buffers only.

That makes sense as well. No objections to that.

>>2) Regarding page comparison:
>>We could just use memcpy() here. compareImages was useful to get a
>>clear image of what the inconsistencies were, but you don't do that
>>anymore.
> memcmp(), right?

Yep :)

>>6)
>>+           /*
>>+            * Remember that, if WAL consistency check is enabled for
>>the current rmid,
>>+            * we always include backup image with the WAL record.
>>But, during redo we
>>+            * restore the backup block only if needs_backup is set.
>>+            */
>>+           if (needs_backup)
>>+               bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
>>This should use wal_consistency[rmid]?
>
> needs_backup is set when XLogRecordAssemble decides that backup image
> should be included in the record for redo purpose. This image will be
> restored during
> redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included
> image should be restored during redo(or has_image should be set or not).

When decoding a record, I think that you had better not use has_image
to assume that a FPW has to be double-checked. This has better be a
different boolean flag, say check_page or similar. This way after
decoding a record it is possible to know if there is a PFW, and if a
check on it is needed or not.

>>7) This patch has zero documentation. Please add some. Any human being
>>on this list other than those who worked on the first versions
>>(Heikki, Simon and I?) is going to have a hard time to review this
>>patch in details moving on if there is no reference to tell what this
>>feature does for the user...
>>
>>This patch is going to the good direction, but I don't think it's far
>>from being ready for commit yet. So I am going to mark it as returned
>>with feedback if there are no objections
>
> I think only major change that this patch needs a proper and detailed
> documentation. Other than that there are very minor changes which can
> be done quickly. Right?

It seems to me that you need to think about the way to document things
properly first, with for example:
- Have a first documentation patch that explains what is a resource
manager for WAL, and what are the types available with a nice table.
- Add in your patch documentation to explain what are the benefits of
using this facility, the main purpose is testing, but there are also
mention upthread about users that would like to get that into
production, assuming that the overhead is minimal.
- Add more comments in your code to finish. One example is
checkConsistency() that is here, but explains nothing.

Well, if you'd simply use an on/off switch to control the feature, the
documentation load for rmgrs would be zero, but as I am visibly
outnumbered in this fight... We could also have an off/on switch
implemented first, and extend that later on depending on the feedback
from other users. We discussed rmgr-level or relation-level tuning of
FPW compression at some point, but we've finished with the most simple
approach, and we still stick with it.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Wed, Sep 14, 2016 at 6:34 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>>>2) Regarding page comparison:
>>>We could just use memcpy() here. compareImages was useful to get a
>>>clear image of what the inconsistencies were, but you don't do that
>>>anymore.
>> memcmp(), right?
>
>Yep :)
If I use memcmp(), I won't get the byte location where the first mismatch
has occurred. It will be helpful to display the byte location which causes
an inconsistency.

>>>6)
>>>+           /*
>>>+            * Remember that, if WAL consistency check is enabled for
>>>the current rmid,
>>>+            * we always include backup image with the WAL record.
>>>But, during redo we
>>>+            * restore the backup block only if needs_backup is set.
>>>+            */
>>>+           if (needs_backup)
>>>+               bimg.bimg_info |= BKPIMAGE_IS_REQUIRED_FOR_REDO;
>>>This should use wal_consistency[rmid]?
>>
>> needs_backup is set when XLogRecordAssemble decides that backup image
>> should be included in the record for redo purpose. This image will be
>> restored during
>> redo. BKPIMAGE_IS_REQUIRED_FOR_REDO indicates whether the included
>> image should be restored during redo(or has_image should be set or not).
>
>When decoding a record, I think that you had better not use has_image
>to assume that a FPW has to be double-checked. This has better be a
>different boolean flag, say check_page or similar. This way after
>decoding a record it is possible to know if there is a PFW, and if a
>check on it is needed or not.
I've done some modifications which discards the necessity of adding
anything in DecodeXLogRecord().

Master
---------------
- If wal_consistency check is enabled or needs_backup is set in
XLogRecordAssemble(), we do a fpw.
- If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE,
which in turns set has_image flag while decoding the record.
- If a fpw needs to be restored during redo, i.e., needs_backup is true,
then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO.

Standby
---------------
- In XLogReadBufferForRedoExtended(), if both XLogRecHasBlockImage() and
XLogRecHasBlockImageForRedo()(added by me*) return true, we restore the
backup image.
- In checkConsistency, we only check if XLogRecHasBlockImage() returns true
when wal_consistency check is enabled for this rmid.

*XLogRecHasBlockImageForRedo() checks whether bimg_info is set with
BKPIMAGE_IS_REQUIRED_FOR_REDO.

Thoughts?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Sep 14, 2016 at 2:56 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> Master
> ---------------
> - If wal_consistency check is enabled or needs_backup is set in
> XLogRecordAssemble(), we do a fpw.
> - If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE,
> which in turns set has_image flag while decoding the record.
> - If a fpw needs to be restored during redo, i.e., needs_backup is true,
> then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO.

Here that should be if wal_consistency is true, no?

> Standby
> ---------------
> - In XLogReadBufferForRedoExtended(), if both XLogRecHasBlockImage() and
> XLogRecHasBlockImageForRedo()(added by me*) return true, we restore the
> backup image.
> - In checkConsistency, we only check if XLogRecHasBlockImage() returns true
> when wal_consistency check is enabled for this rmid.

My guess would have been that you do not need to check anymore for
wal_consistency in checkConsistency, making the GUC value only used on
master node.

> *XLogRecHasBlockImageForRedo() checks whether bimg_info is set with
> BKPIMAGE_IS_REQUIRED_FOR_REDO.

Yes, that's more or less what you should have.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Wed, Sep 14, 2016 at 11:31 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Sep 14, 2016 at 2:56 PM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> Master
>> ---------------
>> - If wal_consistency check is enabled or needs_backup is set in
>> XLogRecordAssemble(), we do a fpw.
>> - If a fpw is to be done, then fork_flags is set with BKPBLOCK_HAS_IMAGE,
>> which in turns set has_image flag while decoding the record.
>> - If a fpw needs to be restored during redo, i.e., needs_backup is true,
>> then bimg_info is set with BKPIMAGE_IS_REQUIRED_FOR_REDO.
>
> Here that should be if wal_consistency is true, no?
Nope. I'll try to explain using some pseudo-code:
XLogRecordAssemble()
{
....include_image = needs_backup || wal_consistency[rmid];if (include_image){ .... set XLogRecordBlockHeader.fork_flags
|=BKPBLOCK_HAS_IMAGE; if (needs_backup)   set XLogRecordBlockImageHeader.bimg_info                         |=
BKPIMAGE_IS_REQUIRED_FOR_REDO;....}.....
 
}

XLogReadBufferForRedoExtended()
{......if (XLogRecHasBlockImage() && XLogRecHasBlockImageForRedo()){ RestoreBlockImage(); .... return
BLK_RESTORED;}......
}

checkConsistency()
{....if (wal_consistency[rmid] && !XLogRecHasBlockImage()) throw error;.....
}

*XLogRecHasBlockImageForRedo() checks whether bimg_info is set withBKPIMAGE_IS_REQUIRED_FOR_REDO.

For a backup image any of the followings is possible:
1. consistency should be checked.
2. page should restored.
3. both 1 and 2.

Consistency check can be controlled by a guc parameter. But, standby
should be conveyed whether an image should be restored. For that, we
have used the new flag.
Suggestions?

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Thu, Sep 8, 2016 at 1:20 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>> 2. For BRIN/UPDATE+INIT, block numbers (in rm_tid[0]) are different in
>> REVMAP page. This happens only for two cases. I'm not sure what the
>> reason can be.
>
> Hm? This smells like a block reference bug. What are the cases you are
> referring to?
>
Following is the only case where the backup page stored in the wal
record and the current page after redo are not consistent.

test:BRIN using gmake-check

Master
-----------------------------------------
STATEMENT:  VACUUM brintest;
LOG:  INSERT @ 0/59E1E0F8:  - BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1

Standby
----------------------------------------------
LOG:  REDO @ 0/59E1B500; LSN 0/59E1E0F8: prev 0/59E17578; xid 0; len
14; blkref #0: rel 1663/16384/30556, blk 12; blkref #1: rel
1663/16384/30556, blk 1; blkref #2: rel 1663/16384/30556, blk 2 -
BRIN/UPDATE+INIT: heapBlk 100 pagesPerRange 1 old offnum 11, new
offnum 1

WARNING:  Inconsistent page (at byte 26) found, rel 1663/16384/30556,
forknum 0, blkno 1
CONTEXT:  xlog redo at 0/59E1B500 for BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1

thoughts?
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Robert Haas
Date:
On Tue, Sep 13, 2016 at 9:04 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> It seems to me that you need to think about the way to document things
> properly first, with for example:
> - Have a first documentation patch that explains what is a resource
> manager for WAL, and what are the types available with a nice table.
> - Add in your patch documentation to explain what are the benefits of
> using this facility, the main purpose is testing, but there are also
> mention upthread about users that would like to get that into
> production, assuming that the overhead is minimal.

So, I don't think that this patch should be required to document all
of the currently-undocumented stuff that somebody might want to know
that it is related to this patch.  It should be enough to documented
the patch itself.   One paragraph in config.sgml in the usual format
should be fine.  Maybe two paragraphs.  We do need to list the
resource managers, but that can just be something like this:

The default value of for this setting is <literal>off</>.  To check
all records written to the write-ahead log, set this parameter to
<literal>all</literal>.  To check only same records, specify a
comma-separated list of resource managers.  The resource managers
which are currently supported are <literal>heap</>, <literal>btree</>,
<literal>hash</>, BLAH, and BLAH.

If somebody wants to write some user-facing documentation of the
write-ahead log format, great.  That could certainly be very helpful
for people who are running pg_xlogdump.  But I don't think that stuff
goes in this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
Hello,

I've added the updated the patch with the necessary documentation and comments.
I've referenced Robert's reply in this thread and Simon's reply in
Production block comparison facility thread to write the documentation.

This feature is used to check the consistency of WAL records, i.e,
whether the WAL records are inserted and applied correctly.
A guc parameter named wal_consistency is added to enable this feature.
When wal_consistency is enabled for a WAL record, it stores a full-page image
along with the record. When a full-page image arrives during redo, it compares
against the current page to check whether both are consistent.

The default value for this setting is none. To check all records written to the
write-ahead log, set this parameter to all. To check only some records, specify
a comma-separated list of resource managers. The resource managers which
are currently supported are xlog, heap2, heap, btree, hash, gin, gist, spgist,
sequence, brin and generic.

If any inconsistency is detected, it throws a WARNING. But, as per discussions
in the earlier threads, it can be changed to ERROR./FATAL(just a one
word change).
I've kept this as warning because of some inconsistency in BRIN VACUUM
during gmake check.

In recovery tests, I've enabled this feature in PostgresNode.pm.

Thanks to Amit, Dilip, Michael, Simon and Robert for their valuable feedbacks.

Thoughts?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Sep 15, 2016 at 7:30 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> Thoughts?

There are still a couple of things that this patch makes me unhappy,
particularly the handling of the GUC with the xlogreader flags. I am
not sure if I'll be able to look at that again within the next couple
of weeks, but please be sure that this is registered in the next
commit fest. You could for example do that by changing the patch from
"Returned with Feedback" to "Moved to next CF" in the commit fest app.
Be sure as well to spend a couple of cycles in reviewing patches.
Usually for one patch sent, that's one patch of equal difficulty to
review, and there are many patch still waiting for feedback.
-- 
Michael



Re: WAL consistency check facility

From
Robert Haas
Date:
On Thu, Sep 15, 2016 at 9:23 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Sep 15, 2016 at 7:30 PM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> Thoughts?
>
> There are still a couple of things that this patch makes me unhappy,
> particularly the handling of the GUC with the xlogreader flags. I am
> not sure if I'll be able to look at that again within the next couple
> of weeks, but please be sure that this is registered in the next
> commit fest. You could for example do that by changing the patch from
> "Returned with Feedback" to "Moved to next CF" in the commit fest app.
> Be sure as well to spend a couple of cycles in reviewing patches.
> Usually for one patch sent, that's one patch of equal difficulty to
> review, and there are many patch still waiting for feedback.

I don't think you have the right to tell Kuntal that he has to move
the patch to the next CommitFest because there are unspecified things
about the current version you don't like.  If you don't have time to
review further, that's your call, but he can leave the patch as Needs
Review and see if someone else has time.

You are right that he should review some other people's patches, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Fri, Sep 16, 2016 at 10:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think you have the right to tell Kuntal that he has to move
> the patch to the next CommitFest because there are unspecified things
> about the current version you don't like.  If you don't have time to
> review further, that's your call, but he can leave the patch as Needs
> Review and see if someone else has time.

No complain from here if done this way. I don't mean any offense :)
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Fri, Sep 16, 2016 at 10:36 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Sep 16, 2016 at 10:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't think you have the right to tell Kuntal that he has to move
>> the patch to the next CommitFest because there are unspecified things
>> about the current version you don't like.  If you don't have time to
>> review further, that's your call, but he can leave the patch as Needs
>> Review and see if someone else has time.
>
> No complain from here if done this way. I don't mean any offense :)

Seeing nothing happening, I have moved the patch to next CF as there
is a new version, but no reviews for it.
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Sep 29, 2016 at 12:49 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Seeing nothing happening, I have moved the patch to next CF as there
> is a new version, but no reviews for it.

Just a note for anybody potentially looking at this patch. I am
currently looking at it in depth, and will post a new version of the
patch in a couple of days with review comments. Thanks.
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Oct 27, 2016 at 5:08 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Sep 29, 2016 at 12:49 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Seeing nothing happening, I have moved the patch to next CF as there
>> is a new version, but no reviews for it.
>
> Just a note for anybody potentially looking at this patch. I am
> currently looking at it in depth, and will post a new version of the
> patch in a couple of days with review comments. Thanks.

And here we go. Here is a review as well as a large brush-up for this
patch. A couple of things:
- wal_consistency is using a list of RMGRs, at the cost of being
PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
have been thinking hard about that, and still I don't see the point).
It is rather easy to for example default it to false, and enable it to
true to check if a certain code path is correctly exercised or not for
WAL consistency. Note that this simplification reduces the patch size
by 100~150 lines. I know, I know, I'd expect some complains about
that....
- Looking for wal_consistency at replay has no real value. What if on
a standby the parameter value is inconsistent than the one on the
master? This logic adds a whole new level of complications and
potential bugs. So instead my suggestion is to add a marker at WAL
record level to check if this record should be checked for consistency
at replay or not. This is also quite flexible if you think about it,
the standby is made independent of the WAL generated on the master and
just applies, or checks what it sees is fit checking for. The best
match here is to add a flag for XLogRecord->xl_info and make use of
one of the low 4 bits and only one is used now for
XLR_SPECIAL_REL_UPDATE. An interesting side effect of this approach is
that callers of XLogInsert can set XLR_CHECK_CONSISTENCY to enforce a
consistency check even if wal_consistency is off. It is true that we
could register such a data via XLogRegisterBuffer() instead, though
the 4 bits with the BKPBLOCK_* flags are already occupied so that
would induce a record penalty length and I have a hard time believing
that one would like to check the consistency of a record in
particular.
- Speaking of which using BKPIMAGE_IS_REQUIRED_FOR_REDO stored in the
block definition is sort of weird because we want to know if
consistency should be checked at a higher level.
- in maskPage, the new rmgr routine, there is no need for the info and
blkno arguments. info is not used at all to begin with. blkno is used
for gin pages to detect meta pages but this can be guessed using the
opaque pointer. For heap pages and speculative inserts, masking the
blkno would be fine. That's not worth it.
- Instead of palloc'ing the old and new pages to compare, it would be
more performant to keep around two static buffers worth of BLCKSZ and
just use that. This way there is no need as well to perform any palloc
calls in the masking functions, limiting the risk of errors (those
code paths had better avoid errors IMO). It would be also less costly
to just pass to the masking function a pointer to a buffer of size
BLCKSZ and just do the masking on it.
- The masking routine names can be more generic, like XXX_mask(char
*page). No need to say page, we already know they work on it via the
argument provided.
- mask_xlog_page and mask_generic_page are useless as the block
restored comes directly from a FPW, so you are comparing basically a
FPW with itself.
- In checkConsistency, there is no need to allocate the old page. As
RestorebackupImage stores the data in an already allocated buffer, you
can reuse the same location as the buffer masked afterwards.
- Removed comparePages(), using memcmp instead for simplicity(). This
does not show up the exact location of the inconsistency, still that
won't be a win as there could be more than one inconsistency across a
page. So this gives an invitation to user to look at the exact
context. memcmp can be used anyway to understand where is the
inconsistency if need be.
- I have noticed that mask_common_page is meaningfull just for the
sequence RMGR, and just that does not justify its existence so I
ripped it off.
- PostgresNode.pm enables wal_consistency. Seeing the high amount of
WAL this produces, I feel cold about doing that, the patch does
include it btw...
- Standbys now stop with FATAL when an inconsistency is found. This
makes error detection easier on buildfarm machines.
- A couple of masking functions still use 0xFFFFFF or similar marks.
Those should be replaced by MASK_MARKING. Not done that yet.
- Some of the masking routines should be refined, particularly the
heap and GIn functions. I did not spend time yet to do it.

On top of that, I have done a fair amount of testing, creating
manually some inconsistencies in the REDO routines to trigger failures
on standbys. And that was sort of fun to break things intentionally.

Another fun thing is the large amount of WAL that this generates (!),
so anyone willing to enable that in production would be crazy.
Enabling it for development and/or session is something that would
clearly help.

I am sending back the patch as waiting on author. Attached is what I
have up to now.
--
Michael

Attachment

Re: WAL consistency check facility

From
Robert Haas
Date:
On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> And here we go. Here is a review as well as a large brush-up for this
> patch. A couple of things:
> - wal_consistency is using a list of RMGRs, at the cost of being
> PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
> have been thinking hard about that, and still I don't see the point).
> It is rather easy to for example default it to false, and enable it to
> true to check if a certain code path is correctly exercised or not for
> WAL consistency. Note that this simplification reduces the patch size
> by 100~150 lines. I know, I know, I'd expect some complains about
> that....

I don't understand how you can fail to see the point of that.  As you
yourself said, this facility generates a ton of WAL.  If you're
focusing on one AM, why would you want to be forced to incur the
overhead for every other AM?  A good deal has been written about this
upthread already, and just saying "I don't see the point" seems to be
ignoring the explanations already given.

> - Looking for wal_consistency at replay has no real value. What if on
> a standby the parameter value is inconsistent than the one on the
> master? This logic adds a whole new level of complications and
> potential bugs. So instead my suggestion is to add a marker at WAL
> record level to check if this record should be checked for consistency
> at replay or not.

Agreed.

> This is also quite flexible if you think about it,
> the standby is made independent of the WAL generated on the master and
> just applies, or checks what it sees is fit checking for.

+1.

> The best
> match here is to add a flag for XLogRecord->xl_info and make use of
> one of the low 4 bits and only one is used now for
> XLR_SPECIAL_REL_UPDATE.

Seems reasonable.

> - in maskPage, the new rmgr routine, there is no need for the info and
> blkno arguments. info is not used at all to begin with. blkno is used
> for gin pages to detect meta pages but this can be guessed using the
> opaque pointer. For heap pages and speculative inserts, masking the
> blkno would be fine. That's not worth it.

Passing the blkno doesn't cost anything.  If it avoids guessing,
that's entirely worth it.

> - Instead of palloc'ing the old and new pages to compare, it would be
> more performant to keep around two static buffers worth of BLCKSZ and
> just use that. This way there is no need as well to perform any palloc
> calls in the masking functions, limiting the risk of errors (those
> code paths had better avoid errors IMO). It would be also less costly
> to just pass to the masking function a pointer to a buffer of size
> BLCKSZ and just do the masking on it.

We always palloc buffers like this so that they will be aligned.  But
we could arrange not to repeat the palloc every time (see, e.g.,
BootstrapXLOG()).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Fri, Oct 28, 2016 at 11:35 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> And here we go. Here is a review as well as a large brush-up for this
> patch. A couple of things:
Thanks for reviewing the patch in detail.

> - Speaking of which using BKPIMAGE_IS_REQUIRED_FOR_REDO stored in the
> block definition is sort of weird because we want to know if
> consistency should be checked at a higher level.
A full page image can be included in the WAL record because of the following
reasons:
1. It needs to be restored during replay.
2. WAL consistency should be checked for the record.
3. Both of above.
In your patch, you've included a full page image whenever wal_consistency
is true. So, XLogReadBufferForRedoExtended always restores the image
and returns BLK_RESTORED, which is unacceptable. We can't change
the default WAL replay behaviour. A full image should only be restored if it is
necessary to do so. Although, I agree that BKPIMAGE_IS_REQUIRED_FOR_REDO
doesn't look a clean way to implement this feature.

> - wal_consistency is using a list of RMGRs, at the cost of being
> PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
> have been thinking hard about that, and still I don't see the point).
> It is rather easy to for example default it to false, and enable it to
> true to check if a certain code path is correctly exercised or not for
> WAL consistency. Note that this simplification reduces the patch size
> by 100~150 lines. I know, I know, I'd expect some complains about
> that....
As Robert also told, if I'm focusing on a single AM, I really don't
want to store
full images and perform consistency check for other AMs.

> On top of that, I have done a fair amount of testing, creating
> manually some inconsistencies in the REDO routines to trigger failures
> on standbys. And that was sort of fun to break things intentionally.
I know the feeling. :)

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Mon, Oct 31, 2016 at 9:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> And here we go. Here is a review as well as a large brush-up for this
>> patch. A couple of things:
>> - wal_consistency is using a list of RMGRs, at the cost of being
>> PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
>> have been thinking hard about that, and still I don't see the point).
>> It is rather easy to for example default it to false, and enable it to
>> true to check if a certain code path is correctly exercised or not for
>> WAL consistency. Note that this simplification reduces the patch size
>> by 100~150 lines. I know, I know, I'd expect some complains about
>> that....
>
> I don't understand how you can fail to see the point of that.  As you
> yourself said, this facility generates a ton of WAL.  If you're
> focusing on one AM, why would you want to be forced to incur the
> overhead for every other AM?  A good deal has been written about this
> upthread already, and just saying "I don't see the point" seems to be
> ignoring the explanations already given.

Hehe, I was expecting you to jump on those lines. While looking at the
patch I have simplified it first to focus on the core engine of the
thing. Adding back this code sounds fine to me as there is a wall of
contestation. I offer to do it myself if the effort is the problem.
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Mon, Oct 31, 2016 at 9:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
>> - Instead of palloc'ing the old and new pages to compare, it would be
>> more performant to keep around two static buffers worth of BLCKSZ and
>> just use that. This way there is no need as well to perform any palloc
>> calls in the masking functions, limiting the risk of errors (those
>> code paths had better avoid errors IMO). It would be also less costly
>> to just pass to the masking function a pointer to a buffer of size
>> BLCKSZ and just do the masking on it.
>
> We always palloc buffers like this so that they will be aligned.  But
> we could arrange not to repeat the palloc every time (see, e.g.,
> BootstrapXLOG()).

Yeah, we could go with that and there is clearly no reason to not do so.
-- 
Michael



Re: WAL consistency check facility

From
Robert Haas
Date:
On Mon, Oct 31, 2016 at 5:51 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Hehe, I was expecting you to jump on those lines. While looking at the
> patch I have simplified it first to focus on the core engine of the
> thing. Adding back this code sounds fine to me as there is a wall of
> contestation. I offer to do it myself if the effort is the problem.

IMHO, your rewrite of this patch was a bit heavy-handed.  I haven't
scrutinized the code here so maybe it was a big improvement, and if so
fine, but if not it's better to collaborate with the author than to
take over.  In any case, yeah, I think you should put that back.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Peter Geoghegan
Date:
On Mon, Oct 31, 2016 at 5:31 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 28, 2016 at 2:05 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> And here we go. Here is a review as well as a large brush-up for this
>> patch. A couple of things:
>> - wal_consistency is using a list of RMGRs, at the cost of being
>> PGC_POSTMASTER. I'd suggest making it PGC_SUSER, and use a boolean (I
>> have been thinking hard about that, and still I don't see the point).
>> It is rather easy to for example default it to false, and enable it to
>> true to check if a certain code path is correctly exercised or not for
>> WAL consistency. Note that this simplification reduces the patch size
>> by 100~150 lines. I know, I know, I'd expect some complains about
>> that....
>
> I don't understand how you can fail to see the point of that.  As you
> yourself said, this facility generates a ton of WAL.  If you're
> focusing on one AM, why would you want to be forced to incur the
> overhead for every other AM?  A good deal has been written about this
> upthread already, and just saying "I don't see the point" seems to be
> ignoring the explanations already given.

+1. I strongly agree.


-- 
Peter Geoghegan



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Tue, Nov 1, 2016 at 10:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> IMHO, your rewrite of this patch was a bit heavy-handed.

OK... Sorry for that.

> I haven't
> scrutinized the code here so maybe it was a big improvement, and if so
> fine, but if not it's better to collaborate with the author than to
> take over.

While reviewing the code, that has finished by being a large rewrite,
and that was more understandable than a review looking at all the
small tweaks and things I have been through while reading it. I have
also experimented a couple of ideas with the patch that I added, so at
the end it proves to be a gain for everybody. I think that the last
patch is an improvement, if you want to make your own opinion on the
matter looking at the differences between both patches would be the
most direct way to go.

> In any case, yeah, I think you should put that back.

Here you go with this parameter back and the allocation of the masked
buffers done beforehand, close to the moment the XLogReader is
allocated actually. I have also removed wal_consistency from
PostgresNode.pm, small buildfarm machines would really suffer on it,
and hamster is very good to track race conditions when running TAP
tests. On top of that I have replaced a bunch of 0xFFFFF thingies by
their PG_UINT_MAX equivalents to keep things cleaner.

Now, I have put back the GUC-related code exactly to the same shape as
it was originally. Here are a couple of comments regarding it after
review:
- Let's drop 'none' as a magic keyword. Users are going to use an
empty string, and the default should be defined as such IMO.
- Using an allocated array of booleans to store the values of each
RMGRs could be replaced by an integer using bitwise shifts. Your
option looks better and makes the code cleaner.

A more nitpick remark: code comments don't refer much to RMIDs, but
they use the term "resource managers" more generally. I'd suggest to
do the same.
--
Michael

Attachment

Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Wed, Nov 2, 2016 at 10:23 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Nov 1, 2016 at 10:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> IMHO, your rewrite of this patch was a bit heavy-handed.
>
> OK... Sorry for that.
>
>> I haven't
>> scrutinized the code here so maybe it was a big improvement, and if so
>> fine, but if not it's better to collaborate with the author than to
>> take over.
>
> While reviewing the code, that has finished by being a large rewrite,
> and that was more understandable than a review looking at all the
> small tweaks and things I have been through while reading it. I have
> also experimented a couple of ideas with the patch that I added, so at
> the end it proves to be a gain for everybody. I think that the last
> patch is an improvement, if you want to make your own opinion on the
> matter looking at the differences between both patches would be the
> most direct way to go.
>
If my understanding is correct regarding this feature, last two patches
completely break the fundamental idea of wal consistency check feature.
I mentioned this in my last reply as well that we've to use some flag
to indicate
whether an image should be restored during replay or not. Otherwise,
XLogReadBufferForRedoExtended will always restore the image skipping the usual
redo operation. What's happening now is the following:
1. If wal_consistency is on, include backup block image with the wal record.
2. During replay, XLogReadBufferForRedoExtended always restores the backup block
image in local buffer since XLogRecHasBlockImage is true for each block.
3. In checkConsistency, you compare the local buffer with the backup block image
from the wal record. It'll always be consistent.
This feature aims to validate whether wal replay operation is
happening correctly or not.
To achieve that aim, we should not alter the wal replay operation itself.

Rest of the suggestions are well-taken. I'll update the patch accordingly.
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Nov 2, 2016 at 4:41 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Wed, Nov 2, 2016 at 10:23 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Tue, Nov 1, 2016 at 10:31 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> IMHO, your rewrite of this patch was a bit heavy-handed.
>>
>> OK... Sorry for that.
>>
>>> I haven't
>>> scrutinized the code here so maybe it was a big improvement, and if so
>>> fine, but if not it's better to collaborate with the author than to
>>> take over.
>>
>> While reviewing the code, that has finished by being a large rewrite,
>> and that was more understandable than a review looking at all the
>> small tweaks and things I have been through while reading it. I have
>> also experimented a couple of ideas with the patch that I added, so at
>> the end it proves to be a gain for everybody. I think that the last
>> patch is an improvement, if you want to make your own opinion on the
>> matter looking at the differences between both patches would be the
>> most direct way to go.
>>
> If my understanding is correct regarding this feature, last two patches
> completely break the fundamental idea of wal consistency check feature.
> I mentioned this in my last reply as well that we've to use some flag
> to indicate
> whether an image should be restored during replay or not. Otherwise,
> XLogReadBufferForRedoExtended will always restore the image skipping the usual
> redo operation. What's happening now is the following:
> 1. If wal_consistency is on, include backup block image with the wal record.
> 2. During replay, XLogReadBufferForRedoExtended always restores the backup block
> image in local buffer since XLogRecHasBlockImage is true for each block.
> 3. In checkConsistency, you compare the local buffer with the backup block image
> from the wal record. It'll always be consistent.
> This feature aims to validate whether wal replay operation is
> happening correctly or not.
> To achieve that aim, we should not alter the wal replay operation itself.

Hm... Right. That was broken. And actually, while the record-level
flag is useful so as you don't need to rely on checking
wal_consistency when doing WAL redo, the block-level flag is useful to
make a distinction between blocks that have to be replayed and the
ones that are used only for consistency, and both types could be mixed
in a record. Using it in bimg_info would be fine... Perhaps a better
name for the flag would be something like BKPIMAGE_APPLY, to mean that
the FPW needs to be applied at redo. Or BKPIMAGE_IGNORE, to bypass it
when replaying it. IS_REQUIRED_FOR_REDO is quite confusing.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Wed, Nov 2, 2016 at 1:34 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Hm... Right. That was broken. And actually, while the record-level
> flag is useful so as you don't need to rely on checking
> wal_consistency when doing WAL redo, the block-level flag is useful to
> make a distinction between blocks that have to be replayed and the
> ones that are used only for consistency, and both types could be mixed
> in a record. Using it in bimg_info would be fine... Perhaps a better
> name for the flag would be something like BKPIMAGE_APPLY, to mean that
> the FPW needs to be applied at redo. Or BKPIMAGE_IGNORE, to bypass it
> when replaying it. IS_REQUIRED_FOR_REDO is quite confusing.
BKPIMAGE_APPLY seems reasonable.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Wed, Nov 2, 2016 at 1:11 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> Rest of the suggestions are well-taken. I'll update the patch accordingly.
I've updated the last submitted patch(v10) with the following changes:
- added a block level flag BKPIMAGE_APPLY to distinguish backup image
blocks which needs to be restored during replay.
- at present, hash index operations are not WAL-logged. Hence, I've removed
the consistency check option for hash indices. It can be added later.

Few comments:
- Michael suggested to use an integer variable and bitwise-shift
operation to store
the RMGR values instead of using a boolean array. But, boolean array
implementation looks cleaner to me. For example,
+if (wal_consistency[rmid])
+       rechdr->xl_info |= XLR_CHECK_CONSISTENCY;

+include_image = needs_backup || wal_consistency[rmid];

- Another suggestion was to remove wal_consistency from PostgresNode.pm
because small buildfarm machines may suffer on it. Although I've no
experience in this matter, I would like to be certain that nothings breaks
in recovery tests after some modifications.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Nov 2, 2016 at 11:30 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> On Wed, Nov 2, 2016 at 1:11 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> Rest of the suggestions are well-taken. I'll update the patch accordingly.
> I've updated the last submitted patch(v10) with the following changes:
> - added a block level flag BKPIMAGE_APPLY to distinguish backup image
> blocks which needs to be restored during replay.
> - at present, hash index operations are not WAL-logged. Hence, I've removed
> the consistency check option for hash indices. It can be added later.

Both make sense.

> - Another suggestion was to remove wal_consistency from PostgresNode.pm
> because small buildfarm machines may suffer on it. Although I've no
> experience in this matter, I would like to be certain that nothings breaks
> in recovery tests after some modifications.

An extra idea that I have here would be to extend the TAP tests to
accept an environment variable that would be used to specify extra
options when starting Postgres instances. Buildfarm machines could use
it.

+            /*
+             * Remember that, if WAL consistency check is enabled for
the current rmid,
+             * we always include backup image with the WAL record.
But, during redo we
+             * restore the backup block only if needs_backup is set.
+             */
+            if (needs_backup)
+                bimg.bimg_info |= BKPIMAGE_APPLY;
+
+
You should be careful about extra newlines and noise in the code.

-    /* If it's a full-page image, restore it. */
-    if (XLogRecHasBlockImage(record, block_id))
+    /* If full-page image should be restored, do it. */
+    if (XLogRecBlockImageApply(record, block_id))
Hm. It seems to me that this modification is incorrect. If the page
does not need to be applied, aka if it needs to be used for
consistency checks, what should be done is more something like the
following in XLogReadBufferForRedoExtended:
if (Apply(record, block_id))   return;
if (HasImage)
{   //do what needs to be done with an image
}
else
{   //do default things
}

XLogRecBlockImageApply() should only check for BKP_APPLY and not imply
HasImage(). This will be more flexible when for example using it for
assertions.

With this patch the comments on top of XLogReadBufferForRedo are
wrong. A block is not unconditionally applied.

+#define XLogRecBlockImageApply(decoder, block_id) \
+    (XLogRecHasBlockImage(decoder, block_id) && \
+    (((decoder)->blocks[block_id].bimg_info & BKPIMAGE_APPLY) > 0))
Maybe != 0? That's the most common practice in the code.

It would be more consistent to have a boolean flag to treat
BKPIMAGE_APPLY in xlogreader.c. Have a look at has_image for example.
This will as well reduce dependencies on the header xlog_record.h

+            /*
+             * For a speculative tuple, the content of t_ctid is conflicting
+             * between the backup page and current page. Hence, we set it
+             * to the current block number and current offset.
+             */
+            if (HeapTupleHeaderIsSpeculative(page_htup))
+                ItemPointerSet(&page_htup->t_ctid, blkno, off);
In the set of masking functions this is the only portion of the code
depending on blkno. But isn't that actually a bug in speculative
inserts? Andres (added now in CC), Peter, could you provide some input
regarding that? The masking functions should not prevent the detection
of future errors, and this code is making me uneasy.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Thu, Nov 3, 2016 at 2:35 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> - Another suggestion was to remove wal_consistency from PostgresNode.pm
>> because small buildfarm machines may suffer on it. Although I've no
>> experience in this matter, I would like to be certain that nothings breaks
>> in recovery tests after some modifications.
>
> An extra idea that I have here would be to extend the TAP tests to
> accept an environment variable that would be used to specify extra
> options when starting Postgres instances. Buildfarm machines could use
> it.
>
It can be added as a separate feature.

>
> -    /* If it's a full-page image, restore it. */
> -    if (XLogRecHasBlockImage(record, block_id))
> +    /* If full-page image should be restored, do it. */
> +    if (XLogRecBlockImageApply(record, block_id))
> Hm. It seems to me that this modification is incorrect. If the page
> does not need to be applied, aka if it needs to be used for
> consistency checks, what should be done is more something like the
> following in XLogReadBufferForRedoExtended:
> if (Apply(record, block_id))
>     return;
> if (HasImage)
> {
>     //do what needs to be done with an image
> }
> else
> {
>     //do default things
> }
>
XLogReadBufferForRedoExtended should return a redo action
(block restored, done, block needs redo or block not found). So, we
can't just return
from the function if BLKIMAGE_APPLY flag is not set. It still has to
check whether a
redo is required or not.

> XLogRecBlockImageApply() should only check for BKP_APPLY and not imply
> HasImage(). This will be more flexible when for example using it for
> assertions.
seems reasonable.

> It would be more consistent to have a boolean flag to treat
> BKPIMAGE_APPLY in xlogreader.c. Have a look at has_image for example.
For flags in bimg_info, we directly check if the mask bit is set in bimg_info
(ex: hole, compressed). Besides, we use this flag only at
XLogReadBufferForRedoExtended.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Nov 3, 2016 at 3:24 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 2:35 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>> - Another suggestion was to remove wal_consistency from PostgresNode.pm
>>> because small buildfarm machines may suffer on it. Although I've no
>>> experience in this matter, I would like to be certain that nothings breaks
>>> in recovery tests after some modifications.
>>
>> An extra idea that I have here would be to extend the TAP tests to
>> accept an environment variable that would be used to specify extra
>> options when starting Postgres instances. Buildfarm machines could use
>> it.
>>
> It can be added as a separate feature.
>
>>
>> -    /* If it's a full-page image, restore it. */
>> -    if (XLogRecHasBlockImage(record, block_id))
>> +    /* If full-page image should be restored, do it. */
>> +    if (XLogRecBlockImageApply(record, block_id))
>> Hm. It seems to me that this modification is incorrect. If the page
>> does not need to be applied, aka if it needs to be used for
>> consistency checks, what should be done is more something like the
>> following in XLogReadBufferForRedoExtended:
>> if (Apply(record, block_id))
>>     return;
>> if (HasImage)
>> {
>>     //do what needs to be done with an image
>> }
>> else
>> {
>>     //do default things
>> }
>>
> XLogReadBufferForRedoExtended should return a redo action
> (block restored, done, block needs redo or block not found). So, we
> can't just return
> from the function if BLKIMAGE_APPLY flag is not set. It still has to
> check whether a
> redo is required or not.

Wouldn't the definition of a new redo action make sense then? Say
SKIPPED. None of the existing actions match the non-apply case.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Thu, Nov 3, 2016 at 12:34 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 3:24 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> On Thu, Nov 3, 2016 at 2:35 AM, Michael Paquier
>>> -    /* If it's a full-page image, restore it. */
>>> -    if (XLogRecHasBlockImage(record, block_id))
>>> +    /* If full-page image should be restored, do it. */
>>> +    if (XLogRecBlockImageApply(record, block_id))
>>> Hm. It seems to me that this modification is incorrect. If the page
>>> does not need to be applied, aka if it needs to be used for
>>> consistency checks, what should be done is more something like the
>>> following in XLogReadBufferForRedoExtended:
>>> if (Apply(record, block_id))
>>>     return;
>>> if (HasImage)
>>> {
>>>     //do what needs to be done with an image
>>> }
>>> else
>>> {
>>>     //do default things
>>> }
>>>
>> XLogReadBufferForRedoExtended should return a redo action
>> (block restored, done, block needs redo or block not found). So, we
>> can't just return
>> from the function if BLKIMAGE_APPLY flag is not set. It still has to
>> check whether a
>> redo is required or not.
>
> Wouldn't the definition of a new redo action make sense then? Say
> SKIPPED. None of the existing actions match the non-apply case.

As per my understanding, XLogReadBufferForRedoExtended works as follows:
1.  If wal record has backup block
2.  {
3.   restore the backup block;
4.   return BLK_RESTORED;
5.  }
6.  else
7.  {
8.   If block found in buffer
10.  If lsn of block is less than last replayed record
11.   return BLK_DONE;
12.  else
13.   return BLK_NEEDS_REDO;
14. else
15.  return BLK_NOT_FOUND;
16. }
Now, we can just change step 1 as follows:
1.  If wal record has backup block and it needs to be restored.

I'm not getting why we should introduce a new redo action and return
from the function beforehand.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Nov 3, 2016 at 4:04 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Wouldn't the definition of a new redo action make sense then? Say
> SKIPPED. None of the existing actions match the non-apply case.

I just took 5 minutes to look at the code and reason about it, and
something like what your patch is doing would be actually fine. Still
I don't think that checking for the apply flag in the macro routine
should look for has_image. Let's keep things separate.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Thu, Nov 3, 2016 at 2:27 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 4:04 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Wouldn't the definition of a new redo action make sense then? Say
>> SKIPPED. None of the existing actions match the non-apply case.
>
> I just took 5 minutes to look at the code and reason about it, and
> something like what your patch is doing would be actually fine. Still
> I don't think that checking for the apply flag in the macro routine
> should look for has_image. Let's keep things separate.
Actually, I just verified that bimg_info is not even valid if
has_image is not set.
In DecodeXLogRecord, we initialize bimg_info only when has_image flag
is set. So, keeping them
separate doesn't look a good approach to me. If we keep them separate,
the output
of the following assert is undefined:
Assert(XLogRecHasBlockImage(record, block_id) ||
!XLogRecBlockImageApply(record, block_id)).

Thoughts??
-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Nov 3, 2016 at 5:56 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> I'm not getting why we should introduce a new redo action and return
> from the function beforehand.

Per my last email, same conclusion from here :)
Sorry if I am picky and noisy on many points, I am trying to think
about the value of each change introduced in this patch, particularly
if they are meaningful, can be improved in some way, or can be
simplified and make the code more simple.
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Nov 3, 2016 at 6:15 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> Actually, I just verified that bimg_info is not even valid if
> has_image is not set.
> In DecodeXLogRecord, we initialize bimg_info only when has_image flag
> is set. So, keeping them
> separate doesn't look a good approach to me. If we keep them separate,
> the output
> of the following assert is undefined:
> Assert(XLogRecHasBlockImage(record, block_id) ||
> !XLogRecBlockImageApply(record, block_id)).
>
> Thoughts??

Yes, that's exactly the reason why we should keep both macros as
checking for separate things: apply implies that has_image is set and
that's normal, hence we could use sanity checks by just using those
macros and not propagating xlogreader.h.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Thu, Nov 3, 2016 at 2:52 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 6:15 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> Actually, I just verified that bimg_info is not even valid if
>> has_image is not set.
>> In DecodeXLogRecord, we initialize bimg_info only when has_image flag
>> is set. So, keeping them
>> separate doesn't look a good approach to me. If we keep them separate,
>> the output
>> of the following assert is undefined:
>> Assert(XLogRecHasBlockImage(record, block_id) ||
>> !XLogRecBlockImageApply(record, block_id)).
>>
>> Thoughts??
>
> Yes, that's exactly the reason why we should keep both macros as
> checking for separate things: apply implies that has_image is set and
> that's normal, hence we could use sanity checks by just using those
> macros and not propagating xlogreader.h.
>
No, apply doesn't mean has_image is set. If has_image is not set,
apply/bimg_info
is invalid(/undefined) and we should not use that. For example, in
XLogDumpDisplayRecord we use
bimg_info as following,
if (XLogRecHasBlockImage(record, block_id))
{  if (record->blocks[block_id].bimg_info & BKPIMAGE_IS_COMPRESSED)  {  }
}
So, whenever we are required to use bimg_info flag, we should make
sure that has_image
is set.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Nov 3, 2016 at 6:48 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> So, whenever we are required to use bimg_info flag, we should make
> sure that has_image
> is set.

OK, we are taking past each other here. There are two possible patterns:
- has_image is set, not apply, meaning that the image block is used
for consistency checks.
- has_image is set, as well as apply, meaning that the block needs to
be applied at redo.
So I mean exactly the same thing as you do. The point I am trying to
raise is that it would be meaningful to put in some code paths checks
of the type (apply && !has_image) and ERROR on them. Perhaps we could
just do that in xlogreader.c though. If having those checks external
to xlogreader.c makes sense, then using separate macros is more
portable.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
I've updated the patch for review.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Thu, Nov 3, 2016 at 7:47 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> I've updated the patch for review.
>
If an inconsistency is found, it'll just log it for now. Once, the
patch is finalized, we can
change it to FATAL as before. I was making sure that all regression
tests should pass with the patch.
It seems that there is some inconsistency in regression tests for BRIN index.

LOG:  Inconsistent page found, rel 1663/16384/30607, forknum 0, blkno 1
CONTEXT:  xlog redo at 0/9BAE08C8 for BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Robert Haas
Date:
On Wed, Nov 2, 2016 at 10:30 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> - Another suggestion was to remove wal_consistency from PostgresNode.pm
> because small buildfarm machines may suffer on it. Although I've no
> experience in this matter, I would like to be certain that nothings breaks
> in recovery tests after some modifications.

I think running the whole test suite with this enabled is going to
provoke complaints from buildfarm owners.  That's too bad, because I
agree with you that it would be nice to have the test coverage, but it
seems that many of the buildfarm machines are VMs with very minimal
resource allocations -- or very old physical machines -- or running
with settings like CLOBBER_CACHE_ALWAYS that make runs very slow.  If
you blow on them too hard, they fall over.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Thu, Nov 3, 2016 at 8:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Nov 2, 2016 at 10:30 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> - Another suggestion was to remove wal_consistency from PostgresNode.pm
>> because small buildfarm machines may suffer on it. Although I've no
>> experience in this matter, I would like to be certain that nothings breaks
>> in recovery tests after some modifications.
>
> I think running the whole test suite with this enabled is going to
> provoke complaints from buildfarm owners.  That's too bad, because I
> agree with you that it would be nice to have the test coverage, but it
> seems that many of the buildfarm machines are VMs with very minimal
> resource allocations -- or very old physical machines -- or running
> with settings like CLOBBER_CACHE_ALWAYS that make runs very slow.  If
> you blow on them too hard, they fall over.
>
Thanks Robert. I got your point. Then, as Michael has suggested, it is nice to
have some environment variable to pass optional conf parameters during
tap-tests.
Implementing this feature actually solves the problem.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Fri, Nov 4, 2016 at 4:16 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 8:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Nov 2, 2016 at 10:30 AM, Kuntal Ghosh
>> <kuntalghosh.2007@gmail.com> wrote:
>>> - Another suggestion was to remove wal_consistency from PostgresNode.pm
>>> because small buildfarm machines may suffer on it. Although I've no
>>> experience in this matter, I would like to be certain that nothings breaks
>>> in recovery tests after some modifications.
>>
>> I think running the whole test suite with this enabled is going to
>> provoke complaints from buildfarm owners.  That's too bad, because I
>> agree with you that it would be nice to have the test coverage, but it
>> seems that many of the buildfarm machines are VMs with very minimal
>> resource allocations -- or very old physical machines -- or running
>> with settings like CLOBBER_CACHE_ALWAYS that make runs very slow.  If
>> you blow on them too hard, they fall over.

Count me in. My RPIs won't like that! Actually I have a couple of
things internally mimicking the buildfarm client code on machines with
far higher capacity. And FWIW I am definitely going to enable this
option in the test suite, finishing with reports here.

> Thanks Robert. I got your point. Then, as Michael has suggested, it is nice to
> have some environment variable to pass optional conf parameters during
> tap-tests.
> Implementing this feature actually solves the problem.

We just need make PostgresNode.pm aware of something like PGTAPOPTIONS
to enforce a server started by TAP tests to append options to it.
There is already PGCTLTIMEOUT that behaves similarly. Even if this
brings extra load to buildfarm owners, that will limit complaints.
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Nov 3, 2016 at 11:17 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> I've updated the patch for review.

Thank you for the new patch. This will be hopefully the last round of
reviews, we are getting close to something that has an acceptable
shape.

+       </para>
+      </listitem>
+     </varlistentry>
+
+      </listitem>
+     </varlistentry>
Did you try to compile the docs? Because that will break. (Likely my
fault). What needs to be done is removing one </listitem> and one
</varlistentry> markup.

+/*-------------------------------------------------------------------------
+ *
+ * bufmask.h
+ *       Buffer masking definitions.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/bufmask.h
+ */
We could likely survive here with just a copyright mention as 2016,
PGDG... Same remark for bufmask.c.

--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -25,6 +25,7 @@#include "commands/vacuum.h"#include "miscadmin.h"#include "optimizer/plancat.h"
+#include "storage/bufmask.h"#include "utils/index_selfuncs.h"#include "utils/rel.h"
This header declaration is not necessary.

+   /*
+    * Mask the Page LSN. Because, we store the page before updating the LSN.
+    * Hence, LSNs of both pages will always be different.
+    */
+   mask_page_lsn(page_norm);
I don't fully understand this comment if phrased this way. Well, I do
understand it, but people who would read this code for the first time
may have a hard time understanding it. So I would suggest removing it,
but add a comment on top of mask_page_lsn() to mention that in
consistency checks the LSN of the two pages compared will likely be
different because of concurrent operations when the WAL is generated
and the state of the page where WAL is applied.

+   maskopaq = (BTPageOpaque)
+           (((char *) page_norm) + ((PageHeader) page_norm)->pd_special);
+   /*
+    * Mask everything on a DELETED page.
+    */
Let's make the code breath and add a space here.

+/* Aligned Buffers dedicated to consistency checks of size BLCKSZ */
+static char *new_page_masked = NULL;
+static char *old_page_masked = NULL;
palloc'd buffers are aligned, so you could just remove the work
"Aligned" in this comment?

+       /* If we've just restored the block from backup image, skip
consistency check. */
+       if (XLogRecHasBlockImage(record, block_id) &&
+           XLogRecBlockImageApply(record, block_id))
+           continue;
Here you could just check for Apply() to decide if continue should be
called or not, and Assert afterwards on HasBlockImage(). The assertion
would help checking for inconsistency errors.

@@ -7810,6 +7929,7 @@ ReadCheckpointRecord(XLogReaderState
*xlogreader, XLogRecPtr RecPtr,   }
   record = ReadRecord(xlogreader, RecPtr, LOG, true);
+   info = record->xl_info & ~XLR_INFO_MASK;
   if (record == NULL)   {
@@ -7852,8 +7972,8 @@ ReadCheckpointRecord(XLogReaderState
*xlogreader, XLogRecPtr RecPtr,       }       return NULL;   }
-   if (record->xl_info != XLOG_CHECKPOINT_SHUTDOWN &&
-       record->xl_info != XLOG_CHECKPOINT_ONLINE)
+   if (info != XLOG_CHECKPOINT_SHUTDOWN &&
+       info != XLOG_CHECKPOINT_ONLINE)
Those changes are not directly related to this patch, but make sure
that record checks are done correctly or this patch would just fail.
It may be better to apply those changes independently first per the
patch on this thread:
https://www.postgresql.org/message-id/CAB7nPqSWVyaZJg7_amRKVqRpEP=_=54e+762+2PF9u3Q3+Z0Nw@mail.gmail.com
My recommendation is to do so.

+           /*
+            * Remember that, if WAL consistency check is enabled for
the current rmid,
+            * we always include backup image with the WAL record.
But, during redo we
+            * restore the backup block only if needs_backup is set.
+            */
This could be rewritten a bit:
"if WAL consistency is enabled for the resource manager of this WAL
record, a full-page image is included in the record for the block
modified. During redo, the full-page is replayed only of
BKPIMAGE_APPLY is set."

- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if BKPIMAGE_APPLY flag is not set for the backup block,
+ * the relation is extended with all-zeroes pages up to the
referenced block number.
+ * In RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value * is always BLK_NEEDS_REDO
You are forgetting to mention "if the page does not exist" in the new
comment block.

+   /* If it has a full-page image and it should be restored, do it. */
+   if (XLogRecHasBlockImage(record, block_id) &&
XLogRecBlockImageApply(record, block_id))   {
Perhaps on two lines?

The headers of the functions in bufmask.c could be more descriptive,
there should be explanations regarding in which aspect they are useful
to guide the user in using them wisely (linked to my comment upstread
if the badly formulated comments before called mask_page_lsn).

Something regarding check_wal_consistency is making uneasy... But I
can't put my finger on what that is now..

I would still for the removal of blkno in the list of arguments of the
masking functions. This is used just for speculative inserts, where we
could just enforce the page number to 0 because this does not matter,
as Peter has mentioned upthread.

Could it be possible to add in pg_xlogdump.c a mention about a FPW
that has the "apply" flag. That would be important for debugging and
development. You could just have for example "(FPW)" for a page that
won't be applied, and "(FPW) apply" for a page where the apply flag is
active.

Please update gindesc.c for FPWs that have the apply flag, issue found
while checking the callers of XLogRecHasBlockImage().

In DecodeXLogRecord@xlogreader.c, please add a boolean flag "apply"
and then please could you do some error checks on it. Only one is
needed: if "apply" is true and has_image is false, xlogreader.c should
complain about an inconsistency!

I haven't performed any tests with the patch, and that's all I have
regarding the code. With that done we should be in good shape
code-speaking I think...
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Fri, Nov 4, 2016 at 5:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Nov 3, 2016 at 11:17 PM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> I've updated the patch for review.
>
> Thank you for the new patch. This will be hopefully the last round of
> reviews, we are getting close to something that has an acceptable
> shape.

One last thing: in XLogRecordAssemble(), could you enforce the value
of info at the beginning of the routine when wal_consistency[rmid] is
true? And then use the value of info to decide if include_image is
true or not? The idea here is to allow callers of XLogInsert() to pass
by themselves XLR_CHECK_CONSISTENCY and still have consistency checks
enabled for a given record even if wal_consistency is false for the
rmgr of the record happening. This would be potentially useful for
extension and feature developers when debugging some stuff, for some
builds compiled with a DEBUG flag, or whatever.
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Fri, Nov 4, 2016 at 6:02 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Nov 4, 2016 at 5:02 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Nov 3, 2016 at 11:17 PM, Kuntal Ghosh
>> <kuntalghosh.2007@gmail.com> wrote:
>>> I've updated the patch for review.
>>
>> Thank you for the new patch. This will be hopefully the last round of
>> reviews, we are getting close to something that has an acceptable
>> shape.
>
> One last thing: in XLogRecordAssemble(), could you enforce the value
> of info at the beginning of the routine when wal_consistency[rmid] is
> true? And then use the value of info to decide if include_image is
> true or not? The idea here is to allow callers of XLogInsert() to pass
> by themselves XLR_CHECK_CONSISTENCY and still have consistency checks
> enabled for a given record even if wal_consistency is false for the
> rmgr of the record happening. This would be potentially useful for
> extension and feature developers when debugging some stuff, for some
> builds compiled with a DEBUG flag, or whatever.

And you need to rebase the patch, d5f6f13 has fixed the handling of
xl_info with XLR_INFO_MASK.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Fri, Nov 4, 2016 at 1:32 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Thank you for the new patch. This will be hopefully the last round of
> reviews, we are getting close to something that has an acceptable
> shape.
Thanks a lot for reviewing the patch. Based on your review, I've attached the
updated patch along with few comments.

> In DecodeXLogRecord@xlogreader.c, please add a boolean flag "apply"
> and then please could you do some error checks on it. Only one is
> needed: if "apply" is true and has_image is false, xlogreader.c should
> complain about an inconsistency!
Added a flag named apply_image in DecodedBkpBlock and used it to
check whether image apply is required or not.

> I would still for the removal of blkno in the list of arguments of the
> masking functions. This is used just for speculative inserts, where we
> could just enforce the page number to 0 because this does not matter,
> as Peter has mentioned upthread.
It just doesn't feel right to me to enforce the number manually when
I can use the blkno without any harm.

> I haven't performed any tests with the patch, and that's all I have
> regarding the code. With that done we should be in good shape
> code-speaking I think...
I've done a fair amount of testing which includes regression tests
and manual creation of inconsistencies in the page. I've also found the
reason behind inconsistency in brin revmap page.
Brin revmap page doesn't have standard page layout. Besides, It doesn't update
pd_upper and pd_lower in its operations as well. But, during WAL
insertions, it uses
REGBUF_STANDARD to register a reference in the WAL record. Hence, when we
restore image before consistency check, RestoreBlockImage fills the space
between pd_upper and pd_lower(page hole) with zero. I've posted this as a
separate thread.

https://www.postgresql.org/message-id/flat/CAGz5QCJ%3D00UQjScSEFbV%3D0qO5ShTZB9WWz_Fm7%2BWd83zPs9Geg%40mail.gmail.com#CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Nov 9, 2016 at 11:32 PM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> On Fri, Nov 4, 2016 at 1:32 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Thank you for the new patch. This will be hopefully the last round of
>> reviews, we are getting close to something that has an acceptable
>> shape.
>
> Thanks a lot for reviewing the patch. Based on your review, I've attached the
> updated patch along with few comments.

Thanks for the new version. pg_xlogdump is really helpful now for debugging.

>> I haven't performed any tests with the patch, and that's all I have
>> regarding the code. With that done we should be in good shape
>> code-speaking I think...
>
> I've done a fair amount of testing which includes regression tests
> and manual creation of inconsistencies in the page. I've also found the
> reason behind inconsistency in brin revmap page.
> Brin revmap page doesn't have standard page layout. Besides, It doesn't update
> pd_upper and pd_lower in its operations as well. But, during WAL
> insertions, it uses
> REGBUF_STANDARD to register a reference in the WAL record. Hence, when we
> restore image before consistency check, RestoreBlockImage fills the space
> between pd_upper and pd_lower(page hole) with zero. I've posted this as a
> separate thread.
>
https://www.postgresql.org/message-id/flat/CAGz5QCJ%3D00UQjScSEFbV%3D0qO5ShTZB9WWz_Fm7%2BWd83zPs9Geg%40mail.gmail.com#CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com

Nice to have spotted the inconsistency. This tool is going to be useful..

I have spent a couple of hours playing with the patch, and worked on
it a bit more with a couple of minor changes:
- In gindesc.c, the if blocks are more readable with brackets.
- Addition of a comment when info is set, to mention that this is done
at the beginning of the routine so as callers of XLogInsert() can pass
the flag for consistency checks.
- apply_image should be reset in ResetDecoder().
- The BRIN problem is here:
2016-11-10 12:24:10 JST [65776]: [23-1] db=,user=,app=,client= FATAL:
Inconsistent page found, rel 1663/16385/30625, forknum 0, blkno 1
2016-11-10 12:24:10 JST [65776]: [24-1] db=,user=,app=,client=
CONTEXT:  xlog redo at 0/9BD31148 for BRIN/UPDATE+INIT: heapBlk 100
pagesPerRange 1 old offnum 11, new offnum 1
2016-11-10 12:24:10 JST [65776]: [25-1] db=,user=,app=,client=
WARNING:  buffer refcount leak: [4540] (rel=base/16385/30625,
blockNum=1, flags=0x93800000, refcount=1 1)
TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2506)
Now the buffer refcount leak is not normal! The safest thing to do
here is to release the buffer once a copy of it has been taken, and
the leaks goes away when calling FATAL to report the inconsistency.
- When checking for XLR_CHECK_CONSISTENCY, better to add a != 0 to get
a boolean out of it.
- guc_malloc() should be done as late as possible, this simplifies the
code and prevents any memory leaks which is what your patch is doing
when there is an error. (I have finally put my finger on what was
itching me here).
- In btree_mask, the lookup of BTP_DELETED can be deadly simplified.
- I wondered also about putting assign_wal_consistency and
check_wal_consistency in a separate file, say xlogparams.c, concluding
that the current code does nothing bad either even if it adds rmgr.h
in the list of headers in guc.c.
- Some comment blocks are longer than 72~80 characters.
- Typos!

With the patch for BRIN applied, I am able to get installcheck-world
working with wal_consistency = all and a standby doing the consistency
checks behind. Adding wal_consistency = all in PostgresNode.pm, the
recovery tests are passing. This patch is switched as "Ready for
Committer". Thanks for completing this effort begun 3 years ago!
--
Michael

Attachment

Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Thu, Nov 10, 2016 at 10:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Nov 9, 2016 at 11:32 PM, Kuntal Ghosh
>> Thanks a lot for reviewing the patch. Based on your review, I've attached the

>> I've done a fair amount of testing which includes regression tests
>> and manual creation of inconsistencies in the page. I've also found the
>> reason behind inconsistency in brin revmap page.
>> Brin revmap page doesn't have standard page layout. Besides, It doesn't update
>> pd_upper and pd_lower in its operations as well. But, during WAL
>> insertions, it uses
>> REGBUF_STANDARD to register a reference in the WAL record. Hence, when we
>> restore image before consistency check, RestoreBlockImage fills the space
>> between pd_upper and pd_lower(page hole) with zero. I've posted this as a
>> separate thread.
>>
https://www.postgresql.org/message-id/flat/CAGz5QCJ%3D00UQjScSEFbV%3D0qO5ShTZB9WWz_Fm7%2BWd83zPs9Geg%40mail.gmail.com#CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com
>
> Nice to have spotted the inconsistency. This tool is going to be useful..
>
> I have spent a couple of hours playing with the patch, and worked on
> it a bit more with a couple of minor changes:
> - In gindesc.c, the if blocks are more readable with brackets.
> - Addition of a comment when info is set, to mention that this is done
> at the beginning of the routine so as callers of XLogInsert() can pass
> the flag for consistency checks.
> - apply_image should be reset in ResetDecoder().
> - The BRIN problem is here:
> 2016-11-10 12:24:10 JST [65776]: [23-1] db=,user=,app=,client= FATAL:
> Inconsistent page found, rel 1663/16385/30625, forknum 0, blkno 1
> 2016-11-10 12:24:10 JST [65776]: [24-1] db=,user=,app=,client=
> CONTEXT:  xlog redo at 0/9BD31148 for BRIN/UPDATE+INIT: heapBlk 100
> pagesPerRange 1 old offnum 11, new offnum 1
> 2016-11-10 12:24:10 JST [65776]: [25-1] db=,user=,app=,client=
> WARNING:  buffer refcount leak: [4540] (rel=base/16385/30625,
> blockNum=1, flags=0x93800000, refcount=1 1)
> TRAP: FailedAssertion("!(RefCountErrors == 0)", File: "bufmgr.c", Line: 2506)
> Now the buffer refcount leak is not normal! The safest thing to do
> here is to release the buffer once a copy of it has been taken, and
> the leaks goes away when calling FATAL to report the inconsistency.
> - When checking for XLR_CHECK_CONSISTENCY, better to add a != 0 to get
> a boolean out of it.
> - guc_malloc() should be done as late as possible, this simplifies the
> code and prevents any memory leaks which is what your patch is doing
> when there is an error. (I have finally put my finger on what was
> itching me here).
> - In btree_mask, the lookup of BTP_DELETED can be deadly simplified.
> - I wondered also about putting assign_wal_consistency and
> check_wal_consistency in a separate file, say xlogparams.c, concluding
> that the current code does nothing bad either even if it adds rmgr.h
> in the list of headers in guc.c.
> - Some comment blocks are longer than 72~80 characters.
> - Typos!
All the changes make perfect sense to me.

> With the patch for BRIN applied, I am able to get installcheck-world
> working with wal_consistency = all and a standby doing the consistency
> checks behind. Adding wal_consistency = all in PostgresNode.pm, the
> recovery tests are passing. This patch is switched as "Ready for
> Committer". Thanks for completing this effort begun 3 years ago!
Thanks to you for reviewing all the patches in so much detail. Amit, Robert
and Dilip also helped me a lot in developing the feature. Thanks to them
as well.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Robert Haas
Date:
On Thu, Nov 10, 2016 at 10:02 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
>> With the patch for BRIN applied, I am able to get installcheck-world
>> working with wal_consistency = all and a standby doing the consistency
>> checks behind. Adding wal_consistency = all in PostgresNode.pm, the
>> recovery tests are passing. This patch is switched as "Ready for
>> Committer". Thanks for completing this effort begun 3 years ago!
> Thanks to you for reviewing all the patches in so much detail. Amit, Robert
> and Dilip also helped me a lot in developing the feature. Thanks to them
> as well.

So, who should be credited as co-authors of this patch and in what
order, if and when it gets committed?  If X started this patch and
then Kuntal did a little more work on it, I would credit it as:

X and Kuntal Ghosh

If Kuntal did major work on it, though, then I would think of
something more like:

Kuntal Ghosh, based on an earlier patch from X

If he didn't use any of the old code but just the idea, then I would
do something like this:

Kuntal Ghosh, inspired by a previous patch from X

So, who are all of the people involved in the effort to produce this
patch, and what's the right way to attribute credit?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Fri, Nov 11, 2016 at 1:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> So, who are all of the people involved in the effort to produce this
> patch, and what's the right way to attribute credit?

The original idea was from Heikki as he has introduced the idea of
doing such checks if you look at the original thread. I added on top
of it a couple of things like the concept of page masking, and hacked
an early version of the versoin we have now
(https://www.postgresql.org/message-id/CAB7nPqR4vxdKijP+Du82vOcOnGMvutq-gfqiU2dsH4bsM77hYg@mail.gmail.com).
So it seems to me that marking Heikki as an author would be fair as
the original idea is from him. I don't mind being marked only as a
reviewer of the feature, or as someone from which has written an
earlier version of the patch, but I let that up to your judgement.
Kuntai is definitely an author.
-- 
Michael



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Fri, Nov 11, 2016 at 3:36 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Nov 11, 2016 at 1:36 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> So, who are all of the people involved in the effort to produce this
>> patch, and what's the right way to attribute credit?
>
> The original idea was from Heikki as he has introduced the idea of
> doing such checks if you look at the original thread. I added on top
> of it a couple of things like the concept of page masking, and hacked
> an early version of the versoin we have now
> (https://www.postgresql.org/message-id/CAB7nPqR4vxdKijP+Du82vOcOnGMvutq-gfqiU2dsH4bsM77hYg@mail.gmail.com).
> So it seems to me that marking Heikki as an author would be fair as
> the original idea is from him. I don't mind being marked only as a
> reviewer of the feature, or as someone from which has written an
> earlier version of the patch, but I let that up to your judgement.
> Kuntai is definitely an author.
Although lot of changes have been done later, but I've started with the patch
attached in the above thread. Hence, I feel the author of that patch should
also get the credit.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: WAL consistency check facility

From
Peter Eisentraut
Date:
On 11/9/16 11:55 PM, Michael Paquier wrote:
> +     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
> +      <term><varname>wal_consistency</varname> (<type>string</type>)
> +      <indexterm>
> +       <primary><varname>wal_consistency</> configuration parameter</primary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        This parameter is used to check the consistency of WAL records, i.e,
> +        whether the WAL records are inserted and applied correctly. When
> +        <varname>wal_consistency</varname> is enabled for a WAL record, it
> +        stores a full-page image along with the record. When a full-page image
> +        arrives during redo, it compares against the current page to check whether
> +        both are consistent.
> +       </para>

Could we name this something like wal_consistency_checking?

Otherwise it sounds like you use this to select the amount of
consistency in the WAL (similar to, say, wal_level).

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Sun, Nov 13, 2016 at 12:06 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Could we name this something like wal_consistency_checking?
>
> Otherwise it sounds like you use this to select the amount of
> consistency in the WAL (similar to, say, wal_level).

Or wal_check? Or wal_consistency_check?
-- 
Michael



Re: WAL consistency check facility

From
Robert Haas
Date:
On Sat, Nov 12, 2016 at 10:06 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 11/9/16 11:55 PM, Michael Paquier wrote:
>> +     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
>> +      <term><varname>wal_consistency</varname> (<type>string</type>)
>> +      <indexterm>
>> +       <primary><varname>wal_consistency</> configuration parameter</primary>
>> +      </indexterm>
>> +      </term>
>> +      <listitem>
>> +       <para>
>> +        This parameter is used to check the consistency of WAL records, i.e,
>> +        whether the WAL records are inserted and applied correctly. When
>> +        <varname>wal_consistency</varname> is enabled for a WAL record, it
>> +        stores a full-page image along with the record. When a full-page image
>> +        arrives during redo, it compares against the current page to check whether
>> +        both are consistent.
>> +       </para>
>
> Could we name this something like wal_consistency_checking?
>
> Otherwise it sounds like you use this to select the amount of
> consistency in the WAL (similar to, say, wal_level).

+1.  I like that name.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WAL consistency check facility

From
Kuntal Ghosh
Date:
On Tue, Nov 15, 2016 at 7:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Nov 12, 2016 at 10:06 PM, Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com> wrote:
>> On 11/9/16 11:55 PM, Michael Paquier wrote:
>>> +     <varlistentry id="guc-wal-consistency" xreflabel="wal_consistency">
>>> +      <term><varname>wal_consistency</varname> (<type>string</type>)
>>> +      <indexterm>
>>> +       <primary><varname>wal_consistency</> configuration parameter</primary>
>>> +      </indexterm>
>>> +      </term>
>>> +      <listitem>
>>> +       <para>
>>> +        This parameter is used to check the consistency of WAL records, i.e,
>>> +        whether the WAL records are inserted and applied correctly. When
>>> +        <varname>wal_consistency</varname> is enabled for a WAL record, it
>>> +        stores a full-page image along with the record. When a full-page image
>>> +        arrives during redo, it compares against the current page to check whether
>>> +        both are consistent.
>>> +       </para>
>>
>> Could we name this something like wal_consistency_checking?
>>
>> Otherwise it sounds like you use this to select the amount of
>> consistency in the WAL (similar to, say, wal_level).
>
> +1.  I like that name.
I've modified the guc parameter name as wal_consistency_check (little
hesitant for a participle in suffix :) ). Also, updated the sgml and
variable name accordingly.
FYI, regression test will fail because of an inconsistency in brin
page. I've already submitted a patch for that. Following is the thread
for the same:

https://www.postgresql.org/message-id/flat/CAGz5QCJ%3D00UQjScSEFbV%3D0qO5ShTZB9WWz_Fm7%2BWd83zPs9Geg%40mail.gmail.com#CAGz5QCJ=00UQjScSEFbV=0qO5ShTZB9WWz_Fm7+Wd83zPs9Geg@mail.gmail.com
--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: WAL consistency check facility

From
Michael Paquier
Date:
On Tue, Nov 15, 2016 at 7:50 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> I've modified the guc parameter name as wal_consistency_check (little
> hesitant for a participle in suffix :) ). Also, updated the sgml and
> variable name accordingly.

The changes look good to me.
-- 
Michael



Re: WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Nov 16, 2016 at 2:15 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Nov 15, 2016 at 7:50 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> I've modified the guc parameter name as wal_consistency_check (little
>> hesitant for a participle in suffix :) ). Also, updated the sgml and
>> variable name accordingly.
>
> The changes look good to me.

Moved to CF 2017-01, as no committers have showed up yet :(
-- 
Michael



Re: [HACKERS] WAL consistency check facility

From
Robert Haas
Date:
On Mon, Nov 28, 2016 at 11:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Moved to CF 2017-01, as no committers have showed up yet :(

Seeing no other volunteers, here I am.

On a first read-through of this patch -- I have not studied it in
detail yet -- this looks pretty good to me.  One concern is that this
patch adds a bit of code to XLogInsert(), which is a very hot piece of
code.  Conceivably, that might produce a regression even when this is
disabled; if so, we'd probably need to make it a build-time option.  I
hope that's not necessary, because I think it would be great to
compile this into the server by default, but we better make sure it's
not a problem.  A bulk load into an existing table might be a good
test case.

Aside from that, I think the biggest issue here is that the masking
functions are virtually free of comments, whereas I think they should
have extensive and detailed comments.  For example, in heap_mask, you
have this:

+            page_htup->t_infomask =
+                HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
+                HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;

For something like this, you could write "We want to ignore
differences in hint bits, since they can be set by SetHintBits without
emitting WAL.  Force them all to be set so that we don't notice
discrepancies."  Actually, though, I think that you could be a bit
more nuanced here: HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID =
HEAP_XMIN_FROZEN, so maybe what you should do is clear
HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID but only clear the others if
one is set but not both.

Anyway, leaving that aside, I think every single change that gets
masked in every single masking routine needs a similar comment,
explaining why that change can happen on the master without also
happening on the standby and hopefully referring to the code that
makes that unlogged change.

I think wal_consistency_checking, as proposed by Peter, is better than
wal_consistency_check, as implemented.

Having StartupXLOG() call pfree() on the masking buffers is a waste of
code.  The process is going to exit anyway.

+                 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",

Primary error messages aren't capitalized.

+        if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
+        {
+            /* Caller specified a bogus block_id. Do nothing. */
+            continue;
+        }

Why would the caller do something so dastardly?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WAL consistency check facility

From
Kuntal Ghosh
Date:
On Wed, Dec 21, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Nov 28, 2016 at 11:31 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Moved to CF 2017-01, as no committers have showed up yet :(
>
> Seeing no other volunteers, here I am.
>
Thanks Robert for looking into the patch.

> On a first read-through of this patch -- I have not studied it in
> detail yet -- this looks pretty good to me.  One concern is that this
> patch adds a bit of code to XLogInsert(), which is a very hot piece of
> code.  Conceivably, that might produce a regression even when this is
> disabled; if so, we'd probably need to make it a build-time option.  I
> hope that's not necessary, because I think it would be great to
> compile this into the server by default, but we better make sure it's
> not a problem.  A bulk load into an existing table might be a good
> test case.
>
I'll do this test and post the results.

> +        if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
> +        {
> +            /* Caller specified a bogus block_id. Do nothing. */
> +            continue;
> +        }
>
> Why would the caller do something so dastardly?
>
Sorry, it's my bad. I've copied the code from somewhere else, but forgot
to modify the comment. It should be something like
/* block_id is not used. */

I'll modify the above along with other suggested changes.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] WAL consistency check facility

From
Kuntal Ghosh
Date:
On Wed, Dec 21, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> On a first read-through of this patch -- I have not studied it in
> detail yet -- this looks pretty good to me.  One concern is that this
> patch adds a bit of code to XLogInsert(), which is a very hot piece of
> code.  Conceivably, that might produce a regression even when this is
> disabled; if so, we'd probably need to make it a build-time option.  I
> hope that's not necessary, because I think it would be great to
> compile this into the server by default, but we better make sure it's
> not a problem.  A bulk load into an existing table might be a good
> test case.
>
I've done some bulk load testing with 16,32,64 clients. I didn't
notice any regression
in the results.

> Aside from that, I think the biggest issue here is that the masking
> functions are virtually free of comments, whereas I think they should
> have extensive and detailed comments.  For example, in heap_mask, you
> have this:
>
> +            page_htup->t_infomask =
> +                HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
> +                HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
>
> For something like this, you could write "We want to ignore
> differences in hint bits, since they can be set by SetHintBits without
> emitting WAL.  Force them all to be set so that we don't notice
> discrepancies."  Actually, though, I think that you could be a bit
> more nuanced here: HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID =
> HEAP_XMIN_FROZEN, so maybe what you should do is clear
> HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID but only clear the others if
> one is set but not both.
>
I've modified it as follows:
+
+                       if (!HeapTupleHeaderXminFrozen(page_htup))
+                               page_htup->t_infomask |= HEAP_XACT_MASK;
+                       else
+                               page_htup->t_infomask |=
HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;

> Anyway, leaving that aside, I think every single change that gets
> masked in every single masking routine needs a similar comment,
> explaining why that change can happen on the master without also
> happening on the standby and hopefully referring to the code that
> makes that unlogged change.
>
I've added comments for all the masking routines.

> I think wal_consistency_checking, as proposed by Peter, is better than
> wal_consistency_check, as implemented.
>
Modified to wal_consistency_checking.

> Having StartupXLOG() call pfree() on the masking buffers is a waste of
> code.  The process is going to exit anyway.
>
> +                 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
>
Done.

> Primary error messages aren't capitalized.
>
> +        if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
> +        {
> +            /* Caller specified a bogus block_id. Do nothing. */
> +            continue;
> +        }
>
> Why would the caller do something so dastardly?
>
Modified to following comment:
+                       /*
+                        * WAL record doesn't contain a block reference
+                        * with the given id. Do nothing.
+                        */

I've attached the patch with the modified changes. PFA.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Jan 5, 2017 at 2:54 PM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> On Wed, Dec 21, 2016 at 10:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> On a first read-through of this patch -- I have not studied it in
>> detail yet -- this looks pretty good to me.  One concern is that this
>> patch adds a bit of code to XLogInsert(), which is a very hot piece of
>> code.  Conceivably, that might produce a regression even when this is
>> disabled; if so, we'd probably need to make it a build-time option.  I
>> hope that's not necessary, because I think it would be great to
>> compile this into the server by default, but we better make sure it's
>> not a problem.  A bulk load into an existing table might be a good
>> test case.
>>
> I've done some bulk load testing with 16,32,64 clients. I didn't
> notice any regression
> in the results.
>
>> Aside from that, I think the biggest issue here is that the masking
>> functions are virtually free of comments, whereas I think they should
>> have extensive and detailed comments.  For example, in heap_mask, you
>> have this:
>>
>> +            page_htup->t_infomask =
>> +                HEAP_XMIN_COMMITTED | HEAP_XMIN_INVALID |
>> +                HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
>>
>> For something like this, you could write "We want to ignore
>> differences in hint bits, since they can be set by SetHintBits without
>> emitting WAL.  Force them all to be set so that we don't notice
>> discrepancies."  Actually, though, I think that you could be a bit
>> more nuanced here: HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID =
>> HEAP_XMIN_FROZEN, so maybe what you should do is clear
>> HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID but only clear the others if
>> one is set but not both.
>>
> I've modified it as follows:
> +
> +                       if (!HeapTupleHeaderXminFrozen(page_htup))
> +                               page_htup->t_infomask |= HEAP_XACT_MASK;
> +                       else
> +                               page_htup->t_infomask |=
> HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID;
>
>> Anyway, leaving that aside, I think every single change that gets
>> masked in every single masking routine needs a similar comment,
>> explaining why that change can happen on the master without also
>> happening on the standby and hopefully referring to the code that
>> makes that unlogged change.
>>
> I've added comments for all the masking routines.
>
>> I think wal_consistency_checking, as proposed by Peter, is better than
>> wal_consistency_check, as implemented.
>>
> Modified to wal_consistency_checking.
>
>> Having StartupXLOG() call pfree() on the masking buffers is a waste of
>> code.  The process is going to exit anyway.
>>
>> +                 "Inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u",
>>
> Done.
>
>> Primary error messages aren't capitalized.
>>
>> +        if (!XLogRecGetBlockTag(record, block_id, &rnode, &forknum, &blkno))
>> +        {
>> +            /* Caller specified a bogus block_id. Do nothing. */
>> +            continue;
>> +        }
>>
>> Why would the caller do something so dastardly?
>>
> Modified to following comment:
> +                       /*
> +                        * WAL record doesn't contain a block reference
> +                        * with the given id. Do nothing.
> +                        */
>
> I've attached the patch with the modified changes. PFA.

Moved to CF 2017-03 with same status, "ready for committer".
-- 
Michael



Re: [HACKERS] WAL consistency check facility

From
Robert Haas
Date:
On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> I've attached the patch with the modified changes. PFA.

Can this patch check contrib/bloom?

+        /*
+         * Mask some line pointer bits, particularly those marked as
+         * used on a master and unused on a standby.
+         */

Comment doesn't explain why we need to do this.

+        /*
+         * For GIN_DELETED page, the page is initialized to empty.
+         * Hence mask everything.
+         */
+        if (opaque->flags & GIN_DELETED)
+            memset(page_norm, MASK_MARKER, BLCKSZ);
+        else
+            mask_unused_space(page_norm);

If the page is initialized to empty, why do we need to mask
anything/everything?  Anyway, it doesn't seem right to mask the
GinPageOpaque structure itself. Or at least it doesn't seem right to
mask the flags word.

+        /*
+         * For gist leaf pages, mask some line pointer bits, particularly
+         * those marked as used on a master and unused on a standby.
+         */

Comment doesn't explain why we need to do this.

+            if (!HeapTupleHeaderXminFrozen(page_htup))
+                page_htup->t_infomask |= HEAP_XACT_MASK;
+            else
+                page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
HEAP_XMAX_INVALID;

Comment doesn't address this logic.  Also, the "else" case shouldn't
exist at all, I think.

+            /*
+             * For a speculative tuple, the content of t_ctid is conflicting
+             * between the backup page and current page. Hence, we set it
+             * to the current block number and current offset.
+             */

Why does it differ?  Is that a bug?

+    /*
+     * Mask everything on a DELETED page since it will be re-initialized
+     * during replay.
+     */
+    if ((maskopaq->btpo_flags & BTP_DELETED) != 0)
+    {
+        /* Mask Page Content */
+        memset(page_norm + SizeOfPageHeaderData, MASK_MARKER,
+               BLCKSZ - SizeOfPageHeaderData);
+
+        /* Mask pd_lower and pd_upper */
+        memset(&((PageHeader) page_norm)->pd_lower, MASK_MARKER,
+               sizeof(uint16));
+        memset(&((PageHeader) page_norm)->pd_upper, MASK_MARKER,
+               sizeof(uint16));

This isn't consistent with the GIN_DELETE case - it is more selective
about what it masks.  Probably that logic should be adapted to look
more like this.

+        /*
+         * Mask some line pointer bits, particularly those marked as
+         * used on a master and unused on a standby.
+         */

Comment (still) doesn't explain why we need to do this.

+    /*
+     * During replay of a btree page split, we don't set the BTP_SPLIT_END
+     * flag of the right sibling and initialize th cycle_id to 0 for the same
+     * page.
+     */

A reference to the name of the redo function might be helpful here
(and in some other places), unless it's just ${AMNAME}_redo.  "th" is
a typo for "the".

+                        appendStringInfoString(buf, " (full page
image, apply)");
+                    else
+                        appendStringInfoString(buf, " (full page image)");

How about "(full page image)" and "(full page image, for WAL verification)"?

Similarly in XLogDumpDisplayRecord, I think we should assume that
"FPW" means what it has always meant, and leave that output alone.
Instead, distinguish the WAL-consistency-checking case when it
happens, e.g. "(FPW for consistency check)".

+checkConsistency(XLogReaderState *record)

How about CheckXLogConsistency()?

+         * If needs_backup is true or wal consistency check is enabled for

...or WAL checking is enabled...

+             * If WAL consistency is enabled for the resource manager of

If WAL consistency checking is enabled...

+ * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag

with the BKPIMAGE_APPLY flag

- * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
- * with all-zeroes pages up to the referenced block number.  In
- * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
+ * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
+ * is not set for the backup block, the relation is extended with all-zeroes
+ * pages up to the referenced block number.

OK, I'm puzzled by this.  Surely we don't want the WAL consistency
checking facility to cause the relation to be extended like this.  And
I don't see why it would, because the WAL consistency checking happens
after the page changes have already been applied.  I wonder if this
hunk is in error and should be dropped.

+    PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
+    phdr->pd_prune_xid = PG_UINT32_MAX;
+    phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
+    phdr->pd_flags |= PD_ALL_VISIBLE;
+#define MASK_MARKER        0xFF
(and many others)

Why do we mask by setting bits rather than clearing bits?  My
intuition would have been to zero things we want to mask, rather than
setting them to one.

+                {
+                    newwalconsistency[i] = true;
+                }

Superfluous braces.

+    /*
+     * Leave if no masking functions defined, this is possible in the case
+     * resource managers generating just full page writes, comparing an
+     * image to itself has no meaning in those cases.
+     */
+    if (RmgrTable[rmid].rm_mask == NULL)
+        return;

...and also...

+            /*
+             * This feature is enabled only for the resource managers where
+             * a masking function is defined.
+             */
+            for (i = 0; i <= RM_MAX_ID; i++)
+            {
+                if (RmgrTable[i].rm_mask != NULL)

Why do we assume that the feature is only enabled for RMs that have a
mask function?  Why not instead assume that if there's no masking
function, no masking is required?

+        /* Definitely not an individual resource manager. Check for 'all'. */
+        if (pg_strcasecmp(tok, "all") == 0)

It seems like it might be cleaner to check for "all" first, and then
check for individual RMs afterward.

+    /*
+     * Parameter should contain either 'all' or a combination of resource
+     * managers.
+     */
+    if (isAll && isRmgrId)
+    {
+        GUC_check_errdetail("Invalid value combination");
+        return false;
+    }

That error message is not very clear, and I don't see why we even need
to check this.  If someone sets wal_consistency_checking = hash, all,
let's just set it for all and the fact that hash is also set won't
matter to anything.

+    void        (*rm_mask) (char *page, BlockNumber blkno);

Could the page be passed as type "Page" rather than a "char *" to make
things more convenient for the masking functions?  If not, could those
functions at least do something like "Page page = (Page) pagebytes;"
rather than "Page page_norm = (Page) page;"?

+        /*
+         * Read the contents from the current buffer and store it in a
+         * temporary page.
+         */
+        buf = XLogReadBufferExtended(rnode, forknum, blkno,
+                                          RBM_NORMAL);
+        if (!BufferIsValid(buf))
+            continue;
+
+        new_page = BufferGetPage(buf);
+
+        /*
+         * Read the contents from the backup copy, stored in WAL record
+         * and store it in a temporary page. There is not need to allocate
+         * a new page here, a local buffer is fine to hold its contents and
+         * a mask can be directly applied on it.
+         */
+        if (!RestoreBlockImage(record, block_id, old_page_masked))
+            elog(ERROR, "failed to restore block image");
+
+        /*
+         * Take a copy of the new page where WAL has been applied to have
+         * a comparison base before masking it...
+         */
+        memcpy(new_page_masked, new_page, BLCKSZ);
+
+        /* No need for this page anymore now that a copy is in */
+        ReleaseBuffer(buf);

The order of operations is strange here.  We read the "new" page,
holding the pin (but no lock?).  Then we restore the block image into
old_page_masked.  Now we copy the new page and release the pin.  It
would be better, ISTM, to rearrange that so that we finish with the
new page and release the pin before dealing with the old page.  Also,
I think we need to actually lock the buffer before copying it.  Maybe
that's not strictly necessary since this is all happening on the
standby, but it seems like a bad idea to set the precedent that you
can read a page without taking the content lock.

I think the "new" and "old" page terminology is kinda weird too.
Maybe we should call them the "replay image" and the "master image" or
something like that.   A few more comments wouldn't hurt either.

+     * Also mask the all-visible flag.
+     *
+     * XXX: It is unfortunate that we have to do this. If the flag is set
+     * incorrectly, that's serious, and we would like to catch it. If the flag
+     * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
+     * records don't currently set the flag, even though it is set in the
+     * master, so we must silence failures that that causes.
+     */
+    phdr->pd_flags |= PD_ALL_VISIBLE;

I'm puzzled by the reference to HEAP_CLEAN.  The thing that might set
the all-visible bit is XLOG_HEAP2_VISIBLE, not XLOG_HEAP2_CLEAN.
Unless I'm missing something, there's no situation in which
XLOG_HEAP2_CLEAN might be associated with setting PD_ALL_VISIBLE.
Also, XLOG_HEAP2_VISIBLE records do SOMETIMES set the bit, just not
always.  And there's a good reason for that, which is explained in
this comment:
        * We don't bump the LSN of the heap page when setting the visibility        * map bit (unless checksums or
wal_hint_bitsis enabled, in which        * case we must), because that would generate an unworkable volume of        *
full-pagewrites.  This exposes us to torn page hazards, but since        * we're not inspecting the existing page
contentsin any way, we        * don't care.        *        * However, all operations that clear the visibility map bit
*do*bump        * the LSN, and those operations will only be replayed if the XLOG LSN        * follows the page LSN.
Thus,if the page LSN has advanced past our        * XLOG record's LSN, we mustn't mark the page all-visible, because
   * the subsequent update won't be replayed to clear the flag.
 

So I think this comment needs to be rewritten with a bit more nuance.

+extern void mask_unused_space(Page page);
+#endif

Missing newline.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Feb 1, 2017 at 1:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> I've attached the patch with the modified changes. PFA.
>
> Can this patch check contrib/bloom?

Only full pages are applied at redo by the generic WAL facility. So
you would finish by comparing a page with itself in generic_redo.

> +            /*
> +             * For a speculative tuple, the content of t_ctid is conflicting
> +             * between the backup page and current page. Hence, we set it
> +             * to the current block number and current offset.
> +             */
>
> Why does it differ?  Is that a bug?

This has been discussed twice in this thread, once by me, once by Alvaro:
https://www.postgresql.org/message-id/CAM3SWZQC8nUgp8SjKDY3d74VLpdf9puHc7-n3zf4xcr_bghPzg%40mail.gmail.com
https://www.postgresql.org/message-id/CAB7nPqQTLGvn_XePjS07kZMMw46kS6S7LfsTocK%2BgLpTN0bcZw%40mail.gmail.com

> +    /*
> +     * Leave if no masking functions defined, this is possible in the case
> +     * resource managers generating just full page writes, comparing an
> +     * image to itself has no meaning in those cases.
> +     */
> +    if (RmgrTable[rmid].rm_mask == NULL)
> +        return;
>
> ...and also...
>
> +            /*
> +             * This feature is enabled only for the resource managers where
> +             * a masking function is defined.
> +             */
> +            for (i = 0; i <= RM_MAX_ID; i++)
> +            {
> +                if (RmgrTable[i].rm_mask != NULL)
>
> Why do we assume that the feature is only enabled for RMs that have a
> mask function?  Why not instead assume that if there's no masking
> function, no masking is required?

Not all RMs work on full pages. Tracking in smgr.h the list of RMs
that are no-ops when masking things is easier than having empty
routines declared all over the code base. Don't you think so>

> +    void        (*rm_mask) (char *page, BlockNumber blkno);
>
> Could the page be passed as type "Page" rather than a "char *" to make
> things more convenient for the masking functions?  If not, could those
> functions at least do something like "Page page = (Page) pagebytes;"
> rather than "Page page_norm = (Page) page;"?

xlog_internal.h is used as well by frontends, this makes the header
dependencies cleaner. (I have looked at using Page when hacking this
stuff, but the header dependencies are not worth it, I don't recall
all the details though this was a couple of months back).
-- 
Michael



Re: [HACKERS] WAL consistency check facility

From
Kuntal Ghosh
Date:
On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> I've attached the patch with the modified changes. PFA.
Thanks Robert for taking your time for the review.  I'll update the
patch with the changes suggested by you.


-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] WAL consistency check facility

From
Kuntal Ghosh
Date:
On Wed, Feb 1, 2017 at 8:01 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 1:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:

>> +    /*
>> +     * Leave if no masking functions defined, this is possible in the case
>> +     * resource managers generating just full page writes, comparing an
>> +     * image to itself has no meaning in those cases.
>> +     */
>> +    if (RmgrTable[rmid].rm_mask == NULL)
>> +        return;
>>
>> ...and also...
>>
>> +            /*
>> +             * This feature is enabled only for the resource managers where
>> +             * a masking function is defined.
>> +             */
>> +            for (i = 0; i <= RM_MAX_ID; i++)
>> +            {
>> +                if (RmgrTable[i].rm_mask != NULL)
>>
>> Why do we assume that the feature is only enabled for RMs that have a
>> mask function?  Why not instead assume that if there's no masking
>> function, no masking is required?
>
> Not all RMs work on full pages. Tracking in smgr.h the list of RMs
> that are no-ops when masking things is easier than having empty
> routines declared all over the code base. Don't you think so>
>
Robert's suggestion surely makes the approach more general. But,  the
existing approach makes it easier to decide the RM's for which we
support the consistency check facility. Surely, we can use a list to
track the RMs which should (/not) be masked. But, I think we always
have to mask the lsn of the pages at least. Isn't it?

>> +    void        (*rm_mask) (char *page, BlockNumber blkno);
>>
>> Could the page be passed as type "Page" rather than a "char *" to make
>> things more convenient for the masking functions?  If not, could those
>> functions at least do something like "Page page = (Page) pagebytes;"
>> rather than "Page page_norm = (Page) page;"?
>
> xlog_internal.h is used as well by frontends, this makes the header
> dependencies cleaner. (I have looked at using Page when hacking this
> stuff, but the header dependencies are not worth it, I don't recall
> all the details though this was a couple of months back).
+ 1



-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] WAL consistency check facility

From
Robert Haas
Date:
On Tue, Jan 31, 2017 at 9:31 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Feb 1, 2017 at 1:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
>> <kuntalghosh.2007@gmail.com> wrote:
>>> I've attached the patch with the modified changes. PFA.
>>
>> Can this patch check contrib/bloom?
>
> Only full pages are applied at redo by the generic WAL facility. So
> you would finish by comparing a page with itself in generic_redo.

generic_redo is more complicated than just restoring an FPI.  I admit
that generic_redo *probably* has no bugs, but I don't see why it would
hurt to allow it to be checked.  It's not IMPOSSIBLE that restoring
the page delta could go wrong somehow.

>> +            /*
>> +             * For a speculative tuple, the content of t_ctid is conflicting
>> +             * between the backup page and current page. Hence, we set it
>> +             * to the current block number and current offset.
>> +             */
>>
>> Why does it differ?  Is that a bug?
>
> This has been discussed twice in this thread, once by me, once by Alvaro:
> https://www.postgresql.org/message-id/CAM3SWZQC8nUgp8SjKDY3d74VLpdf9puHc7-n3zf4xcr_bghPzg%40mail.gmail.com
> https://www.postgresql.org/message-id/CAB7nPqQTLGvn_XePjS07kZMMw46kS6S7LfsTocK%2BgLpTN0bcZw%40mail.gmail.com

Sorry, I missed/forgot about that.  I think this is evidence that the
comment isn't really good enough.  It says "hey, this might differ"
... with no real explanation of why it might differ or why that's OK.
If there were an explanation there, I wouldn't have flagged it.

>> +    /*
>> +     * Leave if no masking functions defined, this is possible in the case
>> +     * resource managers generating just full page writes, comparing an
>> +     * image to itself has no meaning in those cases.
>> +     */
>> +    if (RmgrTable[rmid].rm_mask == NULL)
>> +        return;
>>
>> ...and also...
>>
>> +            /*
>> +             * This feature is enabled only for the resource managers where
>> +             * a masking function is defined.
>> +             */
>> +            for (i = 0; i <= RM_MAX_ID; i++)
>> +            {
>> +                if (RmgrTable[i].rm_mask != NULL)
>>
>> Why do we assume that the feature is only enabled for RMs that have a
>> mask function?  Why not instead assume that if there's no masking
>> function, no masking is required?
>
> Not all RMs work on full pages. Tracking in smgr.h the list of RMs
> that are no-ops when masking things is easier than having empty
> routines declared all over the code base. Don't you think so>

Sure, but I don't think that's what the code is doing.  If an RM is a
no-op when masking things, it must define an empty function.  If it
defines no function, checking is disabled.  I think we want to be able
to check any RM.  No?

>> +    void        (*rm_mask) (char *page, BlockNumber blkno);
>>
>> Could the page be passed as type "Page" rather than a "char *" to make
>> things more convenient for the masking functions?  If not, could those
>> functions at least do something like "Page page = (Page) pagebytes;"
>> rather than "Page page_norm = (Page) page;"?
>
> xlog_internal.h is used as well by frontends, this makes the header
> dependencies cleaner. (I have looked at using Page when hacking this
> stuff, but the header dependencies are not worth it, I don't recall
> all the details though this was a couple of months back).

OK.  I still think page_norm is a lame variable name.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WAL consistency check facility

From
Amit Kapila
Date:
On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> +            if (!HeapTupleHeaderXminFrozen(page_htup))
> +                page_htup->t_infomask |= HEAP_XACT_MASK;
> +            else
> +                page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
> HEAP_XMAX_INVALID;
>
> Comment doesn't address this logic.  Also, the "else" case shouldn't
> exist at all, I think.
>

In the *if* check, it just checks frozen status of xmin, so I think
you need to mask xmax related bits in else check.  Can you explain
what makes you think that the else case shouldn't exist?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] WAL consistency check facility

From
Robert Haas
Date:
On Tue, Feb 7, 2017 at 6:32 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> +            if (!HeapTupleHeaderXminFrozen(page_htup))
>> +                page_htup->t_infomask |= HEAP_XACT_MASK;
>> +            else
>> +                page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
>> HEAP_XMAX_INVALID;
>>
>> Comment doesn't address this logic.  Also, the "else" case shouldn't
>> exist at all, I think.
>>
>
> In the *if* check, it just checks frozen status of xmin, so I think
> you need to mask xmax related bits in else check.  Can you explain
> what makes you think that the else case shouldn't exist?

Oh, you're right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WAL consistency check facility

From
Kuntal Ghosh
Date:
On Tue, Jan 31, 2017 at 9:36 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jan 5, 2017 at 12:54 AM, Kuntal Ghosh
> <kuntalghosh.2007@gmail.com> wrote:
>> I've attached the patch with the modified changes. PFA.
>
> Can this patch check contrib/bloom?
>
Added support for generic resource manager type.

> +        /*
> +         * Mask some line pointer bits, particularly those marked as
> +         * used on a master and unused on a standby.
> +         */
>
> Comment doesn't explain why we need to do this.
>
Added comments.

> +        /*
> +         * For GIN_DELETED page, the page is initialized to empty.
> +         * Hence mask everything.
> +         */
> +        if (opaque->flags & GIN_DELETED)
> +            memset(page_norm, MASK_MARKER, BLCKSZ);
> +        else
> +            mask_unused_space(page_norm);
>
> If the page is initialized to empty, why do we need to mask
> anything/everything?  Anyway, it doesn't seem right to mask the
> GinPageOpaque structure itself. Or at least it doesn't seem right to
> mask the flags word.
>
Modified accordingly.

>
> +            if (!HeapTupleHeaderXminFrozen(page_htup))
> +                page_htup->t_infomask |= HEAP_XACT_MASK;
> +            else
> +                page_htup->t_infomask |= HEAP_XMAX_COMMITTED |
> HEAP_XMAX_INVALID;
>
> Comment doesn't address this logic.  Also, the "else" case shouldn't
> exist at all, I think.
Added comments. I think that "Else" part is needed for xmax.

>
> +            /*
> +             * For a speculative tuple, the content of t_ctid is conflicting
> +             * between the backup page and current page. Hence, we set it
> +             * to the current block number and current offset.
> +             */
>
> Why does it differ?  Is that a bug?
>
Added comments.

>
> +    /*
> +     * During replay of a btree page split, we don't set the BTP_SPLIT_END
> +     * flag of the right sibling and initialize th cycle_id to 0 for the same
> +     * page.
> +     */
>
> A reference to the name of the redo function might be helpful here
> (and in some other places), unless it's just ${AMNAME}_redo.  "th" is
> a typo for "the".
Corrected.

> +                        appendStringInfoString(buf, " (full page
> image, apply)");
> +                    else
> +                        appendStringInfoString(buf, " (full page image)");
>
> How about "(full page image)" and "(full page image, for WAL verification)"?
>
> Similarly in XLogDumpDisplayRecord, I think we should assume that
> "FPW" means what it has always meant, and leave that output alone.
> Instead, distinguish the WAL-consistency-checking case when it
> happens, e.g. "(FPW for consistency check)".
>
Corrected.

> +checkConsistency(XLogReaderState *record)
>
> How about CheckXLogConsistency()?
>
> +         * If needs_backup is true or wal consistency check is enabled for
>
> ...or WAL checking is enabled...
>
> +             * If WAL consistency is enabled for the resource manager of
>
> If WAL consistency checking is enabled...
>
> + * Note: when a backup block is available in XLOG with BKPIMAGE_APPLY flag
>
> with the BKPIMAGE_APPLY flag
Modified accordingly.

>
> - * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
> - * with all-zeroes pages up to the referenced block number.  In
> - * RBM_ZERO_AND_LOCK and RBM_ZERO_AND_CLEANUP_LOCK modes, the return value
> + * In RBM_ZERO_* modes, if the page doesn't exist or BKPIMAGE_APPLY flag
> + * is not set for the backup block, the relation is extended with all-zeroes
> + * pages up to the referenced block number.
>
> OK, I'm puzzled by this.  Surely we don't want the WAL consistency
> checking facility to cause the relation to be extended like this.  And
> I don't see why it would, because the WAL consistency checking happens
> after the page changes have already been applied.  I wonder if this
> hunk is in error and should be dropped.
>
Dropped comments.

> +    PageXLogRecPtrSet(phdr->pd_lsn, PG_UINT64_MAX);
> +    phdr->pd_prune_xid = PG_UINT32_MAX;
> +    phdr->pd_flags |= PD_PAGE_FULL | PD_HAS_FREE_LINES;
> +    phdr->pd_flags |= PD_ALL_VISIBLE;
> +#define MASK_MARKER        0xFF
> (and many others)
>
> Why do we mask by setting bits rather than clearing bits?  My
> intuition would have been to zero things we want to mask, rather than
> setting them to one.
>
Modified accordingly so that we can zero things during masking instead
of setting them to one.

> +                {
> +                    newwalconsistency[i] = true;
> +                }
>
> Superfluous braces.
>
Corrected.

> +    /*
> +     * Leave if no masking functions defined, this is possible in the case
> +     * resource managers generating just full page writes, comparing an
> +     * image to itself has no meaning in those cases.
> +     */
> +    if (RmgrTable[rmid].rm_mask == NULL)
> +        return;
>
> ...and also...
>
> +            /*
> +             * This feature is enabled only for the resource managers where
> +             * a masking function is defined.
> +             */
> +            for (i = 0; i <= RM_MAX_ID; i++)
> +            {
> +                if (RmgrTable[i].rm_mask != NULL)
>
> Why do we assume that the feature is only enabled for RMs that have a
> mask function?  Why not instead assume that if there's no masking
> function, no masking is required?
I've introduced a function in rmgr.c, named
consistencyCheck_is_enabled, which returns true if
wal_consistency_checking is enabled for a resource manager. It does
not have any dependency on the masking function. If a masking function
is defined, then we mask the page before consistency check. However,
I'm not sure whether rmgr.c is the right place to define the function
consistencyCheck_is_enabled.

>
> +        /* Definitely not an individual resource manager. Check for 'all'. */
> +        if (pg_strcasecmp(tok, "all") == 0)
>
> It seems like it might be cleaner to check for "all" first, and then
> check for individual RMs afterward.
>
Done.

> +    /*
> +     * Parameter should contain either 'all' or a combination of resource
> +     * managers.
> +     */
> +    if (isAll && isRmgrId)
> +    {
> +        GUC_check_errdetail("Invalid value combination");
> +        return false;
> +    }
>
> That error message is not very clear, and I don't see why we even need
> to check this.  If someone sets wal_consistency_checking = hash, all,
> let's just set it for all and the fact that hash is also set won't
> matter to anything.
>
Modified accordingly.

> +    void        (*rm_mask) (char *page, BlockNumber blkno);
>
> Could the page be passed as type "Page" rather than a "char *" to make
> things more convenient for the masking functions?  If not, could those
> functions at least do something like "Page page = (Page) pagebytes;"
> rather than "Page page_norm = (Page) page;"?
>
Modified it as "Page tempPage = (Page) page;"

> +        /*
> +         * Read the contents from the current buffer and store it in a
> +         * temporary page.
> +         */
> +        buf = XLogReadBufferExtended(rnode, forknum, blkno,
> +                                          RBM_NORMAL);
> +        if (!BufferIsValid(buf))
> +            continue;
> +
> +        new_page = BufferGetPage(buf);
> +
> +        /*
> +         * Read the contents from the backup copy, stored in WAL record
> +         * and store it in a temporary page. There is not need to allocate
> +         * a new page here, a local buffer is fine to hold its contents and
> +         * a mask can be directly applied on it.
> +         */
> +        if (!RestoreBlockImage(record, block_id, old_page_masked))
> +            elog(ERROR, "failed to restore block image");
> +
> +        /*
> +         * Take a copy of the new page where WAL has been applied to have
> +         * a comparison base before masking it...
> +         */
> +        memcpy(new_page_masked, new_page, BLCKSZ);
> +
> +        /* No need for this page anymore now that a copy is in */
> +        ReleaseBuffer(buf);
>
> The order of operations is strange here.  We read the "new" page,
> holding the pin (but no lock?).  Then we restore the block image into
> old_page_masked.  Now we copy the new page and release the pin.  It
> would be better, ISTM, to rearrange that so that we finish with the
> new page and release the pin before dealing with the old page.  Also,
> I think we need to actually lock the buffer before copying it.  Maybe
> that's not strictly necessary since this is all happening on the
> standby, but it seems like a bad idea to set the precedent that you
> can read a page without taking the content lock.
>
Modified accordingly.

> I think the "new" and "old" page terminology is kinda weird too.
> Maybe we should call them the "replay image" and the "master image" or
> something like that.   A few more comments wouldn't hurt either.
>
Done.

> +     * Also mask the all-visible flag.
> +     *
> +     * XXX: It is unfortunate that we have to do this. If the flag is set
> +     * incorrectly, that's serious, and we would like to catch it. If the flag
> +     * is cleared incorrectly, that's serious too. But redo of HEAP_CLEAN
> +     * records don't currently set the flag, even though it is set in the
> +     * master, so we must silence failures that that causes.
> +     */
> +    phdr->pd_flags |= PD_ALL_VISIBLE;
>
> I'm puzzled by the reference to HEAP_CLEAN.  The thing that might set
> the all-visible bit is XLOG_HEAP2_VISIBLE, not XLOG_HEAP2_CLEAN.
> Unless I'm missing something, there's no situation in which
> XLOG_HEAP2_CLEAN might be associated with setting PD_ALL_VISIBLE.
> Also, XLOG_HEAP2_VISIBLE records do SOMETIMES set the bit, just not
> always.  And there's a good reason for that, which is explained in
> this comment:
>
>          * We don't bump the LSN of the heap page when setting the visibility
>          * map bit (unless checksums or wal_hint_bits is enabled, in which
>          * case we must), because that would generate an unworkable volume of
>          * full-page writes.  This exposes us to torn page hazards, but since
>          * we're not inspecting the existing page contents in any way, we
>          * don't care.
>          *
>          * However, all operations that clear the visibility map bit *do* bump
>          * the LSN, and those operations will only be replayed if the XLOG LSN
>          * follows the page LSN.  Thus, if the page LSN has advanced past our
>          * XLOG record's LSN, we mustn't mark the page all-visible, because
>          * the subsequent update won't be replayed to clear the flag.
>
> So I think this comment needs to be rewritten with a bit more nuance.
>
Corrected.

> +extern void mask_unused_space(Page page);
> +#endif
>
> Missing newline.
>
Done.

Thank you Robert for the review. Please find the updated patch in the
attachment.

Thanks to Amit Kapila and Dilip Kumar for their suggestions in offline
discussions.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WAL consistency check facility

From
Robert Haas
Date:
On Wed, Feb 8, 2017 at 1:25 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
> Thank you Robert for the review. Please find the updated patch in the
> attachment.

I have committed this patch after fairly extensive revisions:

* Rewrote the documentation to give some idea what the underlying
mechanism of operation of the feature is, so that users who choose to
enable this will hopefully have some understanding of what they've
turned on.
* Renamed "char *page" arguments to "char *pagedata" and "Page page",
because tempPage doesn't seem to be to be any better a name than
page_norm.
* Moved bufmask.c to src/backend/access/common, because there's no
code in src/backend/storage/buffer that knows anything about the
format of pages; that is the job of AMs, hence src/backend/access.
* Improved some comments in bufmask.c
* Removed consistencyCheck_is_enabled in favor of determining which
RMs support masking by the presence of absence of an rm_mask function.
* Removed assertion in checkXLogConsistency that consistency checking
is enabled for the RM; that's trivially false if
wal_consistency_checking is not the same on the master and the
standby.  (Note that quite apart from the issue of whether this
function should exist at all, adding it to a header file after the
closing #endif guard is certainly not right.)
* Changed checkXLogConsistency to use RBM_NORMAL_NO_LOG instead of
RBM_NORMAL.  I'm not sure if there are any cases where this makes a
difference, but it seems safer.
* Changed checkXLogConsistency to skip pages whose LSN is newer than
that of the record.  Without this, if you shut down recovery and
restart it, it complains of inconsistent pages and dies.  (I'm not
sure this is the only scenario that needs to be covered; it would be
smart to do more testing of restarting the standby.)
* Made wal_consistency_checking a developer option instead of a WAL
option.  Even though it CAN be used in production, we don't
particularly want to encourage that; enabling WAL consistency checking
has a big performance cost and makes your system more fragile not less
-- a WAL consistency failure causes your standby to die a hard death.
(Maybe there should be a way to suppress consistency checking on the
standby -- but I think not just by requiring wal_consistency_checking
on both ends.  Or maybe we should just downgrade the FATAL to WARNING;
blowing up the standby irrevocably seems like poor behavior.)
* Coding style improvement in check_wal_consistency_checking.
* Removed commas in messages added to pg_xlogdump; those didn't look
good to me, on further review.
* Comment improvements in xlog_internal.h and xlogreader.h

I also bumped XLOG_PAGE_MAGIC (which is normally done by the
committer, not the patch author, so I wasn't expecting that to be in
the patch as submitted).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WAL consistency check facility

From
Kuntal Ghosh
Date:
On Thu, Feb 9, 2017 at 2:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 8, 2017 at 1:25 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> Thank you Robert for the review. Please find the updated patch in the
>> attachment.
>
> I have committed this patch after fairly extensive revisions:
>
Thank you, Robert, for the above corrections and commit. Thanks to
Michael Paquier, Peter Eisentraut, Amit Kapila, Álvaro Herrera, and
Simon Riggs for taking their time to complete the patch. It was a
great learning experience for me.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com



Re: [HACKERS] WAL consistency check facility

From
Michael Paquier
Date:
On Thu, Feb 9, 2017 at 5:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Feb 8, 2017 at 1:25 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:
>> Thank you Robert for the review. Please find the updated patch in the
>> attachment.
>
> I have committed this patch after fairly extensive revisions:

Cool. I had finally a look at what has been committed in a507b869.
Running regression tests with all RMGRs enabled, a single installcheck
generates 7GB of WAL. Woah.

> * Rewrote the documentation to give some idea what the underlying
> mechanism of operation of the feature is, so that users who choose to
> enable this will hopefully have some understanding of what they've
> turned on.

Thanks, those look good to me.

> * Renamed "char *page" arguments to "char *pagedata" and "Page page",
> because tempPage doesn't seem to be to be any better a name than
> page_norm.

> * Removed assertion in checkXLogConsistency that consistency checking
> is enabled for the RM; that's trivially false if
> wal_consistency_checking is not the same on the master and the
> standby.  (Note that quite apart from the issue of whether this
> function should exist at all, adding it to a header file after the
> closing #endif guard is certainly not right.)

I recall doing those two things the same way as in the commit. Not
sure at which point they have been re-introduced.

> * Changed checkXLogConsistency to skip pages whose LSN is newer than
> that of the record.  Without this, if you shut down recovery and
> restart it, it complains of inconsistent pages and dies.  (I'm not
> sure this is the only scenario that needs to be covered; it would be
> smart to do more testing of restarting the standby.)

Good point.

> -- a WAL consistency failure causes your standby to die a hard death.
> (Maybe there should be a way to suppress consistency checking on the
> standby -- but I think not just by requiring wal_consistency_checking
> on both ends.  Or maybe we should just downgrade the FATAL to WARNING;
> blowing up the standby irrevocably seems like poor behavior.)

Having a FATAL is useful for buildfarm members, that would show up in
red. Having a switch to generate a warning would be useful for live
deployments I agree. Now I think that we need as well two things:
- A recovery test to run regression tests with a standby behind.
- Extend the TAP tests so as it is possible to fill in postgresql.conf
with custom variables.
- have the buildfarm client run recovery tests!
I am fine to write those patches.

> I also bumped XLOG_PAGE_MAGIC (which is normally done by the
> committer, not the patch author, so I wasn't expecting that to be in
> the patch as submitted).

Here are a couple of things I have noticed while looking at the code.

+ * Portions Copyright (c) 2016, PostgreSQL Global Development Group
s/2016/2017/ in bufmask.c and bufmask.h.

+       if (ItemIdIsNormal(iid))
+       {
+
+           HeapTupleHeader page_htup = (HeapTupleHeader) page_item;
Unnecessary newline here.

+        * Read the contents from the backup copy, stored in WAL record and
+        * store it in a temporary page. There is not need to allocate a new
+        * page here, a local buffer is fine to hold its contents and a mask
+        * can be directly applied on it.
s/not need/no need/.

In checkXLogConsistency(), FPWs that have the flag BKPIMAGE_APPLY set
will still be checked, resulting in a FPW being compared to itself. I
think that those had better be bypassed.

Please find attached a patch with those fixes.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WAL consistency check facility

From
Robert Haas
Date:
On Thu, Feb 9, 2017 at 8:17 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Please find attached a patch with those fixes.

Committed, but I changed the copyright dates to 2016-2017 rather than
just 2017 since surely some of the code was originally written before
2017.  Even that might not really be going back far enough, but it
doesn't matter too much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Feb 15, 2017 at 2:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 8:17 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Please find attached a patch with those fixes.
>
> Committed, but I changed the copyright dates to 2016-2017 rather than
> just 2017 since surely some of the code was originally written before
> 2017.  Even that might not really be going back far enough, but it
> doesn't matter too much.

Just for curiosity: does the moment when the code has been written or
committed counts? It's no big deal seeing how liberal the Postgres
license is, but this makes me wonder...
-- 
Michael



Re: [HACKERS] WAL consistency check facility

From
Robert Haas
Date:
On Tue, Feb 14, 2017 at 5:16 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Feb 15, 2017 at 2:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Thu, Feb 9, 2017 at 8:17 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Please find attached a patch with those fixes.
>>
>> Committed, but I changed the copyright dates to 2016-2017 rather than
>> just 2017 since surely some of the code was originally written before
>> 2017.  Even that might not really be going back far enough, but it
>> doesn't matter too much.
>
> Just for curiosity: does the moment when the code has been written or
> committed counts? It's no big deal seeing how liberal the Postgres
> license is, but this makes me wonder...

IANAL, but I think if you ask one, he or she will tell you that what
matters is the date the work was created.  In the case of code, that
means when the code was written.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WAL consistency check facility

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Feb 14, 2017 at 5:16 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Just for curiosity: does the moment when the code has been written or
>> committed counts? It's no big deal seeing how liberal the Postgres
>> license is, but this makes me wonder...

> IANAL, but I think if you ask one, he or she will tell you that what
> matters is the date the work was created.  In the case of code, that
> means when the code was written.

FWIW, my own habit when creating new PG files is generally to write
* Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group* Portions Copyright (c) 1994, Regents of the
Universityof California
 

even if it's "all new" code.  The main reason being that it's hardly ever
the case that you didn't copy-and-paste some amount of stuff out of a
pre-existing file, and trying to sort out how much of what originated
exactly when is an unrewarding exercise.  Even if it is basically all
new code, this feels like giving an appropriate amount of credit to
Those Who Went Before Us.
        regards, tom lane



Re: [HACKERS] WAL consistency check facility

From
Robert Haas
Date:
On Tue, Feb 14, 2017 at 7:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Feb 14, 2017 at 5:16 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Just for curiosity: does the moment when the code has been written or
>>> committed counts? It's no big deal seeing how liberal the Postgres
>>> license is, but this makes me wonder...
>
>> IANAL, but I think if you ask one, he or she will tell you that what
>> matters is the date the work was created.  In the case of code, that
>> means when the code was written.
>
> FWIW, my own habit when creating new PG files is generally to write
>
>  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
>  * Portions Copyright (c) 1994, Regents of the University of California
>
> even if it's "all new" code.  The main reason being that it's hardly ever
> the case that you didn't copy-and-paste some amount of stuff out of a
> pre-existing file, and trying to sort out how much of what originated
> exactly when is an unrewarding exercise.  Even if it is basically all
> new code, this feels like giving an appropriate amount of credit to
> Those Who Went Before Us.

Right.  I tend to do the same, and wonder if we shouldn't make that a
general practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WAL consistency check facility

From
Michael Paquier
Date:
On Wed, Feb 15, 2017 at 11:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 14, 2017 at 7:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> FWIW, my own habit when creating new PG files is generally to write
>>
>>  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
>>  * Portions Copyright (c) 1994, Regents of the University of California
>>
>> even if it's "all new" code.  The main reason being that it's hardly ever
>> the case that you didn't copy-and-paste some amount of stuff out of a
>> pre-existing file, and trying to sort out how much of what originated
>> exactly when is an unrewarding exercise.  Even if it is basically all
>> new code, this feels like giving an appropriate amount of credit to
>> Those Who Went Before Us.
>
> Right.  I tend to do the same, and wonder if we shouldn't make that a
> general practice.

This looks sensible to me. No-brainer rules that make sense are less
things to worry about.
-- 
Michael