Thread: [bug fix] Cascaded standby cannot start after a clean shutdown

[bug fix] Cascaded standby cannot start after a clean shutdown

From
"Tsunakawa, Takayuki"
Date:
Hello,

Our customer encountered a rare bug of PostgreSQL which prevents a cascaded standby from starting up.  The attached
patchis a fix for it.  I hope this will be back-patched.  I'll add this to the next CF.
 


PROBLEM
==============================

The PostgreSQL version is 9.5.  The cluster consists of a master, a cascading standby (SB1), and a cascaded standby
(SB2). The WAL flows like this: master -> SB1 -> SB2.
 

The user shut down SB2 and tried to restart it, but failed with the following message:

    FATAL:  invalid memory alloc request size 3075129344

The master and SB1 continued to run.


CAUSE
==============================

total_len in the code below was about 3GB, so palloc() rejected the memory allocation request.

[xlogreader.c]
    record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
    total_len = record->xl_tot_len;
...
    /*
     * Enlarge readRecordBuf as needed.
     */
    if (total_len > state->readRecordBufSize &&
        !allocate_recordbuf(state, total_len))
    {

Why was XLogRecord->xl_tot_len so big?  That's because the XLOG reader read the garbage portion in a WAL file, which is
immediatelyafter the end of valid WAL records.
 

Why was there the garbage?  Because the cascading standby sends just the valid WAL records, not the whole WAL blocks,
tothe cascaded standby.  When the cascaded standby receives those WAL records and write them in a recycled WAL file,
thelast WAL block in the file contains the mix of valid WAL records and the garbage left over.
 

Why did the XLOG reader try to allocate memory for a garbage?  Doesn't it stop reading the WAL?  As the following code
shows,when the WAL record header crosses a WAL block boundary, the XLOG reader first allocates memory for the whole
record,and then checks the validity of the record header as soon as it reads the entire header.
 

[xlogreader.c]
    /*
     * If the whole record header is on this page, validate it immediately.
     * Otherwise do just a basic sanity check on xl_tot_len, and validate the
     * rest of the header after reading it from the next page.  The xl_tot_len
     * check is necessary here to ensure that we enter the "Need to reassemble
     * record" code path below; otherwise we might fail to apply
     * ValidXLogRecordHeader at all.
     */
    if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord)
    {
        if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record,
                                   randAccess))
            goto err;
        gotheader = true;
    }
    else
    {
        /* XXX: more validation should be done here */
        if (total_len < SizeOfXLogRecord)
        {
            report_invalid_record(state,
                                  "invalid record length at %X/%X: wanted %u, got %u",
                                  (uint32) (RecPtr >> 32), (uint32) RecPtr,
                                  (uint32) SizeOfXLogRecord, total_len);
            goto err;
        }
        gotheader = false;
    }


FIX
==============================

One idea is to defer the memory allocation until the entire record header is read and ValidXLogRecordHeader() is
called. However, ValidXLogRecordHeader() might misjudge the validity if the garbage contains xl_prev seeming correct.
 

So, I modified walreceiver to zero-fill the last partial WAL block.  This is the guard that the master does in
AdvanceXLInsertBuffer()as below:
 

        /*
         * Be sure to re-zero the buffer so that bytes beyond what we've
         * written will look like zeroes and not valid XLOG records...
         */
        MemSet((char *) NewPage, 0, XLOG_BLCKSZ);

FYI, the following unsolved problem may be solved, too.

https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com


Regards
Takayuki Tsunakawa


Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Wed, Feb 14, 2018 at 04:37:05AM +0000, Tsunakawa, Takayuki wrote:
> The PostgreSQL version is 9.5.  The cluster consists of a master, a
> cascading standby (SB1), and a cascaded standby (SB2).  The WAL flows
> like this: master -> SB1 -> SB2.
>
> The user shut down SB2 and tried to restart it, but failed with the
> following message:
>
>     FATAL:  invalid memory alloc request size 3075129344
>
> The master and SB1 continued to run.

This matches a couple of recent bug reports we have seen around like
this one:
https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com
I recalled as well this one but in this case the user shot his own foot
with the failover scenario he designed:
https://www.postgresql.org/message-id/CAAc9rOyKAyzipiq7ee1%3DVbcRy2fRqV_hRujLbC0mrBkL07%3D7wQ%40mail.gmail.com

> CAUSE
> ==============================
>
> total_len in the code below was about 3GB, so palloc() rejected the
> memory allocation request.

Yes, it is limited to 1GB.

> [xlogreader.c]
>     record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
>     total_len = record->xl_tot_len;
> ...
>     /*
>      * Enlarge readRecordBuf as needed.
>      */
>     if (total_len > state->readRecordBufSize &&
>         !allocate_recordbuf(state, total_len))
>     {
>
> Why was XLogRecord->xl_tot_len so big?  That's because the XLOG reader
> read the garbage portion in a WAL file, which is immediately after the
> end of valid WAL records.
>
> Why was there the garbage?  Because the cascading standby sends just
> the valid WAL records, not the whole WAL blocks, to the cascaded
> standby.  When the cascaded standby receives those WAL records and
> write them in a recycled WAL file, the last WAL block in the file
> contains the mix of valid WAL records and the garbage left over.

Wait a minute here, when recycled past WAL segments would be filled with
zeros before being used.

> Why did the XLOG reader try to allocate memory for a garbage?  Doesn't
> it stop reading the WAL?  As the following code shows, when the WAL
> record header crosses a WAL block boundary, the XLOG reader first
> allocates memory for the whole record, and then checks the validity of
> the record header as soon as it reads the entire header.
>
> [...]
>
> FIX
> ==============================
>
> One idea is to defer the memory allocation until the entire record
> header is read and ValidXLogRecordHeader() is called.  However,
> ValidXLogRecordHeader() might misjudge the validity if the garbage
> contains xl_prev seeming correct.

It seems to me that the consolidation of the page read should happen
directly in xlogreader.c and not even in one of its callbacks so as no
garbage data is presented back to the caller using its own XLogReader.
I think that you need to initialize XLogReaderState->readBuf directly in
ReadPageInternal() before reading a page and you should be good.  With
your patch you get visibly only one portion of things addressed, what of
other code paths using xlogreader.c's APIs like pg_rewind, 2PC code and
such?

> FYI, the following unsolved problem may be solved, too.
>
> https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com

Yeah, I noticed this one too before going through your message in
details ;)

Please note that your patch has some unnecessary garbage in two places:
- * Portions Copyright (c) 2010-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 2010-2017, PostgreSQL Global Development Group
[...]
-     * Only superusers and members of pg_read_all_stats can see details.
-     * Other users only get the pid value
+     * Only superusers can see details. Other users only get the pid valu

You may want to indent your patch as well before sending it.

+    if (zbuffer == NULL)
+        zbuffer = palloc0(XLOG_BLCKSZ);
You could just use a static buffer which is MemSet'd with 0 only if
necessary.
--
Michael

Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Fri, Feb 16, 2018 at 04:19:00PM +0900, Michael Paquier wrote:
> Wait a minute here, when recycled past WAL segments would be filled with
> zeros before being used.

Please feel free to ignore this part.  I pushed the "Send" button
without seeing it, and I was thinking uner which circumstances segments
get zero-filled, which is moot for this thread.
--
Michael

Attachment

RE: [bug fix] Cascaded standby cannot start after a clean shutdown

From
"Tsunakawa, Takayuki"
Date:
Thank you for reviewing.

From: Michael Paquier [mailto:michael@paquier.xyz]
> It seems to me that the consolidation of the page read should happen directly
> in xlogreader.c and not even in one of its callbacks so as no garbage data
> is presented back to the caller using its own XLogReader.
> I think that you need to initialize XLogReaderState->readBuf directly in
> ReadPageInternal() before reading a page and you should be good.  With your
> patch you get visibly only one portion of things addressed, what of other
> code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such?

ReadPageInternal() doesn't know where the end of valid WAL is, so it cannot determine where to do memset(0).  For
example,in non-streaming cases, it reads the entire WAL block into readbuf, including the garbage.
 

    /*
     * If the current segment is being streamed from master, calculate how
     * much of the current page we have received already. We know the
     * requested record has been received, but this is for the benefit of
     * future calls, to allow quick exit at the top of this function.
     */
    if (readSource == XLOG_FROM_STREAM)
    {
        if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
            readLen = XLOG_BLCKSZ;
        else
            readLen = receivedUpto % XLogSegSize - targetPageOff;
    }
    else
        readLen = XLOG_BLCKSZ;


So we have to zero-fill the empty space of a WAL block before writing it.  Currently, the master does that in
AdvanceXLInsertBuffer(),StartupXLOG(), XLogFileCopy().  The cascading standby receives WAL from the master block by
block,so it doesn't suffer from the garbage.  pg_receivexlog zero-fills a new WAL file.
 

> Please note that your patch has some unnecessary garbage in two places:

Ouch, cleaned up.


> +    if (zbuffer == NULL)
> +        zbuffer = palloc0(XLOG_BLCKSZ);
> You could just use a static buffer which is MemSet'd with 0 only if necessary.

I wanted to allocate memory only when necessary.  Currently, only the cascaded standby needs this.  To that end, I
movedthe memory allocation code to a right place.  Thanks for letting me notice this.
 

Regards
Takayuki Tsunakawa



Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Mon, Feb 19, 2018 at 03:01:15AM +0000, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:michael@paquier.xyz]

Sorry for my late reply.  I was looking at this problem for the last
couple of days here and there, still thinking about it.

>> It seems to me that the consolidation of the page read should happen directly
>> in xlogreader.c and not even in one of its callbacks so as no garbage data
>> is presented back to the caller using its own XLogReader.
>> I think that you need to initialize XLogReaderState->readBuf directly in
>> ReadPageInternal() before reading a page and you should be good.  With your
>> patch you get visibly only one portion of things addressed, what of other
>> code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such?
>
> ReadPageInternal() doesn't know where the end of valid WAL is, so it
> cannot determine where to do memset(0).  For example, in non-streaming
> cases, it reads the entire WAL block into readbuf, including the
> garbage.

Why couldn't it know about it?  It would be perfectly fine to feed to it
the end LSN position as well and it is an internal API to xlogreader.c.
Note that both its callers, XLogFindNextRecord or XLogReadRecord know
about that as well.

>     /*
>      * If the current segment is being streamed from master, calculate how
>      * much of the current page we have received already. We know the
>      * requested record has been received, but this is for the benefit of
>      * future calls, to allow quick exit at the top of this function.
>      */
>     if (readSource == XLOG_FROM_STREAM)
>     {
>         if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ))
>             readLen = XLOG_BLCKSZ;
>         else
>             readLen = receivedUpto % XLogSegSize - targetPageOff;
>     }
>     else
>         readLen = XLOG_BLCKSZ;
>
> So we have to zero-fill the empty space of a WAL block before writing
> it.  Currently, the master does that in AdvanceXLInsertBuffer(),
> StartupXLOG(), XLogFileCopy().  The cascading standby receives WAL
> from the master block by block, so it doesn't suffer from the garbage.
> pg_receivexlog zero-fills a new WAL file.

I cannot completely parse this statement.  Physical replication sends up
to 16 WAL pages per message, rounded down to the last page completed.
Even a cascading standby sends WAL following this protocol, using the
last flush position as a base for the maximum amount of data sent.

>> +    if (zbuffer == NULL)
>> +        zbuffer = palloc0(XLOG_BLCKSZ);
>> You could just use a static buffer which is MemSet'd with 0 only if necessary.
>
> I wanted to allocate memory only when necessary.  Currently, only the
> cascaded standby needs this.  To that end, I moved the memory
> allocation code to a right place.  Thanks for letting me notice this.

Still, it seems to me that any problems are not related to the fact that
we are using cascading standbys.  As [1] says as well, similar problems
have been reported on a standby after restarting its primary.

I am definitely ready to buy that it can be possible to have garbage
being read the length field which can cause allocate_recordbuf to fail
as that's the only code path in xlogreader.c which does such an
allocation.  Still, it seems to me that we should first try to see if
there are strange allocation patterns that happen and see if it is
possible to have a reproduceable test case or a pattern which gives us
confidence that we are on the right track.  One idea I have to
monitor those allocations like the following:
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -162,6 +162,10 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
    newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
        newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));

+#ifndef FRONTEND
+   elog(LOG, "Allocation for xlogreader increased to %u", newSize);
+#endif

This could be upgraded to a PANIC or such if it sees a larger
allocation, as pgbench generates records with known lengths, that would
be a good fit to see if there is some garbage being read.  No need to go
up to 1GB before seeing a failure.

[1]:
https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com
--
Michael

Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Thu, Feb 22, 2018 at 04:55:38PM +0900, Michael Paquier wrote:
> I am definitely ready to buy that it can be possible to have garbage
> being read the length field which can cause allocate_recordbuf to fail
> as that's the only code path in xlogreader.c which does such an
> allocation.  Still, it seems to me that we should first try to see if
> there are strange allocation patterns that happen and see if it is
> possible to have a reproduceable test case or a pattern which gives us
> confidence that we are on the right track.  One idea I have to
> monitor those allocations like the following:
> --- a/src/backend/access/transam/xlogreader.c
> +++ b/src/backend/access/transam/xlogreader.c
> @@ -162,6 +162,10 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
>     newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
>         newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));
>
> +#ifndef FRONTEND
> +   elog(LOG, "Allocation for xlogreader increased to %u", newSize);
> +#endif

So, I have been playing a bit more with that and defined the following
strategy to see if it is possible to create inconsistencies:
- Use a primary and a standby.
- Set up max_wal_size and min_wal_size to a minimum of 80MB so as the
segment recycling takes effect more quickly.
- Create a single table with a UUID column to increase the likelihood of
random data in INSERT records and FPWs, and insert enough data to
trigger a full WAL recycling.
- Every 5 seconds, insert a set of tuples into the table, using 110 to
120 tuples generates enough data for a bit more than a full WAL page.
And then restart the primary.  This causes the standby to catch up with
normally a page streamed which is not completely initialized as it
fetches the page in the middle.

With the monitoring mentioned in the upper comment block, I have let the
whole thing run for a couple of hours, but I have not been able to catch
up problems, except the usual "invalid record length at 0/XXX: wanted
24, got 0".  The allocation for recordbuf did not get higher than 40960
bytes as well, which matches with 5 WAL pages.

An other, evil, idea that I have on top of all those things is to
directly hexedit the WAL segment of the standby just at the limit where
it would receive a record from the primary and insert in it garbage
data which would make the validation tests to blow up in xlogreader.c
for the record allocation.
--
Michael

Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Fri, Feb 23, 2018 at 11:26:31AM +0900, Michael Paquier wrote:
> An other, evil, idea that I have on top of all those things is to
> directly hexedit the WAL segment of the standby just at the limit where
> it would receive a record from the primary and insert in it garbage
> data which would make the validation tests to blow up in xlogreader.c
> for the record allocation.

OK, I have been playing with such methods and finally I have been able
to check the theory of Tsunakawa-san here.  At first I played with
hexedit to pollute the data after the last record received by a standby,
record which is not at the end of a WAL page.  This can easily happen by
stopping the primary which will most likely only send data up to the
middle of a page.  Using that I have easily faced errors like that:
LOG: invalid resource manager ID 255 at 0/75353B8
And I have been able to see that garbage could be read as the length of
a record before the validation of the header is done.  With the patch
attached you can easily see a collection of garbage:
LOG: length of 77913214 fetched for record 0/30009D8
And this happens if a standby is reading a page with the last record in
the middle of it.

At this state, the XLOG reader is dealing with random data, so this
would most likely fail in ValidXLogRecordHeader(), which is what
happened with the invalid rmgr for example.  However, if one is really
unlucky, and the probability of facing that are really low, the random
data being read *can* pass the validation checks of the header, which
would cause a set of problems:
- In the backend, this means a failure on palloc_extended if the garbage
read is higher than 1GB.  This causes a standby to stop immediately
while it should normally continue to poll the primary by spawning a new
WAL receiver and begin streaming from the beginning of the segment where
it needs its data (assuming that there is no archive).
- In the frontend, this can cause problems for pg_waldump and
pg_rewind.  For pg_waldump, this is not actually a big deal because a
failure means the end of the data that can be decoded.  However the
problem is more serious for pg_rewind.  At the initial phase of the
tool, pg_rewind scans the WAL segments of the target server to look for
the modified blocks since the last checkpoint before promotion up to the
end of the stream.  If at the end of the stream it finds garbage data,
then there is a risk that it allocates a couple of GBs of data, likely
finishing by causing the process to be killed on OOM by the automatic
OOM killer, preventing the rewind to complete :(

After some thoughts, I have also come up with a more simple way to test
the stabilility of the XLOG reader:
1) Create a primary/standby cluster.
2) Stop the standby.
3) Add in the standby's pg_wal a set of junk WAL segments generated with
if=/dev/urandom of=junk_walseg bs=16MB count=1.  Note that recycled WAL
segments are simple copies of past ones when fetching an existing one,
so on a standby when a new segment writes based on a past existing
segment it writes data on top of some garbage.  So it is possible to
face this problem, this just makes it show up faster.
4) Start the standby again, and let it alive.
5) Generate on the primary enough records worth of more or less 8kB to
fill in a complete WAL page.
6) Restart the primary cleanly, sleep like 5s to let the standby the
time to stream the new partial page and return to 5).

When 5) and 6) loop, you will see the monitoring log of the attached
patch complain after a couple of restarts.

Tsunakawa-san has proposed upthread to fix the problem by zero-ing the
page read in the WAL receiver.  While I agree that zeroing the page is
the way to go, doing so in the WAL receiver does not take care of
problems with the frontends.  I don't have the time to test that yet as
it is late here, but based on my code lookups tweaking
ReadPageInternal() so as the page is zero'ed correctly should do it for
all the cases.  This way, the checks happening after a page read would
fail because of the zero'ed fields consistently instead of the garbage
accumulated.
--
Michael

Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Fri, Feb 23, 2018 at 11:02:19PM +0900, Michael Paquier wrote:
> Tsunakawa-san has proposed upthread to fix the problem by zero-ing the
> page read in the WAL receiver.  While I agree that zeroing the page is
> the way to go, doing so in the WAL receiver does not take care of
> problems with the frontends.  I don't have the time to test that yet as
> it is late here, but based on my code lookups tweaking
> ReadPageInternal() so as the page is zero'ed correctly should do it for
> all the cases.  This way, the checks happening after a page read would
> fail because of the zero'ed fields consistently instead of the garbage
> accumulated.

I have been playing more with that this morning, and trying to tweak the
XLOG reader so as the fetched page is zeroed where necessary does not
help much.  XLogReaderState->EndRecPtr is updated once the last record
is set so it is possible to use it and zero the rest of the page which
would prevent a lookup.  But this is actually not wise for performance:
- The data after EndRecPtr could be perfectly valid, so you could zero
data which could be reused later.
- It is necessary to drop the quick-exit checks in PageReadInternal.

The WAL receiver approach also has a drawback.  If WAL is streamed at
full speed, then the primary sends data with a maximum of 6 WAL pages.
When beginning streaming with a new segment, then the WAL sent stops at
page boundary.  But if you stop once in the middle of a page then you
need to zero-fill the page until the current segment is finished
streaming.  So if the workload generates spiky WAL then the WAL receiver
can would a lot of extra lseek() calls with the patch applied, while all
the writes would be sequential on HEAD, so that's not performant-wise
IMO.

Another idea I am thinking about would be to zero-fill the segments
when recycled instead of being just renamed when doing
InstallXLogFileSegment().  This would also have the advantage of making
the segments ahead more compressible, which is a gain for custom
backups, and the WAL receiver does not need any tweaks as it would write
the data on a clean file.  Zero-filling the segments is done only when a
new segment is created (see XLogFileInit).

Each approach has its own cost. Thoughts?
--
Michael

Attachment

RE: [bug fix] Cascaded standby cannot start after a clean shutdown

From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael@paquier.xyz]
> I have been playing more with that this morning, and trying to tweak the
> XLOG reader so as the fetched page is zeroed where necessary does not help
> much.  XLogReaderState->EndRecPtr is updated once the last record is set
> so it is possible to use it and zero the rest of the page which would prevent
> a lookup.  But this is actually not wise for performance:
> - The data after EndRecPtr could be perfectly valid, so you could zero data
> which could be reused later.
> - It is necessary to drop the quick-exit checks in PageReadInternal.

First of all, thanks for your many experiments and ideas.

Yes, the above method doesn't work.  The reason is not only performance but also correctness.  After you zero-fill the
remainingportion of state->readbuf based on state->EndRecPtr, you need to read the same page upon the request of the
nextWAL record.  And that page read brings garbage again into state->readbuf.  After all, the callers of
ReadPageInternal()has to face the garbage.
 


> The WAL receiver approach also has a drawback.  If WAL is streamed at full
> speed, then the primary sends data with a maximum of 6 WAL pages.
> When beginning streaming with a new segment, then the WAL sent stops at
> page boundary.  But if you stop once in the middle of a page then you need
> to zero-fill the page until the current segment is finished streaming.  So
> if the workload generates spiky WAL then the WAL receiver can would a lot
> of extra lseek() calls with the patch applied, while all the writes would
> be sequential on HEAD, so that's not performant-wise IMO.

Does even the non-cascading standby stop in the middle of a page?  I thought the master always the whole WAL blocks
withoutstopping in the middle of a page.
 


> Another idea I am thinking about would be to zero-fill the segments when
> recycled instead of being just renamed when doing InstallXLogFileSegment().
> This would also have the advantage of making the segments ahead more
> compressible, which is a gain for custom backups, and the WAL receiver does
> not need any tweaks as it would write the data on a clean file.  Zero-filling
> the segments is done only when a new segment is created (see XLogFileInit).

Yes, I was (and am) inclined to take this approach; this is easy and clean, but not good for performance...  I hope
there'ssomething which justifies this approach.
 


Regards
Takayuki Tsunakawa




Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Mon, Feb 26, 2018 at 07:25:46AM +0000, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:michael@paquier.xyz]
>> The WAL receiver approach also has a drawback.  If WAL is streamed at full
>> speed, then the primary sends data with a maximum of 6 WAL pages.
>> When beginning streaming with a new segment, then the WAL sent stops at
>> page boundary.  But if you stop once in the middle of a page then you need
>> to zero-fill the page until the current segment is finished streaming.  So
>> if the workload generates spiky WAL then the WAL receiver can would a lot
>> of extra lseek() calls with the patch applied, while all the writes would
>> be sequential on HEAD, so that's not performant-wise IMO.
>
> Does even the non-cascading standby stop in the middle of a page?  I
> thought the master always the whole WAL blocks without stopping in the
> middle of a page.

You even have problems on normal standbys.  I have a small script which
is able to reproduce that if you want (need a small rewrite as it is
adapted to my test framework) which introduces a garbage set of WAL
segments on a stopped standby.  With the small monitoring patch I
mentioned upthread then you can see the XLOG reader finding garbage data
as well before validating the record header.  With any fixes on the WAL
receiver, your first patch included, then the garbage read goes away,
and XLOG reader complains about a record with an incorrect length
(invalid record length at XX/YYY: wanted 24, got 0) instead of complains
from header validation part.  One key point is to cleanly stop the
primary to as it forces the standby's WAL receiver to write to its WAL
segment in the middle of a page.

>> Another idea I am thinking about would be to zero-fill the segments when
>> recycled instead of being just renamed when doing InstallXLogFileSegment().
>> This would also have the advantage of making the segments ahead more
>> compressible, which is a gain for custom backups, and the WAL receiver does
>> not need any tweaks as it would write the data on a clean file.  Zero-filling
>> the segments is done only when a new segment is created (see XLogFileInit).
>
> Yes, I was (and am) inclined to take this approach; this is easy and
> clean, but not good for performance...  I hope there's something which
> justifies this approach.

InstallXLogFileSegment uses a plain durable_link_or_rename() to recycle
the past segment which syncs the old segment before the rename anyway,
so the I/O effort will be there, no?

This was mentioned back in 2001 by the way, but this did not count much
for the case discussed here:
https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us
The issue here is that the streaming case makes it easier to hit the
problem as it opens more easily access to not-completely written WAL
pages depending on the message frequency during replication.  At the
same time, we are discussing about a very low-probability issue.  Note
that if the XLOG reader is bumping into this problem, then at the next
WAL receiver wake up, recovery would begin from the beginning of the
last segment, and if the primary has produced some more WAL then the
standby would be able to actually avoid the random junk.  It is also
possible to bypass the problem by zeroing manually the areas in
question, or to actually wait for the standby to generate more WAL so as
the garbage is overwritten automatically.  And you really need to be
very, very unlucky to have random garbage able to bypass the header
validation checks.
--
Michael

Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Mon, Feb 26, 2018 at 05:08:49PM +0900, Michael Paquier wrote:
> This was mentioned back in 2001 by the way, but this did not count much
> for the case discussed here:
> https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us
> The issue here is that the streaming case makes it easier to hit the
> problem as it opens more easily access to not-completely written WAL
> pages depending on the message frequency during replication.  At the
> same time, we are discussing about a very low-probability issue.  Note
> that if the XLOG reader is bumping into this problem, then at the next
> WAL receiver wake up, recovery would begin from the beginning of the
> last segment, and if the primary has produced some more WAL then the
> standby would be able to actually avoid the random junk.  It is also
> possible to bypass the problem by zeroing manually the areas in
> question, or to actually wait for the standby to generate more WAL so as
> the garbage is overwritten automatically.  And you really need to be
> very, very unlucky to have random garbage able to bypass the header
> validation checks.

By the way, as long as I have my mind of it.  Another strategy would be
to just make the checks in XLogReadRecord() a bit smarter if the whole
record header is not on the page.  If we check at least for
AllocSizeIsValid(total_len) then there this code would not fail on an
allocation as you user reported.  Still this misses the case where a
record size is lower than 1GB but invalid so you would allocate
allocate_recordbuf for nothing :(

At least this extra check is costless, and avoids all kind of hard
failures.
--
Michael

Attachment

RE: [bug fix] Cascaded standby cannot start after a clean shutdown

From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael@paquier.xyz]
> By the way, as long as I have my mind of it.  Another strategy would be
> to just make the checks in XLogReadRecord() a bit smarter if the whole record
> header is not on the page.  If we check at least for
> AllocSizeIsValid(total_len) then there this code would not fail on an
> allocation as you user reported.  Still this misses the case where a record
> size is lower than 1GB but invalid so you would allocate allocate_recordbuf
> for nothing :(

That was my first thought, and I gave it up.  As you say, XLogReadRecord() could allocate up to 1 GB of memory for a
garbage. That allocation can fail due to memory shortage, which prevents the recovery from proceeding.
 


Regards
Takayuki Tsunakawa






Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Tue, Feb 27, 2018 at 05:15:29AM +0000, Tsunakawa, Takayuki wrote:
> That was my first thought, and I gave it up.  As you say,
> XLogReadRecord() could allocate up to 1 GB of memory for a garbage.
> That allocation can fail due to memory shortage, which prevents the
> recovery from proceeding.

Even with that, the resulting patch is sort of ugly...  So it seems to
me that the conclusion to this thread is that there is no clear winner,
and that the problem is so unlikely to happen that it is not worth the
performance impact to zero all the WAL pages fetched.  Still, the
attached has the advantage to not cause the startup process to fail
suddendly because of the allocation failure but to fail afterwards when
validating the record's contents, which has the disadvantage to allocate
temporarily up to 1GB of memory for some of the garbage, but that
allocation is short-lived.  So that would switch the failure from a
FATAL allocation failure taking down Postgres to something which will
fetching of new WAL records to be tried again.

All in all, that would be a win for the case reported on this thread.

Thoughts from anybody?
--
Michael

Attachment

RE: [bug fix] Cascaded standby cannot start after a clean shutdown

From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael@paquier.xyz]
> Even with that, the resulting patch is sort of ugly...  So it seems to me
> that the conclusion to this thread is that there is no clear winner, and
> that the problem is so unlikely to happen that it is not worth the performance
> impact to zero all the WAL pages fetched.  Still, the attached has the
> advantage to not cause the startup process to fail suddendly because of
> the allocation failure but to fail afterwards when validating the record's
> contents, which has the disadvantage to allocate temporarily up to 1GB of
> memory for some of the garbage, but that allocation is short-lived.  So
> that would switch the failure from a FATAL allocation failure taking down
> Postgres to something which will fetching of new WAL records to be tried
> again.
> 
> All in all, that would be a win for the case reported on this thread.

I'm sorry to be late to reply, and thanks for another patch.

Honestly, I'm fine with either patch.  I like your simpler and cleaner one that has no performance impact.  But please
notethat the allocation attempt could amount to nearly 1 GB.  That can fail due to memory shortage, which leads to
FATALerror (ereport(ERROR) results in FATAL in startup process), and cause standby to shut down.
 

Regards
Takayuki Tsunakawa




Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Fri, Mar 16, 2018 at 05:27:58AM +0000, Tsunakawa, Takayuki wrote:
> Honestly, I'm fine with either patch.  I like your simpler and cleaner
> one that has no performance impact.  But please note that the
> allocation attempt could amount to nearly 1 GB.  That can fail due to
> memory shortage, which leads to FATAL error (ereport(ERROR) results in
> FATAL in startup process), and cause standby to shut down.

We use palloc_extended with MCXT_ALLOC_NO_OOM in 9.5~, and malloc()
further down, so once you remove the FATAL error caused by a record
whose length is higher than 1GB, you discard all the hard failures, no?
--
Michael

Attachment

RE: [bug fix] Cascaded standby cannot start after a clean shutdown

From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael@paquier.xyz]
> We use palloc_extended with MCXT_ALLOC_NO_OOM in 9.5~, and malloc() further
> down, so once you remove the FATAL error caused by a record whose length
> is higher than 1GB, you discard all the hard failures, no?

Ouch, you're right.  If memory allocation fails, the startup process would emit a LOG message and continue to fetch new
WALrecords.  Then, I'm completely happy with your patch.
 

I added you as the author, and marked this as ready for committer.

Regards
Takayuki Tsunakawa




Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Fri, Mar 16, 2018 at 06:02:25AM +0000, Tsunakawa, Takayuki wrote:
> Ouch, you're right.  If memory allocation fails, the startup process
> would emit a LOG message and continue to fetch new WAL records.  Then,
> I'm completely happy with your patch.

Thanks for double-checking, Tsunakawa-san.
--
Michael

Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Sun, Mar 18, 2018 at 08:49:01AM +0900, Michael Paquier wrote:
> On Fri, Mar 16, 2018 at 06:02:25AM +0000, Tsunakawa, Takayuki wrote:
>> Ouch, you're right.  If memory allocation fails, the startup process
>> would emit a LOG message and continue to fetch new WAL records.  Then,
>> I'm completely happy with your patch.
>
> Thanks for double-checking, Tsunakawa-san.

As this is one of those small bug fixes for which we can do something,
attached is an updated patch with a commit description, and
tutti-quanti.  At the end, I have moved the size check within
allocate_recordbuf().  Even if the size calculated is not going to be
higher than total_len, it feels safer in the long term for two reasons:
- Allocation size calculation could be changed, or let's say made
smarter.
- New callers of allocate_recordbuf would not be able to see the problem
for the backend, and this could become a problem if the WAL reader
facility is extended.

Thoughts?
--
Michael

Attachment

RE: [bug fix] Cascaded standby cannot start after a clean shutdown

From
"Tsunakawa, Takayuki"
Date:
> From: Michael Paquier [mailto:michael@paquier.xyz]
> As this is one of those small bug fixes for which we can do something, attached
> is an updated patch with a commit description, and tutti-quanti.  At the
> end, I have moved the size check within allocate_recordbuf().  Even if the
> size calculated is not going to be higher than total_len, it feels safer
> in the long term for two reasons:

Thank you so much.  This version looks better.

+     * this would cause the instance to stop suddendly with a hard failure,

suddendly -> suddenly


Regards
Takayuki Tsunakawa






Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Tue, Jun 12, 2018 at 06:30:49AM +0000, Tsunakawa, Takayuki wrote:
> Thank you so much.  This version looks better.
>
> +     * this would cause the instance to stop suddendly with a hard failure,
>
> suddendly -> suddenly

Yep.  Thanks for the extra lookup.
--
Michael

Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Tue, Jun 12, 2018 at 04:27:50PM +0900, Michael Paquier wrote:
> On Tue, Jun 12, 2018 at 06:30:49AM +0000, Tsunakawa, Takayuki wrote:
>> Thank you so much.  This version looks better.
>>
>> +     * this would cause the instance to stop suddendly with a hard failure,
>>
>> suddendly -> suddenly
>
> Yep.  Thanks for the extra lookup.

Note for everybody on this list: I will be out for a couple of days at
the end of this week, and my intention is to finish wrapping this patch
at the beginning of next week, with a back-patch down to 9.5 where
palloc_extended has been introduced as this has been seen a couple of
times on -bugs or -hackers.  Feel free of course to comment and look at
the thread for more background on the matter.
--
Michael

Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Wed, Jun 13, 2018 at 09:00:47AM +0900, Michael Paquier wrote:
> Note for everybody on this list: I will be out for a couple of days at
> the end of this week, and my intention is to finish wrapping this patch
> at the beginning of next week, with a back-patch down to 9.5 where
> palloc_extended has been introduced as this has been seen a couple of
> times on -bugs or -hackers.  Feel free of course to comment and look at
> the thread for more background on the matter.

Okay, done this way.

Trying to run regression tests in parallel in ~9.5 leads to spurious
failures, which is annoying...  I had a patch fixing that but I cannot
put my finger on the thread where this has been discussed.
--
Michael

Attachment

Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> Trying to run regression tests in parallel in ~9.5 leads to spurious
> failures, which is annoying...  I had a patch fixing that but I cannot
> put my finger on the thread where this has been discussed.

Yeah, for me parallelized check-world only works in >= 9.6.  My (vague)
recollection is that multiple fixes were needed to get to that point,
so I doubt it's worth trying to fix in older branches.

            regards, tom lane


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Andres Freund
Date:
Hi,

On 2018-06-17 22:31:02 -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > Trying to run regression tests in parallel in ~9.5 leads to spurious
> > failures, which is annoying...  I had a patch fixing that but I cannot
> > put my finger on the thread where this has been discussed.
> 
> Yeah, for me parallelized check-world only works in >= 9.6.  My (vague)
> recollection is that multiple fixes were needed to get to that point,
> so I doubt it's worth trying to fix in older branches.

There were multiple fixes, that's right - I'd personally however still
be in favor of trying to backpatch those. Running a check world serially
for the backbranches seriously slows down things for me.

Greetings,

Andres Freund


Re: [bug fix] Cascaded standby cannot start after a clean shutdown

From
Michael Paquier
Date:
On Sun, Jun 17, 2018 at 07:33:01PM -0700, Andres Freund wrote:
> On 2018-06-17 22:31:02 -0400, Tom Lane wrote:
>> Yeah, for me parallelized check-world only works in >= 9.6.  My (vague)
>> recollection is that multiple fixes were needed to get to that point,
>> so I doubt it's worth trying to fix in older branches.
>
> There were multiple fixes, that's right - I'd personally however still
> be in favor of trying to backpatch those.

It took me some time, but I have been able to find the thread which
discusses all the fixes needed to get parallel builds to work on older
branches without having to port PostgresNode.pm, which is a way higher
level of efforts:
https://www.postgresql.org/message-id/CAB7nPqQTGveO3_zvnXQAs32cMosX8c76_ewe4L1cNL=1xZmt+g@mail.gmail.com
Could that be considered?

> Running a check world serially for the backbranches seriously slows
> down things for me.

Me too.  That was already frustrating one year ago, and it still is.
--
Michael

Attachment