Thread: [bug fix] Cascaded standby cannot start after a clean shutdown
Hello, Our customer encountered a rare bug of PostgreSQL which prevents a cascaded standby from starting up. The attached patchis a fix for it. I hope this will be back-patched. I'll add this to the next CF. PROBLEM ============================== The PostgreSQL version is 9.5. The cluster consists of a master, a cascading standby (SB1), and a cascaded standby (SB2). The WAL flows like this: master -> SB1 -> SB2. The user shut down SB2 and tried to restart it, but failed with the following message: FATAL: invalid memory alloc request size 3075129344 The master and SB1 continued to run. CAUSE ============================== total_len in the code below was about 3GB, so palloc() rejected the memory allocation request. [xlogreader.c] record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); total_len = record->xl_tot_len; ... /* * Enlarge readRecordBuf as needed. */ if (total_len > state->readRecordBufSize && !allocate_recordbuf(state, total_len)) { Why was XLogRecord->xl_tot_len so big? That's because the XLOG reader read the garbage portion in a WAL file, which is immediatelyafter the end of valid WAL records. Why was there the garbage? Because the cascading standby sends just the valid WAL records, not the whole WAL blocks, tothe cascaded standby. When the cascaded standby receives those WAL records and write them in a recycled WAL file, thelast WAL block in the file contains the mix of valid WAL records and the garbage left over. Why did the XLOG reader try to allocate memory for a garbage? Doesn't it stop reading the WAL? As the following code shows,when the WAL record header crosses a WAL block boundary, the XLOG reader first allocates memory for the whole record,and then checks the validity of the record header as soon as it reads the entire header. [xlogreader.c] /* * If the whole record header is on this page, validate it immediately. * Otherwise do just a basic sanity check on xl_tot_len, and validate the * rest of the header after reading it from the next page. The xl_tot_len * check is necessary here to ensure that we enter the "Need to reassemble * record" code path below; otherwise we might fail to apply * ValidXLogRecordHeader at all. */ if (targetRecOff <= XLOG_BLCKSZ - SizeOfXLogRecord) { if (!ValidXLogRecordHeader(state, RecPtr, state->ReadRecPtr, record, randAccess)) goto err; gotheader = true; } else { /* XXX: more validation should be done here */ if (total_len < SizeOfXLogRecord) { report_invalid_record(state, "invalid record length at %X/%X: wanted %u, got %u", (uint32) (RecPtr >> 32), (uint32) RecPtr, (uint32) SizeOfXLogRecord, total_len); goto err; } gotheader = false; } FIX ============================== One idea is to defer the memory allocation until the entire record header is read and ValidXLogRecordHeader() is called. However, ValidXLogRecordHeader() might misjudge the validity if the garbage contains xl_prev seeming correct. So, I modified walreceiver to zero-fill the last partial WAL block. This is the guard that the master does in AdvanceXLInsertBuffer()as below: /* * Be sure to re-zero the buffer so that bytes beyond what we've * written will look like zeroes and not valid XLOG records... */ MemSet((char *) NewPage, 0, XLOG_BLCKSZ); FYI, the following unsolved problem may be solved, too. https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com Regards Takayuki Tsunakawa
Attachment
On Wed, Feb 14, 2018 at 04:37:05AM +0000, Tsunakawa, Takayuki wrote: > The PostgreSQL version is 9.5. The cluster consists of a master, a > cascading standby (SB1), and a cascaded standby (SB2). The WAL flows > like this: master -> SB1 -> SB2. > > The user shut down SB2 and tried to restart it, but failed with the > following message: > > FATAL: invalid memory alloc request size 3075129344 > > The master and SB1 continued to run. This matches a couple of recent bug reports we have seen around like this one: https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com I recalled as well this one but in this case the user shot his own foot with the failover scenario he designed: https://www.postgresql.org/message-id/CAAc9rOyKAyzipiq7ee1%3DVbcRy2fRqV_hRujLbC0mrBkL07%3D7wQ%40mail.gmail.com > CAUSE > ============================== > > total_len in the code below was about 3GB, so palloc() rejected the > memory allocation request. Yes, it is limited to 1GB. > [xlogreader.c] > record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ); > total_len = record->xl_tot_len; > ... > /* > * Enlarge readRecordBuf as needed. > */ > if (total_len > state->readRecordBufSize && > !allocate_recordbuf(state, total_len)) > { > > Why was XLogRecord->xl_tot_len so big? That's because the XLOG reader > read the garbage portion in a WAL file, which is immediately after the > end of valid WAL records. > > Why was there the garbage? Because the cascading standby sends just > the valid WAL records, not the whole WAL blocks, to the cascaded > standby. When the cascaded standby receives those WAL records and > write them in a recycled WAL file, the last WAL block in the file > contains the mix of valid WAL records and the garbage left over. Wait a minute here, when recycled past WAL segments would be filled with zeros before being used. > Why did the XLOG reader try to allocate memory for a garbage? Doesn't > it stop reading the WAL? As the following code shows, when the WAL > record header crosses a WAL block boundary, the XLOG reader first > allocates memory for the whole record, and then checks the validity of > the record header as soon as it reads the entire header. > > [...] > > FIX > ============================== > > One idea is to defer the memory allocation until the entire record > header is read and ValidXLogRecordHeader() is called. However, > ValidXLogRecordHeader() might misjudge the validity if the garbage > contains xl_prev seeming correct. It seems to me that the consolidation of the page read should happen directly in xlogreader.c and not even in one of its callbacks so as no garbage data is presented back to the caller using its own XLogReader. I think that you need to initialize XLogReaderState->readBuf directly in ReadPageInternal() before reading a page and you should be good. With your patch you get visibly only one portion of things addressed, what of other code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such? > FYI, the following unsolved problem may be solved, too. > > https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com Yeah, I noticed this one too before going through your message in details ;) Please note that your patch has some unnecessary garbage in two places: - * Portions Copyright (c) 2010-2018, PostgreSQL Global Development Group + * Portions Copyright (c) 2010-2017, PostgreSQL Global Development Group [...] - * Only superusers and members of pg_read_all_stats can see details. - * Other users only get the pid value + * Only superusers can see details. Other users only get the pid valu You may want to indent your patch as well before sending it. + if (zbuffer == NULL) + zbuffer = palloc0(XLOG_BLCKSZ); You could just use a static buffer which is MemSet'd with 0 only if necessary. -- Michael
Attachment
On Fri, Feb 16, 2018 at 04:19:00PM +0900, Michael Paquier wrote: > Wait a minute here, when recycled past WAL segments would be filled with > zeros before being used. Please feel free to ignore this part. I pushed the "Send" button without seeing it, and I was thinking uner which circumstances segments get zero-filled, which is moot for this thread. -- Michael
Attachment
Thank you for reviewing. From: Michael Paquier [mailto:michael@paquier.xyz] > It seems to me that the consolidation of the page read should happen directly > in xlogreader.c and not even in one of its callbacks so as no garbage data > is presented back to the caller using its own XLogReader. > I think that you need to initialize XLogReaderState->readBuf directly in > ReadPageInternal() before reading a page and you should be good. With your > patch you get visibly only one portion of things addressed, what of other > code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such? ReadPageInternal() doesn't know where the end of valid WAL is, so it cannot determine where to do memset(0). For example,in non-streaming cases, it reads the entire WAL block into readbuf, including the garbage. /* * If the current segment is being streamed from master, calculate how * much of the current page we have received already. We know the * requested record has been received, but this is for the benefit of * future calls, to allow quick exit at the top of this function. */ if (readSource == XLOG_FROM_STREAM) { if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ)) readLen = XLOG_BLCKSZ; else readLen = receivedUpto % XLogSegSize - targetPageOff; } else readLen = XLOG_BLCKSZ; So we have to zero-fill the empty space of a WAL block before writing it. Currently, the master does that in AdvanceXLInsertBuffer(),StartupXLOG(), XLogFileCopy(). The cascading standby receives WAL from the master block by block,so it doesn't suffer from the garbage. pg_receivexlog zero-fills a new WAL file. > Please note that your patch has some unnecessary garbage in two places: Ouch, cleaned up. > + if (zbuffer == NULL) > + zbuffer = palloc0(XLOG_BLCKSZ); > You could just use a static buffer which is MemSet'd with 0 only if necessary. I wanted to allocate memory only when necessary. Currently, only the cascaded standby needs this. To that end, I movedthe memory allocation code to a right place. Thanks for letting me notice this. Regards Takayuki Tsunakawa
Attachment
On Mon, Feb 19, 2018 at 03:01:15AM +0000, Tsunakawa, Takayuki wrote: > From: Michael Paquier [mailto:michael@paquier.xyz] Sorry for my late reply. I was looking at this problem for the last couple of days here and there, still thinking about it. >> It seems to me that the consolidation of the page read should happen directly >> in xlogreader.c and not even in one of its callbacks so as no garbage data >> is presented back to the caller using its own XLogReader. >> I think that you need to initialize XLogReaderState->readBuf directly in >> ReadPageInternal() before reading a page and you should be good. With your >> patch you get visibly only one portion of things addressed, what of other >> code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such? > > ReadPageInternal() doesn't know where the end of valid WAL is, so it > cannot determine where to do memset(0). For example, in non-streaming > cases, it reads the entire WAL block into readbuf, including the > garbage. Why couldn't it know about it? It would be perfectly fine to feed to it the end LSN position as well and it is an internal API to xlogreader.c. Note that both its callers, XLogFindNextRecord or XLogReadRecord know about that as well. > /* > * If the current segment is being streamed from master, calculate how > * much of the current page we have received already. We know the > * requested record has been received, but this is for the benefit of > * future calls, to allow quick exit at the top of this function. > */ > if (readSource == XLOG_FROM_STREAM) > { > if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / XLOG_BLCKSZ)) > readLen = XLOG_BLCKSZ; > else > readLen = receivedUpto % XLogSegSize - targetPageOff; > } > else > readLen = XLOG_BLCKSZ; > > So we have to zero-fill the empty space of a WAL block before writing > it. Currently, the master does that in AdvanceXLInsertBuffer(), > StartupXLOG(), XLogFileCopy(). The cascading standby receives WAL > from the master block by block, so it doesn't suffer from the garbage. > pg_receivexlog zero-fills a new WAL file. I cannot completely parse this statement. Physical replication sends up to 16 WAL pages per message, rounded down to the last page completed. Even a cascading standby sends WAL following this protocol, using the last flush position as a base for the maximum amount of data sent. >> + if (zbuffer == NULL) >> + zbuffer = palloc0(XLOG_BLCKSZ); >> You could just use a static buffer which is MemSet'd with 0 only if necessary. > > I wanted to allocate memory only when necessary. Currently, only the > cascaded standby needs this. To that end, I moved the memory > allocation code to a right place. Thanks for letting me notice this. Still, it seems to me that any problems are not related to the fact that we are using cascading standbys. As [1] says as well, similar problems have been reported on a standby after restarting its primary. I am definitely ready to buy that it can be possible to have garbage being read the length field which can cause allocate_recordbuf to fail as that's the only code path in xlogreader.c which does such an allocation. Still, it seems to me that we should first try to see if there are strange allocation patterns that happen and see if it is possible to have a reproduceable test case or a pattern which gives us confidence that we are on the right track. One idea I have to monitor those allocations like the following: --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -162,6 +162,10 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ); newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ)); +#ifndef FRONTEND + elog(LOG, "Allocation for xlogreader increased to %u", newSize); +#endif This could be upgraded to a PANIC or such if it sees a larger allocation, as pgbench generates records with known lengths, that would be a good fit to see if there is some garbage being read. No need to go up to 1GB before seeing a failure. [1]: https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com -- Michael
Attachment
On Thu, Feb 22, 2018 at 04:55:38PM +0900, Michael Paquier wrote: > I am definitely ready to buy that it can be possible to have garbage > being read the length field which can cause allocate_recordbuf to fail > as that's the only code path in xlogreader.c which does such an > allocation. Still, it seems to me that we should first try to see if > there are strange allocation patterns that happen and see if it is > possible to have a reproduceable test case or a pattern which gives us > confidence that we are on the right track. One idea I have to > monitor those allocations like the following: > --- a/src/backend/access/transam/xlogreader.c > +++ b/src/backend/access/transam/xlogreader.c > @@ -162,6 +162,10 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength) > newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ); > newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ)); > > +#ifndef FRONTEND > + elog(LOG, "Allocation for xlogreader increased to %u", newSize); > +#endif So, I have been playing a bit more with that and defined the following strategy to see if it is possible to create inconsistencies: - Use a primary and a standby. - Set up max_wal_size and min_wal_size to a minimum of 80MB so as the segment recycling takes effect more quickly. - Create a single table with a UUID column to increase the likelihood of random data in INSERT records and FPWs, and insert enough data to trigger a full WAL recycling. - Every 5 seconds, insert a set of tuples into the table, using 110 to 120 tuples generates enough data for a bit more than a full WAL page. And then restart the primary. This causes the standby to catch up with normally a page streamed which is not completely initialized as it fetches the page in the middle. With the monitoring mentioned in the upper comment block, I have let the whole thing run for a couple of hours, but I have not been able to catch up problems, except the usual "invalid record length at 0/XXX: wanted 24, got 0". The allocation for recordbuf did not get higher than 40960 bytes as well, which matches with 5 WAL pages. An other, evil, idea that I have on top of all those things is to directly hexedit the WAL segment of the standby just at the limit where it would receive a record from the primary and insert in it garbage data which would make the validation tests to blow up in xlogreader.c for the record allocation. -- Michael
Attachment
On Fri, Feb 23, 2018 at 11:26:31AM +0900, Michael Paquier wrote: > An other, evil, idea that I have on top of all those things is to > directly hexedit the WAL segment of the standby just at the limit where > it would receive a record from the primary and insert in it garbage > data which would make the validation tests to blow up in xlogreader.c > for the record allocation. OK, I have been playing with such methods and finally I have been able to check the theory of Tsunakawa-san here. At first I played with hexedit to pollute the data after the last record received by a standby, record which is not at the end of a WAL page. This can easily happen by stopping the primary which will most likely only send data up to the middle of a page. Using that I have easily faced errors like that: LOG: invalid resource manager ID 255 at 0/75353B8 And I have been able to see that garbage could be read as the length of a record before the validation of the header is done. With the patch attached you can easily see a collection of garbage: LOG: length of 77913214 fetched for record 0/30009D8 And this happens if a standby is reading a page with the last record in the middle of it. At this state, the XLOG reader is dealing with random data, so this would most likely fail in ValidXLogRecordHeader(), which is what happened with the invalid rmgr for example. However, if one is really unlucky, and the probability of facing that are really low, the random data being read *can* pass the validation checks of the header, which would cause a set of problems: - In the backend, this means a failure on palloc_extended if the garbage read is higher than 1GB. This causes a standby to stop immediately while it should normally continue to poll the primary by spawning a new WAL receiver and begin streaming from the beginning of the segment where it needs its data (assuming that there is no archive). - In the frontend, this can cause problems for pg_waldump and pg_rewind. For pg_waldump, this is not actually a big deal because a failure means the end of the data that can be decoded. However the problem is more serious for pg_rewind. At the initial phase of the tool, pg_rewind scans the WAL segments of the target server to look for the modified blocks since the last checkpoint before promotion up to the end of the stream. If at the end of the stream it finds garbage data, then there is a risk that it allocates a couple of GBs of data, likely finishing by causing the process to be killed on OOM by the automatic OOM killer, preventing the rewind to complete :( After some thoughts, I have also come up with a more simple way to test the stabilility of the XLOG reader: 1) Create a primary/standby cluster. 2) Stop the standby. 3) Add in the standby's pg_wal a set of junk WAL segments generated with if=/dev/urandom of=junk_walseg bs=16MB count=1. Note that recycled WAL segments are simple copies of past ones when fetching an existing one, so on a standby when a new segment writes based on a past existing segment it writes data on top of some garbage. So it is possible to face this problem, this just makes it show up faster. 4) Start the standby again, and let it alive. 5) Generate on the primary enough records worth of more or less 8kB to fill in a complete WAL page. 6) Restart the primary cleanly, sleep like 5s to let the standby the time to stream the new partial page and return to 5). When 5) and 6) loop, you will see the monitoring log of the attached patch complain after a couple of restarts. Tsunakawa-san has proposed upthread to fix the problem by zero-ing the page read in the WAL receiver. While I agree that zeroing the page is the way to go, doing so in the WAL receiver does not take care of problems with the frontends. I don't have the time to test that yet as it is late here, but based on my code lookups tweaking ReadPageInternal() so as the page is zero'ed correctly should do it for all the cases. This way, the checks happening after a page read would fail because of the zero'ed fields consistently instead of the garbage accumulated. -- Michael
Attachment
On Fri, Feb 23, 2018 at 11:02:19PM +0900, Michael Paquier wrote: > Tsunakawa-san has proposed upthread to fix the problem by zero-ing the > page read in the WAL receiver. While I agree that zeroing the page is > the way to go, doing so in the WAL receiver does not take care of > problems with the frontends. I don't have the time to test that yet as > it is late here, but based on my code lookups tweaking > ReadPageInternal() so as the page is zero'ed correctly should do it for > all the cases. This way, the checks happening after a page read would > fail because of the zero'ed fields consistently instead of the garbage > accumulated. I have been playing more with that this morning, and trying to tweak the XLOG reader so as the fetched page is zeroed where necessary does not help much. XLogReaderState->EndRecPtr is updated once the last record is set so it is possible to use it and zero the rest of the page which would prevent a lookup. But this is actually not wise for performance: - The data after EndRecPtr could be perfectly valid, so you could zero data which could be reused later. - It is necessary to drop the quick-exit checks in PageReadInternal. The WAL receiver approach also has a drawback. If WAL is streamed at full speed, then the primary sends data with a maximum of 6 WAL pages. When beginning streaming with a new segment, then the WAL sent stops at page boundary. But if you stop once in the middle of a page then you need to zero-fill the page until the current segment is finished streaming. So if the workload generates spiky WAL then the WAL receiver can would a lot of extra lseek() calls with the patch applied, while all the writes would be sequential on HEAD, so that's not performant-wise IMO. Another idea I am thinking about would be to zero-fill the segments when recycled instead of being just renamed when doing InstallXLogFileSegment(). This would also have the advantage of making the segments ahead more compressible, which is a gain for custom backups, and the WAL receiver does not need any tweaks as it would write the data on a clean file. Zero-filling the segments is done only when a new segment is created (see XLogFileInit). Each approach has its own cost. Thoughts? -- Michael
Attachment
From: Michael Paquier [mailto:michael@paquier.xyz] > I have been playing more with that this morning, and trying to tweak the > XLOG reader so as the fetched page is zeroed where necessary does not help > much. XLogReaderState->EndRecPtr is updated once the last record is set > so it is possible to use it and zero the rest of the page which would prevent > a lookup. But this is actually not wise for performance: > - The data after EndRecPtr could be perfectly valid, so you could zero data > which could be reused later. > - It is necessary to drop the quick-exit checks in PageReadInternal. First of all, thanks for your many experiments and ideas. Yes, the above method doesn't work. The reason is not only performance but also correctness. After you zero-fill the remainingportion of state->readbuf based on state->EndRecPtr, you need to read the same page upon the request of the nextWAL record. And that page read brings garbage again into state->readbuf. After all, the callers of ReadPageInternal()has to face the garbage. > The WAL receiver approach also has a drawback. If WAL is streamed at full > speed, then the primary sends data with a maximum of 6 WAL pages. > When beginning streaming with a new segment, then the WAL sent stops at > page boundary. But if you stop once in the middle of a page then you need > to zero-fill the page until the current segment is finished streaming. So > if the workload generates spiky WAL then the WAL receiver can would a lot > of extra lseek() calls with the patch applied, while all the writes would > be sequential on HEAD, so that's not performant-wise IMO. Does even the non-cascading standby stop in the middle of a page? I thought the master always the whole WAL blocks withoutstopping in the middle of a page. > Another idea I am thinking about would be to zero-fill the segments when > recycled instead of being just renamed when doing InstallXLogFileSegment(). > This would also have the advantage of making the segments ahead more > compressible, which is a gain for custom backups, and the WAL receiver does > not need any tweaks as it would write the data on a clean file. Zero-filling > the segments is done only when a new segment is created (see XLogFileInit). Yes, I was (and am) inclined to take this approach; this is easy and clean, but not good for performance... I hope there'ssomething which justifies this approach. Regards Takayuki Tsunakawa
On Mon, Feb 26, 2018 at 07:25:46AM +0000, Tsunakawa, Takayuki wrote: > From: Michael Paquier [mailto:michael@paquier.xyz] >> The WAL receiver approach also has a drawback. If WAL is streamed at full >> speed, then the primary sends data with a maximum of 6 WAL pages. >> When beginning streaming with a new segment, then the WAL sent stops at >> page boundary. But if you stop once in the middle of a page then you need >> to zero-fill the page until the current segment is finished streaming. So >> if the workload generates spiky WAL then the WAL receiver can would a lot >> of extra lseek() calls with the patch applied, while all the writes would >> be sequential on HEAD, so that's not performant-wise IMO. > > Does even the non-cascading standby stop in the middle of a page? I > thought the master always the whole WAL blocks without stopping in the > middle of a page. You even have problems on normal standbys. I have a small script which is able to reproduce that if you want (need a small rewrite as it is adapted to my test framework) which introduces a garbage set of WAL segments on a stopped standby. With the small monitoring patch I mentioned upthread then you can see the XLOG reader finding garbage data as well before validating the record header. With any fixes on the WAL receiver, your first patch included, then the garbage read goes away, and XLOG reader complains about a record with an incorrect length (invalid record length at XX/YYY: wanted 24, got 0) instead of complains from header validation part. One key point is to cleanly stop the primary to as it forces the standby's WAL receiver to write to its WAL segment in the middle of a page. >> Another idea I am thinking about would be to zero-fill the segments when >> recycled instead of being just renamed when doing InstallXLogFileSegment(). >> This would also have the advantage of making the segments ahead more >> compressible, which is a gain for custom backups, and the WAL receiver does >> not need any tweaks as it would write the data on a clean file. Zero-filling >> the segments is done only when a new segment is created (see XLogFileInit). > > Yes, I was (and am) inclined to take this approach; this is easy and > clean, but not good for performance... I hope there's something which > justifies this approach. InstallXLogFileSegment uses a plain durable_link_or_rename() to recycle the past segment which syncs the old segment before the rename anyway, so the I/O effort will be there, no? This was mentioned back in 2001 by the way, but this did not count much for the case discussed here: https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us The issue here is that the streaming case makes it easier to hit the problem as it opens more easily access to not-completely written WAL pages depending on the message frequency during replication. At the same time, we are discussing about a very low-probability issue. Note that if the XLOG reader is bumping into this problem, then at the next WAL receiver wake up, recovery would begin from the beginning of the last segment, and if the primary has produced some more WAL then the standby would be able to actually avoid the random junk. It is also possible to bypass the problem by zeroing manually the areas in question, or to actually wait for the standby to generate more WAL so as the garbage is overwritten automatically. And you really need to be very, very unlucky to have random garbage able to bypass the header validation checks. -- Michael
Attachment
On Mon, Feb 26, 2018 at 05:08:49PM +0900, Michael Paquier wrote: > This was mentioned back in 2001 by the way, but this did not count much > for the case discussed here: > https://www.postgresql.org/message-id/24901.995381770%40sss.pgh.pa.us > The issue here is that the streaming case makes it easier to hit the > problem as it opens more easily access to not-completely written WAL > pages depending on the message frequency during replication. At the > same time, we are discussing about a very low-probability issue. Note > that if the XLOG reader is bumping into this problem, then at the next > WAL receiver wake up, recovery would begin from the beginning of the > last segment, and if the primary has produced some more WAL then the > standby would be able to actually avoid the random junk. It is also > possible to bypass the problem by zeroing manually the areas in > question, or to actually wait for the standby to generate more WAL so as > the garbage is overwritten automatically. And you really need to be > very, very unlucky to have random garbage able to bypass the header > validation checks. By the way, as long as I have my mind of it. Another strategy would be to just make the checks in XLogReadRecord() a bit smarter if the whole record header is not on the page. If we check at least for AllocSizeIsValid(total_len) then there this code would not fail on an allocation as you user reported. Still this misses the case where a record size is lower than 1GB but invalid so you would allocate allocate_recordbuf for nothing :( At least this extra check is costless, and avoids all kind of hard failures. -- Michael
Attachment
From: Michael Paquier [mailto:michael@paquier.xyz] > By the way, as long as I have my mind of it. Another strategy would be > to just make the checks in XLogReadRecord() a bit smarter if the whole record > header is not on the page. If we check at least for > AllocSizeIsValid(total_len) then there this code would not fail on an > allocation as you user reported. Still this misses the case where a record > size is lower than 1GB but invalid so you would allocate allocate_recordbuf > for nothing :( That was my first thought, and I gave it up. As you say, XLogReadRecord() could allocate up to 1 GB of memory for a garbage. That allocation can fail due to memory shortage, which prevents the recovery from proceeding. Regards Takayuki Tsunakawa
On Tue, Feb 27, 2018 at 05:15:29AM +0000, Tsunakawa, Takayuki wrote: > That was my first thought, and I gave it up. As you say, > XLogReadRecord() could allocate up to 1 GB of memory for a garbage. > That allocation can fail due to memory shortage, which prevents the > recovery from proceeding. Even with that, the resulting patch is sort of ugly... So it seems to me that the conclusion to this thread is that there is no clear winner, and that the problem is so unlikely to happen that it is not worth the performance impact to zero all the WAL pages fetched. Still, the attached has the advantage to not cause the startup process to fail suddendly because of the allocation failure but to fail afterwards when validating the record's contents, which has the disadvantage to allocate temporarily up to 1GB of memory for some of the garbage, but that allocation is short-lived. So that would switch the failure from a FATAL allocation failure taking down Postgres to something which will fetching of new WAL records to be tried again. All in all, that would be a win for the case reported on this thread. Thoughts from anybody? -- Michael
Attachment
From: Michael Paquier [mailto:michael@paquier.xyz] > Even with that, the resulting patch is sort of ugly... So it seems to me > that the conclusion to this thread is that there is no clear winner, and > that the problem is so unlikely to happen that it is not worth the performance > impact to zero all the WAL pages fetched. Still, the attached has the > advantage to not cause the startup process to fail suddendly because of > the allocation failure but to fail afterwards when validating the record's > contents, which has the disadvantage to allocate temporarily up to 1GB of > memory for some of the garbage, but that allocation is short-lived. So > that would switch the failure from a FATAL allocation failure taking down > Postgres to something which will fetching of new WAL records to be tried > again. > > All in all, that would be a win for the case reported on this thread. I'm sorry to be late to reply, and thanks for another patch. Honestly, I'm fine with either patch. I like your simpler and cleaner one that has no performance impact. But please notethat the allocation attempt could amount to nearly 1 GB. That can fail due to memory shortage, which leads to FATALerror (ereport(ERROR) results in FATAL in startup process), and cause standby to shut down. Regards Takayuki Tsunakawa
On Fri, Mar 16, 2018 at 05:27:58AM +0000, Tsunakawa, Takayuki wrote: > Honestly, I'm fine with either patch. I like your simpler and cleaner > one that has no performance impact. But please note that the > allocation attempt could amount to nearly 1 GB. That can fail due to > memory shortage, which leads to FATAL error (ereport(ERROR) results in > FATAL in startup process), and cause standby to shut down. We use palloc_extended with MCXT_ALLOC_NO_OOM in 9.5~, and malloc() further down, so once you remove the FATAL error caused by a record whose length is higher than 1GB, you discard all the hard failures, no? -- Michael
Attachment
From: Michael Paquier [mailto:michael@paquier.xyz] > We use palloc_extended with MCXT_ALLOC_NO_OOM in 9.5~, and malloc() further > down, so once you remove the FATAL error caused by a record whose length > is higher than 1GB, you discard all the hard failures, no? Ouch, you're right. If memory allocation fails, the startup process would emit a LOG message and continue to fetch new WALrecords. Then, I'm completely happy with your patch. I added you as the author, and marked this as ready for committer. Regards Takayuki Tsunakawa
On Fri, Mar 16, 2018 at 06:02:25AM +0000, Tsunakawa, Takayuki wrote: > Ouch, you're right. If memory allocation fails, the startup process > would emit a LOG message and continue to fetch new WAL records. Then, > I'm completely happy with your patch. Thanks for double-checking, Tsunakawa-san. -- Michael
Attachment
On Sun, Mar 18, 2018 at 08:49:01AM +0900, Michael Paquier wrote: > On Fri, Mar 16, 2018 at 06:02:25AM +0000, Tsunakawa, Takayuki wrote: >> Ouch, you're right. If memory allocation fails, the startup process >> would emit a LOG message and continue to fetch new WAL records. Then, >> I'm completely happy with your patch. > > Thanks for double-checking, Tsunakawa-san. As this is one of those small bug fixes for which we can do something, attached is an updated patch with a commit description, and tutti-quanti. At the end, I have moved the size check within allocate_recordbuf(). Even if the size calculated is not going to be higher than total_len, it feels safer in the long term for two reasons: - Allocation size calculation could be changed, or let's say made smarter. - New callers of allocate_recordbuf would not be able to see the problem for the backend, and this could become a problem if the WAL reader facility is extended. Thoughts? -- Michael
Attachment
> From: Michael Paquier [mailto:michael@paquier.xyz] > As this is one of those small bug fixes for which we can do something, attached > is an updated patch with a commit description, and tutti-quanti. At the > end, I have moved the size check within allocate_recordbuf(). Even if the > size calculated is not going to be higher than total_len, it feels safer > in the long term for two reasons: Thank you so much. This version looks better. + * this would cause the instance to stop suddendly with a hard failure, suddendly -> suddenly Regards Takayuki Tsunakawa
On Tue, Jun 12, 2018 at 06:30:49AM +0000, Tsunakawa, Takayuki wrote: > Thank you so much. This version looks better. > > + * this would cause the instance to stop suddendly with a hard failure, > > suddendly -> suddenly Yep. Thanks for the extra lookup. -- Michael
Attachment
On Tue, Jun 12, 2018 at 04:27:50PM +0900, Michael Paquier wrote: > On Tue, Jun 12, 2018 at 06:30:49AM +0000, Tsunakawa, Takayuki wrote: >> Thank you so much. This version looks better. >> >> + * this would cause the instance to stop suddendly with a hard failure, >> >> suddendly -> suddenly > > Yep. Thanks for the extra lookup. Note for everybody on this list: I will be out for a couple of days at the end of this week, and my intention is to finish wrapping this patch at the beginning of next week, with a back-patch down to 9.5 where palloc_extended has been introduced as this has been seen a couple of times on -bugs or -hackers. Feel free of course to comment and look at the thread for more background on the matter. -- Michael
Attachment
On Wed, Jun 13, 2018 at 09:00:47AM +0900, Michael Paquier wrote: > Note for everybody on this list: I will be out for a couple of days at > the end of this week, and my intention is to finish wrapping this patch > at the beginning of next week, with a back-patch down to 9.5 where > palloc_extended has been introduced as this has been seen a couple of > times on -bugs or -hackers. Feel free of course to comment and look at > the thread for more background on the matter. Okay, done this way. Trying to run regression tests in parallel in ~9.5 leads to spurious failures, which is annoying... I had a patch fixing that but I cannot put my finger on the thread where this has been discussed. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > Trying to run regression tests in parallel in ~9.5 leads to spurious > failures, which is annoying... I had a patch fixing that but I cannot > put my finger on the thread where this has been discussed. Yeah, for me parallelized check-world only works in >= 9.6. My (vague) recollection is that multiple fixes were needed to get to that point, so I doubt it's worth trying to fix in older branches. regards, tom lane
Hi, On 2018-06-17 22:31:02 -0400, Tom Lane wrote: > Michael Paquier <michael@paquier.xyz> writes: > > Trying to run regression tests in parallel in ~9.5 leads to spurious > > failures, which is annoying... I had a patch fixing that but I cannot > > put my finger on the thread where this has been discussed. > > Yeah, for me parallelized check-world only works in >= 9.6. My (vague) > recollection is that multiple fixes were needed to get to that point, > so I doubt it's worth trying to fix in older branches. There were multiple fixes, that's right - I'd personally however still be in favor of trying to backpatch those. Running a check world serially for the backbranches seriously slows down things for me. Greetings, Andres Freund
On Sun, Jun 17, 2018 at 07:33:01PM -0700, Andres Freund wrote: > On 2018-06-17 22:31:02 -0400, Tom Lane wrote: >> Yeah, for me parallelized check-world only works in >= 9.6. My (vague) >> recollection is that multiple fixes were needed to get to that point, >> so I doubt it's worth trying to fix in older branches. > > There were multiple fixes, that's right - I'd personally however still > be in favor of trying to backpatch those. It took me some time, but I have been able to find the thread which discusses all the fixes needed to get parallel builds to work on older branches without having to port PostgresNode.pm, which is a way higher level of efforts: https://www.postgresql.org/message-id/CAB7nPqQTGveO3_zvnXQAs32cMosX8c76_ewe4L1cNL=1xZmt+g@mail.gmail.com Could that be considered? > Running a check world serially for the backbranches seriously slows > down things for me. Me too. That was already frustrating one year ago, and it still is. -- Michael