Thread: Non-replayable WAL records through overflows and >MaxAllocSize lengths

Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

11 March 2022, 15:42:23

Hi,

Xlogreader limits the size of what it considers valid xlog records to
MaxAllocSize; but this is not currently enforced in the
XLogRecAssemble API. This means it is possible to assemble a record
that postgresql cannot replay.
Similarly; it is possible to repeatedly call XlogRegisterData() so as
to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
writes while processing record data;

PFA a patch that attempts to fix both of these issues in the insertion
API; by checking against overflows and other incorrectly large values
in the relevant functions in xloginsert.c. In this patch, I've also
added a comment to the XLogRecord spec to document that xl_tot_len
should not be larger than 1GB - 1B; and why that limit exists.

Kind regards,

Matthias van de Meent

Attachment

v1-0001-Add-protections-in-xlog-record-APIs-against-large.patch

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Heikki Linnakangas

Date:

11 March 2022, 20:42:42

On 11/03/2022 17:42, Matthias van de Meent wrote:
> Hi,
> 
> Xlogreader limits the size of what it considers valid xlog records to
> MaxAllocSize; but this is not currently enforced in the
> XLogRecAssemble API. This means it is possible to assemble a record
> that postgresql cannot replay.

Oops, that would be nasty.

> Similarly; it is possible to repeatedly call XlogRegisterData() so as
> to overflow rec->xl_tot_len; resulting in out-of-bounds reads and
> writes while processing record data;

And that too.

Have you been able to create a test case for that? The largest record I 
can think of is a commit record with a huge number of subtransactions, 
dropped relations, and shared inval messages. I'm not sure if you can 
overflow a uint32 with that, but exceeding MaxAllocSize seems possible.

> PFA a patch that attempts to fix both of these issues in the insertion
> API; by checking against overflows and other incorrectly large values
> in the relevant functions in xloginsert.c. In this patch, I've also
> added a comment to the XLogRecord spec to document that xl_tot_len
> should not be larger than 1GB - 1B; and why that limit exists.
> diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
> index c260310c4c..ae654177de 100644
> --- a/src/backend/access/transam/xloginsert.c
> +++ b/src/backend/access/transam/xloginsert.c
> @@ -342,6 +342,11 @@ XLogRegisterData(char *data, int len)
>  
>      if (num_rdatas >= max_rdatas)
>          elog(ERROR, "too much WAL data");
> +
> +    /* protect against overflow */
> +    if (unlikely((uint64) mainrdata_len + (uint64) len > UINT32_MAX))
> +        elog(ERROR, "too much WAL data");
> +
>      rdata = &rdatas[num_rdatas++];
>  
>      rdata->data = data;

Could check for just AllocSizeValid(mainrdata_len), if we're only 
worried about the total size of the data to exceed the limit, and assume 
that each individual piece of data is smaller.

We also don't check for negative 'len'. I think that's fine, the caller 
bears some responsibility for passing valid arguments too. But maybe 
uint32 or size_t would be more appropriate here.

I wonder if these checks hurt performance. These are very cheap, but 
then again, this codepath is very hot. It's probably fine, but it still 
worries me a little. Maybe some of these could be Asserts.

> @@ -387,6 +392,11 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
>  
>      if (num_rdatas >= max_rdatas)
>          elog(ERROR, "too much WAL data");
> +
> +    /* protect against overflow */
> +    if (unlikely((uint64) regbuf->rdata_len + (uint64) len > UINT32_MAX))
> +        elog(ERROR, "too much WAL data");
> +
>      rdata = &rdatas[num_rdatas++];
>  
>      rdata->data = data;

Could check "len > UINT16_MAX". As you noted in XLogRecordAssemble, 
that's the real limit. And if you check for that here, you don't need to 
check it in XLogRecordAssemble.

> @@ -505,7 +515,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>                     XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
>  {
>      XLogRecData *rdt;
> -    uint32        total_len = 0;
> +    uint64        total_len = 0;
>      int            block_id;
>      pg_crc32c    rdata_crc;
>      registered_buffer *prev_regbuf = NULL;

I don't think the change to uint64 is necessary. If all the data blocks 
are limited to 64 kB, and the number of blocks is limited, and the 
number of blocks is limited too.

> @@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>  
>          if (needs_data)
>          {
> +            /* protect against overflow */
> +            if (unlikely(regbuf->rdata_len > UINT16_MAX))
> +                elog(ERROR, "too much WAL data for registered buffer");
> +
>              /*
>               * Link the caller-supplied rdata chain for this buffer to the
>               * overall list.
> @@ -836,6 +850,13 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>      for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
>          COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
>  
> +    /*
> +     * Ensure that xlogreader.c can read the record; and check that we don't
> +     * accidentally overflow the size of the record.
> +     * */
> +    if (unlikely(!AllocSizeIsValid(total_len) || total_len > UINT32_MAX))
> +        elog(ERROR, "too much registered data for WAL record");
> +
>      /*
>       * Fill in the fields in the record header. Prev-link is filled in later,
>       * once we know where in the WAL the record will be inserted. The CRC does

It's enough to check AllocSizeIsValid(total_len), no need to also check 
against UINT32_MAX.

- Heikki

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Robert Haas

Date:

11 March 2022, 21:12:30

On Fri, Mar 11, 2022 at 3:42 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Have you been able to create a test case for that? The largest record I
> can think of is a commit record with a huge number of subtransactions,
> dropped relations, and shared inval messages. I'm not sure if you can
> overflow a uint32 with that, but exceeding MaxAllocSize seems possible.

I believe that wal_level=logical can generate very large update and
delete records, especially with REPLICA IDENTITY FULL.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Andres Freund

Date:

12 March 2022, 01:03:15

Hi,

On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote:
> Have you been able to create a test case for that? The largest record I can
> think of is a commit record with a huge number of subtransactions, dropped
> relations, and shared inval messages. I'm not sure if you can overflow a
> uint32 with that, but exceeding MaxAllocSize seems possible.

MaxAllocSize is pretty easy:
SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);

on a standby:

2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG:  record length 2145386550 at 0/3000060 too long

> I wonder if these checks hurt performance. These are very cheap, but then
> again, this codepath is very hot. It's probably fine, but it still worries
> me a little. Maybe some of these could be Asserts.

I wouldn't expect the added branch itself to hurt much in XLogRegisterData() -
it should be statically predicted to be not taken with the unlikely. I don't
think it's quite inner-loop enough for the instructions or the number of
"concurrently out of order branches" to be a problem.

FWIW, often the added elog()s are worse, because they require a decent amount
of code and restrict the optimizer somewhat (e.g. no sibling calls, more local
variables etc). They can't even be deduplicated because of the line-numbers
embedded.

So maybe just collapse the new elog() with the previous elog, with a common
unlikely()?

> > @@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
> >          if (needs_data)
> >          {
> > +            /* protect against overflow */
> > +            if (unlikely(regbuf->rdata_len > UINT16_MAX))
> > +                elog(ERROR, "too much WAL data for registered buffer");
> > +
> >              /*
> >               * Link the caller-supplied rdata chain for this buffer to the
> >               * overall list.

FWIW, this branch I'm a tad more concerned about - it's in a loop body where
plausibly a lot of branches could be outstanding at the same time.

ISTM that this could just be an assert?

Greetings,

Andres Freund

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

14 March 2022, 16:57:23

Thank you all for the feedback. Please find attached v2 of the
patchset, which contains updated comments and applies the suggested
changes.

On Sat, 12 Mar 2022 at 02:03, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2022-03-11 22:42:42 +0200, Heikki Linnakangas wrote:
> > Have you been able to create a test case for that? The largest record I can
> > think of is a commit record with a huge number of subtransactions, dropped
> > relations, and shared inval messages. I'm not sure if you can overflow a
> > uint32 with that, but exceeding MaxAllocSize seems possible.
>
> MaxAllocSize is pretty easy:
> SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);
>
> on a standby:
>
> 2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG:  record length 2145386550 at 0/3000060 too long

Thanks for the reference. I was already playing around with 2PC log
records (which can theoretically contain >4GB of data); but your
example is much easier and takes significantly less time.

I'm not sure whether or not to include this in the test suite, though,
as this would require a machine with at least 1GB of memory available
for this test alone, and I don't know the current requirements for
running the test suite.

> > I wonder if these checks hurt performance. These are very cheap, but then
> > again, this codepath is very hot. It's probably fine, but it still worries
> > me a little. Maybe some of these could be Asserts.
>
> I wouldn't expect the added branch itself to hurt much in XLogRegisterData() -
> it should be statically predicted to be not taken with the unlikely. I don't
> think it's quite inner-loop enough for the instructions or the number of
> "concurrently out of order branches" to be a problem.
>
> FWIW, often the added elog()s are worse, because they require a decent amount
> of code and restrict the optimizer somewhat (e.g. no sibling calls, more local
> variables etc). They can't even be deduplicated because of the line-numbers
> embedded.
>
> So maybe just collapse the new elog() with the previous elog, with a common
> unlikely()?

Updated.

> > > @@ -734,6 +744,10 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
> > >             if (needs_data)
> > >             {
> > > +                   /* protect against overflow */
> > > +                   if (unlikely(regbuf->rdata_len > UINT16_MAX))
> > > +                           elog(ERROR, "too much WAL data for registered buffer");
> > > +
> > >                     /*
> > >                      * Link the caller-supplied rdata chain for this buffer to the
> > >                      * overall list.
>
> FWIW, this branch I'm a tad more concerned about - it's in a loop body where
> plausibly a lot of branches could be outstanding at the same time.
>
> ISTM that this could just be an assert?

This specific location has been replaced with an Assert, while
XLogRegisterBufData always does the unlikely()-ed bounds check.

Kind regards,

Matthias

Attachment

v2-0001-Add-protections-in-xlog-record-APIs-against-large.patch

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Andres Freund

Date:

14 March 2022, 17:14:45

Hi

A random thought I had while thinking about the size limits: We could use the
low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
allow us to e.g. get away from needing Heap2. Which would aestethically be
pleasing.

On 2022-03-14 17:57:23 +0100, Matthias van de Meent wrote:
> I'm not sure whether or not to include this in the test suite, though,
> as this would require a machine with at least 1GB of memory available
> for this test alone, and I don't know the current requirements for
> running the test suite.

We definitely shouldn't require this much RAM for the tests.

It might be worth adding tests exercising edge cases around segment boundaries
(and perhaps page boundaries) though. E.g. record headers split across pages
and segments.

> --- a/src/backend/access/transam/xloginsert.c
> +++ b/src/backend/access/transam/xloginsert.c
> @@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len)
>  {
>      XLogRecData *rdata;
>  
> -    Assert(begininsert_called);
> +    Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));

Shouldn't we just make the length argument unsigned?

> -    if (num_rdatas >= max_rdatas)
> +    /*
> +     * Check against max_rdatas; and ensure we don't fill a record with
> +     * more data than can be replayed
> +     */
> +    if (unlikely(num_rdatas >= max_rdatas ||
> +                 !AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
>          elog(ERROR, "too much WAL data");
> +
>      rdata = &rdatas[num_rdatas++];

Personally I'd write it as unlikely(num_rdatas >= max_rdatas) || unlikely(...)
but I doubt if it makes an actual difference to the compiler.

>      rdata->data = data;
> @@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
>      registered_buffer *regbuf;
>      XLogRecData *rdata;
>  
> -    Assert(begininsert_called);
> +    Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);
>  
>      /* find the registered buffer struct */
>      regbuf = ®istered_buffers[block_id];
> @@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
>          elog(ERROR, "no block with id %d registered with WAL insertion",
>               block_id);
>  
> -    if (num_rdatas >= max_rdatas)
> +    /*
> +     * Check against max_rdatas; and ensure we don't register more data per
> +     * buffer than can be handled by the physical record format.
> +     */
> +    if (unlikely(num_rdatas >= max_rdatas ||
> +                 regbuf->rdata_len + len > UINT16_MAX))
>          elog(ERROR, "too much WAL data");
> +
>      rdata = &rdatas[num_rdatas++];

Given the repeated check it might be worth to just put it in a static inline
used from the relevant places (which'd generate less code because the same
line number would be used for all the checks).

Greetings,

Andres Freund

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

15 March 2022, 19:48:58

On Mon, 14 Mar 2022 at 18:14, Andres Freund <andres@anarazel.de> wrote:
>
> Hi
>
> A random thought I had while thinking about the size limits: We could use the
> low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
> XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
> allow us to e.g. get away from needing Heap2. Which would aestethically be
> pleasing.

That would be interesting; though out of scope for this bug I'm trying to fix.

> On 2022-03-14 17:57:23 +0100, Matthias van de Meent wrote:
> > I'm not sure whether or not to include this in the test suite, though,
> > as this would require a machine with at least 1GB of memory available
> > for this test alone, and I don't know the current requirements for
> > running the test suite.
>
> We definitely shouldn't require this much RAM for the tests.
>
> It might be worth adding tests exercising edge cases around segment boundaries
> (and perhaps page boundaries) though. E.g. record headers split across pages
> and segments.
>
>
>
> > --- a/src/backend/access/transam/xloginsert.c
> > +++ b/src/backend/access/transam/xloginsert.c
> > @@ -338,10 +338,16 @@ XLogRegisterData(char *data, int len)
> >  {
> >       XLogRecData *rdata;
> >
> > -     Assert(begininsert_called);
> > +     Assert(begininsert_called && len >= 0 && AllocSizeIsValid(len));
>
> Shouldn't we just make the length argument unsigned?

I've applied that in the attached revision; but I'd like to note that
this makes the fix less straightforward to backpatch; as the changes
to the public function signatures shouldn't be applied in older
versions.

> > -     if (num_rdatas >= max_rdatas)
> > +     /*
> > +      * Check against max_rdatas; and ensure we don't fill a record with
> > +      * more data than can be replayed
> > +      */
> > +     if (unlikely(num_rdatas >= max_rdatas ||
> > +                              !AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
> >               elog(ERROR, "too much WAL data");
> > +
> >       rdata = &rdatas[num_rdatas++];
>
> Personally I'd write it as unlikely(num_rdatas >= max_rdatas) || unlikely(...)
> but I doubt if it makes an actual difference to the compiler.

Agreed, updated.

> >       rdata->data = data;
> > @@ -377,7 +383,7 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
> >       registered_buffer *regbuf;
> >       XLogRecData *rdata;
> >
> > -     Assert(begininsert_called);
> > +     Assert(begininsert_called && len >= 0 && len <= UINT16_MAX);
> >
> >       /* find the registered buffer struct */
> >       regbuf = ®istered_buffers[block_id];
> > @@ -385,8 +391,14 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
> >               elog(ERROR, "no block with id %d registered with WAL insertion",
> >                        block_id);
> >
> > -     if (num_rdatas >= max_rdatas)
> > +     /*
> > +      * Check against max_rdatas; and ensure we don't register more data per
> > +      * buffer than can be handled by the physical record format.
> > +      */
> > +     if (unlikely(num_rdatas >= max_rdatas ||
> > +                              regbuf->rdata_len + len > UINT16_MAX))
> >               elog(ERROR, "too much WAL data");
> > +
> >       rdata = &rdatas[num_rdatas++];
>
> Given the repeated check it might be worth to just put it in a static inline
> used from the relevant places (which'd generate less code because the same
> line number would be used for all the checks).

The check itself is slightly different in those 3 places; but the
error message is shared. Do you mean to extract the elog() into a
static inline function (as attached), or did I misunderstand?


-Matthias

Attachment

v3-0001-Add-protections-in-xlog-record-APIs-against-large.patch

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

15 March 2022, 22:57:48

Apart from registering this in CF 2022-07 last Friday, I've also just
added this issue to the Open Items list for PG15 under "Older bugs
affecting stable branches"; as a precaution to not lose track of this
issue in the buzz of the upcoming feature freeze.

-Matthias

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

18 April 2022, 15:48:50

Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?

The changes that were requested by Heikki and Andres have been merged
into patch v3, and I think it would be nice to fix this security issue
in the upcoming minor release(s).


-Matthias

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

19 April 2022, 05:19:45

On Mon, Apr 18, 2022 at 05:48:50PM +0200, Matthias van de Meent wrote:
> Seeing that the busiest time for PG15 - the last commitfest before the
> feature freeze - has passed, could someone take another look at this?

The next minor release is three weeks away, so now would be a good
time to get that addressed.  Heikki, Andres, are you planning to look
more at what has been proposed here?
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

David Zhang

Date:

10 June 2022, 23:31:53

Hi,> > MaxAllocSize is pretty easy:
> > SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);
> >
> > on a standby:
> >
> > 2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG: record length 2145386550 at 0/3000060 too long
>
> Thanks for the reference. I was already playing around with 2PC log
> records (which can theoretically contain >4GB of data); but your
> example is much easier and takes significantly less time.

A little confused here, does this patch V3 intend to solve this problem "record length 2145386550 at 0/3000060 too long"?

I set up a simple Primary and Standby stream replication environment, and use the above query to run the test for before and after patch v3. The error message still exist, but with different message.

Before patch v3, the error is showing below,

2022-06-10 15:32:25.307 PDT [4253] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:32:47.763 PDT [4257] FATAL: terminating walreceiver process due to administrator command
2022-06-10 15:32:47.763 PDT [4253] LOG: record length 2145386550 at 0/3000060 too long

After patch v3, the error displays differently

2022-06-10 15:53:53.397 PDT [12848] LOG: record length 2145386550 at 0/3000060 too long
2022-06-10 15:54:07.249 PDT [12852] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000045 has already been removed
2022-06-10 15:54:07.275 PDT [12848] LOG: record length 2145386550 at 0/3000060 too long

And once the error happens, then the Standby can't continue the replication.

Is a particular reason to say "more datas" at line 52 in patch v3?

+ * more datas than are being accounted for by the XLog infrastructure.

On 2022-04-18 10:19 p.m., Michael Paquier wrote:

On Mon, Apr 18, 2022 at 05:48:50PM +0200, Matthias van de Meent wrote:

Seeing that the busiest time for PG15 - the last commitfest before the
feature freeze - has passed, could someone take another look at this?

The next minor release is three weeks away, so now would be a good
time to get that addressed.  Heikki, Andres, are you planning to look
more at what has been proposed here?
--
Michael

Thank you,

--
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

11 June 2022, 19:25:48

On Sat, 11 Jun 2022 at 01:32, David Zhang <david.zhang@highgo.ca> wrote:
>
> Hi,
>
> > > MaxAllocSize is pretty easy:
> > > SELECT pg_logical_emit_message(false, long, long) FROM repeat(repeat(' ', 1024), 1024*1023) as l(long);
> > >
> > > on a standby:
> > >
> > > 2022-03-11 16:41:59.336 PST [3639744][startup][1/0:0] LOG:  record length 2145386550 at 0/3000060 too long
> >
> > Thanks for the reference. I was already playing around with 2PC log
> > records (which can theoretically contain >4GB of data); but your
> > example is much easier and takes significantly less time.
>
> A little confused here, does this patch V3 intend to solve this problem "record length 2145386550 at 0/3000060 too
long"?

No, not once the record exists. But it does remove Postgres' ability
to create such records, thereby solving the problem for all systems
that generate WAL through Postgres' WAL writing APIs.

> I set up a simple Primary and Standby stream replication environment, and use the above query to run the test for
beforeand after patch v3. The error message still exist, but with different message.

>
> Before patch v3, the error is showing below,
>
> 2022-06-10 15:32:25.307 PDT [4253] LOG:  record length 2145386550 at 0/3000060 too long
> 2022-06-10 15:32:47.763 PDT [4257] FATAL:  terminating walreceiver process due to administrator command
> 2022-06-10 15:32:47.763 PDT [4253] LOG:  record length 2145386550 at 0/3000060 too long
>
> After patch v3, the error displays differently
>
> 2022-06-10 15:53:53.397 PDT [12848] LOG:  record length 2145386550 at 0/3000060 too long
> 2022-06-10 15:54:07.249 PDT [12852] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
000000010000000000000045has already been removed

> 2022-06-10 15:54:07.275 PDT [12848] LOG:  record length 2145386550 at 0/3000060 too long
>
> And once the error happens, then the Standby can't continue the replication.

Did you initiate a new cluster or otherwise skip the invalid record
you generated when running the instance based on master? It seems to
me you're trying to replay the invalid record (len > MaxAllocSize),
and this patch does not try to fix that issue. This patch just tries
to forbid emitting records larger than MaxAllocSize, as per the check
in XLogRecordAssemble, so that we wont emit unreadable records into
the WAL anymore.

Reading unreadable records still won't be possible, but that's also
not something I'm trying to fix.

> Is a particular reason to say "more datas" at line 52 in patch v3?
>
> + * more datas than are being accounted for by the XLog infrastructure.

Yes. This error is thrown when you try to register a 34th block, or an
Nth rdata where the caller previously only reserved n - 1 data slots.
As such 'datas', for the num_rdatas and max_rdatas variables.

Thanks for looking at the patch.

- Matthias

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

David Zhang

Date:

13 June 2022, 22:18:45

>> A little confused here, does this patch V3 intend to solve this problem "record length 2145386550 at 0/3000060 too
long"?
> No, not once the record exists. But it does remove Postgres' ability
> to create such records, thereby solving the problem for all systems
> that generate WAL through Postgres' WAL writing APIs.
>
>> I set up a simple Primary and Standby stream replication environment, and use the above query to run the test for
beforeand after patch v3. The error message still exist, but with different message.
 
>>
>> Before patch v3, the error is showing below,
>>
>> 2022-06-10 15:32:25.307 PDT [4253] LOG:  record length 2145386550 at 0/3000060 too long
>> 2022-06-10 15:32:47.763 PDT [4257] FATAL:  terminating walreceiver process due to administrator command
>> 2022-06-10 15:32:47.763 PDT [4253] LOG:  record length 2145386550 at 0/3000060 too long
>>
>> After patch v3, the error displays differently
>>
>> 2022-06-10 15:53:53.397 PDT [12848] LOG:  record length 2145386550 at 0/3000060 too long
>> 2022-06-10 15:54:07.249 PDT [12852] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment
000000010000000000000045has already been removed
 
>> 2022-06-10 15:54:07.275 PDT [12848] LOG:  record length 2145386550 at 0/3000060 too long
>>
>> And once the error happens, then the Standby can't continue the replication.
> Did you initiate a new cluster or otherwise skip the invalid record
> you generated when running the instance based on master? It seems to
> me you're trying to replay the invalid record (len > MaxAllocSize),
> and this patch does not try to fix that issue. This patch just tries
> to forbid emitting records larger than MaxAllocSize, as per the check
> in XLogRecordAssemble, so that we wont emit unreadable records into
> the WAL anymore.
>
> Reading unreadable records still won't be possible, but that's also
> not something I'm trying to fix.

Thanks a lot for the clarification. My testing environment is pretty 
simple, initdb for Primary, run basebackup and set the connection string 
for Standby, then run the "pg_logical_emit_message" query and tail the 
log on standby side.

Best regards,

-- 
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

20 June 2022, 05:02:17

On Sat, Jun 11, 2022 at 09:25:48PM +0200, Matthias van de Meent wrote:
> Did you initiate a new cluster or otherwise skip the invalid record
> you generated when running the instance based on master? It seems to
> me you're trying to replay the invalid record (len > MaxAllocSize),
> and this patch does not try to fix that issue. This patch just tries
> to forbid emitting records larger than MaxAllocSize, as per the check
> in XLogRecordAssemble, so that we wont emit unreadable records into
> the WAL anymore.
>
> Reading unreadable records still won't be possible, but that's also
> not something I'm trying to fix.

As long as you cannot generate such WAL records that should be fine as
wAL is not reused across upgrades, so this kind of restriction is a
no-brainer on HEAD.  The back-patching argument is not on the table
anyway, as some of the routine signatures change with the unsigned
arguments, because of those safety checks.

+   if (unlikely(num_rdatas >= max_rdatas) ||
+       unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
+       XLogErrorDataLimitExceeded();
[...]
+inline void
+XLogErrorDataLimitExceeded()
+{
+   elog(ERROR, "too much WAL data");
+}
The three checks are different, OK..  Note that static is missing.

+   if (unlikely(!AllocSizeIsValid(total_len)))
+       XLogErrorDataLimitExceeded();
Rather than a single check at the end of XLogRecordAssemble(), you'd
better look after that each time total_len is added up?
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

20 June 2022, 09:01:51

On Mon, 20 Jun 2022 at 07:02, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Sat, Jun 11, 2022 at 09:25:48PM +0200, Matthias van de Meent wrote:
> > Did you initiate a new cluster or otherwise skip the invalid record
> > you generated when running the instance based on master? It seems to
> > me you're trying to replay the invalid record (len > MaxAllocSize),
> > and this patch does not try to fix that issue. This patch just tries
> > to forbid emitting records larger than MaxAllocSize, as per the check
> > in XLogRecordAssemble, so that we wont emit unreadable records into
> > the WAL anymore.
> >
> > Reading unreadable records still won't be possible, but that's also
> > not something I'm trying to fix.
>
> As long as you cannot generate such WAL records that should be fine as
> wAL is not reused across upgrades, so this kind of restriction is a
> no-brainer on HEAD.  The back-patching argument is not on the table
> anyway, as some of the routine signatures change with the unsigned
> arguments, because of those safety checks.

The signature change is mostly ornamental, see attached v4.backpatch.
The main reason for changing the signature is to make sure nobody can
provide a negative value, but it's not important to the patch.

>
> +   if (unlikely(num_rdatas >= max_rdatas) ||
> +       unlikely(!AllocSizeIsValid((uint64) mainrdata_len + (uint64) len)))
> +       XLogErrorDataLimitExceeded();
> [...]
> +inline void
> +XLogErrorDataLimitExceeded()
> +{
> +   elog(ERROR, "too much WAL data");
> +}
> The three checks are different, OK..

They each check slightly different things, but with the same error. In
RegisterData, it checks that the data can still be allocated and does
not overflow the register, in RegisterBlock it checks that the total
length of data registered to the block does not exceed the max value
of XLogRecordBlockHeader->data_length. I've updated the comments above
the checks so that this distinction is more clear.

> Note that static is missing.

Fixed in attached v4.patch

> +   if (unlikely(!AllocSizeIsValid(total_len)))
> +       XLogErrorDataLimitExceeded();
> Rather than a single check at the end of XLogRecordAssemble(), you'd
> better look after that each time total_len is added up?

I was doing so previously, but there were some good arguments against that:

- Performance of XLogRecordAssemble should be impacted as little as
possible. XLogRecordAssemble is in many hot paths, and it is highly
unlikely this check will be hit, because nobody else has previously
reported this issue. Any check, however unlikely, will add some
overhead, so removing check counts reduces overhead of this patch.

- The user or system is unlikely to care about which specific check
was hit, and only needs to care _that_ the check was hit. An attached
debugger will be able to debug the internals of the xlog machinery and
find out the specific reasons for the error, but I see no specific
reason why the specific reason would need to be reported to the
connection.

Kind regards,

Matthias van de Meent

Attachment

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

21 June 2022, 01:44:57

On Mon, Jun 20, 2022 at 11:01:51AM +0200, Matthias van de Meent wrote:
> On Mon, 20 Jun 2022 at 07:02, Michael Paquier <michael@paquier.xyz> wrote:
>> +   if (unlikely(!AllocSizeIsValid(total_len)))
>> +       XLogErrorDataLimitExceeded();
>> Rather than a single check at the end of XLogRecordAssemble(), you'd
>> better look after that each time total_len is added up?
>
> I was doing so previously, but there were some good arguments against that:
>
> - Performance of XLogRecordAssemble should be impacted as little as
> possible. XLogRecordAssemble is in many hot paths, and it is highly
> unlikely this check will be hit, because nobody else has previously
> reported this issue. Any check, however unlikely, will add some
> overhead, so removing check counts reduces overhead of this patch.

Some macro-benchmarking could be in place here, and this would most
likely become noticeable when assembling a bunch of little records?

> - The user or system is unlikely to care about which specific check
> was hit, and only needs to care _that_ the check was hit. An attached
> debugger will be able to debug the internals of the xlog machinery and
> find out the specific reasons for the error, but I see no specific
> reason why the specific reason would need to be reported to the
> connection.

Okay.

+   /*
+    * Ensure that xlogreader.c can read the record by ensuring that the
+    * data section of the WAL record can be allocated.
+    */
+   if (unlikely(!AllocSizeIsValid(total_len)))
+       XLogErrorDataLimitExceeded();

By the way, while skimming through the patch, the WAL reader seems to
be a bit more pessimistic than this estimation, calculating the amount
to allocate as of DecodeXLogRecordRequiredSpace(), based on the
xl_tot_len given by a record.
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

01 July 2022, 15:11:05

On Tue, 21 Jun 2022 at 03:45, Michael Paquier <michael@paquier.xyz> wrote:
> +   /*
> +    * Ensure that xlogreader.c can read the record by ensuring that the
> +    * data section of the WAL record can be allocated.
> +    */
> +   if (unlikely(!AllocSizeIsValid(total_len)))
> +       XLogErrorDataLimitExceeded();
>
> By the way, while skimming through the patch, the WAL reader seems to
> be a bit more pessimistic than this estimation, calculating the amount
> to allocate as of DecodeXLogRecordRequiredSpace(), based on the
> xl_tot_len given by a record.

I see, thanks for notifying me about that.

PFA a correction for that issue. It does copy over the value for
MaxAllocSize from memutils.h into xlogreader.h, because we need that
value in FRONTEND builds too, and memutils.h can't be included in
FRONTEND builds. One file suffixed with .backpatch that doesn't
include the function signature changes, but it is not optimized for
any stable branch[15].

-Matthias

PS. I'm not amused by the double copy we do in the xlogreader, as I
had expected we'd just read the record and point into that single
xl_rec_len-sized buffer. Apparently that's not how it works...

[15] it should apply to stable branches all the way back to
REL_15_STABLE and still work as expected. Any older than that I
haven't tested, but probably only require some updates for
XLogRecMaxLength in xlogreader.h.

Attachment

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

David Zhang

Date:

08 July 2022, 19:35:22

Hi,

I tried to apply this patch v5 to current master branch but it complains,
"git apply --check 
v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch
error: patch failed: src/include/access/xloginsert.h:43
error: src/include/access/xloginsert.h: patch does not apply"

then I checked it out before the commit 
`b0a55e43299c4ea2a9a8c757f9c26352407d0ccc` and applied this v5 patch.

1) both make check and make installcheck passed.

2) and I can also see this patch v5 prevents the error happens previously,

"postgres=# SELECT pg_logical_emit_message(false, long, long) FROM 
repeat(repeat(' ', 1024), 1024*1023) as l(long);
ERROR:  too much WAL data"

3) without this v5 patch, the same test will cause the standby crash 
like below, and the standby not be able to boot up after this crash.

"2022-07-08 12:28:16.425 PDT [2363] FATAL:  invalid memory alloc request 
size 2145388995
2022-07-08 12:28:16.426 PDT [2360] LOG:  startup process (PID 2363) 
exited with exit code 1
2022-07-08 12:28:16.426 PDT [2360] LOG:  terminating any other active 
server processes
2022-07-08 12:28:16.427 PDT [2360] LOG:  shutting down due to startup 
process failure
2022-07-08 12:28:16.428 PDT [2360] LOG:  database system is shut down"


Best regards,

-- 
David

Software Engineer
Highgo Software Inc. (Canada)
www.highgo.ca

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

11 July 2022, 12:26:46

On Fri, 8 Jul 2022 at 21:35, David Zhang <david.zhang@highgo.ca> wrote:
>
> Hi,
>
> I tried to apply this patch v5 to current master branch but it complains,
> "git apply --check
> v5-0001-Add-protections-in-xlog-record-APIs-against-large.patch
> error: patch failed: src/include/access/xloginsert.h:43
> error: src/include/access/xloginsert.h: patch does not apply"
>
> then I checked it out before the commit
> `b0a55e43299c4ea2a9a8c757f9c26352407d0ccc` and applied this v5 patch.

The attached rebased patchset should work with master @ 2cd2569c and
REL_15_STABLE @ 53df1e28. I've also added a patch that works for PG14
and earlier, which should be correct for all versions that include
commit 2c03216d (that is, all versions back to 9.5).

> 1) both make check and make installcheck passed.
>
> 2) and I can also see this patch v5 prevents the error happens previously,
>
> "postgres=# SELECT pg_logical_emit_message(false, long, long) FROM
> repeat(repeat(' ', 1024), 1024*1023) as l(long);
> ERROR:  too much WAL data"
>
> 3) without this v5 patch, the same test will cause the standby crash
> like below, and the standby not be able to boot up after this crash.

Thanks for reviewing.

Kind regards,

Matthias van de Meent

Attachment

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

13 July 2022, 05:54:18

On Mon, Jul 11, 2022 at 02:26:46PM +0200, Matthias van de Meent wrote:
> Thanks for reviewing.

I think that v6 is over-engineered because there should be no need to
add a check in xlogreader.c as long as the origin of the problem is
blocked, no?  And the origin here is when the record is assembled.  At
least this is the cleanest solution for HEAD, but not in the
back-branches if we'd care about doing something with records already
generated, and I am not sure that we need to care about other things
than HEAD, TBH.  So it seems to me that there is no need to create a
XLogRecMaxLength which is close to a duplicate of
DecodeXLogRecordRequiredSpace().

@@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
                   XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)

 {
    XLogRecData *rdt;
-   uint32      total_len = 0;
+   uint64      total_len = 0;
This has no need to change.

My suggestion from upthread was close to what you proposed, but I had
in mind something simpler, as of:

+   /*
+    * Ensure that xlogreader.c can read the record.
+    */
+   if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len))))
+       elog(ERROR, "too much WAL data");

This would be the amount of data allocated by the WAL reader when it
is possible to allocate an oversized record, related to the business
of the circular buffer depending on if the read is blocking or not.

Among the two problems to solve at hand, the parts where the APIs are
changed and made more robust with unsigned types and where block data
is not overflowed with its 16-byte limit are committable, so I'd like
to do that first (still need to check its performance with some micro
benchmark on XLogRegisterBufData()).  The second part to block the
creation of the assembled record is simpler, now
DecodeXLogRecordRequiredSpace() would make the path a bit hotter,
though we could inline it in the worst case?
--
Michael

Attachment

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

15 July 2022, 09:25:54

On Mon, 14 Mar 2022 at 18:14, Andres Freund <andres@anarazel.de> wrote:
>
> A random thought I had while thinking about the size limits: We could use the
> low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
> XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
> allow us to e.g. get away from needing Heap2. Which would aestethically be
> pleasing.

I just remembered your comment while going through the xlog code and
thought this about the same issue: We still have 2 bytes of padding in
XLogRecord, between xl_rmid and xl_crc. Can't we instead use that
space for rmgr-specific flags, as opposed to stealing bits from
xl_info?

Kind regards,

Matthias van de Meent

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Andres Freund

Date:

15 July 2022, 17:37:31

Hi,

On 2022-07-15 11:25:54 +0200, Matthias van de Meent wrote:
> On Mon, 14 Mar 2022 at 18:14, Andres Freund <andres@anarazel.de> wrote:
> >
> > A random thought I had while thinking about the size limits: We could use the
> > low bits of the length and xl_prev to store XLR_SPECIAL_REL_UPDATE |
> > XLR_CHECK_CONSISTENCY and give rmgrs the full 8 bit of xl_info. Which would
> > allow us to e.g. get away from needing Heap2. Which would aestethically be
> > pleasing.
> 
> I just remembered your comment while going through the xlog code and
> thought this about the same issue: We still have 2 bytes of padding in
> XLogRecord, between xl_rmid and xl_crc. Can't we instead use that
> space for rmgr-specific flags, as opposed to stealing bits from
> xl_info?

Sounds like a good idea to me. I'm not sure who is stealing bits from what
right now, but it clearly seems worthwhile to separate "flags" from "record
type within rmgr".

I think we should split it at least into three things:

1) generic per-record flags for xlog machinery (ie. XLR_SPECIAL_REL_UPDATE, XLR_CHECK_CONSISTENCY)
2) rmgr record type identifier (e.g. XLOG_HEAP_*)
2) rmgr specific flags (e.g. XLOG_HEAP_INIT_PAGE)

Greetings,

Andres Freund

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Heikki Linnakangas

Date:

25 July 2022, 11:12:05

On 13/07/2022 08:54, Michael Paquier wrote:
> I think that v6 is over-engineered because there should be no need to
> add a check in xlogreader.c as long as the origin of the problem is
> blocked, no?  And the origin here is when the record is assembled.  At
> least this is the cleanest solution for HEAD, but not in the
> back-branches if we'd care about doing something with records already
> generated, and I am not sure that we need to care about other things
> than HEAD, TBH.  So it seems to me that there is no need to create a
> XLogRecMaxLength which is close to a duplicate of
> DecodeXLogRecordRequiredSpace().
> 
> @@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>                     XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
> 
>   {
>      XLogRecData *rdt;
> -   uint32      total_len = 0;
> +   uint64      total_len = 0;
> This has no need to change.
> 
> My suggestion from upthread was close to what you proposed, but I had
> in mind something simpler, as of:
> 
> +   /*
> +    * Ensure that xlogreader.c can read the record.
> +    */
> +   if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len))))
> +       elog(ERROR, "too much WAL data");
> 
> This would be the amount of data allocated by the WAL reader when it
> is possible to allocate an oversized record, related to the business
> of the circular buffer depending on if the read is blocking or not.

The way this is written, it would change whenever we add/remove fields 
in DecodedBkpBlock, for example. That's fragile; if you added a field in 
a back-branch, you could accidentally make the new minor version unable 
to read maximum-sized WAL records generated with an older version. I'd 
like the maximum to be more explicit.

How large exactly is the maximum size that this gives? I'd prefer to set 
the limit conservatively to 1020 MB, for example, with a compile-time 
static assertion that 
AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).

> Among the two problems to solve at hand, the parts where the APIs are
> changed and made more robust with unsigned types and where block data
> is not overflowed with its 16-byte limit are committable, so I'd like
> to do that first (still need to check its performance with some micro
> benchmark on XLogRegisterBufData()).

+1. I'm not excited about adding the "unlikely()" hints, though. We have 
a pg_attribute_cold hint in ereport(), that should be enough.

- Heikki

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

25 July 2022, 11:17:21

On Wed, 13 Jul 2022 at 07:54, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Jul 11, 2022 at 02:26:46PM +0200, Matthias van de Meent wrote:
> > Thanks for reviewing.
>
> I think that v6 is over-engineered because there should be no need to
> add a check in xlogreader.c as long as the origin of the problem is
> blocked, no?  And the origin here is when the record is assembled.  At
> least this is the cleanest solution for HEAD, but not in the
> back-branches if we'd care about doing something with records already
> generated, and I am not sure that we need to care about other things
> than HEAD, TBH.

I would prefer it if we would fix the "cannot catch up to primary
because of oversized WAL record" issue in backbranches too. Rather
than failing to recover after failure or breaking replication streams,
I'd rather be unable to write the singular offending WAL record and
break up to one transaction.

> So it seems to me that there is no need to create a
> XLogRecMaxLength which is close to a duplicate of
> DecodeXLogRecordRequiredSpace().
>
> @@ -519,7 +549,7 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>                    XLogRecPtr *fpw_lsn, int *num_fpi, bool *topxid_included)
>
>  {
>     XLogRecData *rdt;
> -   uint32      total_len = 0;
> +   uint64      total_len = 0;
> This has no need to change.
>
> My suggestion from upthread was close to what you proposed, but I had
> in mind something simpler, as of:
>
> +   /*
> +    * Ensure that xlogreader.c can read the record.
> +    */
> +   if (unlikely(!AllocSizeIsValid(DecodeXLogRecordRequiredSpace(total_len))))
> +       elog(ERROR, "too much WAL data");

Huh, yeah, I hadn't thought of that, but that's much simpler indeed.

> This would be the amount of data allocated by the WAL reader when it
> is possible to allocate an oversized record, related to the business
> of the circular buffer depending on if the read is blocking or not.

Yes, I see your point.

> Among the two problems to solve at hand, the parts where the APIs are
> changed and made more robust with unsigned types and where block data
> is not overflowed with its 16-byte limit are committable, so I'd like
> to do that first (still need to check its performance with some micro
> benchmark on XLogRegisterBufData()).

> The second part to block the
> creation of the assembled record is simpler, now
> DecodeXLogRecordRequiredSpace() would make the path a bit hotter,
> though we could inline it in the worst case?

I think that would be better for performance, yes.
DecodeXLogRecordRequiredSpace will already be optimized to just a
single addition by any of `-O[123]`, so keeping this indirection is
quite expensive (relative to the operation being performed).

As for your patch patch:

> +XLogRegisterData(char *data, uint32 len)
> {
>     XLogRecData *rdata;
>
>     Assert(begininsert_called);
>
> -    if (num_rdatas >= max_rdatas)
> +    if (unlikely(num_rdatas >= max_rdatas))
>         elog(ERROR, "too much WAL data");
>     rdata = &rdatas[num_rdatas++];

XLogRegisterData is designed to be called multiple times for each
record, and this allows the user of the API to overflow the internal
mainrdata_len field if we don't check that the field does not exceed
the maximum record length (or overflow the 32-bit field). As such, I'd
still want a len-check in that function.

I'll send an updated patch by tomorrow.

- Matthias

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

26 July 2022, 07:20:34

On Mon, Jul 25, 2022 at 02:12:05PM +0300, Heikki Linnakangas wrote:
> The way this is written, it would change whenever we add/remove fields in
> DecodedBkpBlock, for example. That's fragile; if you added a field in a
> back-branch, you could accidentally make the new minor version unable to
> read maximum-sized WAL records generated with an older version. I'd like the
> maximum to be more explicit.

That's a good point.

> How large exactly is the maximum size that this gives? I'd prefer to set the
> limit conservatively to 1020 MB, for example, with a compile-time static
> assertion that AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).

Something like that would work, I guess.

>  Among the two problems to solve at hand, the parts where the APIs are
>  changed and made more robust with unsigned types and where block data
>  is not overflowed with its 16-byte limit are committable, so I'd like
>  to do that first (still need to check its performance with some micro
>  benchmark on XLogRegisterBufData()).
>
> +1. I'm not excited about adding the "unlikely()" hints, though. We have a
> pg_attribute_cold hint in ereport(), that should be enough.

Okay, that makes sense.  FWIW, I have been wondering about the
addition of the extra condition in XLogRegisterBufData() and I did not
see a difference on HEAD in terms of execution time or profile, with a
micro-benchmark doing a couple of million calls in a row as of the
following, roughly:
    // Can be anything, really..
    rel = relation_open(RelationRelationId, AccessShareLock);
    buffer = ReadBuffer(rel, 0);
    for (i = 0 ; i < WAL_MAX_CALLS ; i++)
    {
        XLogBeginInsert();
        XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
        XLogRegisterBufData(0, buf, 10);
        XLogResetInsertion();
    }
    ReleaseBuffer(buffer);
    relation_close(rel, AccessShareLock);
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

26 July 2022, 16:58:02

On Tue, 26 Jul 2022 at 09:20, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Jul 25, 2022 at 02:12:05PM +0300, Heikki Linnakangas wrote:
> >  Among the two problems to solve at hand, the parts where the APIs are
> >  changed and made more robust with unsigned types and where block data
> >  is not overflowed with its 16-byte limit are committable, so I'd like
> >  to do that first (still need to check its performance with some micro
> >  benchmark on XLogRegisterBufData()).
> >
> > +1. I'm not excited about adding the "unlikely()" hints, though. We have a
> > pg_attribute_cold hint in ereport(), that should be enough.
>
> Okay, that makes sense.  FWIW, I have been wondering about the
> addition of the extra condition in XLogRegisterBufData() and I did not
> see a difference on HEAD in terms of execution time or profile, with a
> micro-benchmark doing a couple of million calls in a row as of the
> following, roughly:
>     [...]

Thanks for testing.

> > How large exactly is the maximum size that this gives? I'd prefer to set the
> > limit conservatively to 1020 MB, for example, with a compile-time static
> > assertion that AllocSizeIsValid(DecodeXLogRecordRequiredSpace(1020 MB)).
>
> Something like that would work, I guess.

I've gone over the patch and reviews again, and updated those places
that received comments:

- updated the MaxXLogRecordSize and XLogRecordLengthIsValid(len)
macros (now in xlogrecord.h), with a max length of the somewhat
arbitrary 1020MiB.
 This leaves room for approx. 4MiB of per-record allocation overhead
before you'd hit MaxAllocSize, and also detaches the dependency on
memutils.h.

- Retained the check in XLogRegisterData, so that we check against
integer overflows in the registerdata code instead of only an assert
in XLogRecordAssemble where it might be too late.
- Kept the inline static elog-ing function (as per Andres' suggestion
on 2022-03-14; this decreases binary sizes)
- Dropped any changes in xlogreader.h/c

Kind regards,

Matthias van de Meent

Attachment

v8-0001-Add-protections-in-xlog-record-APIs-against-large.patch

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

27 July 2022, 09:09:38

On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote:
> - Retained the check in XLogRegisterData, so that we check against
> integer overflows in the registerdata code instead of only an assert
> in XLogRecordAssemble where it might be too late.

Why?  The record has not been inserted yet.  I would tend to keep only
the check at the bottom of XLogRecordAssemble(), for simplicity, and
call it a day.

> - Kept the inline static elog-ing function (as per Andres' suggestion
> on 2022-03-14; this decreases binary sizes)

I am not really convinced that this one is worth doing.

+#define MaxXLogRecordSize  (1020 * 1024 * 1024)
+
+#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < MaxXLogRecordSize)

These are used only in xloginsert.c, so we could keep them isolated.

+ * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
s/hhis/this/.

For now, I have extracted from the patch the two API changes and the
checks for the block information for uint16, and applied this part.
That's one less thing to worry about.
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

27 July 2022, 12:07:05

On Wed, 27 Jul 2022 at 11:09, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Jul 26, 2022 at 06:58:02PM +0200, Matthias van de Meent wrote:
> > - Retained the check in XLogRegisterData, so that we check against
> > integer overflows in the registerdata code instead of only an assert
> > in XLogRecordAssemble where it might be too late.
>
> Why?  The record has not been inserted yet.  I would tend to keep only
> the check at the bottom of XLogRecordAssemble(), for simplicity, and
> call it a day.

Because the sum value main_rdatalen can easily overflow in both the
current and the previous APIs, which then corrupts the WAL - one of
the two issues that I mentioned when I started the thread.

We don't re-summarize the lengths of all XLogRecData segments for the
main record data when assembling a record to keep the performance of
RecordAssemble (probably to limit the complexity when many data
segments are registered), and because I didn't want to add more
changes than necessary this check will need to be done in the place
where the overflow may occur, which is in XLogRegisterData.

> > - Kept the inline static elog-ing function (as per Andres' suggestion
> > on 2022-03-14; this decreases binary sizes)
>
> I am not really convinced that this one is worth doing.

I'm not married to that change, but I also don't see why this can't be
updated while this code is already being touched.

> +#define MaxXLogRecordSize  (1020 * 1024 * 1024)
> +
> +#define XLogRecordLengthIsValid(len) ((len) >= 0 && (len) < MaxXLogRecordSize)
>
> These are used only in xloginsert.c, so we could keep them isolated.

They might be only used in xloginsert right now, but that's not the
point. This is now advertised as part of the record API spec: A record
larger than 1020MB is explicitly not supported. If it was kept
internal to xloginsert, that would be implicit and other people might
start hitting issues similar to those we're hitting right now -
records that are too large to read. Although PostgreSQL is usually the
only one generating WAL, we do support physical replication from
arbitrary PG-compatible WAL streams, which means that any compatible
WAL source could be the origin of our changes - and those need to be
aware of the assumptions we make about the WAL format.

I'm fine with also updating xlogreader.c to check this while reading
records to clarify the limits there as well, if so desired.

> + * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
> s/hhis/this/.

Will be included in the next update..

> For now, I have extracted from the patch the two API changes and the
> checks for the block information for uint16, and applied this part.
> That's one less thing to worry about.

Thanks.

Kind regards,

Matthias van de Meent

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

05 October 2022, 07:46:25

Hi Matthias,

On Wed, Jul 27, 2022 at 02:07:05PM +0200, Matthias van de Meent wrote:

My apologies for the time it took me to come back to this thread.
> > + * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
> > s/hhis/this/.
>
> Will be included in the next update..

v8 fails to apply.  Could you send a rebased version?

As far as I recall the problems with the block image sizes are solved,
but we still have a bit more to do in terms of the overall record
size.  Perhaps there are some parts of the patch you'd like to
revisit?

For now, I have switched the back as waiting on author, and moved it
to the next CF.
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Ian Lawrence Barwick

Date:

04 November 2022, 00:52:39

2022年10月5日(水) 16:46 Michael Paquier <michael@paquier.xyz>:
>
> Hi Matthias,
>
> On Wed, Jul 27, 2022 at 02:07:05PM +0200, Matthias van de Meent wrote:
>
> My apologies for the time it took me to come back to this thread.
> > > + * To accommodate some overhead, hhis MaxXLogRecordSize value allows for
> > > s/hhis/this/.
> >
> > Will be included in the next update..
>
> v8 fails to apply.  Could you send a rebased version?
>
> As far as I recall the problems with the block image sizes are solved,
> but we still have a bit more to do in terms of the overall record
> size.  Perhaps there are some parts of the patch you'd like to
> revisit?
>
> For now, I have switched the back as waiting on author, and moved it
> to the next CF.

Hi Matthias

CommitFest 2022-11 is currently underway, so if you are interested
in moving this patch forward, now would be a good time to update it.

Thanks

Ian Barwick

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

02 December 2022, 05:22:55

On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
> CommitFest 2022-11 is currently underway, so if you are interested
> in moving this patch forward, now would be a good time to update it.

No replies after 4 weeks, so I have marked this entry as returned
with feedback.  I am still wondering what would be the best thing to
do here..
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Andres Freund

Date:

02 December 2022, 16:57:17

Hi,

On 2022-12-02 14:22:55 +0900, Michael Paquier wrote:
> On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
> > CommitFest 2022-11 is currently underway, so if you are interested
> > in moving this patch forward, now would be a good time to update it.
> 
> No replies after 4 weeks, so I have marked this entry as returned
> with feedback.  I am still wondering what would be the best thing to
> do here..

IMO this a bugfix, I don't think we can just close the entry, even if Matthias
doesn't have time / energy to push it forward.

I think the big issue with the patch as it stands is that it will typically
cause PANICs on failure, because the record-too-large ERROR be a in a critical
section. That's still better than generating a record that can't be replayed,
but it's not good.

There's not all that many places with potentially huge records. I wonder if we
ought to modify at least the most prominent ones to prepare the record before
the critical section. I think the by far most prominent real-world case is
RecordTransactionCommit(). I think we could rename XactLogCommitRecord() to
XactBuildCommitRecord() build the commit record, then have the caller do
START_CRIT_SECTION(), set DELAY_CHKPT_START, and only then do the
XLogInsert().

That'd even have the nice side-effect of reducing the window in which
DELAY_CHKPT_START is set a bit.

Greetings,

Andres Freund

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Andres Freund

Date:

02 December 2022, 17:09:13

Hi,

On 2022-07-26 18:58:02 +0200, Matthias van de Meent wrote:
> - updated the MaxXLogRecordSize and XLogRecordLengthIsValid(len)
> macros (now in xlogrecord.h), with a max length of the somewhat
> arbitrary 1020MiB.
>  This leaves room for approx. 4MiB of per-record allocation overhead
> before you'd hit MaxAllocSize, and also detaches the dependency on
> memutils.h.
> 
> - Retained the check in XLogRegisterData, so that we check against
> integer overflows in the registerdata code instead of only an assert
> in XLogRecordAssemble where it might be too late.
> - Kept the inline static elog-ing function (as per Andres' suggestion
> on 2022-03-14; this decreases binary sizes)

I don't think it should be a static inline. It should to be a *non* inlined
function, so we don't include the code for the elog in the callers.


> +/*
> + * Error due to exceeding the maximum size of a WAL record, or registering
> + * more datas than are being accounted for by the XLog infrastructure.
> + */
> +static inline void
> +XLogErrorDataLimitExceeded()
> +{
> +    elog(ERROR, "too much WAL data");
> +}

I think this should be pg_noinline, as mentioned above.


>  /*
>   * Begin constructing a WAL record. This must be called before the
>   * XLogRegister* functions and XLogInsert().
> @@ -348,14 +359,29 @@ XLogRegisterBlock(uint8 block_id, RelFileLocator *rlocator, ForkNumber forknum,
>   * XLogRecGetData().
>   */
>  void
> -XLogRegisterData(char *data, int len)
> +XLogRegisterData(char *data, uint32 len)
>  {
>      XLogRecData *rdata;
>  
> -    Assert(begininsert_called);
> +    Assert(begininsert_called && XLogRecordLengthIsValid(len));
> +
> +    /*
> +     * Check against max_rdatas; and ensure we don't fill a record with
> +     * more data than can be replayed. Records are allocated in one chunk
> +     * with some overhead, so ensure XLogRecordLengthIsValid() for that
> +     * size of record.
> +     *
> +     * Additionally, check that we don't accidentally overflow the
> +     * intermediate sum value on 32-bit systems by ensuring that the
> +     * sum of the two inputs is no less than one of the inputs.
> +     */
> +    if (num_rdatas >= max_rdatas ||
> +#if SIZEOF_SIZE_T == 4
> +         mainrdata_len + len < len ||
> +#endif
> +        !XLogRecordLengthIsValid((size_t) mainrdata_len + (size_t) len))
> +        XLogErrorDataLimitExceeded();

This is quite a complicated check, and the SIZEOF_SIZE_T == 4 bit is fairly
ugly.

I think we should make mainrdata_len a uint64, then we don't have to worry
about it overflowing on 32bit systems. And TBH, we don't care about some minor
inefficiency on 32bit systems.



> @@ -399,8 +425,16 @@ XLogRegisterBufData(uint8 block_id, char *data, int len)
>          elog(ERROR, "no block with id %d registered with WAL insertion",
>               block_id);
>  
> -    if (num_rdatas >= max_rdatas)
> -        elog(ERROR, "too much WAL data");
> +    /*
> +     * Check against max_rdatas; and ensure we don't register more data per
> +     * buffer than can be handled by the physical data format; 
> +     * i.e. that regbuf->rdata_len does not grow beyond what
> +     * XLogRecordBlockHeader->data_length can hold.
> +     */
> +    if (num_rdatas >= max_rdatas ||
> +        regbuf->rdata_len + len > UINT16_MAX)
> +        XLogErrorDataLimitExceeded();
> +
>      rdata = &rdatas[num_rdatas++];
>  
>      rdata->data = data;

This partially has been applied in ffd1b6bb6f8, I think we should consider
adding XLogErrorDataLimitExceeded() separately too.


>              rdt_datas_last->next = regbuf->rdata_head;
> @@ -858,6 +907,16 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>      for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
>          COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
>  
> +    /*
> +     * Ensure that the XLogRecord is not too large.
> +     *
> +     * XLogReader machinery is only able to handle records up to a certain
> +     * size (ignoring machine resource limitations), so make sure we will
> +     * not emit records larger than those sizes we advertise we support.
> +     */
> +    if (!XLogRecordLengthIsValid(total_len))
> +        XLogErrorDataLimitExceeded();
> +
>      /*
>       * Fill in the fields in the record header. Prev-link is filled in later,
>       * once we know where in the WAL the record will be inserted. The CRC does

I think this needs to mention that it'll typically cause a PANIC.


Greetings,

Andres Freund

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Alvaro Herrera

Date:

19 December 2022, 11:37:19

On 2022-Dec-02, Andres Freund wrote:

> Hi,
> 
> On 2022-12-02 14:22:55 +0900, Michael Paquier wrote:
> > On Fri, Nov 04, 2022 at 09:52:39AM +0900, Ian Lawrence Barwick wrote:
> > > CommitFest 2022-11 is currently underway, so if you are interested
> > > in moving this patch forward, now would be a good time to update it.
> > 
> > No replies after 4 weeks, so I have marked this entry as returned
> > with feedback.  I am still wondering what would be the best thing to
> > do here..
> 
> IMO this a bugfix, I don't think we can just close the entry, even if Matthias
> doesn't have time / energy to push it forward.

I have created one in the January commitfest,
https://commitfest.postgresql.org/41/
and rebased the patch on current master.  (I have not reviewed this.)

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"La gente vulgar sólo piensa en pasar el tiempo;
el que tiene talento, en aprovecharlo"

Attachment

v9-0001-Add-protections-in-xlog-record-APIs-against-large.patch

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

28 March 2023, 11:42:42

On Mon, Dec 19, 2022 at 12:37:19PM +0100, Alvaro Herrera wrote:
> I have created one in the January commitfest,
> https://commitfest.postgresql.org/41/
> and rebased the patch on current master.  (I have not reviewed this.)

I have spent some time on that, and here are some comments with an
updated version of the patch attached.

The checks in XLogRegisterData() seemed overcomplicated to me.  In
this context, I think that we should just care about making sure that
mainrdata_len does not overflow depending on the length given by the
caller, which is where pg_add_u32_overflow() becomes handy.

XLogRegisterBufData() added a check on UINT16_MAX in an assert, though
we already check for overflow a couple of lines down.  This is not
necessary, it seems.

@@ -535,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
    XLogRecord *rechdr;
    char       *scratch = hdr_scratch;

+   /* ensure that any assembled record can be decoded */
+   Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize)));

A hardcoded check like that has no need to be in a code path triggered
each time a WAL record is assembled.  One place where this could be is
InitXLogInsert().  It still means that it is called one time for each
backend, but seeing it where the initialization of xloginsert.c feels
natural, at least.  A postmaster location would be enough, as well.

XLogRecordMaxSize just needs to be checked once IMO, around the end of
XLogRecordAssemble() once we know the total size of the record that
will be fed to a XLogReader.  One thing that we should be more careful
of is to make sure that total_len does not overflow its uint32 value
while assembling the record, as well.

I have removed XLogErrorDataLimitExceeded(), replacing it with more
context about the errors happening.  Perhaps this has no need to be
that much verbose, but it can be really useful for developers.

Some comments had no need to be updated, and there were some typos.

I am on board with the idea of a XLogRecordMaxSize that's bounded at
1020MB, leaving 4MB as room for the extra data needed by a
XLogReader.

At the end, I think that this is quite interesting long-term.  For
example, if we lift up XLogRecordMaxSize, we can evaluate the APIs
adding buffer data or main data separately.

Thoughts about this version?
--
Michael

Attachment

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

05 April 2023, 14:35:37

On Tue, 28 Mar 2023 at 13:42, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Mon, Dec 19, 2022 at 12:37:19PM +0100, Alvaro Herrera wrote:
> > I have created one in the January commitfest,
> > https://commitfest.postgresql.org/41/
> > and rebased the patch on current master.  (I have not reviewed this.)
>
> I have spent some time on that, and here are some comments with an
> updated version of the patch attached.
>
> The checks in XLogRegisterData() seemed overcomplicated to me.  In
> this context, I think that we should just care about making sure that
> mainrdata_len does not overflow depending on the length given by the
> caller, which is where pg_add_u32_overflow() becomes handy.
>
> XLogRegisterBufData() added a check on UINT16_MAX in an assert, though
> we already check for overflow a couple of lines down.  This is not
> necessary, it seems.
>
> @@ -535,6 +567,9 @@ XLogRecordAssemble(RmgrId rmid, uint8 info,
>     XLogRecord *rechdr;
>     char       *scratch = hdr_scratch;
>
> +   /* ensure that any assembled record can be decoded */
> +   Assert(AllocSizeIsValid(DecodeXLogRecordRequiredSpace(MaxXLogRecordSize)));
>
> A hardcoded check like that has no need to be in a code path triggered
> each time a WAL record is assembled.  One place where this could be is
> InitXLogInsert().  It still means that it is called one time for each
> backend, but seeing it where the initialization of xloginsert.c feels
> natural, at least.  A postmaster location would be enough, as well.
>
> XLogRecordMaxSize just needs to be checked once IMO, around the end of
> XLogRecordAssemble() once we know the total size of the record that
> will be fed to a XLogReader.  One thing that we should be more careful
> of is to make sure that total_len does not overflow its uint32 value
> while assembling the record, as well.
>
> I have removed XLogErrorDataLimitExceeded(), replacing it with more
> context about the errors happening.  Perhaps this has no need to be
> that much verbose, but it can be really useful for developers.
>
> Some comments had no need to be updated, and there were some typos.
>
> I am on board with the idea of a XLogRecordMaxSize that's bounded at
> 1020MB, leaving 4MB as room for the extra data needed by a
> XLogReader.
>
> At the end, I think that this is quite interesting long-term.  For
> example, if we lift up XLogRecordMaxSize, we can evaluate the APIs
> adding buffer data or main data separately.
>
> Thoughts about this version?

I thought that the plan was to use int64 to skip checking for most
overflows and to do a single check at the end in XLogRecordAssemble,
so that the checking has minimal overhead in the performance-critical
log record assembling path and reduced load on the branch predictor.

One more issue that Andres was suggesting we'd fix was to allow XLog
assembly separate from the actual XLog insertion:
Currently you can't pre-assemble a record outside a critical section
if the record must be inserted in a critical section, which makes e.g.
commit records problematic due to the potentially oversized data
resulting in ERRORs during record assembly. This would crash postgres
because commit xlog insertion happens in a critical section. Having a
pre-assembled record would greatly improve the ergonomics in that path
and reduce the length of the critical path.

I think it was something along the lines of the attached; 0001
contains separated Commit/Abort record construction and insertion like
Andres suggested, 0002 does the size checks with updated error
messages.


Kind regards,

Matthias van de Meent

Attachment

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

06 April 2023, 01:54:43

On Wed, Apr 05, 2023 at 04:35:37PM +0200, Matthias van de Meent wrote:
> I thought that the plan was to use int64 to skip checking for most
> overflows and to do a single check at the end in XLogRecordAssemble,
> so that the checking has minimal overhead in the performance-critical
> log record assembling path and reduced load on the branch predictor.

And that's the reason why your v11-0002 is better and simpler than the
v10-0001 I posted a few days ago.

+   if (regbuf->rdata_len + len > UINT16_MAX || len > UINT16_MAX)
+       ereport(ERROR,
+               (errmsg_internal("too much WAL data"),
+                errdetail_internal("Registering more than max %u bytes total to block %u: current %uB, adding %uB",
+                                   UINT16_MAX, block_id, regbuf->rdata_len, len)));

I was wondering for a few minutes about the second part of this
check..  But you are worried about the case where len is too large
that it would overflow rdata_len if calling XLogRegisterBufData() more
than once on the same block, if len is between
(UINT32_MAX-UINT16_MAX,UINT32_MAX) on the second call.

The extra errdetail_internal() could be tweaked a bit more, but I'm
also OK with your proposal, overall.  One thing is "current %uB,
adding %uB" would be better using "bytes".

> One more issue that Andres was suggesting we'd fix was to allow XLog
> assembly separate from the actual XLog insertion:
> Currently you can't pre-assemble a record outside a critical section
> if the record must be inserted in a critical section, which makes e.g.
> commit records problematic due to the potentially oversized data
> resulting in ERRORs during record assembly. This would crash postgres
> because commit xlog insertion happens in a critical section. Having a
> pre-assembled record would greatly improve the ergonomics in that path
> and reduce the length of the critical path.
>
> I think it was something along the lines of the attached; 0001
> contains separated Commit/Abort record construction and insertion like
> Andres suggested,

I am honestly not sure whether we should complicate xloginsert.c this
way, but we could look at that for v17.

> 0002 does the size checks with updated error messages.

0002 can also be done before 0001, so I'd like to get that part
applied on HEAD before the feature freeze and close this thread.  If
there are any objections, please feel free..
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

06 April 2023, 23:08:34

On Thu, Apr 06, 2023 at 10:54:43AM +0900, Michael Paquier wrote:
> 0002 can also be done before 0001, so I'd like to get that part
> applied on HEAD before the feature freeze and close this thread.  If
> there are any objections, please feel free..

I was doing a pre-commit review of the patch, and double-checked the
uses of mainrdata_len.   And there is this part:
    /* followed by main data, if any */
    if (mainrdata_len > 0)
    {
        if (mainrdata_len > 255)
        {
            *(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
            memcpy(scratch, &mainrdata_len, sizeof(uint32));
            scratch += sizeof(uint32);
        }
        else
        {
            *(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
            *(scratch++) = (uint8) mainrdata_len;
        }
        rdt_datas_last->next = mainrdata_head;
        rdt_datas_last = mainrdata_last;
        total_len += mainrdata_len;
    }
    rdt_datas_last->next = NULL;

So bumping mainrdata_len to uint64 is actually not entirely in line
with this code.  Well, it will work because we'd still fail a couple
of lines down, but perhaps its readability should be improved so as
we have an extra check in this code path to make sure that
mainrdata_len is not higher than PG_UINT32_MAX, then use an
intermediate casted variable before saving the length in the record
data to make clear that the type of the main static length in
xloginsert.c is not the same as what a record has?  The v10 I sent
previously blocked this possibility, but not v11.
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

06 April 2023, 23:35:05

On Fri, Apr 07, 2023 at 08:08:34AM +0900, Michael Paquier wrote:
> So bumping mainrdata_len to uint64 is actually not entirely in line
> with this code.  Well, it will work because we'd still fail a couple
> of lines down, but perhaps its readability should be improved so as
> we have an extra check in this code path to make sure that
> mainrdata_len is not higher than PG_UINT32_MAX, then use an
> intermediate casted variable before saving the length in the record
> data to make clear that the type of the main static length in
> xloginsert.c is not the same as what a record has?  The v10 I sent
> previously blocked this possibility, but not v11.

So, I was thinking about something like the attached tweaking this
point, the error details a bit, applying an indentation and writing a
commit message...  Matthias?
--
Michael

Attachment

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

06 April 2023, 23:50:00

On Fri, 7 Apr 2023, 01:35 Michael Paquier, <michael@paquier.xyz> wrote:

On Fri, Apr 07, 2023 at 08:08:34AM +0900, Michael Paquier wrote:
> So bumping mainrdata_len to uint64 is actually not entirely in line
> with this code. Well, it will work because we'd still fail a couple
> of lines down, but perhaps its readability should be improved so as
> we have an extra check in this code path to make sure that
> mainrdata_len is not higher than PG_UINT32_MAX, then use an
> intermediate casted variable before saving the length in the record
> data to make clear that the type of the main static length in
> xloginsert.c is not the same as what a record has? The v10 I sent
> previously blocked this possibility, but not v11.

Yes, that was a bad oversight, which would've shown up in tests on a system with an endianness that my computer doesn't have...

So, I was thinking about something like the attached tweaking this
point, the error details a bit, applying an indentation and writing a
commit message... Matthias?

That looks fine to me. Thanks for picking this up and fixing the issue.

Kind regards,

Matthias van de Meent

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

06 April 2023, 23:59:22

On Fri, Apr 07, 2023 at 01:50:00AM +0200, Matthias van de Meent wrote:
> Yes, that was a bad oversight, which would've shown up in tests on a system
> with an endianness that my computer doesn't have...

I don't think that we have many bigendian animals in the buildfarm,
either..

> That looks fine to me. Thanks for picking this up and fixing the issue.

Okay, cool!
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

07 April 2023, 06:05:17

On Fri, Apr 07, 2023 at 08:59:22AM +0900, Michael Paquier wrote:
> Okay, cool!

Done this one with 8fcb32d.
--
Michael

Attachment

signature.asc

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Matthias van de Meent

Date:

08 April 2023, 14:24:35

On Fri, 7 Apr 2023 at 08:05, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Fri, Apr 07, 2023 at 08:59:22AM +0900, Michael Paquier wrote:
> > Okay, cool!
>
> Done this one with 8fcb32d.

Thanks a lot! I'll post the separation of record construction and
write-out to xlog in a future thread for 17.

One remaining question: Considering that the changes and checks of
that commit are mostly internal to xloginsert.c (or xlog.c in older
releases), and that no special public-facing changes were made, would
it be safe to backport this to older releases?

PostgreSQL 15 specifically would benefit from this as it supports
external rmgrs which may generate WAL records and would benefit from
these additional checks, but all supported releases of PostgreSQL have
pg_logical_emit_message and are thus easily subject to the issue of
writing oversized WAL records and subsequent recovery- and replication
stream failures.

Kind regards,

Matthias van de Meent

Re: Non-replayable WAL records through overflows and >MaxAllocSize lengths

From

Michael Paquier

Date:

09 April 2023, 23:31:21

On Sat, Apr 08, 2023 at 04:24:35PM +0200, Matthias van de Meent wrote:
> Thanks a lot! I'll post the separation of record construction and
> write-out to xlog in a future thread for 17.

Thanks!  Creating a new thread makes sense.

> One remaining question: Considering that the changes and checks of
> that commit are mostly internal to xloginsert.c (or xlog.c in older
> releases), and that no special public-facing changes were made, would
> it be safe to backport this to older releases?

The routine changes done in ffd1b6b cannot be backpatched on ABI
grounds, still you would propose to have protection around
needs_data as well as the whole record length.

> PostgreSQL 15 specifically would benefit from this as it supports
> external rmgrs which may generate WAL records and would benefit from
> these additional checks, but all supported releases of PostgreSQL have
> pg_logical_emit_message and are thus easily subject to the issue of
> writing oversized WAL records and subsequent recovery- and replication
> stream failures.

Custom RMGRs are a good argument, though I don't really see an urgent
argument about doing something in REL_15_STABLE.  For one, it would
mean more backpatching conflicts with ~14.  Another argument is that
XLogRecordMaxSize is not an exact science, either.  In ~15, a record
with a total size between XLogRecordMaxSize and
DecodeXLogRecordRequiredSpace(MaxAllocSize) would work, though it
would not in 16~ because we have the 4MB margin given as room for the
per-record allocation in the XLogReader.  A record of such a size
would not be generated anymore after a minor release update of 15.3~
if we were to do something about that by May on REL_15_STABLE.
--
Michael

Attachment

signature.asc