Thread: BUG #17903: There is a bug in the KeepLogSeg()

BUG #17903: There is a bug in the KeepLogSeg()

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17903
Logged by:          xu xingwang
Email address:      xu.xw2008@163.com
PostgreSQL version: 13.3
Operating system:   openEuler
Description:

Hi,

I found that KeepLogSeg() has a piece of code that is not correctly.

segno may be larger than currSegNo, since the slot_keep_segs variable is of
type "uint64", in this case the code "if (currSegNo - segno >
slot_keep_segs)" is incorrect. 

"if (currSegNo - segno < keep_segs)" is also the same.

Checkpoint calls the KeepLogSeg function, and there are many operations
between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
pg_control file, so segno may be larger than currSegNo.

KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)
{
    XLogSegNo    currSegNo;
    XLogSegNo    segno;
    XLogRecPtr    keep;

    XLByteToSeg(recptr, currSegNo, wal_segment_size);
    segno = currSegNo;

    /*
     * Calculate how many segments are kept by slots first, adjusting for
     * max_slot_wal_keep_size.
     */
    keep = XLogGetReplicationSlotMinimumLSN();
    if (keep != InvalidXLogRecPtr)
    {
        XLByteToSeg(keep, segno, wal_segment_size);

        /* Cap by max_slot_wal_keep_size ... */
        if (max_slot_wal_keep_size_mb >= 0)
        {
            uint64        slot_keep_segs;

            slot_keep_segs =
                ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);

            if (currSegNo - segno > slot_keep_segs)
                segno = currSegNo - slot_keep_segs;
        }
    }

    /* but, keep at least wal_keep_size if that's set */
    if (wal_keep_size_mb > 0)
    {
        uint64        keep_segs;

        keep_segs = ConvertToXSegs(wal_keep_size_mb, wal_segment_size);
        if (currSegNo - segno < keep_segs)
        {
            /* avoid underflow, don't go below 1 */
            if (currSegNo <= keep_segs)
                segno = 1;
            else
                segno = currSegNo - keep_segs;
        }
    }


regards.

-- 
xu xingwang


Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Kyotaro Horiguchi
Date:
At Wed, 19 Apr 2023 10:26:13 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> I found that KeepLogSeg() has a piece of code that is not correctly.
> 
> segno may be larger than currSegNo, since the slot_keep_segs variable is of
> type "uint64", in this case the code "if (currSegNo - segno >
> slot_keep_segs)" is incorrect. 
> 
> "if (currSegNo - segno < keep_segs)" is also the same.
> 
> Checkpoint calls the KeepLogSeg function, and there are many operations
> between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
> pg_control file, so segno may be larger than currSegNo.

Correct. Thanks for the report.

If checkpointer somehow takes a long time between inserting a
checkpoint record and removing WAL files, while replication advances a
certain distnace, it can actually happen. Although that behavior
doesn't directly affect max_slot_wal_keep_size, it does disrupt the
effect of wal_keep_size.

The thinko was that we incorrectly assumed the slot minimum LSN can't
be larger than the checkpoint record LSN. We don't need to consider
max_slot_wal_keep_size if the slot minimum LSN is already larger than
currSegNo.

The attached fix works. However, I can't come up with a reasonable
testing script.

This dates back to 13, where max_slot_wal_keep_size was introduced.

regards.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Junwang Zhao
Date:
If `segno` can be larger than `currSegNo`, your patch seems to miss
the following branch,
are you missing this for some particular reason?

```
/* but, keep at least wal_keep_size if that's set */
if (wal_keep_size_mb > 0)
{
    uint64 keep_segs;

    keep_segs = ConvertToXSegs(wal_keep_size_mb, wal_segment_size);
    if (currSegNo - segno < keep_segs)           <<<<  see here
    {
        /* avoid underflow, don't go below 1 */
        if (currSegNo <= keep_segs)
            segno = 1;
        else
            segno = currSegNo - keep_segs;
        }
}
```

On Thu, Apr 20, 2023 at 11:04 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Wed, 19 Apr 2023 10:26:13 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in
> > I found that KeepLogSeg() has a piece of code that is not correctly.
> >
> > segno may be larger than currSegNo, since the slot_keep_segs variable is of
> > type "uint64", in this case the code "if (currSegNo - segno >
> > slot_keep_segs)" is incorrect.
> >
> > "if (currSegNo - segno < keep_segs)" is also the same.
> >
> > Checkpoint calls the KeepLogSeg function, and there are many operations
> > between recptr and XLogGetReplicationSlotMinimumLSN, including updating the
> > pg_control file, so segno may be larger than currSegNo.
>
> Correct. Thanks for the report.
>
> If checkpointer somehow takes a long time between inserting a
> checkpoint record and removing WAL files, while replication advances a
> certain distnace, it can actually happen. Although that behavior
> doesn't directly affect max_slot_wal_keep_size, it does disrupt the
> effect of wal_keep_size.
>
> The thinko was that we incorrectly assumed the slot minimum LSN can't
> be larger than the checkpoint record LSN. We don't need to consider
> max_slot_wal_keep_size if the slot minimum LSN is already larger than
> currSegNo.
>
> The attached fix works. However, I can't come up with a reasonable
> testing script.
>
> This dates back to 13, where max_slot_wal_keep_size was introduced.
>
> regards.
>
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center



--
Regards
Junwang Zhao



Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Kyotaro Horiguchi
Date:
At Thu, 20 Apr 2023 15:40:14 +0800, Junwang Zhao <zhjwpku@gmail.com> wrote in 
> If `segno` can be larger than `currSegNo`, your patch seems to miss
> the following branch,
> are you missing this for some particular reason?

Oops! Sorry for the mistake and thanks for pointing it out.

I should have kept segno within the reasonable range.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment

Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Junwang Zhao
Date:
This patch looks reasonable.

+1

On Thu, Apr 20, 2023 at 4:58 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Thu, 20 Apr 2023 15:40:14 +0800, Junwang Zhao <zhjwpku@gmail.com> wrote in
> > If `segno` can be larger than `currSegNo`, your patch seems to miss
> > the following branch,
> > are you missing this for some particular reason?
>
> Oops! Sorry for the mistake and thanks for pointing it out.
>
> I should have kept segno within the reasonable range.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center



--
Regards
Junwang Zhao



Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Nathan Bossart
Date:
On Thu, Apr 20, 2023 at 05:58:14PM +0900, Kyotaro Horiguchi wrote:
> +        /*
> +         * Slot minimum LSN could be greater than recptr. In such cases, no
> +         * segments are protected by slots but we still need to keep segno in a
> +         * reasonable range for subsequent calculations to avoid underflow.
> +         */
> +        if (segno > currSegNo)
> +            segno = currSegNo;
> +

I wonder if it'd be better to instead change

    if (currSegNo - segno > slot_keep_segs)
to
    if (currSegNo > segno + slot_keep_segs)

and

    if (currSegNo - segno < keep_segs)
to
    if (currSegNo < segNo + keep_segs)

If segno > currSegNo, the first conditional would be false as expected, and
the second would be true as expected.  We could also use
pg_sub_u64_overflow() to detect underflow, but that might be excessive in
this case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Kyotaro Horiguchi
Date:
At Thu, 20 Apr 2023 15:02:42 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> I wonder if it'd be better to instead change
> 
>     if (currSegNo - segno > slot_keep_segs)
> to
>     if (currSegNo > segno + slot_keep_segs)
> 
> and
> 
>     if (currSegNo - segno < keep_segs)
> to
>     if (currSegNo < segNo + keep_segs)
> 
> If segno > currSegNo, the first conditional would be false as expected, and
> the second would be true as expected.  We could also use
> pg_sub_u64_overflow() to detect underflow, but that might be excessive in
> this case.

From what I understand, the XLogSegNo calculations are designed
without considering the actual value range. Basically, it assumes that
(XLogSegNo + <any positive int>) can overflow.  If we take the actual
value range into account, we can make that change.

The choice lies on whether we assume the actual value range or not.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Nathan Bossart
Date:
On Fri, Apr 21, 2023 at 10:42:31AM +0900, Kyotaro Horiguchi wrote:
> From what I understand, the XLogSegNo calculations are designed
> without considering the actual value range. Basically, it assumes that
> (XLogSegNo + <any positive int>) can overflow.  If we take the actual
> value range into account, we can make that change.
> 
> The choice lies on whether we assume the actual value range or not.

Yeah, after staring at this some more, I think your proposed fix is the way
to go.  Alternatively, we could adjust the conditional for the
max_slot_wal_keep_size block to

    if (keep != InvalidXLogRecPtr && keep < recptr)

but AFAICT that yields the exact same behavior.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Nathan Bossart
Date:
Here is a new version of the patch.  It is fundamentally the same as v2,
but I've adjusted the comment and commit message a bit.  Barring
objections, I am planning to commit this (and back-patch to v13) in the
near future.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Attachment

Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Kyotaro Horiguchi
Date:
At Mon, 24 Apr 2023 12:14:52 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> Here is a new version of the patch.  It is fundamentally the same as v2,
> but I've adjusted the comment and commit message a bit.  Barring
> objections, I am planning to commit this (and back-patch to v13) in the
> near future.

> When determining the oldest segment that must be kept for
> replication slots, KeepLogSeg() might calculate a segment number
> ahead of the one for the LSN given to the function.  This causes

Maybe the KeepLogSeg() is a mistake of
XLogGetReplicationSlotMinimumLSN()?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Nathan Bossart
Date:
On Tue, Apr 25, 2023 at 01:14:52PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 24 Apr 2023 12:14:52 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
>> When determining the oldest segment that must be kept for
>> replication slots, KeepLogSeg() might calculate a segment number
>> ahead of the one for the LSN given to the function.  This causes
> 
> Maybe the KeepLogSeg() is a mistake of
> XLogGetReplicationSlotMinimumLSN()?

I adjusted the commit message to call out that function explicitly.  Also,
I decided to go with the "keep < recptr" approach since there's no reason
to do anything in that block otherwise.

Thanks!

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Michael Paquier
Date:
On Thu, Apr 27, 2023 at 02:47:46PM -0700, Nathan Bossart wrote:
> I adjusted the commit message to call out that function explicitly.  Also,
> I decided to go with the "keep < recptr" approach since there's no reason
> to do anything in that block otherwise.

b726236 exists in the tree, but it seems like the message of
pgsql-committers is held on moderation?
--
Michael

Attachment

Re: BUG #17903: There is a bug in the KeepLogSeg()

From
Kyotaro Horiguchi
Date:
At Thu, 27 Apr 2023 14:47:46 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> On Tue, Apr 25, 2023 at 01:14:52PM +0900, Kyotaro Horiguchi wrote:
> > At Mon, 24 Apr 2023 12:14:52 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in 
> >> When determining the oldest segment that must be kept for
> >> replication slots, KeepLogSeg() might calculate a segment number
> >> ahead of the one for the LSN given to the function.  This causes
> > 
> > Maybe the KeepLogSeg() is a mistake of
> > XLogGetReplicationSlotMinimumLSN()?
> 
> I adjusted the commit message to call out that function explicitly.  Also,
> I decided to go with the "keep < recptr" approach since there's no reason
> to do anything in that block otherwise.

That works, too. Thanks!

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center