Thread: Re: pgsql: Allow users to limit storage reserved by replication slots

Re: pgsql: Allow users to limit storage reserved by replication slots

From
Alvaro Herrera
Date:
On 2020-Apr-07, Alvaro Herrera wrote:

> src/test/recovery/t/019_replslot_limit.pl      | 217 +++++++++++++++++++++++++

I fixed the perlcritic complaint from buildfarm member crake, but
there's a new one in francolin:

#   Failed test 'check that the slot state changes to "reserved"'
#   at t/019_replslot_limit.pl line 125.
#          got: '0/15000D8|reserved|216 bytes'
#     expected: '0/1500000|reserved|216 bytes'

#   Failed test 'check that the slot state changes to "lost"'
#   at t/019_replslot_limit.pl line 135.
#          got: '0/15000D8|lost|t'
#     expected: '0/1500000|lost|t'
# Looks like you failed 2 tests of 13.
[23:07:28] t/019_replslot_limit.pl .............. 

where the Perl code is:

  $start_lsn = $node_master->lsn('write');
  $node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
  $node_standby->stop;

  # Advance WAL again without checkpoint, reducing remain by 6 MB.
  advance_wal($node_master, 6);

  # Slot gets into 'reserved' state
  $result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(restart_lsn -
min_safe_lsn)as remain FROM pg_replication_slots WHERE slot_name = 'rep1'");
 
  is($result, "$start_lsn|reserved|216 bytes", 'check that the slot state changes to "reserved"');

0xD8 is 216, so this seems to be saying that the checkpoint record was
skipped by the restart_lsn.  I'm not clear exactly why that happened ...
is this saying that a checkpoint occurred?

One easy fix would be to remove the "restart_lsn" output column from the
query, but do we lose test specificity?  (I think the answer is no.)

However, even with that change, we're still testing that a checkpoint is
216 bytes ... in other words, whenever someone changes the definition of
struct CheckPoint, this test will fail.  That seems unnecessary and
unfriendly.  I'm not sure how to improve that without also removing that
column.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: pgsql: Allow users to limit storage reserved by replication slots

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> I fixed the perlcritic complaint from buildfarm member crake, but
> there's a new one in francolin:

Other buildfarm members are showing related-but-different failures.
I think this test is just plain unstable.

            regards, tom lane



Re: pgsql: Allow users to limit storage reserved by replication slots

From
Andres Freund
Date:
Hi,

On April 7, 2020 6:13:51 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> I fixed the perlcritic complaint from buildfarm member crake, but
>> there's a new one in francolin:
>
>Other buildfarm members are showing related-but-different failures.
>I think this test is just plain unstable.

I have not looked at the source, but the error messages show LSNs and bytes. I can't really imagine how that could be
madestable. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: pgsql: Allow users to limit storage reserved by replication slots

From
Michael Paquier
Date:
On Tue, Apr 07, 2020 at 07:10:07PM -0700, Andres Freund wrote:
> I have not looked at the source, but the error messages show LSNs
> and bytes. I can't really imagine how that could be made stable.

Another bad news is that this is page-size dependent.  What if you
removed pg_size_pretty() and replaced it with a condition that returns
a boolean status in the result itself?
--
Michael

Attachment

Re: pgsql: Allow users to limit storage reserved by replication slots

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> However, even with that change, we're still testing that a checkpoint is
> 216 bytes ... in other words, whenever someone changes the definition of
> struct CheckPoint, this test will fail.  That seems unnecessary and
> unfriendly.  I'm not sure how to improve that without also removing that
> column.

I read florican's results as showing that sizeof(CheckPoint) is already
different on 32-bit machines than 64-bit; it's repeatably getting this:

#   Failed test 'check that the slot state changes to "reserved"'
#   at t/019_replslot_limit.pl line 125.
#          got: '0/15000C0|reserved|192 bytes'
#     expected: '0/15000C0|reserved|216 bytes'

This test case was *not* well thought out.

            regards, tom lane