Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)? - Mailing list pgsql-hackers
From | Venkata B Nagothi |
---|---|
Subject | Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)? |
Date | |
Msg-id | CAEyp7J-BmsMqCVO7zX5vpEoEuRiM+KVPZPJzodUqOdOTZsLpqQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)? (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
List | pgsql-hackers |
On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co. jp> wrote:
Hello,
At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg@mail. gmail.com>
> On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> horiguchi.kyotaro@lab.ntt.co.jp> wrote: I suppose the "configure" means primary_slot_name in recovery.conf.> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
> >
> > - M(aster): createdb as a master with wal_keep_segments = 0
> > (default), min_log_messages = debug2
> > - M: Create a physical repslot.
> > - S(tandby): Setup a standby database.
> > - S: Edit recovery.conf to use the replication slot above then
> > start it.
> > - S: touch /tmp/hoge
> > - M: Run pgbench ...
> > - S: After a while, the standby stops.
> > > LOG: #################### STOP THE SERVER
> >
> > - M: Stop pgbench.
> > - M: Do 'checkpoint;' twice.
> > - S: rm /tmp/hoge
> > - S: Fails to catch up with the following error.
> >
> > > FATAL: could not receive data from WAL stream: ERROR: requested WAL
> > segment 00000001000000000000002B has already been removed
> >
> >
> I have been testing / reviewing the latest patch
> "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> need some more clarification on this.
>
> Before applying the patch, I tried re-producing the above error -
>
> - I had master->standby in streaming replication
> - Took the backup of master
> - with a low max_wal_size and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
> - Configured the replication slot on standby and started the standby
> - I got the below error
>
> >> 2017-03-10 11:58:15.704 AEDT [478] LOG: invalid record length at
> 0/F2000140: wanted 24, got 0
> >> 2017-03-10 11:58:15.706 AEDT [481] LOG: started streaming WAL from
> primary at 0/F2000000 on timeline 1
> >> 2017-03-10 11:58:15.706 AEDT [481] FATAL: could not receive data
> from WAL stream: ERROR: requested WAL segment 0000000100000000000000F2 has
> already been removed
Maybe you created the master slot with non-reserve (default) mode
and put a some-minites pause after making the backup and before
starting the standby. For the case the master slot doesn't keep
WAL segments unless the standby connects so a couple of
checkpoints can blow away the first segment required by the
standby. This is quite reasonable behavior. The following steps
makes this more sure.
> - Took the backup of master
> - with a low max_wal_size = 2 and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
+ - SELECT pg_switch_wal(); on master twice.
+ - checkpoint; on master twice.
> - Configured the replication slot on standby and started the standby
Creating the slot with the following command will save it.
=# select pg_create_physical_replication_slot('s1', true);
I did a test again, by applying the patch and I am not sure if the patch is doing the right thing ?
Here is test case -
- I ran pgbench
- I took the backup of the master first
- Below are the WALs on master after the stop backup -
- After the backup, i created the physical replication slotpostgres=# select pg_stop_backup();NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backuppg_stop_backup----------------0/8C000130(1 row)postgres=# \q[dba@buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/total 65Mdrwx------. 2 dba dba 4.0K Mar 31 09:36 archive_status-rw-------. 1 dba dba 16M Mar 31 11:09 00000001000000000000008E-rw-------. 1 dba dba 16M Mar 31 11:17 00000001000000000000008F-rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008C-rw-------. 1 dba dba 16M Mar 31 11:18 00000001000000000000008D
postgres=# select pg_create_physical_replication_slot('repslot', true); pg_create_physical_replication_slot ------------------------------------- (repslot,0/8D000028)(1 row)postgres=# select pg_walfile_name('0/8D000028');pg_walfile_name--------------------------------------- 00000001000000000000008D(1 row)
- I started the standby and the below error occursHere, When you start the standby, it would ask for the file 00000001000000000000008C, which is the first file needed for the standby and since i applied your patch, i am assuming that, the file00000001000000000000008C should also be retained without being removed - correct ?
>> 2017-03-31 11:26:01.288 AEDT [17475] LOG: invalid record length at 0/8C000108: wanted 24, got 0
>> 2017-03-31 11:26:01.291 AEDT [17486] LOG: started streaming WAL from primary at 0/8C000000 on timeline 1
>> 2017-03-31 11:26:01.291 AEDT [17486] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000001000000000000008C has already been removed
> and i could notice that the file "0000000100000000000000F2" was removed
> from the master. This can be easily re-produced and this occurs
> irrespective of configuring replication slots.
>
> As long as the file "0000000100000000000000F2" is available on the master,
> standby continues to stream WALs without any issues.
...
> If the scenario i created to reproduce the error is correct, then, applying
> the patch is not making a difference.
Yes, the patch is not for saving this case. The patch saves the
case where the previous segment to the first required segment by
standby was removed and it contains the first part of a record
continues to the first required segment. On the other hand this
case is that the segment at the start point of standby is just
removed.
Which means, the file 00000001000000000000008C must be retained as it is the first file standby is looking for - correct ?
Regards,
Venkata Balaji N
Database Consultant
pgsql-hackers by date: