Thread: TAP test to cover "EndOfLogTLI != replayTLI" case

TAP test to cover "EndOfLogTLI != replayTLI" case

From
Amul Sul
Date:
Hi,

Attached patch covers a case where TLI in the filename for a
record being read is different from where it belongs to. In other
words, it covers following case noted in StartupXLOG():

/*
 * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
 * the end-of-log. It could be different from the timeline that EndOfLog
 * nominally belongs to, if there was a timeline switch in that segment,
 * and we were reading the old WAL from a segment belonging to a higher
 * timeline.
 */
EndOfLogTLI = xlogreader->seg.ws_tli;

Test tried to set recovery target just before the end-of-recovery WAL
record.  Unfortunately, the test couldn't directly verify EndOfLogTLI
!= replayTLI case, I am not sure how to do that -- any suggestions
will be greatly appreciated.  Perhaps, having this test is good to
improve xlog.c test coverage.   Also, if you see in other angle test
covers a case where recovery_target_lsn and
recovery_target_inclusive=off which doesn't exist as of now and that
is the reason I have added this test to 003_recovery_targets.pl file.

Thoughts? Suggestions?

--
Regards,
Amul Sul
EDB: http://www.enterprisedb.com

Attachment

Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Andres Freund
Date:
Hi,

On 2021-11-23 11:43:21 +0530, Amul Sul wrote:
> Attached patch covers a case where TLI in the filename for a
> record being read is different from where it belongs to. In other
> words, it covers following case noted in StartupXLOG():

> Thoughts? Suggestions?

It seems the test isn't quite reliable. It occasionally fails on freebsd,
macos, linux and always on windows (starting with the new CI stuff, before the
test wasn't run).

See https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3427

Greetings,

Andres Freund



Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Amul Sul
Date:
On Mon, Jan 10, 2022 at 8:25 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2021-11-23 11:43:21 +0530, Amul Sul wrote:
> > Attached patch covers a case where TLI in the filename for a
> > record being read is different from where it belongs to. In other
> > words, it covers following case noted in StartupXLOG():
>
> > Thoughts? Suggestions?
>
> It seems the test isn't quite reliable. It occasionally fails on freebsd,
> macos, linux and always on windows (starting with the new CI stuff, before the
> test wasn't run).
>
> See https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/36/3427
>

Thanks for the note, I can see the same test is failing on my centos
vm too with the latest master head(376ce3e404b).  The failing reason is
the "recovery_target_inclusive = off" setting which is unnecessary for
this test, the attached patch removing the same.

Regards,
Amul

Attachment

Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Julien Rouhaud
Date:
Hi,

On Mon, Jan 10, 2022 at 09:46:23AM +0530, Amul Sul wrote:
> 
> Thanks for the note, I can see the same test is failing on my centos
> vm too with the latest master head(376ce3e404b).  The failing reason is
> the "recovery_target_inclusive = off" setting which is unnecessary for
> this test, the attached patch removing the same.

This version still fails on windows according to the cfbot:

https://cirrus-ci.com/task/5882621321281536

[18:14:02.639] c:\cirrus>call perl src/tools/msvc/vcregress.pl recoverycheck
[18:14:56.114]
[18:14:56.122] #   Failed test 'check standby content before timeline switch 0/500FB30'
[18:14:56.122] #   at t/003_recovery_targets.pl line 234.
[18:14:56.122] #          got: '6000'
[18:14:56.122] #     expected: '7000'

I'm switching the cf entry to Waiting on Author.



Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Amul Sul
Date:
On Sat, Jan 15, 2022 at 11:35 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
>
> Hi,
>
> On Mon, Jan 10, 2022 at 09:46:23AM +0530, Amul Sul wrote:
> >
> > Thanks for the note, I can see the same test is failing on my centos
> > vm too with the latest master head(376ce3e404b).  The failing reason is
> > the "recovery_target_inclusive = off" setting which is unnecessary for
> > this test, the attached patch removing the same.
>
> This version still fails on windows according to the cfbot:
>
> https://cirrus-ci.com/task/5882621321281536
>
> [18:14:02.639] c:\cirrus>call perl src/tools/msvc/vcregress.pl recoverycheck
> [18:14:56.114]
> [18:14:56.122] #   Failed test 'check standby content before timeline switch 0/500FB30'
> [18:14:56.122] #   at t/003_recovery_targets.pl line 234.
> [18:14:56.122] #          got: '6000'
> [18:14:56.122] #     expected: '7000'
>
> I'm switching the cf entry to Waiting on Author.

Thanks for the note.

I am not sure what really went wrong but I think the 'standby_9'
server shutdown too early before it gets a chance to archive a
required WAL file. The attached patch tries to shutdown that server
after the required WAL are archived, unfortunately, I couldn't test
that on the window, let see how cfbot reacts to this version.

Regards,
Amul

Attachment

Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Julien Rouhaud
Date:
Hi,

On Mon, Jan 17, 2022 at 05:07:48PM +0530, Amul Sul wrote:
> 
> I am not sure what really went wrong but I think the 'standby_9'
> server shutdown too early before it gets a chance to archive a
> required WAL file. The attached patch tries to shutdown that server
> after the required WAL are archived, unfortunately, I couldn't test
> that on the window, let see how cfbot reacts to this version.

Thanks for the updated patch!  Note that thanks to Andres and Thomas work, you
can now easily rely on the exact same CI than the cfbot on your own github
repository, if you need to debug something on a platform you don't have access
to.  You can check the documentation at [1] for more detail on how to setup the
CI.

[1] https://github.com/postgres/postgres/blob/master/src/tools/ci/README



Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Julien Rouhaud
Date:
Hi,

On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote:
> 
> Thanks for the updated patch!  Note that thanks to Andres and Thomas work, you
> can now easily rely on the exact same CI than the cfbot on your own github
> repository, if you need to debug something on a platform you don't have access
> to.  You can check the documentation at [1] for more detail on how to setup the
> CI.

The cfbot reports that the patch still fails on Windows but also failed on
macos with the same error: https://cirrus-ci.com/task/5655777858813952:

[14:20:43.950] #   Failed test 'check standby content before timeline switch 0/500FAF8'
[14:20:43.950] #   at t/003_recovery_targets.pl line 239.
[14:20:43.950] #          got: '6000'
[14:20:43.950] #     expected: '7000'
[14:20:43.950] # Looks like you failed 1 test of 10.

I'm switching the CF entry to Waiting on Author.



Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Kyotaro Horiguchi
Date:
At Tue, 18 Jan 2022 12:25:15 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in 
> Hi,
> 
> On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote:
> > 
> > Thanks for the updated patch!  Note that thanks to Andres and Thomas work, you
> > can now easily rely on the exact same CI than the cfbot on your own github
> > repository, if you need to debug something on a platform you don't have access
> > to.  You can check the documentation at [1] for more detail on how to setup the
> > CI.
> 
> The cfbot reports that the patch still fails on Windows but also failed on
> macos with the same error: https://cirrus-ci.com/task/5655777858813952:
> 
> [14:20:43.950] #   Failed test 'check standby content before timeline switch 0/500FAF8'
> [14:20:43.950] #   at t/003_recovery_targets.pl line 239.
> [14:20:43.950] #          got: '6000'
> [14:20:43.950] #     expected: '7000'
> [14:20:43.950] # Looks like you failed 1 test of 10.
> 
> I'm switching the CF entry to Waiting on Author.

The most significant advantages of the local CI setup are

 - CI immediately responds to push.

 - You can dirty the code with additional logging aid as much as you
   like to see closely what is going on.  It makes us hesitant to do
   the same on this ML:p

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Amul Sul
Date:
On Tue, Jan 18, 2022 at 10:31 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 18 Jan 2022 12:25:15 +0800, Julien Rouhaud <rjuju123@gmail.com> wrote in
> > Hi,
> >
> > On Mon, Jan 17, 2022 at 09:33:57PM +0800, Julien Rouhaud wrote:
> > >
> > > Thanks for the updated patch!  Note that thanks to Andres and Thomas work, you
> > > can now easily rely on the exact same CI than the cfbot on your own github
> > > repository, if you need to debug something on a platform you don't have access
> > > to.  You can check the documentation at [1] for more detail on how to setup the
> > > CI.
> >
> > The cfbot reports that the patch still fails on Windows but also failed on
> > macos with the same error: https://cirrus-ci.com/task/5655777858813952:
> >
> > [14:20:43.950] #   Failed test 'check standby content before timeline switch 0/500FAF8'
> > [14:20:43.950] #   at t/003_recovery_targets.pl line 239.
> > [14:20:43.950] #          got: '6000'
> > [14:20:43.950] #     expected: '7000'
> > [14:20:43.950] # Looks like you failed 1 test of 10.
> >
> > I'm switching the CF entry to Waiting on Author.
>
> The most significant advantages of the local CI setup are
>
>  - CI immediately responds to push.
>
>  - You can dirty the code with additional logging aid as much as you
>    like to see closely what is going on.  It makes us hesitant to do
>    the same on this ML:p
>

Indeed :)

I found the cause for the test failing on window -- is due to the
custom archive command setting which wasn't setting the correct window
archive directory path.

There is no option to choose a custom wal archive and restore patch in
the TAP test. The attach 0001 patch proposes the same, which enables
you to choose a custom WAL archive and restore directory. The only
concern I have is that $node->info() will print the wrong archive
directory path in that case, do we need to fix that?  We might need to
store archive_dir like base_dir, I wasn't sure about that.  Thoughts?
Comments?

Regards,
Amul

Attachment

Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Michael Paquier
Date:
On Thu, Jan 20, 2022 at 12:13:08PM +0530, Amul Sul wrote:
> I found the cause for the test failing on window -- is due to the
> custom archive command setting which wasn't setting the correct window
> archive directory path.

After reading this patch and this thread, I have noticed that you are
testing the same thing as Heikki here, patch 0001:
https://www.postgresql.org/message-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi

The patch sent on the other thread has a better description and shape,
so perhaps we'd better drop what is posted here in favor of the other
version.  Thoughts?
--
Michael

Attachment

Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Kyotaro Horiguchi
Date:
At Thu, 27 Jan 2022 15:31:37 +0900, Michael Paquier <michael@paquier.xyz> wrote in 
> On Thu, Jan 20, 2022 at 12:13:08PM +0530, Amul Sul wrote:
> > I found the cause for the test failing on window -- is due to the
> > custom archive command setting which wasn't setting the correct window
> > archive directory path.
> 
> After reading this patch and this thread, I have noticed that you are
> testing the same thing as Heikki here, patch 0001:
> https://www.postgresql.org/message-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi
> 
> The patch sent on the other thread has a better description and shape,
> so perhaps we'd better drop what is posted here in favor of the other
> version.  Thoughts?

pg_switch_wal() is more preferable than filling-in a large number of
records as the means to advance segments because it is stable against
changes of wal_segment_size.

As you said, using has_restoring on initializing node_ptr is simpler
than explicitly setting archive_dir from that of another node.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Amul Sul
Date:
On Thu, Jan 27, 2022 at 12:01 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, Jan 20, 2022 at 12:13:08PM +0530, Amul Sul wrote:
> > I found the cause for the test failing on window -- is due to the
> > custom archive command setting which wasn't setting the correct window
> > archive directory path.
>
> After reading this patch and this thread, I have noticed that you are
> testing the same thing as Heikki here, patch 0001:
> https://www.postgresql.org/message-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi
>
> The patch sent on the other thread has a better description and shape,
> so perhaps we'd better drop what is posted here in favor of the other
> version.  Thoughts?

Yes, I do agree with you. Thanks for the comparison.

Regards,
Amul



Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Heikki Linnakangas
Date:
On 28/01/2022 12:10, Amul Sul wrote:
> On Thu, Jan 27, 2022 at 12:01 PM Michael Paquier <michael@paquier.xyz> wrote:
>>
>> After reading this patch and this thread, I have noticed that you are
>> testing the same thing as Heikki here, patch 0001:
>> https://www.postgresql.org/message-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi
>>
>> The patch sent on the other thread has a better description and shape,
>> so perhaps we'd better drop what is posted here in favor of the other
>> version.  Thoughts?
> 
> Yes, I do agree with you. Thanks for the comparison.

Pushed the test from that other thread now. Thanks!

- Heikki



Re: TAP test to cover "EndOfLogTLI != replayTLI" case

From
Amul Sul
Date:
On Mon, Feb 14, 2022 at 3:07 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 28/01/2022 12:10, Amul Sul wrote:
> > On Thu, Jan 27, 2022 at 12:01 PM Michael Paquier <michael@paquier.xyz> wrote:
> >>
> >> After reading this patch and this thread, I have noticed that you are
> >> testing the same thing as Heikki here, patch 0001:
> >> https://www.postgresql.org/message-id/52bc9ccd-8591-431b-0086-15d9acf25a3f@iki.fi
> >>
> >> The patch sent on the other thread has a better description and shape,
> >> so perhaps we'd better drop what is posted here in favor of the other
> >> version.  Thoughts?
> >
> > Yes, I do agree with you. Thanks for the comparison.
>
> Pushed the test from that other thread now. Thanks!
>

Thank you !

Regards,
Amul