Thread: In-core regression tests for replication, cascading, archiving, PITR, etc.

In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
Hi all,

The data replication bug causing data corruption on hot slaves found
lately (http://wiki.postgresql.org/wiki/Nov2013ReplicationIssue) is
causing a certain amount of damage among the users of Postgres, either
companies or individuals, and impacts a lot of people. So perhaps it
would be a good time to start thinking about adding some dedicated
regression tests for replication, archiving, PITR, data integrity,
parameter reloading, etc. The possible number of things or use-cases
that could be applied to that is very vast, but let's say that it
involves the manipulation of multiple Postgres nodes and structures
(like a WAL archive) defining a Postgres cluster.

The main purpose of those test would be really simple: using a given
GIT repository, a buildfarm or developer should be able to run a
single "make check" command that runs those tests on a local machine
in a fashion similar to isolation or regression tests and validate
builds or patches.

I imagine that there would be roughly two ways to implement such a facility:
1) Use of a smart set of bash scripts. This would be easy to implement
but reduces pluggability of custom scripts (I am sure that each
user/company has already its own set of scenarios.). pg_upgrade uses
something similar with its test.sh.
2) Use a scripting language, in a way similar to how isolation tests
are done. This would make custom test more customizable.
Here is for example an approach that has been presented at the
unconference of PGcon 2013 (would be something different though as
this proposal does not include node manipulation *during* the tests
like promotion):
https://wiki.postgresql.org/images/1/14/Pg_testframework.pptx
3) Import (and improve) solutions that other projects based on
Postgres technology use for those things.

In all cases, here are the common primary actions that could be run for a test:
- Define and perform actions on a node: init, start, stop, promote,
create_archive, base_backup. So it is a sort of improved wrapper of
pg_ctl.
- Pass parameters to a configuration file, either postgresql.conf,
recovery.conf, or anything.
- Launch SQL commands to a node.
Things like the creation of folders for WAL archiving should be simply
harcoded to simplify the life of developer... As well, the facility
should be smart enough to be allow the use of custom commands that are
the combination of primary actions above, like for example the
possibility to define a sync slave, linked to a root node, is simply
1) create a base backup from a node, 2) pass parameters to
postgresql.conf and recovery.conf, 3) start the node.

Let me know your thoughts.
Regards,
-- 
Michael



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Heikki Linnakangas
Date:
On 12/02/2013 08:40 AM, Michael Paquier wrote:
> The data replication bug causing data corruption on hot slaves found
> lately (http://wiki.postgresql.org/wiki/Nov2013ReplicationIssue) is
> causing a certain amount of damage among the users of Postgres, either
> companies or individuals, and impacts a lot of people. So perhaps it
> would be a good time to start thinking about adding some dedicated
> regression tests for replication, archiving, PITR, data integrity,
> parameter reloading, etc. The possible number of things or use-cases
> that could be applied to that is very vast, but let's say that it
> involves the manipulation of multiple Postgres nodes and structures
> (like a WAL archive) defining a Postgres cluster.
>
> The main purpose of those test would be really simple: using a given
> GIT repository, a buildfarm or developer should be able to run a
> single "make check" command that runs those tests on a local machine
> in a fashion similar to isolation or regression tests and validate
> builds or patches.

+1. The need for such a test suite has been mentioned every single time 
that a bug or new feature related to replication, PITR or hot standby 
has come up. So yes please! The only thing missing is someone to 
actually write the thing. So if you have the time and energy, that'd be 
great!

- Heikki



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Mon, Dec 2, 2013 at 6:24 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> +1. The need for such a test suite has been mentioned every single time that
> a bug or new feature related to replication, PITR or hot standby has come
> up. So yes please! The only thing missing is someone to actually write the
> thing. So if you have the time and energy, that'd be great!
I am sure you know who we need to convince in this case :)
-- 
Michael



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Andres Freund
Date:
On 2013-12-02 18:45:37 +0900, Michael Paquier wrote:
> On Mon, Dec 2, 2013 at 6:24 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > +1. The need for such a test suite has been mentioned every single time that
> > a bug or new feature related to replication, PITR or hot standby has come
> > up. So yes please! The only thing missing is someone to actually write the
> > thing. So if you have the time and energy, that'd be great!

> I am sure you know who we need to convince in this case :)

If you're alluding to Tom, I'd guess he doesn't need to be convinced of
such a facility in general. I seem to remember him complaining about the
lack of testing that as well.
Maybe that it shouldn't be part of the main regression schedule...

+many from me as well. I think the big battle will be how to do it, not
if in general.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Mon, Dec 2, 2013 at 7:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Maybe that it shouldn't be part of the main regression schedule...
Yes, like isolation tests, it don't see those new tests in the main
flow as well.

> +many from me as well. I think the big battle will be how to do it, not
> if in general.
Yeah, that's why we should gather first feedback about the methods
that other projects (Slony, Londiste, Pgpool) are using as well before
heading to a solution or another. Having smth, whatever small for 9.4
would also be of great help.

I am however sure that getting a small prototype integrated with some
of my in-house scripts would not take that much time though...
Regards,
-- 
Michael



Andres Freund <andres@2ndquadrant.com> writes:
> Maybe that it shouldn't be part of the main regression schedule...

It *can't* be part of the main regression tests; those are supposed to
be runnable against an already-installed server, and fooling with that
server's configuration is off-limits too.  But I agree that some
other facility to simplify running tests like this would be handy.

At the same time, I'm pretty skeptical that any simple regression-test
type facility would have caught the bugs we've fixed lately ...
        regards, tom lane



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Andres Freund
Date:
On 2013-12-02 09:41:39 -0500, Tom Lane wrote:
> At the same time, I'm pretty skeptical that any simple regression-test
> type facility would have caught the bugs we've fixed lately ...

Agreed, but it would make reorganizing stuff to be more robust more
realistic. At the moment for everything you change you have to hand-test
everyting possibly affected which takes ages.

I think we also needs support for testing xid/multixid wraparound. It
currently isn't realistically testable because of the timeframes
involved.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund <andres@2ndquadrant.com> writes:
> I think we also needs support for testing xid/multixid wraparound. It
> currently isn't realistically testable because of the timeframes
> involved.

When I've wanted to do that in the past, I've used pg_resetxlog to
adjust a cluster's counters.  It still requires some manual hacking
though because pg_resetxlog isn't bright enough to create the new
pg_clog files needed when you move the xid counter a long way.
We could fix that, or we could make the backend more forgiving of
not finding the initial clog segment present at startup ...
        regards, tom lane



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Alvaro Herrera
Date:
Tom Lane escribió:

> When I've wanted to do that in the past, I've used pg_resetxlog to
> adjust a cluster's counters.  It still requires some manual hacking
> though because pg_resetxlog isn't bright enough to create the new
> pg_clog files needed when you move the xid counter a long way.
> We could fix that, or we could make the backend more forgiving of
> not finding the initial clog segment present at startup ...

FWIW we already have some new code that creates segments when not found.
It's currently used in multixact, and the submitted "commit timestamp"
module uses it too.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Andres Freund
Date:
On 2013-12-02 09:59:12 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > I think we also needs support for testing xid/multixid wraparound. It
> > currently isn't realistically testable because of the timeframes
> > involved.
> 
> When I've wanted to do that in the past, I've used pg_resetxlog to
> adjust a cluster's counters.

I've done that as well, but it's painful and not neccessarily testing
the right thing. E.g. I am far from sure we handle setting the
anti-wraparound limits correctly when promoting a standby - a restart to
adapt pg_control changes things and it might get rolled back because of
a already logged checkpoints.

What I'd love is a function that gives me the opportunity to
*efficiently* move forward pg_clog, pg_multixact/offset,members by large
chunks. So e.g. I could run a normal pgbench alongside another pgbench
moving clog forward in 500k chunks, but so it creates the necessary
files I could possibly need to access.

If you do it naivly you get into quite some fun with hot standby btw. I
can tell you that from experience :P

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Mon, Dec 2, 2013 at 11:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> At the same time, I'm pretty skeptical that any simple regression-test
> type facility would have caught the bugs we've fixed lately ...
The replication bug would have been reproducible at least, Heikki
produced a simple test case able to reproduce it. For the MultiXact
stuff... well some more infrastructure in core might be needed before
having a wrapper calling test scripts aimed to manipulate cluster of
nodes.
-- 
Michael



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Mon, Dec 2, 2013 at 7:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-02 18:45:37 +0900, Michael Paquier wrote:
>> On Mon, Dec 2, 2013 at 6:24 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>> > +1. The need for such a test suite has been mentioned every single time that
>> > a bug or new feature related to replication, PITR or hot standby has come
>> > up. So yes please! The only thing missing is someone to actually write the
>> > thing. So if you have the time and energy, that'd be great!
>
>> I am sure you know who we need to convince in this case :)
>
> If you're alluding to Tom, I'd guess he doesn't need to be convinced of
> such a facility in general. I seem to remember him complaining about the
> lack of testing that as well.
> Maybe that it shouldn't be part of the main regression schedule...
>
> +many from me as well. I think the big battle will be how to do it, not
> if in general.

(Reviving an old thread)
So I am planning to seriously focus soon on this stuff, basically
using the TAP tests as base infrastructure for this regression test
suite. First, does using the TAP tests sound fine?

On the top of my mind I got the following items that should be tested:
- WAL replay: from archive, from stream
- hot standby and read-only queries
- node promotion
- recovery targets and their interferences when multiple targets are
specified (XID, name, timestamp, immediate)
- timelines
- recovery_target_action
- recovery_min_apply_delay (check that WAL is fetch from a source at
some correct interval, can use a special restore_command for that)
- archive_cleanup_command (check that command is kicked at each restart point)
- recovery_end_command (check that command is kicked at the end of recovery)
- timeline jump of a standby after reconnecting to a promoted node

Regards,
-- 
Michael



On 3/8/15 6:19 AM, Michael Paquier wrote:
> On Mon, Dec 2, 2013 at 7:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2013-12-02 18:45:37 +0900, Michael Paquier wrote:
>>> On Mon, Dec 2, 2013 at 6:24 PM, Heikki Linnakangas
>>> <hlinnakangas@vmware.com> wrote:
>>>> +1. The need for such a test suite has been mentioned every single time that
>>>> a bug or new feature related to replication, PITR or hot standby has come
>>>> up. So yes please! The only thing missing is someone to actually write the
>>>> thing. So if you have the time and energy, that'd be great!
>>
>>> I am sure you know who we need to convince in this case :)
>>
>> If you're alluding to Tom, I'd guess he doesn't need to be convinced of
>> such a facility in general. I seem to remember him complaining about the
>> lack of testing that as well.
>> Maybe that it shouldn't be part of the main regression schedule...
>>
>> +many from me as well. I think the big battle will be how to do it, not
>> if in general.
>
> (Reviving an old thread)
> So I am planning to seriously focus soon on this stuff, basically
> using the TAP tests as base infrastructure for this regression test
> suite. First, does using the TAP tests sound fine?
>
> On the top of my mind I got the following items that should be tested:
> - WAL replay: from archive, from stream
> - hot standby and read-only queries
> - node promotion
> - recovery targets and their interferences when multiple targets are
> specified (XID, name, timestamp, immediate)
> - timelines
> - recovery_target_action
> - recovery_min_apply_delay (check that WAL is fetch from a source at
> some correct interval, can use a special restore_command for that)
> - archive_cleanup_command (check that command is kicked at each restart point)
> - recovery_end_command (check that command is kicked at the end of recovery)
> - timeline jump of a standby after reconnecting to a promoted node

If we're keeping a list, there's also hot_standby_feedback, 
max_standby_archive_delay and max_standby_streaming_delay.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



On Sun, Mar 08, 2015 at 08:19:39PM +0900, Michael Paquier wrote:
> So I am planning to seriously focus soon on this stuff, basically
> using the TAP tests as base infrastructure for this regression test
> suite. First, does using the TAP tests sound fine?

Yes.

> On the top of my mind I got the following items that should be tested:
> - WAL replay: from archive, from stream
> - hot standby and read-only queries
> - node promotion
> - recovery targets and their interferences when multiple targets are
> specified (XID, name, timestamp, immediate)
> - timelines
> - recovery_target_action
> - recovery_min_apply_delay (check that WAL is fetch from a source at
> some correct interval, can use a special restore_command for that)
> - archive_cleanup_command (check that command is kicked at each restart point)
> - recovery_end_command (check that command is kicked at the end of recovery)
> - timeline jump of a standby after reconnecting to a promoted node

Those sound good.  The TAP suites still lack support for any Windows target.
If you're inclined to fix that, it would be a great contribution.  The more we
accrue tests before doing that, the harder it will be to dig out.



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Wed, Mar 11, 2015 at 2:47 PM, Noah Misch <noah@leadboat.com> wrote:
> On Sun, Mar 08, 2015 at 08:19:39PM +0900, Michael Paquier wrote:
>> So I am planning to seriously focus soon on this stuff, basically
>> using the TAP tests as base infrastructure for this regression test
>> suite. First, does using the TAP tests sound fine?
>
> Yes.

Check.

>> On the top of my mind I got the following items that should be tested:
>> - WAL replay: from archive, from stream
>> - hot standby and read-only queries
>> - node promotion
>> - recovery targets and their interferences when multiple targets are
>> specified (XID, name, timestamp, immediate)
>> - timelines
>> - recovery_target_action
>> - recovery_min_apply_delay (check that WAL is fetch from a source at
>> some correct interval, can use a special restore_command for that)
>> - archive_cleanup_command (check that command is kicked at each restart point)
>> - recovery_end_command (check that command is kicked at the end of recovery)
>> - timeline jump of a standby after reconnecting to a promoted node
>
> Those sound good.  The TAP suites still lack support for any Windows target.
> If you're inclined to fix that, it would be a great contribution.  The more we
> accrue tests before doing that, the harder it will be to dig out.

Yeah, that's already on my TODO list.
-- 
Michael



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Wed, Mar 11, 2015 at 3:04 PM, Michael Paquier wrote:
> On Wed, Mar 11, 2015 at 2:47 PM, Noah Misch <noah@leadboat.com> wrote:
>> On Sun, Mar 08, 2015 at 08:19:39PM +0900, Michael Paquier wrote:
>>> So I am planning to seriously focus soon on this stuff, basically
>>> using the TAP tests as base infrastructure for this regression test
>>> suite. First, does using the TAP tests sound fine?
>>
>> Yes.
>
> Check.
>
>>> On the top of my mind I got the following items that should be tested:
>>> - WAL replay: from archive, from stream
>>> - hot standby and read-only queries
>>> - node promotion
>>> - recovery targets and their interferences when multiple targets are
>>> specified (XID, name, timestamp, immediate)
>>> - timelines
>>> - recovery_target_action
>>> - recovery_min_apply_delay (check that WAL is fetch from a source at
>>> some correct interval, can use a special restore_command for that)
>>> - archive_cleanup_command (check that command is kicked at each restart point)
>>> - recovery_end_command (check that command is kicked at the end of recovery)
>>> - timeline jump of a standby after reconnecting to a promoted node

So, as long as I had a clear picture of what I wanted to do regarding
this stuff (even if this is a busy commit fest, sorry), I have been
toying around with perl and I have finished with the patch attached,
adding some base structure for a new test suite covering recovery.

This patch includes basic tests for the following items:
- node promotion, test of archiving, streaming, replication cascading
- recovery targets XID, name, timestamp, immediate and PITR
- Timeline jump of a standby when reconnecting to a newly-promoted standby
- Replay delay
Tests are located in src/test/recovery, and are not part of the main
test suite, similarly to the ssl stuff.
I have dropped recovery_target_action for the time being as long as
the matter on the other thread is not set
(http://www.postgresql.org/message-id/20150315132707.GB19792@alap3.anarazel.de),
and I don't think that it would be complicated to create tests for
that btw.

The most important part of this patch is not the tests themselves, but
the base set of routines allowing to simply create nodes, take
backups, create standbys from backups, and set up nodes to do stuff
like streaming, archiving, or restoring from archives. There are many
configurations possible of course in recovery.conf, but the set of
routines that this patch present are made to be *simple* to not
overcomplicate the way tests can be written.

Feedback is of course welcome, but note that I am not seriously
expecting any until we get into 9.6 development cycle and I am adding
this patch to the next CF.

Regards,
--
Michael

Attachment

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Wed, Mar 18, 2015 at 1:59 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Feedback is of course welcome, but note that I am not seriously
> expecting any until we get into 9.6 development cycle and I am adding
> this patch to the next CF.

I have moved this patch to CF 2015-09, as I have enough patches to
take care of for now... Let's focus on Windows support and improvement
of logging for TAP in the first round. That will be already a good
step forward.
-- 
Michael



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Mon, Jun 29, 2015 at 10:11 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Mar 18, 2015 at 1:59 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> Feedback is of course welcome, but note that I am not seriously
>> expecting any until we get into 9.6 development cycle and I am adding
>> this patch to the next CF.
>
> I have moved this patch to CF 2015-09, as I have enough patches to
> take care of for now... Let's focus on Windows support and improvement
> of logging for TAP in the first round. That will be already a good
> step forward.

OK, attached is a new version of this patch, that I have largely
reworked to have more user-friendly routines for the tests. The number
of tests is still limited still it shows what this facility can do:
that's on purpose as it does not make much sense to code a complete
and complicated set of tests as long as the core routines are not
stable, hence let's focus on that first.
I have not done yet tests on Windows, I am expecting some tricks
needed for the archive and recovery commands generated for the tests.
Regards,
--
Michael

Attachment

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Fri, Aug 14, 2015 at 12:54 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Mon, Jun 29, 2015 at 10:11 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Mar 18, 2015 at 1:59 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> Feedback is of course welcome, but note that I am not seriously
>>> expecting any until we get into 9.6 development cycle and I am adding
>>> this patch to the next CF.
>>
>> I have moved this patch to CF 2015-09, as I have enough patches to
>> take care of for now... Let's focus on Windows support and improvement
>> of logging for TAP in the first round. That will be already a good
>> step forward.
>
> OK, attached is a new version of this patch, that I have largely
> reworked to have more user-friendly routines for the tests. The number
> of tests is still limited still it shows what this facility can do:
> that's on purpose as it does not make much sense to code a complete
> and complicated set of tests as long as the core routines are not
> stable, hence let's focus on that first.
> I have not done yet tests on Windows, I am expecting some tricks
> needed for the archive and recovery commands generated for the tests.

Attached is v3. I have tested and fixed the tests such as they can run
on Windows. archive_command and restore_command are using Windows'
copy when needed. There was also a bug with the use of a hot standby
instead of a warm one, causing test 002 to fail.
I am rather happy with the shape of this patch now, so feel free to review it...
Regards,
--
Michael

Attachment

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Amir Rohan
Date:
On 08/14/2015 06:32 AM, Michael Paquier wrote:
> On Fri, Aug 14, 2015 at 12:54 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Mon, Jun 29, 2015 at 10:11 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Wed, Mar 18, 2015 at 1:59 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> Feedback is of course welcome, but note that I am not seriously
>>>> expecting any until we get into 9.6 development cycle and I am adding
>>>> this patch to the next CF.
>>> I have moved this patch to CF 2015-09, as I have enough patches to
>>> take care of for now... Let's focus on Windows support and improvement
>>> of logging for TAP in the first round. That will be already a good
>>> step forward.
>> OK, attached is a new version of this patch, that I have largely
>> reworked to have more user-friendly routines for the tests. The number
>> of tests is still limited still it shows what this facility can do:
>> that's on purpose as it does not make much sense to code a complete
>> and complicated set of tests as long as the core routines are not
>> stable, hence let's focus on that first.
>> I have not done yet tests on Windows, I am expecting some tricks
>> needed for the archive and recovery commands generated for the tests.
> Attached is v3. I have tested and fixed the tests such as they can run
> on Windows. archive_command and restore_command are using Windows'
> copy when needed. There was also a bug with the use of a hot standby
> instead of a warm one, causing test 002 to fail.
> I am rather happy with the shape of this patch now, so feel free to review it...
> Regards,

Michael, I've ran these and it worked fine for me.
See attached patch with a couple of minor fixes.

Amir

Attachment

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Fri, Sep 25, 2015 at 3:11 PM, Amir
Rohan<span dir="ltr"><<a href="mailto:amir.rohan@mail.com" target="_blank">amir.rohan@mail.com</a>></span>
wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div
class="HOEnZb"><divclass="h5">On 08/14/2015 06:32 AM, Michael Paquier wrote:<br /> > On Fri, Aug 14, 2015 at 12:54
AM,Michael Paquier<br /> > <<a href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>>
wrote:<br/> >> On Mon, Jun 29, 2015 at 10:11 PM, Michael Paquier<br /> >> <<a
href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>>wrote:<br /> >>> On Wed, Mar 18, 2015
at1:59 PM, Michael Paquier<br /> >>> <<a
href="mailto:michael.paquier@gmail.com">michael.paquier@gmail.com</a>>wrote:<br /> >>>> Feedback is of
coursewelcome, but note that I am not seriously<br /> >>>> expecting any until we get into 9.6 development
cycleand I am adding<br /> >>>> this patch to the next CF.<br /> >>> I have moved this patch to CF
2015-09,as I have enough patches to<br /> >>> take care of for now... Let's focus on Windows support and
improvement<br/> >>> of logging for TAP in the first round. That will be already a good<br /> >>>
stepforward.<br /> >> OK, attached is a new version of this patch, that I have largely<br /> >> reworked to
havemore user-friendly routines for the tests. The number<br /> >> of tests is still limited still it shows what
thisfacility can do:<br /> >> that's on purpose as it does not make much sense to code a complete<br /> >>
andcomplicated set of tests as long as the core routines are not<br /> >> stable, hence let's focus on that
first.<br/> >> I have not done yet tests on Windows, I am expecting some tricks<br /> >> needed for the
archiveand recovery commands generated for the tests.<br /> > Attached is v3. I have tested and fixed the tests such
asthey can run<br /> > on Windows. archive_command and restore_command are using Windows'<br /> > copy when
needed.There was also a bug with the use of a hot standby<br /> > instead of a warm one, causing test 002 to
fail.<br/> > I am rather happy with the shape of this patch now, so feel free to review it...<br /> > Regards,<br
/><br/></div></div>Michael, I've ran these and it worked fine for me.<br /> See attached patch with a couple of minor
fixes.<spanclass="HOEnZb"><font color="#888888"><br /></font></span></blockquote></div><br /></div><div
class="gmail_extra">Thanks!I still think that we could improve a bit more the way parametrization is done in
postgresql.confwhen a node is initialized by appending a list of parameters or have a set of hardcoded behaviors
includinga set of default parameters and their values... But well feedback is welcome regarding that. I also arrived at
theconclusion that it would be better to place the new package file in src/test/perl instead of src/test/recovery to
allowany users of the TAP tests to have it in their PERL5LIB path and to be able to call the new routines to create and
manipulatenodes.<br />-- <br /><div class="gmail_signature">Michael<br /></div></div></div> 

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Amir Rohan
Date:
On 09/25/2015 09:29 AM, Michael Paquier wrote:
> 
> 
> On Fri, Sep 25, 2015 at 3:11 PM, Amir Rohan <amir.rohan@mail.com
> <mailto:amir.rohan@mail.com>> wrote:
> 
>     On 08/14/2015 06:32 AM, Michael Paquier wrote:

>     > I am rather happy with the shape of this patch now, so feel free
>     to review it...
>     > Regards,
> 
>     Michael, I've ran these and it worked fine for me.
>     See attached patch with a couple of minor fixes.
> 
> 
> Thanks! I still think that we could improve a bit more the way
> parametrization is done in postgresql.conf when a node is initialized by
> appending a list of parameters or have a set of hardcoded behaviors
> including a set of default parameters and their values... But well
> feedback is welcome regarding that. I also arrived at the conclusion
> that it would be better to place the new package file in src/test/perl
> instead of src/test/recovery to allow any users of the TAP tests to have
> it in their PERL5LIB path and to be able to call the new routines to
> create and manipulate nodes.
> -- 
> Michael 

Having a subcommand in Greg's PEG (http://github.com/gregs1104/peg),
that allows you to create one of several "canned" clusters would be
convenient as well, for manual testing and folling around with features.

Amir



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Fri, Sep 25, 2015 at 5:57 PM, Amir Rohan <amir.rohan@mail.com> wrote:
>
> Having a subcommand in Greg's PEG (http://github.com/gregs1104/peg),
> that allows you to create one of several "canned" clusters would be
> convenient as well, for manual testing and folling around with features.


That's the kind of thing that each serious developer on this mailing
list already has in a rather different shape but with the same final
result: offer a way to set up a cluster ready for hacking in a single
command, and being able to switch among them easily. I am not sure we
would really find a cross-platform script generic enough to satisfy
all the needs of folks here and integrate it directly in the tree :)
-- 
Michael



Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Amir Rohan
Date:
On 09/25/2015 01:47 PM, Michael Paquier wrote:
> On Fri, Sep 25, 2015 at 5:57 PM, Amir Rohan <amir.rohan@mail.com> wrote:
>>

> That's the kind of thing that each serious developer on this mailing
> list already has in a rather different shape but with the same final
> result: 

Oh, I guess I'll have to write one then. :)


> offer a way to set up a cluster ready for hacking in a single
> command, and being able to switch among them easily. I am not sure we
> would really find a cross-platform script generic enough to satisfy
> all the needs of folks here and integrate it directly in the tree :)
> 

Yes, perl/bash seems to be the standard and both have their shorcomings,
in either convenience or familiarity... a static binary +
scripts/configuration would be least resistance,and it doesn't actually
have to live in the tree, but it needs to be good enough out of the box
that someone new will prefer to use/improve it over rolling their own
from scratch, and people need to know it's there.

Coming back to the recovery testing package...
I was investigating some behaviour having to do with recovery
and tried your new library to write a repro case. This uncovers some
implicit assumptions in the package that can make things diffficult
when violated. I had to rewrite nearly every function I used.

Major pain points:

1) Lib stores metadata (ports,paths, etc') using ports as keys.
-> Assumes ports aren't reused.
-> Because assumes server keep running until the tear down.

And

2) Behaviour (paths in particular) is hardwired rather then overridable
defaults.

This is exactly what I needed to test, problems:
3) Can't stop server without clearing its testing data (the maps holding
paths and things). But that data might be specifically
needed, in particular the backup shouldn't disappear when the
server melts down or we have a very low-grade DBA on our hands.
4) Port assignment relies on liveness checks on running servers.
If a server is shut down and a new instantiated, the port will get
reused, data will get trashed, and various confusing things can happen.
5) Servers are shutdown with -m 'immediate', which can lead to races
in the script when archiving is turned on. That may be good for some
tests, but there's no control over it.

Other issues:
6. Directory structure, used one directory per thing but more logical
to place all things related to an instance under a single directory,
and name them according to role (57333_backup, and so on).
7. enable_restoring() uses "cp -i" 'archive_command', not a good fit
for an automated test.

Aside from running the tests, the convenience of writing them
needs to be considered. My perl is very weak, it's been at least
a decade, but It was difficult to make progress because everything
is geared toward a batch "pass/fail" run . Stdout is redirected,
and the log files can't be 'tail --retry -f' in another terminal,
because they're clobbered at every run. Also:
8. No canned way to output a pprinted overview of the running system
(paths, ports, for manual checking).
9. Finding things is difficult, See 6.
10. If a test passes/fails or dies due to a bug, everything is cleaned.
Great for testing, bad for postmortem.
11. a canned "server is responding to queries" helper would be convenient.

It might be a good idea to:
1) Never reuse ports during a test. Liveness checking is used
to avoid collisions, but should not determine order of assignment.
2) Decouple cleanup from server shutdown. Do the cleanup as the end of
test only, and allow the user to keep things around.
3) Adjust the directory structure to one top directory per server with
(PGDATA, backup, archive) subdirs.
4) Instead of passing ports around as keys, have _explicit functions
which can be called directly by the user (I'd like the backup *HERE*
please), with the current functions refactored to merely invoke them
by interpolating in the values associated with the port they were given.
4b) server shutdown should perheps be "smart" by default, or segmented
into calmly_bring_to_a_close(), pull_electric_plug() and
drop_down_the_stairs_into_swimming_pool().

Regards,
Amir








Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From
Michael Paquier
Date:
On Sat, Sep 26, 2015 at 10:25 AM, Amir Rohan wrote:
> On 09/25/2015 01:47 PM, Michael Paquier wrote:
>> offer a way to set up a cluster ready for hacking in a single
>> command, and being able to switch among them easily. I am not sure we
>> would really find a cross-platform script generic enough to satisfy
>> all the needs of folks here and integrate it directly in the tree :)
>>
> Yes, perl/bash seems to be the standard and both have their shorcomings,
> in either convenience or familiarity...

Note as well that bash does not run on Windows, so this should be in
perl for the cross-platform requirements.

> Coming back to the recovery testing package...

Thanks for providing input on this patch!

> I was investigating some behaviour having to do with recovery
> and tried your new library to write a repro case. This uncovers some
> implicit assumptions in the package that can make things diffficult
> when violated. I had to rewrite nearly every function I used.

OK. Noted. Could you be more explicit here with for example an example
of script or a patch?

> Major pain points:
> 1) Lib stores metadata (ports,paths, etc') using ports as keys.
> -> Assumes ports aren't reused.

What would be the point of reusing the same port number in a test for
different nodes? Ports are assigned automatically depending on their
availability looking at if a server is listening to it so they are
never reused as long as the node is running so that's a non-issue IMO.
Any server instances created during the tests should never use a
user-defined port for portability. Hence using those ports as keys
just made sense. We could have for example custom names, that have
port values assigned to them, but that's actually an overkill and
complicates the whole facility.

> -> Because assumes server keep running until the tear down.

Check. Locking down the port number is the problem here.

> And
> 2) Behaviour (paths in particular) is hardwired rather then overridable
> defaults.

This is the case of all the TAP tests. We could always use the same
base directory for all the nodes and then embed a sub-directory whose
name is decided using the port number. But I am not really sure if
that's a win.

> This is exactly what I needed to test, problems:
> 3) Can't stop server without clearing its testing data (the maps holding
> paths and things). But that data might be specifically
> needed, in particular the backup shouldn't disappear when the
> server melts down or we have a very low-grade DBA on our hands.

OK, you have a point here. You may indeed want routines for to enable
and disable a node completely decoupled from start and stop, with
something like enable_node and disable_node that basically registers
or unregisters it from the list of active nodes. I have updated the
patch this way.

> 4) Port assignment relies on liveness checks on running servers.
> If a server is shut down and a new instantiated, the port will get
> reused, data will get trashed, and various confusing things can happen.

Right. The safest way to do that is to check in get_free_port if a
port number is used by a registered node, and continue to loop in if
that's the case. So done.

> 5) Servers are shutdown with -m 'immediate', which can lead to races
> in the script when archiving is turned on. That may be good for some
> tests, but there's no control over it.

I hesitated with fast here actually. So changed this way. We would
want as wall a teardown command to stop the node with immediate and
unregister the node from the active list.

> Other issues:
> 6. Directory structure, used one directory per thing but more logical
> to place all things related to an instance under a single directory,
> and name them according to role (57333_backup, and so on).

Er, well. The first version of the patch did so, and then I switched
to an approach closer to what the existing TAP facility is doing. But
well let's simplify things a bit.

> 7. enable_restoring() uses "cp -i" 'archive_command', not a good fit
> for an automated test.

This seems like a good default to me, and actually that's portable on
Windows easily. One could always append a custom archive_command in a
test when for example testing conflicting archiving when archive_mode
= always.

> Aside from running the tests, the convenience of writing them
> needs to be considered. My perl is very weak, it's been at least
> a decade, but It was difficult to make progress because everything
> is geared toward a batch "pass/fail" run . Stdout is redirected,
> and the log files can't be 'tail --retry -f' in another terminal,
> because they're clobbered at every run.

This relies on the existing TAP infrastructure and this has been
considered as the most portable way of doing by including Windows.
This patch is not aiming at changing that, and will use as much as
possible the existing infrastructure.

> 8. No canned way to output a pprinted overview of the running system
> (paths, ports, for manual checking).

Hm. Why not... Are you suggesting something like print_current_conf
that goes through all the registered nodes and outputs that? How would
you use it?

> 9. Finding things is difficult, See 6.

See my comment above.

> 10. If a test passes/fails or dies due to a bug, everything is cleaned.
> Great for testing, bad for postmortem.

That's something directly related to TestLib.pm where
File:Temp:tempdir creates a temporary path with CLEANUP = 1. We had
discussions regarding that actually...

> 11. a canned "server is responding to queries" helper would be convenient.

Do you mean a wrapper on pg_isready? Do you have use cases in mind for it?

> It might be a good idea to:
> 1) Never reuse ports during a test. Liveness checking is used
> to avoid collisions, but should not determine order of assignment.

Agreed. As far as I can see the problem here is related to the fact
that the port of non-running server may be fetched by another one.
That's a bug of my patch.

> 2) Decouple cleanup from server shutdown. Do the cleanup as the end of
> test only, and allow the user to keep things around.

Agreed here.

> 3) Adjust the directory structure to one top directory per server with
> (PGDATA, backup, archive) subdirs.

Hm. OK. The first version of the patch actually did so.

> 4) Instead of passing ports around as keys, have _explicit functions
> which can be called directly by the user (I'd like the backup *HERE*
> please), with the current functions refactored to merely invoke them
> by interpolating in the values associated with the port they were given.

I don't really see in what this would be a win. We definitely should
have all the data depending on temporary paths during the tests to
facilitate the cleanup wrapper work.

> 4b) server shutdown should perhaps be "smart" by default, or segmented
> into calmly_bring_to_a_close(), pull_electric_plug() and
> drop_down_the_stairs_into_swimming_pool().

Nope, not agreeing here. "immediate" is rather violent to stop a node,
hence I have switched it to use "fast" and there is now a
teardown_node routine that uses immediate, that's more aimed at
cleanup up existing things fiercely.

I have as well moved RecoveryTest.pm to src/test/perl so as all the
consumers of prove_check can use it by default, and decoupled
start_node from make_master and make_*_standby so as it is possible to
add for example new parameters to their postgresql.conf and
recovery.conf files before starting them.

Thanks a lot for the feedback! Attached is an updated patch with all
the things mentioned above done. Are included as well the typo fixes
you sent upthread.
Regards,
--
Michael

Attachment
On 10/02/2015 03:33 PM, Michael Paquier wrote:
>
>

Michael, I'm afraid my email bungling has damaged your thread.

I didn't include an "In-reply-To" header when I posted:

trinity-b4a8035d-59af-4c42-a37e-258f0f28e44a-1443795007012@3capp-mailcom-lxa08.

And we subsequently had our discussion over there instead of here, where
the commitfest app is tracking it.

https://commitfest.postgresql.org/6/197/

Perhaps it would help a little if you posted the latest patch here as
well? So that at least the app picks it up again.

Apologies for my ML n00bness,
Amir






On October 4, 2015 3:27:00 PM GMT+02:00, Amir Rohan <amir.rohan@zoho.com> wrote:

>Perhaps it would help a little if you posted the latest patch here as
>well? So that at least the app picks it up again.

You can as additional threads in the cf app.

-- 
Please excuse brevity and formatting - I am writing this on my mobile phone.

Andres Freund                       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



On 10/04/2015 04:29 PM, Andres Freund wrote:
> On October 4, 2015 3:27:00 PM GMT+02:00, Amir Rohan <amir.rohan@zoho.com> wrote:
> 
>> Perhaps it would help a little if you posted the latest patch here as
>> well? So that at least the app picks it up again.
> 
> You can as additional threads in the cf app.
> 

Done, thank you.




On 25 September 2015 at 14:29, Michael Paquier
<michael.paquier@gmail.com> wrote:

> I also arrived at the conclusion that it would be
> better to place the new package file in src/test/perl instead of
> src/test/recovery to allow any users of the TAP tests to have it in their
> PERL5LIB path and to be able to call the new routines to create and
> manipulate nodes.

While it's Python not Perl, you might find it interesting that support
for the replication protocol is being added to psycopg2, the Python
driver for PostgreSQL. I've been reviewing the patch at
https://github.com/psycopg/psycopg2/pull/322 .

I'm using it to write protocol validation for a logical decoding
plugin at the moment, so that the decoding plugin's output can be
validated in a consistent way for easily controlled inputs.

Perhaps it's worth teaching DBD::Pg to do this? Or adopting psycopg2
for some optional protocol tests...

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services