Thread: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

"Amir Rohan"

Date:

02 October 2015, 14:10:19

On 10/02/2015 03:33 PM, Michael Paquier wrote:

> Any server instances created during the tests should never use a
> user-defined port for portability. Hence using those ports as keys
> just made sense. We could have for example custom names, that have
> port values assigned to them, but that's actually an overkill and
> complicates the whole facility.
> 

Something like:
   global nPortsAssigned = 0;   AssignPort() -> return is_ok(nPortsAssigned++)

was what I used.

>> 2) Behaviour (paths in particular) is hardwired rather then overridable
>> defaults.
> 
> This is the case of all the TAP tests. We could always use the same
> base directory for all the nodes and then embed a sub-directory whose
> name is decided using the port number. But I am not really sure if
> that's a win.
> 

I understand, but it eliminates the kind of scenarios this convenience
package lets you express... conveniently.

>> This is exactly what I needed to test, problems:
>> 3) Can't stop server without clearing its testing data (the maps holding
>> paths and things). But that data might be specifically
>> needed, in particular the backup shouldn't disappear when the
>> server melts down or we have a very low-grade DBA on our hands.
> 
> OK, you have a point here. You may indeed want routines for to enable
> and disable a node completely decoupled from start and stop, with
> something like enable_node and disable_node that basically registers
> or unregisters it from the list of active nodes. I have updated the
> patch this way.
> 

Excellent.

>> 4) Port assignment relies on liveness checks on running servers.
>> If a server is shut down and a new instantiated, the port will get
>> reused, data will get trashed, and various confusing things can happen.
> 
> Right. The safest way to do that is to check in get_free_port if a
> port number is used by a registered node, and continue to loop in if
> that's the case. So done.
> 

That eliminates the "sweet and gentle" variant of the scenario, but it's
susceptible to the "ABA problem":
https://en.wikipedia.org/wiki/ABA_problem
https://youtu.be/CmxkPChOcvw?t=786

Granted, you have to try fairly hard to shoot yourself in the leg,
but since the solution is so simple, why not? If we never reuse ports
within a single test, this goes away.

>> 5) Servers are shutdown with -m 'immediate', which can lead to races
>> in the script when archiving is turned on. That may be good for some
>> tests, but there's no control over it.
> 
> I hesitated with fast here actually. So changed this way. We would
> want as wall a teardown command to stop the node with immediate and
> unregister the node from the active list.
> 

In particular, I was shutting down an archiving node and the archiving
was truncated. I *think* smart doesn't do this. But again, it's really
that the test writer can't easily override, not that the default is wrong.

>> Other issues:
>> 6. Directory structure, used one directory per thing but more logical
>> to place all things related to an instance under a single directory,
>> and name them according to role (57333_backup, and so on).
> 
> Er, well. The first version of the patch did so, and then I switched
> to an approach closer to what the existing TAP facility is doing. But
> well let's simplify things a bit.
> 

I know, I know, but:
1) an instance is a "thing" in your script, so having its associated
paraphernalia in one place makes more sense (maybe only to me).
2) That's often how folks (well, how I) arrange things in deployment,
at least with archive/backup as symlinks to the nas.

Alternatively, naming the dirs with a prefix (srv_foo_HASH,
backup_foo_backupname_HASH, etc') would work as well.

>> 7. enable_restoring() uses "cp -i" 'archive_command', not a good fit
>> for an automated test.
> 
> This seems like a good default to me, and actually that's portable on
> Windows easily. One could always append a custom archive_command in a
> test when for example testing conflicting archiving when archive_mode
> = always.
> 

Ok, I wasn't sure about this, but specifically activating a switch that
asks for input from the user during a test? hmm.
>> 8. No canned way to output a pprinted overview of the running system
>> (paths, ports, for manual checking).
> 
> Hm. Why not... Are you suggesting something like print_current_conf
> that goes through all the registered nodes and outputs that? How would
> you use it?
>

For one thin, I could open a few terminals and `$(print_env_for_server
5437), so psql just worked.

I wish PEG had that as well.


>> 10. If a test passes/fails or dies due to a bug, everything is cleaned.
>> Great for testing, bad for postmortem.
> 
> That's something directly related to TestLib.pm where
> File:Temp:tempdir creates a temporary path with CLEANUP = 1. We had
> discussions regarding that actually...
> 

>> 11. a canned "server is responding to queries" helper would be convenient.
> 
> Do you mean a wrapper on pg_isready? Do you have use cases in mind for it?
> 

Block until recovery is finished, before testing. eliminate races, and
avoid the stupid sleep(3) I used.

>> It might be a good idea to:
>> 1) Never reuse ports during a test. Liveness checking is used
>> to avoid collisions, but should not determine order of assignment.
> 
> Agreed. As far as I can see the problem here is related to the fact
> that the port of non-running server may be fetched by another one.
> That's a bug of my patch.
> 
>> 2) Decouple cleanup from server shutdown. Do the cleanup as the end of
>> test only, and allow the user to keep things around.
> 
> Agreed here.
> 
>> 3) Adjust the directory structure to one top directory per server with
>> (PGDATA, backup, archive) subdirs.
> 
> Hm. OK. The first version of the patch actually did so.
> 

Well, why does "consistency with TAP test" trump the advantages I
mentioned? does TAP actually care?

>> 4) Instead of passing ports around as keys, have _explicit functions
>> which can be called directly by the user (I'd like the backup *HERE*
>> please), with the current functions refactored to merely invoke them
>> by interpolating in the values associated with the port they were given.
> 
> I don't really see in what this would be a win. We definitely should
> have all the data depending on temporary paths during the tests to
> facilitate the cleanup wrapper work.
> 

The trouble was that I reused paths between servers. If shutdown/cleanup
are decoupled, this is probably not needed.

>> 4b) server shutdown should perhaps be "smart" by default, or segmented
>> into calmly_bring_to_a_close(), pull_electric_plug() and
>> drop_down_the_stairs_into_swimming_pool().
> 
> Nope, not agreeing here. "immediate" is rather violent to stop a node,
> hence I have switched it to use "fast" and there is now a
> teardown_node routine that uses immediate, that's more aimed at
> cleanup up existing things fiercely.
> 

Ok, not as the default, but possible to request a specific kind of
shutdown. I needed smart in my case. Plus, in a scenario, you might
expressly be testing behavior for a specific mode, it needs to be
controllable.

> I have as well moved RecoveryTest.pm to src/test/perl so as all the
> consumers of prove_check can use it by default, and decoupled
> start_node from make_master and make_*_standby so as it is possible to
> add for example new parameters to their postgresql.conf and
> recovery.conf files before starting them.
> 
> Thanks a lot for the feedback! Attached is an updated patch with all
> the things mentioned above done. Are included as well the typo fixes
> you sent upthread.
> Regards,
> 

Great! I'll look at it and post again if there's more. If any of the
above extra explanations make it clearer why I suggested some changes
you didn't like...

Thanks,
Amir

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

03 October 2015, 11:39:06

On Fri, Oct 2, 2015 at 11:10 PM, Amir Rohan wrote:
> On 10/02/2015 03:33 PM, Michael Paquier wrote:
> > Any server instances created during the tests should never use a
> > user-defined port for portability. Hence using those ports as keys
> > just made sense. We could have for example custom names, that have
> > port values assigned to them, but that's actually an overkill and
> > complicates the whole facility.
> >
>
> Something like:
>
>     global nPortsAssigned = 0;
>     AssignPort() -> return is_ok(nPortsAssigned++)
>
> was what I used.

Why do you need that. Creating a node is in the most basic way a
matter of calling make_master, make_*_standby where a port number, or
identifier gets uniquely assigned to a node. The main point of the
approach taken by this patch is to make port assignment transparent
for the caller.

> >> 4) Port assignment relies on liveness checks on running servers.
> >> If a server is shut down and a new instantiated, the port will get
> >> reused, data will get trashed, and various confusing things can happen.
> >
> > Right. The safest way to do that is to check in get_free_port if a
> > port number is used by a registered node, and continue to loop in if
> > that's the case. So done.
> >
>
> That eliminates the "sweet and gentle" variant of the scenario, but it's
> susceptible to the "ABA problem":
> https://en.wikipedia.org/wiki/ABA_problem
> https://youtu.be/CmxkPChOcvw?t=786

I learnt a new thing here. That's basically an existing problem even
with the existing perl test infrastructure relying on TestLib.pm when
tests are run in parallel. What we would need here is a global mapping
file storing all the port numbers used by all the nodes currently in
the tests.

>
> Granted, you have to try fairly hard to shoot yourself in the leg,
> but since the solution is so simple, why not? If we never reuse ports
> within a single test, this goes away.

Well, you can reuse the same port number in a test. Simply teardown
the existing node and then recreate a new one. I think that port
number assignment to a node should be transparent to the caller, in
our case the perl test script holding a scenario.

> >> 5) Servers are shutdown with -m 'immediate', which can lead to races
> >> in the script when archiving is turned on. That may be good for some
> >> tests, but there's no control over it.
> >
> > I hesitated with fast here actually. So changed this way. We would
> > want as wall a teardown command to stop the node with immediate and
> > unregister the node from the active list.
> >
>
> In particular, I was shutting down an archiving node and the archiving
> was truncated. I *think* smart doesn't do this. But again, it's really
> that the test writer can't easily override, not that the default is wrong.

Ah, OK. Then fast is just fine. It shuts down the node correctly.
"smart" would wait for all the current connections to finish but I am
wondering if it currently matters here: I don't see yet a clear use
case yet where it would make sense to have multi-threaded script... If
somebody comes back with a clear idea here perhaps we could revisit
that but it does not seem worth it now.

> >> Other issues:
> >> 6. Directory structure, used one directory per thing but more logical
> >> to place all things related to an instance under a single directory,
> >> and name them according to role (57333_backup, and so on).
> >
> > Er, well. The first version of the patch did so, and then I switched
> > to an approach closer to what the existing TAP facility is doing. But
> > well let's simplify things a bit.
> >
>
> I know, I know, but:
> 1) an instance is a "thing" in your script, so having its associated
> paraphernalia in one place makes more sense (maybe only to me).
> 2) That's often how folks (well, how I) arrange things in deployment,
> at least with archive/backup as symlinks to the nas.
>
> Alternatively, naming the dirs with a prefix (srv_foo_HASH,
> backup_foo_backupname_HASH, etc') would work as well.

The useful portion about tempdir is that it cleans up itself
automatically should an error happen. It does not seem to me we want
use that.

> >> 7. enable_restoring() uses "cp -i" 'archive_command', not a good fit
> >> for an automated test.
> >
> > This seems like a good default to me, and actually that's portable on
> > Windows easily. One could always append a custom archive_command in a
> > test when for example testing conflicting archiving when archive_mode
> > = always.
> >
>
> Ok, I wasn't sure about this, but specifically activating a switch that
> asks for input from the user during a test? hmm.

Er... The -i switch is a bad idea. I removed it. Honestly I don't
recall why it was here to begin with...

>  >> 8. No canned way to output a pprinted overview of the running system
> >> (paths, ports, for manual checking).
> >
> > Hm. Why not... Are you suggesting something like print_current_conf
> > that goes through all the registered nodes and outputs that? How would
> > you use it?
> >
>
> For one thin, I could open a few terminals and `$(print_env_for_server
> 5437), so psql just worked.
> I wish PEG had that as well.

Hm. Isn't that coupled with the case where a failure happens then but
tempdirs are cleaned up then? I would expect hackers to run those runs
until the end. If a failure happens, it would then be useful to get a
dump of what happens. However, it seems to me that we can get the same
information by logging all the information when creating a node in the
log file. I have added a routine in this sense, which is called each
time a node is initialized. It seems helpful either way.

> >> 11. a canned "server is responding to queries" helper would be convenient.
> >
> > Do you mean a wrapper on pg_isready? Do you have use cases in mind for it?
> >
>
> Block until recovery is finished, before testing. eliminate races, and
> avoid the stupid sleep(3) I used.

TODO

> >> 4b) server shutdown should perhaps be "smart" by default, or segmented
> >> into calmly_bring_to_a_close(), pull_electric_plug() and
> >> drop_down_the_stairs_into_swimming_pool().
> >
> > Nope, not agreeing here. "immediate" is rather violent to stop a node,
> > hence I have switched it to use "fast" and there is now a
> > teardown_node routine that uses immediate, that's more aimed at
> > cleanup up existing things fiercely.
> >
>
> Ok, not as the default, but possible to request a specific kind of
> shutdown. I needed smart in my case. Plus, in a scenario, you might
> expressly be testing behavior for a specific mode, it needs to be
> controllable.

If your test script is running with a single thread, "fast" or "smart"
would not really make a difference, no?

> > I have as well moved RecoveryTest.pm to src/test/perl so as all the
> > consumers of prove_check can use it by default, and decoupled
> > start_node from make_master and make_*_standby so as it is possible to
> > add for example new parameters to their postgresql.conf and
> > recovery.conf files before starting them.
> >
> > Thanks a lot for the feedback! Attached is an updated patch with all
> > the things mentioned above done. Are included as well the typo fixes
> > you sent upthread.
> > Regards,
> >
>
> Great! I'll look at it and post again if there's more. If any of the
> above extra explanations make it clearer why I suggested some changes
> you didn't like...

I am attaching a new version. I found a small bug in test case 001
when checking if standby 2 has caught up. There is also this dump
function that is helpful. The -i switch in cp command has been removed
as well.
--
Michael

Attachment

20151003_recovery_regressions_v5.patch

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

03 October 2015, 12:50:53

On 10/03/2015 02:38 PM, Michael Paquier wrote:
> On Fri, Oct 2, 2015 at 11:10 PM, Amir Rohan wrote:
>> On 10/02/2015 03:33 PM, Michael Paquier wrote:
>>> Any server instances created during the tests should never use a
>>> user-defined port for portability. Hence using those ports as keys
>>> just made sense. We could have for example custom names, that have
>>> port values assigned to them, but that's actually an overkill and
>>> complicates the whole facility.
>>>
>>
>> Something like:
>>
>>     global nPortsAssigned = 0;
>>     AssignPort() -> return is_ok(nPortsAssigned++)
>>
>> was what I used.
> 
> Why do you need that. Creating a node is in the most basic way a
> matter of calling make_master, make_*_standby where a port number, or
> identifier gets uniquely assigned to a node. The main point of the
> approach taken by this patch is to make port assignment transparent
> for the caller.
> 

See next.

>> Granted, you have to try fairly hard to shoot yourself in the leg,
>> but since the solution is so simple, why not? If we never reuse ports
>> within a single test, this goes away.
> 
> Well, you can reuse the same port number in a test. Simply teardown
> the existing node and then recreate a new one. I think that port
> number assignment to a node should be transparent to the caller, in
> our case the perl test script holding a scenario.
> 

What part of "Never assign the same port twice during one test"
makes this "not transparent to the user"?

If you're thinking about parallel test, I don't think you
need to worry. Availability checks take care of one part,
and the portnum-as-map-key-is-test-local takes care of the
other.

But, see next.

> 
>>>> 4) Port assignment relies on liveness checks on running servers.
>>>> If a server is shut down and a new instantiated, the port will get
>>>> reused, data will get trashed, and various confusing things can happen.
>>>
>>> Right. The safest way to do that is to check in get_free_port if a
>>> port number is used by a registered node, and continue to loop in if
>>> that's the case. So done.
>>>
>>
>> That eliminates the "sweet and gentle" variant of the scenario, but it's
>> susceptible to the "ABA problem":
>> https://en.wikipedia.org/wiki/ABA_problem
>> https://youtu.be/CmxkPChOcvw?t=786
> 
> I learnt a new thing here. That's basically an existing problem even
> with the existing perl test infrastructure relying on TestLib.pm when
> tests are run in parallel. What we would need here is a global mapping
> file storing all the port numbers used by all the nodes currently in
> the tests.
> 

Yeah, a poorman's way to ensure ports aren't reused (I wasn't very
clear at top of post ) is something like:

global nPortsAssigned = 0;

AssignPort():  basePort = BASE_PORT;  # the lowest port we use  while(!available(basePort+nPortsAssigned)):
basePort++ nPortsAssigned++
 
  return basePort;

It has its glaring faults, but would probably work ok.
In any case, I'm sure you can do better.

>>
>> Granted, you have to try fairly hard to shoot yourself in the leg,
>> but since the solution is so simple, why not? If we never reuse ports
>> within a single test, this goes away.
> 
> Well, you can reuse the same port number in a test. Simply teardown
> the existing node and then recreate a new one. I think that port
> number assignment to a node should be transparent to the caller, in
> our case the perl test script holding a scenario.
> 

I was using you *never* want to reuse port numbers. That is, as long
as the lib ensures we never reuse ports within one test, all kinds
of corner cases are eliminated.

>>>> 5) Servers are shutdown with -m 'immediate', which can lead to races
>>>> in the script when archiving is turned on. That may be good for some
>>>> tests, but there's no control over it.
>>>
>>> I hesitated with fast here actually. So changed this way. We would
>>> want as wall a teardown command to stop the node with immediate and
>>> unregister the node from the active list.
>>>
>>
>> In particular, I was shutting down an archiving node and the archiving
>> was truncated. I *think* smart doesn't do this. But again, it's really
>> that the test writer can't easily override, not that the default is wrong.
> 
> Ah, OK. Then fast is just fine. It shuts down the node correctly.
> "smart" would wait for all the current connections to finish but I am
> wondering if it currently matters here: I don't see yet a clear use
> case yet where it would make sense to have multi-threaded script... If
> somebody comes back with a clear idea here perhaps we could revisit
> that but it does not seem worth it now.
> 

My mistake. Perhaps what would be useful though is a way
to force "messy" shutdown. a kill -9, basically. It's a recovery
test suite, right?.

>>>> Other issues:
>>>> 6. Directory structure, used one directory per thing but more logical
>>>> to place all things related to an instance under a single directory,
>>>> and name them according to role (57333_backup, and so on).
>>>
>>> Er, well. The first version of the patch did so, and then I switched
>>> to an approach closer to what the existing TAP facility is doing. But
>>> well let's simplify things a bit.
>>>
>>
>> I know, I know, but:
>> 1) an instance is a "thing" in your script, so having its associated
>> paraphernalia in one place makes more sense (maybe only to me).
>> 2) That's often how folks (well, how I) arrange things in deployment,
>> at least with archive/backup as symlinks to the nas.
>>
>> Alternatively, naming the dirs with a prefix (srv_foo_HASH,
>> backup_foo_backupname_HASH, etc') would work as well.
> 
> The useful portion about tempdir is that it cleans up itself
> automatically should an error happen. It does not seem to me we want
> use that.
> 

Ensuring cleanup and directory structure aren't inherently related.
Testlib makes cleanup easy if you're willing to accept its flat
structure. But writing something that does cleanup and lets yo
control directory structure is perfectly doable.

The question is only if you agree or not that having per-server
directories could be convenient. Tying into the next, if you
don't think anyone ever need to look into these directories
(which I disagree with), then dir structure indeed doesn't matter.

>>  >> 8. No canned way to output a pprinted overview of the running system
>>>> (paths, ports, for manual checking).
>>>
>>> Hm. Why not... Are you suggesting something like print_current_conf
>>> that goes through all the registered nodes and outputs that? How would
>>> you use it?
>>>
>>
>> For one thin, I could open a few terminals and `$(print_env_for_server
>> 5437), so psql just worked.
>> I wish PEG had that as well.
> 
> Hm. Isn't that coupled with the case where a failure happens then but
> tempdirs are cleaned up then? 

But I've mentioned that's inconvenient as well. If you don't think
/that/ should be fixed, then yes, there's no point adding this.
Still, perhaps doing both (+previous) would make writing tests
easier.

At least for me, writing tests isn't simply a typing exercise.
I always need to inspect the system at stages during the test.
It's only after the test is ready that cleanup is useful, before
that it actually hampers work.

> I would expect hackers to run those runs
> until the end. 

I agree -- when you're running them , but what about when you're
/writing/ them?

If a failure happens, it would then be useful to get a
> dump of what happens. However, it seems to me that we can get the same
> information by logging all the information when creating a node in the
> log file. I have added a routine in this sense, which is called each
> time a node is initialized. It seems helpful either way.
> 


>>>> 11. a canned "server is responding to queries" helper would be convenient.
>>>
>>> Do you mean a wrapper on pg_isready? Do you have use cases in mind for it?
>>>
>>
>> Block until recovery is finished, before testing. eliminate races, and
>> avoid the stupid sleep(3) I used.
> 
> TODO
> 
>>>> 4b) server shutdown should perhaps be "smart" by default, or segmented
>>>> into calmly_bring_to_a_close(), pull_electric_plug() and
>>>> drop_down_the_stairs_into_swimming_pool().
>>>
>>> Nope, not agreeing here. "immediate" is rather violent to stop a node,
>>> hence I have switched it to use "fast" and there is now a
>>> teardown_node routine that uses immediate, that's more aimed at
>>> cleanup up existing things fiercely.
>>>
>>
>> Ok, not as the default, but possible to request a specific kind of
>> shutdown. I needed smart in my case. Plus, in a scenario, you might
>> expressly be testing behavior for a specific mode, it needs to be
>> controllable.
> 
> If your test script is running with a single thread, "fast" or "smart"
> would not really make a difference, no?
> 

It would If there's a bug in one of them and I'm trying to write
a regression test for it. Recall, this was part of broader view
of "provide defaults, allow override" I was suggesting.

>>> I have as well moved RecoveryTest.pm to src/test/perl so as all the
>>> consumers of prove_check can use it by default, and decoupled
>>> start_node from make_master and make_*_standby so as it is possible to
>>> add for example new parameters to their postgresql.conf and
>>> recovery.conf files before starting them.
>>>
>>> Thanks a lot for the feedback! Attached is an updated patch with all
>>> the things mentioned above done. Are included as well the typo fixes
>>> you sent upthread.
>>> Regards,
>>>
>>
>> Great! I'll look at it and post again if there's more. If any of the
>> above extra explanations make it clearer why I suggested some changes
>> you didn't like...
> 
> I am attaching a new version. I found a small bug in test case 001
> when checking if standby 2 has caught up. There is also this dump
> function that is helpful. The -i switch in cp command has been removed
> as well.
> 

I'm sorry I didn't review the code, but honestly my perl is so rusty I'm
afraid I'll embarrass myself :)


Thanks!
Amir

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

03 October 2015, 13:05:06

On 10/03/2015 03:50 PM, Amir Rohan wrote:
> On 10/03/2015 02:38 PM, Michael Paquier wrote:
>> On Fri, Oct 2, 2015 at 11:10 PM, Amir Rohan wrote:
>>> On 10/02/2015 03:33 PM, Michael Paquier wrote:
>>>
>>> Granted, you have to try fairly hard to shoot yourself in the leg,
>>> but since the solution is so simple, why not? If we never reuse ports
>>> within a single test, this goes away.
>>
>> Well, you can reuse the same port number in a test. Simply teardown
>> the existing node and then recreate a new one. I think that port
>> number assignment to a node should be transparent to the caller, in
>> our case the perl test script holding a scenario.
>>
> 
> What part of "Never assign the same port twice during one test"
> makes this "not transparent to the user"?
> 
> If you're thinking about parallel tests, I don't think you
> need to worry. Availability checks take care of one part,

Except now that I think of it, that's definitely a race:

Thread1: is_available(5432) -> True
Thread2: is_available(5432) -> True
Thread1: listen(5432) -> True
Thread2: listen(5432) -> #$@#$&@#$^&$#@&

I don't know if parallel tests are actually supported, though.
If theye are, you're right that this is a shared global
resource wrt concurrency.

Amir

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

03 October 2015, 13:47:38

On Sat, Oct 3, 2015 at 9:50 PM, Amir Rohan <amir.rohan@zoho.com> wrote:
> On 10/03/2015 02:38 PM, Michael Paquier wrote:
>> On Fri, Oct 2, 2015 at 11:10 PM, Amir Rohan wrote:
>>> On 10/02/2015 03:33 PM, Michael Paquier wrote:
>>>>> 4) Port assignment relies on liveness checks on running servers.
>>>>> If a server is shut down and a new instantiated, the port will get
>>>>> reused, data will get trashed, and various confusing things can happen.
>>>>
>>>> Right. The safest way to do that is to check in get_free_port if a
>>>> port number is used by a registered node, and continue to loop in if
>>>> that's the case. So done.
>>>>
>>>
>>> That eliminates the "sweet and gentle" variant of the scenario, but it's
>>> susceptible to the "ABA problem":
>>> https://en.wikipedia.org/wiki/ABA_problem
>>> https://youtu.be/CmxkPChOcvw?t=786
>>
>> I learnt a new thing here. That's basically an existing problem even
>> with the existing perl test infrastructure relying on TestLib.pm when
>> tests are run in parallel. What we would need here is a global mapping
>> file storing all the port numbers used by all the nodes currently in
>> the tests.
>>
>
> Yeah, a poorman's way to ensure ports aren't reused (I wasn't very
> clear at top of post ) is something like:
>
> global nPortsAssigned = 0;
>
> AssignPort():
>    basePort = BASE_PORT;  # the lowest port we use
>    while(!available(basePort+nPortsAssigned)):
>        basePort++
>
>    nPortsAssigned++
>
>    return basePort;
>
> It has its glaring faults, but would probably work ok.
> In any case, I'm sure you can do better.

Yeah, this would improve the exiting port lookup. I don't mind adding
a global variable in get_free_port for this purpose. This would
accelerate finding a free port in may cases for sure.

>>>
>>> Granted, you have to try fairly hard to shoot yourself in the leg,
>>> but since the solution is so simple, why not? If we never reuse ports
>>> within a single test, this goes away.
>>
>> Well, you can reuse the same port number in a test. Simply teardown
>> the existing node and then recreate a new one. I think that port
>> number assignment to a node should be transparent to the caller, in
>> our case the perl test script holding a scenario.
>>
>
> I was using you *never* want to reuse port numbers. That is, as long
> as the lib ensures we never reuse ports within one test, all kinds
> of corner cases are eliminated.

Hm, sure. Though I don't really why that would be mandatory to enforce
this condition as long as the list of ports occupied is in a single
place (as long as tests are not run in parallel...).

>>>>> 5) Servers are shutdown with -m 'immediate', which can lead to races
>>>>> in the script when archiving is turned on. That may be good for some
>>>>> tests, but there's no control over it.
>>>>
>>>> I hesitated with fast here actually. So changed this way. We would
>>>> want as wall a teardown command to stop the node with immediate and
>>>> unregister the node from the active list.
>>>>
>>>
>>> In particular, I was shutting down an archiving node and the archiving
>>> was truncated. I *think* smart doesn't do this. But again, it's really
>>> that the test writer can't easily override, not that the default is wrong.
>>
>> Ah, OK. Then fast is just fine. It shuts down the node correctly.
>> "smart" would wait for all the current connections to finish but I am
>> wondering if it currently matters here: I don't see yet a clear use
>> case yet where it would make sense to have multi-threaded script... If
>> somebody comes back with a clear idea here perhaps we could revisit
>> that but it does not seem worth it now.
>>
>
> My mistake. Perhaps what would be useful though is a way
> to force "messy" shutdown. a kill -9, basically. It's a recovery
> test suite, right?.

That's what the teardown is aimed at having, the immediate stop mode
would play that fairly good enough. There has been a patch from Tom
Lane around to stop a server should its postmaster.pid be missing as
well...

>>>>> Other issues:
>>>>> 6. Directory structure, used one directory per thing but more logical
>>>>> to place all things related to an instance under a single directory,
>>>>> and name them according to role (57333_backup, and so on).
>>>>
>>>> Er, well. The first version of the patch did so, and then I switched
>>>> to an approach closer to what the existing TAP facility is doing. But
>>>> well let's simplify things a bit.
>>>>
>>>
>>> I know, I know, but:
>>> 1) an instance is a "thing" in your script, so having its associated
>>> paraphernalia in one place makes more sense (maybe only to me).
>>> 2) That's often how folks (well, how I) arrange things in deployment,
>>> at least with archive/backup as symlinks to the nas.
>>>
>>> Alternatively, naming the dirs with a prefix (srv_foo_HASH,
>>> backup_foo_backupname_HASH, etc') would work as well.
>>
>> The useful portion about tempdir is that it cleans up itself
>> automatically should an error happen. It does not seem to me we want
>> use that.
>>
>
> Ensuring cleanup and directory structure aren't inherently related.
> Testlib makes cleanup easy if you're willing to accept its flat
> structure. But writing something that does cleanup and lets yo
> control directory structure is perfectly doable.
>
> The question is only if you agree or not that having per-server
> directories could be convenient. Tying into the next, if you
> don't think anyone ever need to look into these directories
> (which I disagree with), then dir structure indeed doesn't matter.

So your point is having one temp dir for the whole, right? I don't
disagree with that.

>> I would expect hackers to run those runs
>> until the end.
>
> I agree -- when you're running them , but what about when you're
> /writing/ them?

Well, I enforce CLEANUP=0 manually in TestLib.pm for now.

>>>>> 11. a canned "server is responding to queries" helper would be convenient.
>>>>
>>>> Do you mean a wrapper on pg_isready? Do you have use cases in mind for it?
>>>>
>>>
>>> Block until recovery is finished, before testing. eliminate races, and
>>> avoid the stupid sleep(3) I used.
>>
>> TODO

Well. I just recalled this item in the list of things you mentioned. I
marked it but forgot to address it. It sounds right that we may want
something using pg_isready in a loop as a node in recovery would
reject connections.

>>
>>>>> 4b) server shutdown should perhaps be "smart" by default, or segmented
>>>>> into calmly_bring_to_a_close(), pull_electric_plug() and
>>>>> drop_down_the_stairs_into_swimming_pool().
>>>>
>>>> Nope, not agreeing here. "immediate" is rather violent to stop a node,
>>>> hence I have switched it to use "fast" and there is now a
>>>> teardown_node routine that uses immediate, that's more aimed at
>>>> cleanup up existing things fiercely.
>>>>
>>>
>>> Ok, not as the default, but possible to request a specific kind of
>>> shutdown. I needed smart in my case. Plus, in a scenario, you might
>>> expressly be testing behavior for a specific mode, it needs to be
>>> controllable.
>>
>> If your test script is running with a single thread, "fast" or "smart"
>> would not really make a difference, no?
>
> It would If there's a bug in one of them and I'm trying to write
> a regression test for it. Recall, this was part of broader view
> of "provide defaults, allow override" I was suggesting.

We could then extend stop_node with an optional argument containing a
mode, with fast being the default. Sounds right?

>>>> I have as well moved RecoveryTest.pm to src/test/perl so as all the
>>>> consumers of prove_check can use it by default, and decoupled
>>>> start_node from make_master and make_*_standby so as it is possible to
>>>> add for example new parameters to their postgresql.conf and
>>>> recovery.conf files before starting them.
>>>>
>>>> Thanks a lot for the feedback! Attached is an updated patch with all
>>>> the things mentioned above done. Are included as well the typo fixes
>>>> you sent upthread.
>>>> Regards,
>>>>
>>>
>>> Great! I'll look at it and post again if there's more. If any of the
>>> above extra explanations make it clearer why I suggested some changes
>>> you didn't like...
>>
>> I am attaching a new version. I found a small bug in test case 001
>> when checking if standby 2 has caught up. There is also this dump
>> function that is helpful. The -i switch in cp command has been removed
>> as well.
>>
>
> I'm sorry I didn't review the code, but honestly my perl is so rusty I'm
> afraid I'll embarrass myself :)

I don't pretend mine are good :) So we are two.
-- 
Michael

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

04 October 2015, 11:03:40

On Sat, Oct 3, 2015 at 10:47 PM, Michael Paquier wrote:
> On Sat, Oct 3, 2015 at 9:50 PM, Amir Rohan wrote:
>>>> Block until recovery is finished, before testing. eliminate races, and
>>>> avoid the stupid sleep(3) I used.
>>>
>>> TODO
>
> Well. I just recalled this item in the list of things you mentioned. I
> marked it but forgot to address it. It sounds right that we may want
> something using pg_isready in a loop as a node in recovery would
> reject connections.

I just hacked up an updated version with the following things:
- Optional argument for stop_node to define the stop mode of pg_ctl
- Addition of wait_for_node where pg_isready is used to wait until a
node is ready to accept queries
- Addition of a local lookup variable to track the last port assigned.
This accelerates get_free_port.
Regards,
--
Michael

Attachment

20151004_recovery_regressions_v6.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Robert Haas

Date:

06 October 2015, 20:58:15

On Sat, Oct 3, 2015 at 7:38 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>> Granted, you have to try fairly hard to shoot yourself in the leg,
>> but since the solution is so simple, why not? If we never reuse ports
>> within a single test, this goes away.
>
> Well, you can reuse the same port number in a test. Simply teardown
> the existing node and then recreate a new one. I think that port
> number assignment to a node should be transparent to the caller, in
> our case the perl test script holding a scenario.

It seems that these days 'make check' creates a directory in /tmp
called /tmp/pg_regress-RANDOMSTUFF.  Listening on TCP ports is
disabled, and the socket goes inside this directory with a name like
.s.PGSQL.PORT.  You can connect with psql -h
/tmp/pg_regress-RANDOMSTUFF -p PORT, but not over TCP.  This basically
removes the risk of TCP port number collisions, as well as the risk of
your temporary instance being hijacked by a malicious user on the same
machine.  I'm not sure what we do on Windows, though.

>> In particular, I was shutting down an archiving node and the archiving
>> was truncated. I *think* smart doesn't do this. But again, it's really
>> that the test writer can't easily override, not that the default is wrong.
>
> Ah, OK. Then fast is just fine. It shuts down the node correctly.
> "smart" would wait for all the current connections to finish but I am
> wondering if it currently matters here: I don't see yet a clear use
> case yet where it would make sense to have multi-threaded script... If
> somebody comes back with a clear idea here perhaps we could revisit
> that but it does not seem worth it now.

I don't have anything brilliant to say about this point, but here's a
perhaps-not-brilliant comment:

If there's a bug in one of smart and fast shutdown and the other works
great, it would be nice to catch that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

06 October 2015, 22:43:58

On Wed, Oct 7, 2015 at 5:58 AM, Robert Haas wrote:
> On Sat, Oct 3, 2015 at 7:38 AM, Michael Paquier wrote:
> It seems that these days 'make check' creates a directory in /tmp
> called /tmp/pg_regress-RANDOMSTUFF.  Listening on TCP ports is
> disabled, and the socket goes inside this directory with a name like
> .s.PGSQL.PORT.  You can connect with psql -h
> /tmp/pg_regress-RANDOMSTUFF -p PORT, but not over TCP.  This basically
> removes the risk of TCP port number collisions, as well as the risk of
> your temporary instance being hijacked by a malicious user on the same
> machine.

Right, that's for example /var/folders/ on OSX, and this is defined
once per test run via $tempdir_short. PGHOST is set to that as well.

> I'm not sure what we do on Windows, though.

sspi with include_realm through 127.0.0.1.

>>> In particular, I was shutting down an archiving node and the archiving
>>> was truncated. I *think* smart doesn't do this. But again, it's really
>>> that the test writer can't easily override, not that the default is wrong.
>>
>> Ah, OK. Then fast is just fine. It shuts down the node correctly.
>> "smart" would wait for all the current connections to finish but I am
>> wondering if it currently matters here: I don't see yet a clear use
>> case yet where it would make sense to have multi-threaded script... If
>> somebody comes back with a clear idea here perhaps we could revisit
>> that but it does not seem worth it now.
>
> I don't have anything brilliant to say about this point, but here's a
> perhaps-not-brilliant comment:
>
> If there's a bug in one of smart and fast shutdown and the other works
> great, it would be nice to catch that.

Yes, sure. I extended the patch to support other stop modes than fast,
the default being kept to fast if none is defined.
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

06 October 2015, 22:51:33

On Wed, Oct 7, 2015 at 7:43 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Oct 7, 2015 at 5:58 AM, Robert Haas wrote:
>> On Sat, Oct 3, 2015 at 7:38 AM, Michael Paquier wrote:
>> It seems that these days 'make check' creates a directory in /tmp
>> called /tmp/pg_regress-RANDOMSTUFF.  Listening on TCP ports is
>> disabled, and the socket goes inside this directory with a name like
>> .s.PGSQL.PORT.  You can connect with psql -h
>> /tmp/pg_regress-RANDOMSTUFF -p PORT, but not over TCP.  This basically
>> removes the risk of TCP port number collisions, as well as the risk of
>> your temporary instance being hijacked by a malicious user on the same
>> machine.
>
> Right, that's for example /var/folders/ on OSX, and this is defined
> once per test run via $tempdir_short. PGHOST is set to that as well.

Er, mistake here. That's actually once per standard_initdb, except
that all the tests I have included in my patch run it just once to
create a master node. It seems that it would be wiser to set one
socket dir per node then, remove the port assignment stuff, and use
tempdir_short as a key to define a node as well as in the connection
string to this node. I'll update the patch later today...
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

07 October 2015, 06:27:38

On Wed, Oct 7, 2015 at 7:51 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Oct 7, 2015 at 7:43 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Oct 7, 2015 at 5:58 AM, Robert Haas wrote:
>>> On Sat, Oct 3, 2015 at 7:38 AM, Michael Paquier wrote:
>>> It seems that these days 'make check' creates a directory in /tmp
>>> called /tmp/pg_regress-RANDOMSTUFF.  Listening on TCP ports is
>>> disabled, and the socket goes inside this directory with a name like
>>> .s.PGSQL.PORT.  You can connect with psql -h
>>> /tmp/pg_regress-RANDOMSTUFF -p PORT, but not over TCP.  This basically
>>> removes the risk of TCP port number collisions, as well as the risk of
>>> your temporary instance being hijacked by a malicious user on the same
>>> machine.
>>
>> Right, that's for example /var/folders/ on OSX, and this is defined
>> once per test run via $tempdir_short. PGHOST is set to that as well.
>
> Er, mistake here. That's actually once per standard_initdb, except
> that all the tests I have included in my patch run it just once to
> create a master node. It seems that it would be wiser to set one
> socket dir per node then, remove the port assignment stuff, and use
> tempdir_short as a key to define a node as well as in the connection
> string to this node. I'll update the patch later today...

So, my conclusion regarding multiple calls of make_master is that we
should not allow to do it. On Unix/Linux we could have a separate unix
socket directory for each node, but not on Windows where
listen_addresses is set to look after 127.0.0.1. On Unix/Linux, PGHOST
is set by the master node to a tempdir once and for all. Hence, to
make the code more consistent, I think that we should keep the port
lookup machinery here. An updated patch is attached.
--
Michael

Attachment

20151007_recovery_regressions_v7.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

07 October 2015, 07:16:44

On 10/07/2015 09:27 AM, Michael Paquier wrote:
> On Wed, Oct 7, 2015 at 7:51 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Wed, Oct 7, 2015 at 7:43 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Wed, Oct 7, 2015 at 5:58 AM, Robert Haas wrote:
>>>> On Sat, Oct 3, 2015 at 7:38 AM, Michael Paquier wrote:
>>>> It seems that these days 'make check' creates a directory in /tmp
>>>> called /tmp/pg_regress-RANDOMSTUFF.  Listening on TCP ports is
>>>> disabled, and the socket goes inside this directory with a name like
>>>> .s.PGSQL.PORT.  You can connect with psql -h
>>>> /tmp/pg_regress-RANDOMSTUFF -p PORT, but not over TCP.  This basically
>>>> removes the risk of TCP port number collisions, as well as the risk of
>>>> your temporary instance being hijacked by a malicious user on the same
>>>> machine.
>>>
>>> Right, that's for example /var/folders/ on OSX, and this is defined
>>> once per test run via $tempdir_short. PGHOST is set to that as well.
>>
>> Er, mistake here. That's actually once per standard_initdb, except
>> that all the tests I have included in my patch run it just once to
>> create a master node. It seems that it would be wiser to set one
>> socket dir per node then, remove the port assignment stuff, and use
>> tempdir_short as a key to define a node as well as in the connection
>> string to this node. I'll update the patch later today...
> 
> So, my conclusion regarding multiple calls of make_master is that we
> should not allow to do it. On Unix/Linux we could have a separate unix
> socket directory for each node, but not on Windows where
> listen_addresses is set to look after 127.0.0.1. On Unix/Linux, PGHOST
> is set by the master node to a tempdir once and for all. Hence, to
> make the code more consistent, I think that we should keep the port
> lookup machinery here. An updated patch is attached.
> 

If parallel tests are supported, get_free_port is still racy even
with last_port_found because it's:
1) process-local.
2) even if it were shared, there's the race window between the
available-check and the listen() I mentioned upthread.

If parallel tests are explicitly disallowed, a comment to that
effect (and a note on things known to break) might help someone
down the road.

Also, the removal of poll_query_until from pg_rewind looks suspiciously
like a copy-paste gone bad. Pardon if I'm missing something.

Amir

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

07 October 2015, 07:29:34

On Wed, Oct 7, 2015 at 4:16 PM, Amir Rohan <amir.rohan@zoho.com> wrote:
> On 10/07/2015 09:27 AM, Michael Paquier wrote:
>> On Wed, Oct 7, 2015 at 7:51 AM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>> On Wed, Oct 7, 2015 at 7:43 AM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> On Wed, Oct 7, 2015 at 5:58 AM, Robert Haas wrote:
>>>>> On Sat, Oct 3, 2015 at 7:38 AM, Michael Paquier wrote:
>>>>> It seems that these days 'make check' creates a directory in /tmp
>>>>> called /tmp/pg_regress-RANDOMSTUFF.  Listening on TCP ports is
>>>>> disabled, and the socket goes inside this directory with a name like
>>>>> .s.PGSQL.PORT.  You can connect with psql -h
>>>>> /tmp/pg_regress-RANDOMSTUFF -p PORT, but not over TCP.  This basically
>>>>> removes the risk of TCP port number collisions, as well as the risk of
>>>>> your temporary instance being hijacked by a malicious user on the same
>>>>> machine.
>>>>
>>>> Right, that's for example /var/folders/ on OSX, and this is defined
>>>> once per test run via $tempdir_short. PGHOST is set to that as well.
>>>
>>> Er, mistake here. That's actually once per standard_initdb, except
>>> that all the tests I have included in my patch run it just once to
>>> create a master node. It seems that it would be wiser to set one
>>> socket dir per node then, remove the port assignment stuff, and use
>>> tempdir_short as a key to define a node as well as in the connection
>>> string to this node. I'll update the patch later today...
>>
>> So, my conclusion regarding multiple calls of make_master is that we
>> should not allow to do it. On Unix/Linux we could have a separate unix
>> socket directory for each node, but not on Windows where
>> listen_addresses is set to look after 127.0.0.1. On Unix/Linux, PGHOST
>> is set by the master node to a tempdir once and for all. Hence, to
>> make the code more consistent, I think that we should keep the port
>> lookup machinery here. An updated patch is attached.
>>
> If parallel tests are supported, get_free_port is still racy even
> with last_port_found because it's:
> 1) process-local.
> 2) even if it were shared, there's the race window between the
> available-check and the listen() I mentioned upthread.
>
> If parallel tests are explicitly disallowed, a comment to that
> effect (and a note on things known to break) might help someone
> down the road.

Actually, no, port lookup will not map and parallel tests would work
fine thinking more about it, each set of tests uses its own PGHOST to
a private unix socket directory so even if multiple tests use the same
port number they won't interact with each other because they connect
to different socket paths. MinGW is a problem though, and an existing
one in the perl test scripts, I recall that it can use make -j and
that's on Windows where PGHOST is mapping to 127.0.0.1 only.

> Also, the removal of poll_query_until from pg_rewind looks suspiciously
> like a copy-paste gone bad. Pardon if I'm missing something.

Perhaps. Do you have a suggestion regarding that? It seems to me that
this is more useful in TestLib.pm as-is.
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

07 October 2015, 08:44:33

On 10/07/2015 10:29 AM, Michael Paquier wrote:
> On Wed, Oct 7, 2015 at 4:16 PM, Amir Rohan <amir.rohan@zoho.com> wrote:
>> On 10/07/2015 09:27 AM, Michael Paquier wrote:
>>> On Wed, Oct 7, 2015 at 7:51 AM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>> On Wed, Oct 7, 2015 at 7:43 AM, Michael Paquier
>>>> <michael.paquier@gmail.com> wrote:
>>>>> On Wed, Oct 7, 2015 at 5:58 AM, Robert Haas wrote:
>>>>>> On Sat, Oct 3, 2015 at 7:38 AM, Michael Paquier wrote:
>>>>>> It seems that these days 'make check' creates a directory in /tmp
>>>>>> called /tmp/pg_regress-RANDOMSTUFF.  Listening on TCP ports is
>>>>>> disabled, and the socket goes inside this directory with a name like
>>>>>> .s.PGSQL.PORT.  You can connect with psql -h
>>>>>> /tmp/pg_regress-RANDOMSTUFF -p PORT, but not over TCP.  This basically
>>>>>> removes the risk of TCP port number collisions, as well as the risk of
>>>>>> your temporary instance being hijacked by a malicious user on the same
>>>>>> machine.
>>>>>
>>>>> Right, that's for example /var/folders/ on OSX, and this is defined
>>>>> once per test run via $tempdir_short. PGHOST is set to that as well.
>>>>
>>>> Er, mistake here. That's actually once per standard_initdb, except
>>>> that all the tests I have included in my patch run it just once to
>>>> create a master node. It seems that it would be wiser to set one
>>>> socket dir per node then, remove the port assignment stuff, and use
>>>> tempdir_short as a key to define a node as well as in the connection
>>>> string to this node. I'll update the patch later today...
>>>
>>> So, my conclusion regarding multiple calls of make_master is that we
>>> should not allow to do it. On Unix/Linux we could have a separate unix
>>> socket directory for each node, but not on Windows where
>>> listen_addresses is set to look after 127.0.0.1. On Unix/Linux, PGHOST
>>> is set by the master node to a tempdir once and for all. Hence, to
>>> make the code more consistent, I think that we should keep the port
>>> lookup machinery here. An updated patch is attached.
>>>
>> If parallel tests are supported, get_free_port is still racy even
>> with last_port_found because it's:
>> 1) process-local.
>> 2) even if it were shared, there's the race window between the
>> available-check and the listen() I mentioned upthread.
>>
>> If parallel tests are explicitly disallowed, a comment to that
>> effect (and a note on things known to break) might help someone
>> down the road.
> 
> Actually, no, port lookup will not map and parallel tests would work
> fine thinking more about it, each set of tests uses its own PGHOST to
> a private unix socket directory so even if multiple tests use the same
> port number they won't interact with each other because they connect
> to different socket paths. MinGW is a problem though, and an existing
> one in the perl test scripts, I recall that it can use make -j and
> that's on Windows where PGHOST is mapping to 127.0.0.1 only.
> 

ah, the portnum is actually a real tcp port only on windows, and
the race is limited to that case as you say. Note that in the
tcp case, using psql to check is wrong:
$ nc -l 8001  # listen on 8001
$ psql -X -h lo -p 8001 postgres < /dev/null psql: could not connect to
server: Connection refusedIs the server running on host "lo" (127.0.0.1) and acceptingTCP/IP connections on port 8001?

The port isn't free, but psql is really only checking if pg is there
and reports that the port is available. That's a fairly mild issue, though.

>> Also, the removal of poll_query_until from pg_rewind looks suspiciously
>> like a copy-paste gone bad. Pardon if I'm missing something.
> 
> Perhaps. Do you have a suggestion regarding that? It seems to me that
> this is more useful in TestLib.pm as-is.
> 

My mistake, the patch only shows some internal function being deleted
but RewindTest.pm (obviously) imports TestLib. You're right, TestLib is
a better place for it.

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

08 October 2015, 05:19:43

On Wed, Oct 7, 2015 at 5:44 PM, Amir Rohan wrote:
> On 10/07/2015 10:29 AM, Michael Paquier wrote:
>> On Wed, Oct 7, 2015 at 4:16 PM, Amir Rohan wrote:
>>> Also, the removal of poll_query_until from pg_rewind looks suspiciously
>>> like a copy-paste gone bad. Pardon if I'm missing something.
>>
>> Perhaps. Do you have a suggestion regarding that? It seems to me that
>> this is more useful in TestLib.pm as-is.
>>
>
> My mistake, the patch only shows some internal function being deleted
> but RewindTest.pm (obviously) imports TestLib. You're right, TestLib is
> a better place for it.

OK. Here is a new patch version. I have removed the restriction
preventing to call make_master multiple times in the same script (one
may actually want to test some stuff related to logical decoding or
FDW for example, who knows...), forcing PGHOST to always use the same
value after it has been initialized. I have added a sanity check
though, it is not possible to create a node based on a base backup if
no master are defined. This looks like a cheap insurance... I also
refactored a bit the code, using the new init_node_info to fill in the
fields of a newly-initialized node, and I removed get_free_port,
init_node, init_node_from_backup, enable_restoring and
enable_streaming from the list of routines exposed to the users, those
can be used directly with make_master, make_warm_standby and
make_hot_standby. We could add them again if need be, somebody may
want to be able to get a free port, set up a node without those
generic routines, just that it does not seem necessary now.
Regards,
--
Michael

Attachment

20151008_recovery_regressions_v8.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

08 October 2015, 06:59:32

On 10/08/2015 08:19 AM, Michael Paquier wrote:
> On Wed, Oct 7, 2015 at 5:44 PM, Amir Rohan wrote:
>> On 10/07/2015 10:29 AM, Michael Paquier wrote:
>>> On Wed, Oct 7, 2015 at 4:16 PM, Amir Rohan wrote:
>>>> Also, the removal of poll_query_until from pg_rewind looks suspiciously
>>>> like a copy-paste gone bad. Pardon if I'm missing something.
>>>
>>> Perhaps. Do you have a suggestion regarding that? It seems to me that
>>> this is more useful in TestLib.pm as-is.
>>>
>>
>> My mistake, the patch only shows some internal function being deleted
>> but RewindTest.pm (obviously) imports TestLib. You're right, TestLib is
>> a better place for it.
> 
> OK. Here is a new patch version. I have removed the restriction
> preventing to call make_master multiple times in the same script (one
> may actually want to test some stuff related to logical decoding or
> FDW for example, who knows...), forcing PGHOST to always use the same
> value after it has been initialized. I have added a sanity check
> though, it is not possible to create a node based on a base backup if
> no master are defined. This looks like a cheap insurance... I also
> refactored a bit the code, using the new init_node_info to fill in the
> fields of a newly-initialized node, and I removed get_free_port,
> init_node, init_node_from_backup, enable_restoring and
> enable_streaming from the list of routines exposed to the users, those
> can be used directly with make_master, make_warm_standby and
> make_hot_standby. We could add them again if need be, somebody may
> want to be able to get a free port, set up a node without those
> generic routines, just that it does not seem necessary now.
> Regards,
> 

If you'd like, I can write up some tests for cascading replication which
are currently missing.

Someone mentioned a daisy chain setup which sounds fun. Anything else in
particular? Also, it would be nice to have some canned way to measure
end-to-end replication latency for variable number of nodes.
What about going back through the commit log and writing some regression
tests for the real stinkers, if someone care to volunteer some candidate
bugs

Amir

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

08 October 2015, 07:39:37

On Thu, Oct 8, 2015 at 3:59 PM, Amir Rohan <amir.rohan@zoho.com> wrote:
> On 10/08/2015 08:19 AM, Michael Paquier wrote:
>> On Wed, Oct 7, 2015 at 5:44 PM, Amir Rohan wrote:
>>> On 10/07/2015 10:29 AM, Michael Paquier wrote:
>>>> On Wed, Oct 7, 2015 at 4:16 PM, Amir Rohan wrote:
>>>>> Also, the removal of poll_query_until from pg_rewind looks suspiciously
>>>>> like a copy-paste gone bad. Pardon if I'm missing something.
>>>>
>>>> Perhaps. Do you have a suggestion regarding that? It seems to me that
>>>> this is more useful in TestLib.pm as-is.
>>>>
>>>
>>> My mistake, the patch only shows some internal function being deleted
>>> but RewindTest.pm (obviously) imports TestLib. You're right, TestLib is
>>> a better place for it.
>>
>> OK. Here is a new patch version. I have removed the restriction
>> preventing to call make_master multiple times in the same script (one
>> may actually want to test some stuff related to logical decoding or
>> FDW for example, who knows...), forcing PGHOST to always use the same
>> value after it has been initialized. I have added a sanity check
>> though, it is not possible to create a node based on a base backup if
>> no master are defined. This looks like a cheap insurance... I also
>> refactored a bit the code, using the new init_node_info to fill in the
>> fields of a newly-initialized node, and I removed get_free_port,
>> init_node, init_node_from_backup, enable_restoring and
>> enable_streaming from the list of routines exposed to the users, those
>> can be used directly with make_master, make_warm_standby and
>> make_hot_standby. We could add them again if need be, somebody may
>> want to be able to get a free port, set up a node without those
>> generic routines, just that it does not seem necessary now.
>> Regards,
>>
>
> If you'd like, I can write up some tests for cascading replication which
> are currently missing.

001 is testing cascading, like that node1 -> node2 -> node3.

> Someone mentioned a daisy chain setup which sounds fun. Anything else in
> particular? Also, it would be nice to have some canned way to measure
> end-to-end replication latency for variable number of nodes.

Hm. Do you mean comparing the LSN position between two nodes even if
both nodes are not connected to each other? What would you use it for?

> What about going back through the commit log and writing some regression
> tests for the real stinkers, if someone care to volunteer some candidate
> bugs

I have drafted a list with a couple of items upthread:
http://www.postgresql.org/message-id/CAB7nPqSgffSPhOcrhFoAsDAnipvn6WsH2nYkf1KayRm+9_MTGw@mail.gmail.com
So based on the existing patch the list becomes as follows:
- wal_retrieve_retry_interval with a high value, say setting to for
example 2/3s and loop until it is applied by checking it is it has
been received by the standby every second.
- recovery_target_action
- archive_cleanup_command
- recovery_end_command
- pg_xlog_replay_pause and pg_xlog_replay_resume
In the list of things that could have a test, I recall that we should
test as well 2PC with the recovery delay, look at a1105c3d. This could
be included in 005.
The advantage of implementing that now is that we could see if the
existing routines are solid enough or not. Still, looking at what the
patch has now I think that we had better get a committer look at it,
and if the core portion gets integrated we could already use it for
the patch implementing quorum synchronous replication and in doing
more advanced tests with pg_rewind regarding the timeline handling
(both patches of this CF). I don't mind adding more now, though I
think that the set of sample tests included in this version is enough
as a base implementation of the facility and shows what it can do.
Regards,
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

08 October 2015, 09:04:18

On 10/08/2015 10:39 AM, Michael Paquier wrote:
> On Thu, Oct 8, 2015 at 3:59 PM, Amir Rohan <amir.rohan@zoho.com> wrote:
>> On 10/08/2015 08:19 AM, Michael Paquier wrote:
>>> On Wed, Oct 7, 2015 at 5:44 PM, Amir Rohan wrote:
>>>> On 10/07/2015 10:29 AM, Michael Paquier wrote:
>>>>> On Wed, Oct 7, 2015 at 4:16 PM, Amir Rohan wrote:
>>>>>> Also, the removal of poll_query_until from pg_rewind looks suspiciously
>>>>>> like a copy-paste gone bad. Pardon if I'm missing something.
>>>>>
>>>>> Perhaps. Do you have a suggestion regarding that? It seems to me that
>>>>> this is more useful in TestLib.pm as-is.
>>>>>
>>>>
>>>> My mistake, the patch only shows some internal function being deleted
>>>> but RewindTest.pm (obviously) imports TestLib. You're right, TestLib is
>>>> a better place for it.
>>>
>>> OK. Here is a new patch version. I have removed the restriction
>>> preventing to call make_master multiple times in the same script (one
>>> may actually want to test some stuff related to logical decoding or
>>> FDW for example, who knows...), forcing PGHOST to always use the same
>>> value after it has been initialized. I have added a sanity check
>>> though, it is not possible to create a node based on a base backup if
>>> no master are defined. This looks like a cheap insurance... I also
>>> refactored a bit the code, using the new init_node_info to fill in the
>>> fields of a newly-initialized node, and I removed get_free_port,
>>> init_node, init_node_from_backup, enable_restoring and
>>> enable_streaming from the list of routines exposed to the users, those
>>> can be used directly with make_master, make_warm_standby and
>>> make_hot_standby. We could add them again if need be, somebody may
>>> want to be able to get a free port, set up a node without those
>>> generic routines, just that it does not seem necessary now.
>>> Regards,
>>>
>>
>> If you'd like, I can write up some tests for cascading replication which
>> are currently missing.
> 
> 001 is testing cascading, like that node1 -> node2 -> node3.
> 
>> Someone mentioned a daisy chain setup which sounds fun. Anything else in
>> particular? Also, it would be nice to have some canned way to measure
>> end-to-end replication latency for variable number of nodes.
> 
> Hm. Do you mean comparing the LSN position between two nodes even if
> both nodes are not connected to each other? What would you use it for?
> 

In a cascading replication setup, the typical _time_ it takes for a
COMMIT on master to reach the slave (assuming constant WAL generation
rate) is an important operational metric.

It would be useful to catch future regressions for that metric,
which may happen even when a patch doesn't outright break cascading
replication. Just automating the measurement could be useful if
there's no pg facility that tracks performance over time in
a regimented fashion. I've seen multiple projects which consider
a "benchmark suite" to be part of its testing strategy.


As for the "daisy chain" thing, it was (IIRC) mentioned in a josh berkus
talk I caught on youtube. It's possible to setup cascading replication,
take down the master, and then reinsert it as replicating slave, so that
you end up with *all* servers replicating from the
ancestor in the chain, and no master. I think it was more
a fun hack then anything, but also an interesting corner case to
investigate.

>> What about going back through the commit log and writing some regression
>> tests for the real stinkers, if someone care to volunteer some candidate
>> bugs
> 
> I have drafted a list with a couple of items upthread:
> http://www.postgresql.org/message-id/CAB7nPqSgffSPhOcrhFoAsDAnipvn6WsH2nYkf1KayRm+9_MTGw@mail.gmail.com
> So based on the existing patch the list becomes as follows:
> - wal_retrieve_retry_interval with a high value, say setting to for
> example 2/3s and loop until it is applied by checking it is it has
> been received by the standby every second.
> - recovery_target_action
> - archive_cleanup_command
> - recovery_end_command
> - pg_xlog_replay_pause and pg_xlog_replay_resume
> In the list of things that could have a test, I recall that we should
> test as well 2PC with the recovery delay, look at a1105c3d. This could
> be included in 005.

a1105c3 Mar 23 Fix copy & paste error in 4f1b890b137.  Andres Freund
4f1b890 Mar 15 Merge the various forms of transaction commit & abort
records.  Andres Freund

Is that the right commit?

> The advantage of implementing that now is that we could see if the
> existing routines are solid enough or not. 

I can do this if you point me at a self-contained thread/#issue.

> Still, looking at what the
> patch has now I think that we had better get a committer look at it,
> and if the core portion gets integrated we could already use it for
> the patch implementing quorum synchronous replication and in doing
> more advanced tests with pg_rewind regarding the timeline handling
> (both patches of this CF). 
>
> I don't mind adding more now, though I
> think that the set of sample tests included in this version is enough
> as a base implementation of the facility and shows what it can do.

Sure.

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

08 October 2015, 13:48:00

On Thu, Oct 8, 2015 at 6:03 PM, Amir Rohan wrote:
> On 10/08/2015 10:39 AM, Michael Paquier wrote:
>>> Someone mentioned a daisy chain setup which sounds fun. Anything else in
>>> particular? Also, it would be nice to have some canned way to measure
>>> end-to-end replication latency for variable number of nodes.
>>
>> Hm. Do you mean comparing the LSN position between two nodes even if
>> both nodes are not connected to each other? What would you use it for?
>>
>
> In a cascading replication setup, the typical _time_ it takes for a
> COMMIT on master to reach the slave (assuming constant WAL generation
> rate) is an important operational metric.

Hm. You mean the exact amount of time it gets to be sure that a given
WAL position has been flushed on a cascading standby, be it across
multiple layers. Er, that's a bit tough without patching the backend
where I guess we would need to keep a track of when a LSN position has
been flushed. And calls of gettimeofday are expensive, so that does
not sound like a plausible alternative here to me...

> It would be useful to catch future regressions for that metric,
> which may happen even when a patch doesn't outright break cascading
> replication. Just automating the measurement could be useful if
> there's no pg facility that tracks performance over time in
> a regimented fashion. I've seen multiple projects which consider
> a "benchmark suite" to be part of its testing strategy.

Ah, OK. I see. That's a bit out of scope of this patch, and that's
really OS-dependent, but as long as the comparisons can be done on the
same OS it would make sense.

> As for the "daisy chain" thing, it was (IIRC) mentioned in a josh berkus
> talk I caught on youtube. It's possible to setup cascading replication,
> take down the master, and then reinsert it as replicating slave, so that
> you end up with *all* servers replicating from the
> ancestor in the chain, and no master. I think it was more
> a fun hack then anything, but also an interesting corner case to
> investigate.

Ah, yes. I recall this one. I am sure it made the audience smile. All
the nodes link to each other in closed circle.

>>> What about going back through the commit log and writing some regression
>>> tests for the real stinkers, if someone care to volunteer some candidate
>>> bugs
>>
>> I have drafted a list with a couple of items upthread:
>> http://www.postgresql.org/message-id/CAB7nPqSgffSPhOcrhFoAsDAnipvn6WsH2nYkf1KayRm+9_MTGw@mail.gmail.com
>> So based on the existing patch the list becomes as follows:
>> - wal_retrieve_retry_interval with a high value, say setting to for
>> example 2/3s and loop until it is applied by checking it is it has
>> been received by the standby every second.
>> - recovery_target_action
>> - archive_cleanup_command
>> - recovery_end_command
>> - pg_xlog_replay_pause and pg_xlog_replay_resume
>> In the list of things that could have a test, I recall that we should
>> test as well 2PC with the recovery delay, look at a1105c3d. This could
>> be included in 005.
>
> a1105c3 Mar 23 Fix copy & paste error in 4f1b890b137.  Andres Freund
> 4f1b890 Mar 15 Merge the various forms of transaction commit & abort
> records.  Andres Freund
>
> Is that the right commit?

That's this one. a1105c3 was actually rather tricky... The idea is to
simply check the WAL replay delay with COMMIT PREPARED.

>> The advantage of implementing that now is that we could see if the
>> existing routines are solid enough or not.
>
> I can do this if you point me at a self-contained thread/#issue.

Hm. This patch is already 900 lines, perhaps it would be wiser not to
make it more complicated for now..
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

08 October 2015, 14:28:22

On 10/08/2015 04:47 PM, Michael Paquier wrote:
> On Thu, Oct 8, 2015 at 6:03 PM, Amir Rohan wrote:
>> On 10/08/2015 10:39 AM, Michael Paquier wrote:
>>>> Someone mentioned a daisy chain setup which sounds fun. Anything else in
>>>> particular? Also, it would be nice to have some canned way to measure
>>>> end-to-end replication latency for variable number of nodes.
>>>
>>> Hm. Do you mean comparing the LSN position between two nodes even if
>>> both nodes are not connected to each other? What would you use it for?
>>>
>>
>> In a cascading replication setup, the typical _time_ it takes for a
>> COMMIT on master to reach the slave (assuming constant WAL generation
>> rate) is an important operational metric.
> 
> Hm. You mean the exact amount of time it gets to be sure that a given
> WAL position has been flushed on a cascading standby, be it across
> multiple layers. Er, that's a bit tough without patching the backend
> where I guess we would need to keep a track of when a LSN position has
> been flushed. And calls of gettimeofday are expensive, so that does
> not sound like a plausible alternative here to me...
> 

Wouldn't this work?

1) start timer
2) Grab pg_stat_replication.sent_location from master
3) pg_switch_xlog() # I /think/ we want this, could be wrong
4) Poll slave's pg_last_xlog_replay_location() until LSN shows up
5) stop timer

Amir

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

08 October 2015, 23:26:04

On Thu, Oct 8, 2015 at 11:28 PM, Amir Rohan wrote:
> Wouldn't this work?
> 1) start timer
> 2) Grab pg_stat_replication.sent_location from master
> 3) pg_switch_xlog() # I /think/ we want this, could be wrong

For a warm standby, you would want that, but this depends on two factors:
- The moment master completes archiving of this segment
- The moment standby restores it.
On slow machines, those two things become by far the bottleneck,
imagine a PI restricted on I/O with a low-class SD card in the worst
case (I maintain one, with a good card, still the I/O is a
bottleneck).

> 4) Poll slave's pg_last_xlog_replay_location() until LSN shows up
> 5) stop timer

That's not really solid, there is an interval of time between the
moment the LSN position is taken from the master and the standby. An
accurate method is to log/store on master when a given WAL position
has been flushed to disk, and do the same on slave at replay for this
LSN position. In any case this is doing to flood badly the logs of
both nodes, and as the backend cares about the performance of
operations in this code path we won't want to do that anyway.

To make it short, it seems to me that simply waiting until the LSN a
test is waiting for has been replayed is just but fine for this set of
tests to ensure their run consistency, let's not forget that this is
the goal here.
Regards,
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

09 October 2015, 11:12:19

On Fri, Oct 9, 2015 at 8:25 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Oct 8, 2015 at 11:28 PM, Amir Rohan wrote:
>> Wouldn't this work?
>> 1) start timer
>> 2) Grab pg_stat_replication.sent_location from master
>> 3) pg_switch_xlog() # I /think/ we want this, could be wrong
>
> For a warm standby, you would want that, but this depends on two factors:
> - The moment master completes archiving of this segment
> - The moment standby restores it.
> On slow machines, those two things become by far the bottleneck,
> imagine a PI restricted on I/O with a low-class SD card in the worst
> case (I maintain one, with a good card, still the I/O is a
> bottleneck).
>
>> 4) Poll slave's pg_last_xlog_replay_location() until LSN shows up
>> 5) stop timer
>
> That's not really solid, there is an interval of time between the
> moment the LSN position is taken from the master and the standby. An
> accurate method is to log/store on master when a given WAL position
> has been flushed to disk, and do the same on slave at replay for this
> LSN position. In any case this is doing to flood badly the logs of
> both nodes, and as the backend cares about the performance of
> operations in this code path we won't want to do that anyway.
>
> To make it short, it seems to me that simply waiting until the LSN a
> test is waiting for has been replayed is just but fine for this set of
> tests to ensure their run consistency, let's not forget that this is
> the goal here.

In terms of features, it seems that this patch has everything it needs
to allow one to design tests to work on both Linux and Windows, and it
is careful regarding CVE-2014-0067. Thoughts about moving that as
"Ready for committer"?
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

09 October 2015, 11:48:13

On 10/09/2015 02:12 PM, Michael Paquier wrote:
> On Fri, Oct 9, 2015 at 8:25 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Thu, Oct 8, 2015 at 11:28 PM, Amir Rohan wrote:
>>> Wouldn't this work?
>>> 1) start timer
>>> 2) Grab pg_stat_replication.sent_location from master
>>> 3) pg_switch_xlog() # I /think/ we want this, could be wrong
>>
>> For a warm standby, you would want that, but this depends on two factors:
>> - The moment master completes archiving of this segment
>> - The moment standby restores it.
>> On slow machines, those two things become by far the bottleneck,
>> imagine a PI restricted on I/O with a low-class SD card in the worst
>> case (I maintain one, with a good card, still the I/O is a
>> bottleneck).
>>
>>> 4) Poll slave's pg_last_xlog_replay_location() until LSN shows up
>>> 5) stop timer
>>
>> That's not really solid, there is an interval of time between the
>> moment the LSN position is taken from the master and the standby. An
>> accurate method is to log/store on master when a given WAL position
>> has been flushed to disk, and do the same on slave at replay for this
>> LSN position. In any case this is doing to flood badly the logs of
>> both nodes, and as the backend cares about the performance of
>> operations in this code path we won't want to do that anyway.
>>
>> To make it short, it seems to me that simply waiting until the LSN a
>> test is waiting for has been replayed is just but fine for this set of
>> tests to ensure their run consistency, let's not forget that this is
>> the goal here.
> 
> In terms of features, it seems that this patch has everything it needs
> to allow one to design tests to work on both Linux and Windows, and it
> is careful regarding CVE-2014-0067. Thoughts about moving that as
> "Ready for committer"?
> 

Ok, I've put myself down as reviewer in cfapp. I don't think I can
provide any more useful feedback that would actually result in changes
at this point, but I'll read through the entire discussion once last
time and write down final comments/notes. After that I have no problem
marking this for a committer to look at.

Amir

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

09 October 2015, 11:54:02

On Fri, Oct 9, 2015 at 8:47 PM, Amir Rohan wrote:
> Ok, I've put myself down as reviewer in cfapp. I don't think I can
> provide any more useful feedback that would actually result in changes
> at this point, but I'll read through the entire discussion once last
> time and write down final comments/notes. After that I have no problem
> marking this for a committer to look at.

OK. If you have any comments or remarks, please do not hesitate at all!
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

10 October 2015, 11:43:40

On Fri, Oct 9, 2015 at 8:53 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Oct 9, 2015 at 8:47 PM, Amir Rohan wrote:
>> Ok, I've put myself down as reviewer in cfapp. I don't think I can
>> provide any more useful feedback that would actually result in changes
>> at this point, but I'll read through the entire discussion once last
>> time and write down final comments/notes. After that I have no problem
>> marking this for a committer to look at.
>
> OK. If you have any comments or remarks, please do not hesitate at all!

So, to let everybody know the issue, Amir has reported me offlist a
bug in one of the tests that can be reproduced more easily on a slow
machine:

> Amir wrote:
> Before posting the summary, I ran the latest v8 patch on today's git
> master (9c42727) and got some errors:
> t/004_timeline_switch.pl ...
> 1..1
> # ERROR:  invalid input syntax for type pg_lsn: ""
> # LINE 1: SELECT ''::pg_lsn <= pg_last_xlog_replay_location()
> #                ^
> # No tests run!

And here is my reply:
This is a timing issue and can happen when standby1, the promoted
standby which standby2 reconnects to to check that recovery works with
a timeline jump, is still in recovery after being restarted. There is
a small windows where this is possible, and this gets easier to
reproduce on slow machines (did so on a VM). So the issue was in test
004. I have updated the script to check pg_is_in_recovery() to be sure
that the node exits recovery before querying it with
pg_current_xlog_location.

It is worth noticing that the following change has saved me a lot of pain:
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -259,6 +259,7 @@ sub psql
        my ($stdout, $stderr);
        print("# Running SQL command: $sql\n");
        run [ 'psql', '-X', '-A', '-t', '-q', '-d', $dbname, '-f',
'-'], '<', \$sql, '>', \$stdout, '2>', \$stderr or die;
+       print "# Error output: $stderr\n" if $stderr ne "";
Perhaps we should consider backpatching it, it helped me find out the
issue I faced.

Attached is an updated patch fixing 004.
Regards,
--
Michael

Attachment

20151010_recovery_regressions_v9.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

10 October 2015, 12:05:03

On 10/10/2015 02:43 PM, Michael Paquier wrote:
> On Fri, Oct 9, 2015 at 8:53 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Oct 9, 2015 at 8:47 PM, Amir Rohan wrote:
>>> Ok, I've put myself down as reviewer in cfapp. I don't think I can
>>> provide any more useful feedback that would actually result in changes
>>> at this point, but I'll read through the entire discussion once last
>>> time and write down final comments/notes. After that I have no problem
>>> marking this for a committer to look at.
>>
>> OK. If you have any comments or remarks, please do not hesitate at all!
> 
> So, to let everybody know the issue, Amir has reported me offlist a
> bug in one of the tests that can be reproduced more easily on a slow
> machine:
> 

Yeah, I usually stick to the list for discussion, but I ran an earlier
version without issues and thought this might be a problem with my
system as I've changed things a bit this week.

Now that v9 fixes the probkem, here's a summary from going over the
entire thread one last time:

# Windows and TAP sets
Noah (2015-03) mentioned TAP doesn't work on windows, and hoped
this would include some work on that.

IIUC, the facilities and tests do run on windows, but focus was there
and not the preexisting TAP suite.

# Test coverage (in the future)
Andres wanted a test for xid/multixid wraparound which also raises
the question of the tests that will need to be written in the future.

The patch focuses on providing facilities, while providing new coverage
for several features. There should be a TODO list on the wiki (bug
tracker, actually), where the list of tests to be written can be managed.

Some were mentioned in the thread (multi/xid wraparound
hot_standby_feedback, max_standby_archive_delay and
max_standby_streaming_delay? recovery_target_action? some in your
original list?), but threads
are precisely where these things get lost in the cracks.

# Interactive use vs. TAP tests

Early on the goal was also to provide something for interactive use
in order to test scenarios. The shift has focused to the TAP tests
and some of the choices in the API reflect that. Interactive use
is possible, but wasn't a central requirement.

# Directory structure

I suggested keeping backup/log/PGDATA per instance, rejected.

# Parallel tests and port collisions

Lots about this. Final result is no port races are possible because
dedicated dirs are used per test, per instance. And because tcp
isn't used for connections on any platform (can you confirm that's
true on windows as well? I'm not familiar with sspi and what OSI
layer it lives on)

# Allow test to specify shutdown mode

Added

# decouple cleanup from node shutdown

Added (in latest patches?)

# Conveniences for test writing vs. running

My suggestions weren't picked up, but for one thing setting CLEANUP=0
in the lib (which means editing it...) can be useful for writers.

# blocking until server ready

pg_isready wrapper added.

# Multiple masters

back and forth, but supported in latest version.

That's it. I've ran the latest (v9) tests works and passed on my system
(fedora 64bit) and also under docker with --cpu-quota=10000, which
simulates a slow machine.

Michael, is there anything else to do here or shall I mark this for
committer review?

Regards,
Amir

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

10 October 2015, 13:32:17

On Sat, Oct 10, 2015 at 9:04 PM, Amir Rohan wrote:
> Now that v9 fixes the probkem, here's a summary from going over the
> entire thread one last time:

Thanks a lot for the summary of the events.

> # Windows and TAP sets
> Noah (2015-03) mentioned TAP doesn't work on windows, and hoped
> this would include some work on that.
> IIUC, the facilities and tests do run on windows, but focus was there
> and not the preexisting TAP suite.

They do work on Windows, see 13d856e.

> # Test coverage (in the future)
> Andres wanted a test for xid/multixid wraparound which also raises
> the question of the tests that will need to be written in the future.

I recall that this would have needed extra functions on the backend...

> The patch focuses on providing facilities, while providing new coverage
> for several features. There should be a TODO list on the wiki (bug
> tracker, actually), where the list of tests to be written can be managed.
> Some were mentioned in the thread (multi/xid wraparound
> hot_standby_feedback, max_standby_archive_delay and
> max_standby_streaming_delay? recovery_target_action? some in your
> original list?), but threads
> are precisely where these things get lost in the cracks.

Sure, that's an on-going task.

> # Directory structure
> I suggested keeping backup/log/PGDATA per instance, rejected.

I guess that I am still flexible on this one, the node information
(own PGDATA, connection string, port, etc.) is logged as well so this
is not a big deal to me...

> # Parallel tests and port collisions
> Lots about this. Final result is no port races are possible because
> dedicated dirs are used per test, per instance. And because tcp
> isn't used for connections on any platform (can you confirm that's
> true on windows as well? I'm not familiar with sspi and what OSI
> layer it lives on)

On Windows you remain with the problem that all nodes initialized
using TestLib.pm will listen to 127.0.0.1, sspi being used to ensure
that the connection at user level is secure (additional entries in
pg_hba.conf are added).

> # decouple cleanup from node shutdown
> Added (in latest patches?)

Yes this was added.

> Michael, is there anything else to do here or shall I mark this for
> committer review?

I have nothing else. Thanks a lot!
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

10 October 2015, 13:50:00

On 10/10/2015 04:32 PM, Michael Paquier wrote:
> On Sat, Oct 10, 2015 at 9:04 PM, Amir Rohan wrote:
>> Now that v9 fixes the problem, here's a summary from going over the
>> entire thread one last time:
> 
> Thanks a lot for the summary of the events.
> 
>> # Windows and TAP sets
>> Noah (2015-03) mentioned TAP doesn't work on windows, and hoped
>> this would include some work on that.
>> IIUC, the facilities and tests do run on windows, but focus was there
>> and not the preexisting TAP suite.
> 
> They do work on Windows, see 13d856e.
> 

Thanks, I did not know that.

>> # Test coverage (in the future)
>> Andres wanted a test for xid/multixid wraparound which also raises
>> the question of the tests that will need to be written in the future.
> 
> I recall that this would have needed extra functions on the backend...
> 
>> The patch focuses on providing facilities, while providing new coverage
>> for several features. There should be a TODO list on the wiki (bug
>> tracker, actually), where the list of tests to be written can be managed.
>> Some were mentioned in the thread (multi/xid wraparound
>> hot_standby_feedback, max_standby_archive_delay and
>> max_standby_streaming_delay? recovery_target_action? some in your
>> original list?), but threads
>> are precisely where these things get lost in the cracks.
> 
> Sure, that's an on-going task.
> 
>> # Directory structure
>> I suggested keeping backup/log/PGDATA per instance, rejected.
> 
> I guess that I am still flexible on this one, the node information
> (own PGDATA, connection string, port, etc.) is logged as well so this
> is not a big deal to me...
> 
>> # Parallel tests and port collisions
>> Lots about this. Final result is no port races are possible because
>> dedicated dirs are used per test, per instance. And because tcp
>> isn't used for connections on any platform (can you confirm that's
>> true on windows as well? I'm not familiar with sspi and what OSI
>> layer it lives on)
> 
> On Windows you remain with the problem that all nodes initialized
> using TestLib.pm will listen to 127.0.0.1, sspi being used to ensure
> that the connection at user level is secure (additional entries in
> pg_hba.conf are added).
> 
>> # decouple cleanup from node shutdown
>> Added (in latest patches?)
> 
> Yes this was added.
> 
>> Michael, is there anything else to do here or shall I mark this for
>> committer review?
> 
> I have nothing else. Thanks a lot!
> 

Ok, marked for committer, I hope I'm following "correct" cf procedure.

Regards,
Amir

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

10 October 2015, 13:52:42

On 10/10/2015 04:32 PM, Michael Paquier wrote:
> On Sat, Oct 10, 2015 at 9:04 PM, Amir Rohan wrote:
>> The patch focuses on providing facilities, while providing new coverage
>> for several features. There should be a TODO list on the wiki (bug
>> tracker, actually), where the list of tests to be written can be managed.
>> Some were mentioned in the thread (multi/xid wraparound
>> hot_standby_feedback, max_standby_archive_delay and
>> max_standby_streaming_delay? recovery_target_action? some in your
>> original list?), but threads
>> are precisely where these things get lost in the cracks.
> 
> Sure, that's an on-going task.
>  

I was arguing that it's an on-going task that would do
better if it had a TODO list, instead of "ideas for tests"
being scattered across 50-100 messages spanning a year or
more in one thread or another. You may disagree.

Amir

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

10 October 2015, 23:48:04

On Sat, Oct 10, 2015 at 10:52 PM, Amir Rohan <amir.rohan@zoho.com> wrote:
> On 10/10/2015 04:32 PM, Michael Paquier wrote:
> I was arguing that it's an on-going task that would do
> better if it had a TODO list, instead of "ideas for tests"
> being scattered across 50-100 messages spanning a year or
> more in one thread or another. You may disagree.

Let's be clear. I am fully in line with your point.
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

11 October 2015, 07:44:54

On 10/11/2015 02:47 AM, Michael Paquier wrote:
> On Sat, Oct 10, 2015 at 10:52 PM, Amir Rohan <amir.rohan@zoho.com> wrote:
>> On 10/10/2015 04:32 PM, Michael Paquier wrote:
>> I was arguing that it's an on-going task that would do
>> better if it had a TODO list, instead of "ideas for tests"
>> being scattered across 50-100 messages spanning a year or
>> more in one thread or another. You may disagree.
> 
> Let's be clear. I am fully in line with your point.
> 

I apologize -- that didn't came out right.
What I meant to suggest was "open an issue" to track
any works that needs to be done. But I guess that's
not the PG way.

Amir

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

11 October 2015, 10:19:31

On Sun, Oct 11, 2015 at 4:44 PM, Amir Rohan wrote:
> On 10/11/2015 02:47 AM, Michael Paquier wrote:
> I apologize -- that didn't came out right.
> What I meant to suggest was "open an issue" to track
> any works that needs to be done. But I guess that's
> not the PG way.

No problem. I was not clear either. We could create a new item in the
TODO list (https://wiki.postgresql.org/wiki/Todo) and link it to
dedicated page on the wiki where all the potential tests would be
listed.
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Amir Rohan

Date:

11 October 2015, 10:32:08

On 10/11/2015 01:19 PM, Michael Paquier wrote:
> On Sun, Oct 11, 2015 at 4:44 PM, Amir Rohan wrote:
>> On 10/11/2015 02:47 AM, Michael Paquier wrote:
>> I apologize -- that didn't came out right.
>> What I meant to suggest was "open an issue" to track
>> any works that needs to be done. But I guess that's
>> not the PG way.
> 
> No problem. I was not clear either. We could create a new item in the
> TODO list (https://wiki.postgresql.org/wiki/Todo) and link it to
> dedicated page on the wiki where all the potential tests would be
> listed.
> 

It couldn't hurt but also may be just a waste of your time.
I'm just realizing how central an issue tracker is to how I work and
how much not having one irritates me. Tough luck for me I guess.

Regards,
Amir

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

18 November 2015, 15:57:57

Hi, I just started looking this over a bit.  The first thing I noticed
is that it adds a dependency on Archive::Tar which isn't already used
anywhere else.  Did anybody check whether this exists back in 5.8
installations?

Why is "recovery" added to ALWAYS_SUBDIRS in src/test/Makefile instead
of to SUBDIRS?  Seems a strange choice.

Instead of adding    print "# Error output: $stderr\n" if $stderr ne "";
to sub psql, I think it would be better to add line separators, which
would be clearer if the error output ever turns into a multiline error
messages.  It would still show as empty if no stderr is produced; so I
think something like
if ($stderr ne '')
{print "#### Begin standard error\n"print $stderr;print "#### End standard error\n";
}
or something like that.

In my days of Perl, it was starting to become frowned upon to call
subroutines without parenthesizing arguments.  Is that no longer the
case?  Because I notice there are many places in this patch and pre-
existing that call psql with an argument list without parens.  And it's
a bit odd because I couldn't find any other subroutine that we're using
in that way.

In 005_replay_delay there's a 2s delay configured; then we test whether
something is replayed in 1s.  I hate tests that run for a long time, but
is 2s good enough considering that some of our test animals in buildfarm
are really slow?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Erik Rijkers

Date:

18 November 2015, 16:08:02

On 2015-11-18 16:21, Alvaro Herrera wrote:
> Hi, I just started looking this over a bit.  The first thing I noticed
> is that it adds a dependency on Archive::Tar which isn't already used
> anywhere else.  Did anybody check whether this exists back in 5.8
> installations?

Apparently it did not yet exist in core then, Module::CoreList says 
5.9.3:

$ perl -MModule::CoreList -e ' print 
Module::CoreList->first_release('Archive::Tar'), "\n";'
5.009003

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Noah Misch

Date:

19 November 2015, 02:39:48

On Wed, Nov 18, 2015 at 01:21:45PM -0200, Alvaro Herrera wrote:
> In my days of Perl, it was starting to become frowned upon to call
> subroutines without parenthesizing arguments.  Is that no longer the
> case?

I've not witnessed those frowns.

> Because I notice there are many places in this patch and pre-
> existing that call psql with an argument list without parens.  And it's
> a bit odd because I couldn't find any other subroutine that we're using
> in that way.

TestLib.pm has unparenthesized calls to "standard_initdb", "start" and "run".
070_dropuser.pl has such calls to "start_test_server" and "psql".

> In 005_replay_delay there's a 2s delay configured; then we test whether
> something is replayed in 1s.  I hate tests that run for a long time, but
> is 2s good enough considering that some of our test animals in buildfarm
> are really slow?

That test will be unreliable, agreed.

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

19 November 2015, 04:40:41

On Thu, Nov 19, 2015 at 12:21 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>
> Hi, I just started looking this over a bit.  The first thing I noticed
> is that it adds a dependency on Archive::Tar which isn't already used
> anywhere else.  Did anybody check whether this exists back in 5.8
> installations?

Actually I didn't and that's a good point, we have decided to support
TAP down to 5.8.9. The only reason why I introduced this dependency is
that there is no easy native way to copy an entire folder in perl, and
that's for handling base backups. There are things like File::NCopy of
File::Copy::Recursive however it does not seem like a good idea to
depend on other modules that IPC::Run. Would it be better to have an
in-core module dedicated to that similar to SimpleTee.pm? Or are you
guys fine to accept a dependency with another module?

> Why is "recovery" added to ALWAYS_SUBDIRS in src/test/Makefile instead
> of to SUBDIRS?  Seems a strange choice.

Because I thought that it should not be part of the main regression
suite, like ssl/. Feel free to correct me if my feeling is wrong.

> Instead of adding
>     print "# Error output: $stderr\n" if $stderr ne "";
> to sub psql, I think it would be better to add line separators, which
> would be clearer if the error output ever turns into a multiline error
> messages.  It would still show as empty if no stderr is produced; so I
> think something like
> if ($stderr ne '')
> {
>         print "#### Begin standard error\n"
>         print $stderr;
>         print "#### End standard error\n";
> }
> or something like that.

Yes, that would be better.

> In my days of Perl, it was starting to become frowned upon to call
> subroutines without parenthesizing arguments.  Is that no longer the
> case?  Because I notice there are many places in this patch and pre-
> existing that call psql with an argument list without parens.  And it's
> a bit odd because I couldn't find any other subroutine that we're using
> in that way.

Hm, yeah. If we decide about a perl coding policy I would be happy to
follow it. Personally I prefer usually using parenthesis however if we
decide to make the calls consistent we had better address that as a
separate patch.

> In 005_replay_delay there's a 2s delay configured; then we test whether
> something is replayed in 1s.  I hate tests that run for a long time, but
> is 2s good enough considering that some of our test animals in buildfarm
> are really slow?

A call to poll_query_until ensures that we wait for the standby to
replay once the minimum replay threshold is reached. Even with a slow
machine the first query would still see only 10 rows at the first try,
and then wait for the standby to replay before checking if 20 rows are
visible. Or I am not following your point.
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Robert Haas

Date:

19 November 2015, 16:05:44

On Wed, Nov 18, 2015 at 10:21 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> In my days of Perl, it was starting to become frowned upon to call
> subroutines without parenthesizing arguments.  Is that no longer the
> case?  Because I notice there are many places in this patch and pre-
> existing that call psql with an argument list without parens.  And it's
> a bit odd because I couldn't find any other subroutine that we're using
> in that way.

I've been coding in Perl for more than 20 years and have never heard
of such a rule.

Maybe I am not part of the "in" crowd.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Mike Blackwell

Date:

19 November 2015, 16:44:37

On Thu, Nov 19, 2015 at 10:05 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 18, 2015 at 10:21 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> In my days of Perl, it was starting to become frowned upon to call
> subroutines without parenthesizing arguments. Is that no longer the
> case?

As I understand it, there are several reasons not to make function calls in Perl without parenthesis. Whether they are good reasons is a question for the user. Modern Perl chapter 5 covers most of them.

__________________________________________________________________________________
Mike Blackwell | Technical Analyst, Distribution Services/Rollout Management | RR Donnelley
1750 Wallace Ave | St Charles, IL 60174-3401
Office: 630.313.7818
Mike.Blackwell@rrd.com
http://www.rrdonnelley.com

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

David Steele

Date:

19 November 2015, 19:23:57

On 11/19/15 11:05 AM, Robert Haas wrote:
> On Wed, Nov 18, 2015 at 10:21 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> In my days of Perl, it was starting to become frowned upon to call
>> subroutines without parenthesizing arguments.  Is that no longer the
>> case?  Because I notice there are many places in this patch and pre-
>> existing that call psql with an argument list without parens.  And it's
>> a bit odd because I couldn't find any other subroutine that we're using
>> in that way.
>
> I've been coding in Perl for more than 20 years and have never heard
> of such a rule.

I follow the convention of using parentheses for all function calls in
Perl, though this stems more from my greater familiarity with languages
that require them than any adherence to vague Perl conventions.

I do think it makes the code [more] readable.

--
-David
david@pgmasters.net

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

23 November 2015, 21:15:25

I just noticed that RecoveryTest.pm is lacking "use strict; use
warnings;".  With those added, there's a number of problems reported:

Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line66.
 
Global symbol "%backup_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm line
67.
Global symbol "%archive_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line68.
 
Global symbol "%connstr_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line69.
 
Global symbol "%applname_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line70.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line92.
 
Global symbol "%connstr_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line93.
 
Global symbol "%applname_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line93.
 
Global symbol "%archive_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line104.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line111.
 
Global symbol "%archive_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line121.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line130.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line185.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line197.
 
Global symbol "@array" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm line 220.
Global symbol "%backup_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm line
243.
Global symbol "%archive_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line244.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line246.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line257.
 
Global symbol "%backup_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm line
258.
Global symbol "%archive_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line259.
 
Global symbol "%connstr_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line260.
 
Global symbol "%applname_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line261.
 
Global symbol "%backup_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm line
272.
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line287.
 
Global symbol "%backup_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm line
288.
Global symbol "%archive_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line289.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line292.
 
Global symbol "$current_dir" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm line
294.
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line302.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line313.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line320.
 
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line367.
 
Global symbol "%backup_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm line
377.
Global symbol "%datadir_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line390.
 
Global symbol "%backup_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm line
391.
Global symbol "%archive_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line392.
 
Global symbol "%connstr_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line393.
 
Global symbol "%applname_nodes" requires explicit package name at /pgsql/source/master/src/test/perl/RecoveryTest.pm
line394.
 

Most of them are easily fixable by adding the correct "my" lines; but at
least @array and $current_dir require more code to be written.

TBH all that business with arrays that are kept in sync looks too
contrived to me.  Could we have a Perl object representing each node
instead?  That would require a "PostgresNode" package (or similar).  The
RecoveryTest.pm would have a single %nodes hash.  Also, you don't need
@active_nodes, just a flag in PostgresNode, and have the stop routine do
nothing if node is not marked active.  Also: if you pass the "root node"
when creating a node that will become a standby, you don't need to pass
it when calling, say, enable_streaming; the root node becomes an
instance variable.  (Hmm, actually, if we do that, I wonder what if in
the future we want to test node promotion and a standby is repointed to
a new master.  Maybe we don't want to have this knowledge in the Perl
code at all.)

In get_free_port, isn't it easier to use pg_isready rather than psql?

I've been messing with 003 because I think it's a bit too repetitive.
Will finish it after you post a fixed version of RecoveryTest.pm.

Thanks!

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

23 November 2015, 21:27:19

Michael Paquier wrote:
> On Thu, Nov 19, 2015 at 12:21 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > Hi, I just started looking this over a bit.  The first thing I noticed
> > is that it adds a dependency on Archive::Tar which isn't already used
> > anywhere else.  Did anybody check whether this exists back in 5.8
> > installations?
> 
> Actually I didn't and that's a good point, we have decided to support
> TAP down to 5.8.9. The only reason why I introduced this dependency is
> that there is no easy native way to copy an entire folder in perl, and
> that's for handling base backups. There are things like File::NCopy of
> File::Copy::Recursive however it does not seem like a good idea to
> depend on other modules that IPC::Run. Would it be better to have an
> in-core module dedicated to that similar to SimpleTee.pm? Or are you
> guys fine to accept a dependency with another module?

It would be a lot better to not have to rely on another module existing
everywhere.  I'd rather have another simple module, following
SimpleTee's example.  Since this doesn't have to be terribly generic, it
should be reasonably short, I hope.

> > Why is "recovery" added to ALWAYS_SUBDIRS in src/test/Makefile instead
> > of to SUBDIRS?  Seems a strange choice.
> 
> Because I thought that it should not be part of the main regression
> suite, like ssl/. Feel free to correct me if my feeling is wrong.

As I understand, the problem with "ssl" is that it messes with
system-wide settings, which is not the case here.  I'm inclined to move
it to SUBDIRS.  As an example, "modules" is not part of the main
regression suite either.

> > In my days of Perl, it was starting to become frowned upon to call
> > subroutines without parenthesizing arguments.  Is that no longer the
> > case?  Because I notice there are many places in this patch and pre-
> > existing that call psql with an argument list without parens.  And it's
> > a bit odd because I couldn't find any other subroutine that we're using
> > in that way.
> 
> Hm, yeah. If we decide about a perl coding policy I would be happy to
> follow it. Personally I prefer usually using parenthesis however if we
> decide to make the calls consistent we had better address that as a
> separate patch.

Some votes against, some votes for.  Ultimately, it seems that this
depends on the committer.  I don't really care all that much about this
TBH.

> > In 005_replay_delay there's a 2s delay configured; then we test whether
> > something is replayed in 1s.  I hate tests that run for a long time, but
> > is 2s good enough considering that some of our test animals in buildfarm
> > are really slow?
> 
> A call to poll_query_until ensures that we wait for the standby to
> replay once the minimum replay threshold is reached. Even with a slow
> machine the first query would still see only 10 rows at the first try,
> and then wait for the standby to replay before checking if 20 rows are
> visible. Or I am not following your point.

Ah, I see.  Maybe it's fine then, or else I'm not following your point
;-)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

24 November 2015, 05:07:30

On Tue, Nov 24, 2015 at 6:27 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>> On Thu, Nov 19, 2015 at 12:21 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>
>> > Hi, I just started looking this over a bit.  The first thing I noticed
>> > is that it adds a dependency on Archive::Tar which isn't already used
>> > anywhere else.  Did anybody check whether this exists back in 5.8
>> > installations?
>>
>> Actually I didn't and that's a good point, we have decided to support
>> TAP down to 5.8.9. The only reason why I introduced this dependency is
>> that there is no easy native way to copy an entire folder in perl, and
>> that's for handling base backups. There are things like File::NCopy of
>> File::Copy::Recursive however it does not seem like a good idea to
>> depend on other modules that IPC::Run. Would it be better to have an
>> in-core module dedicated to that similar to SimpleTee.pm? Or are you
>> guys fine to accept a dependency with another module?
>
> It would be a lot better to not have to rely on another module existing
> everywhere.  I'd rather have another simple module, following
> SimpleTee's example.  Since this doesn't have to be terribly generic, it
> should be reasonably short, I hope.

Sure, that would be a simple function that does directory lookup and
recursive calls. I'll move ahead with that then and reuse it in the
recovery logic.

>> > Why is "recovery" added to ALWAYS_SUBDIRS in src/test/Makefile instead
>> > of to SUBDIRS?  Seems a strange choice.
>>
>> Because I thought that it should not be part of the main regression
>> suite, like ssl/. Feel free to correct me if my feeling is wrong.
>
> As I understand, the problem with "ssl" is that it messes with
> system-wide settings, which is not the case here.  I'm inclined to move
> it to SUBDIRS.  As an example, "modules" is not part of the main
> regression suite either.

OK, I'll move it back to it then.
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

24 November 2015, 05:14:51

Thanks for the review.

On Tue, Nov 24, 2015 at 6:15 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I just noticed that RecoveryTest.pm is lacking "use strict; use
warnings;". With those added, there's a number of problems reported:

Most of them are easily fixable by adding the correct "my" lines; but at
least @array and $current_dir require more code to be written.

Oops.

TBH all that business with arrays that are kept in sync looks too
contrived to me. Could we have a Perl object representing each node
instead?

Not really to be honest.

That would require a "PostgresNode" package (or similar). The
RecoveryTest.pm would have a single %nodes hash. Also, you don't need
@active_nodes, just a flag in PostgresNode, and have the stop routine do
nothing if node is not marked active. Also: if you pass the "root node"
when creating a node that will become a standby, you don't need to pass
it when calling, say, enable_streaming; the root node becomes an
instance variable. (Hmm, actually, if we do that, I wonder what if in
the future we want to test node promotion and a standby is repointed to
a new master. Maybe we don't want to have this knowledge in the Perl
code at all.)

I think I'll get the idea. In short all the parametrization will just happen at object level, as well as basic actions on the nodes like start, stop, restart etc.

In get_free_port, isn't it easier to use pg_isready rather than psql?

Will switch.

I've been messing with 003 because I think it's a bit too repetitive.
Will finish it after you post a fixed version of RecoveryTest.pm.

Sure, thanks.

I'll rework this patch and will update a new version soon.

Thanks again for the review.

Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

24 November 2015, 14:20:08

On Tue, Nov 24, 2015 at 2:14 PM, Michael Paquier wrote:

I'll rework this patch and will update a new version soon.

So, attached is a new patch addressing all the comments received. The new version has the following changes:

- Print more verbosely stderr output in case of error in psql

- Add recovery test suite to SUBDIRS in src/test/Makefile

- Add strict and warnings to what is used in the new modules of this patch

- Manage node information using package/class PostgresNode.pm and have RecoveryTest use it. I have actually made PostgresNode bare-bone and simple on purpose: one can initialize the node, append configuration parameters to it and manage it through start/stop/restart (we may want to add reload and promote actually if needed). However, more complex configuration is left to RecoveryTest.pm, which is in charge of appending the configuration dedicated to streaming, archiving, etc though a set of routines working on PostgresNode objects. I have also arrived at the conclusion that it is not really worth adding a node status flag in PostgresNode because the port number saved there is sufficient when doing free port lookup, and the list of nodes used in a recovery test are saved in an array.

- Add new module RecursiveCopy to be used for base backups. This removes the dependency with Archive::Tar. PostgresNode makes use of that when initializing a node from a backup.

- Tests have been updated to use the PostgresNode objects instead of the port number as identifier. That's more portable.

Hopefully I have missed nothing.

Regards,

Michael

Attachment

20151124_recovery_regression_v9.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

24 November 2015, 21:22:59

Michael Paquier wrote:

> - Manage node information using package/class PostgresNode.pm and have
> RecoveryTest use it. I have actually made PostgresNode bare-bone and simple
> on purpose: one can initialize the node, append configuration parameters to
> it and manage it through start/stop/restart (we may want to add reload and
> promote actually if needed).

This looks great as a starting point.  I think we should make TestLib
depend on PostgresNode instead of the other way around.  I will have a
look at that (I realize this means messing with the existing tests).

> I have also arrived at the conclusion that it is not really worth
> adding a node status flag in PostgresNode because the port number
> saved there is sufficient when doing free port lookup, and the list of
> nodes used in a recovery test are saved in an array.

I don't disagree with this in principle, but I think the design that you
get a new PostgresNode object by calling get_free_port is strange.  I
think the port lookup code should be part of either TestLib or
PostgresNode, not RecoveryTest.

> - Add new module RecursiveCopy to be used for base backups. This removes
> the dependency with Archive::Tar. PostgresNode makes use of that when
> initializing a node from a backup.

Great.

> - Tests have been updated to use the PostgresNode objects instead of the
> port number as identifier. That's more portable.

Makes sense.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

25 November 2015, 00:08:39

On Wed, Nov 25, 2015 at 6:22 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:

> - Manage node information using package/class PostgresNode.pm and have
> RecoveryTest use it. I have actually made PostgresNode bare-bone and simple
> on purpose: one can initialize the node, append configuration parameters to
> it and manage it through start/stop/restart (we may want to add reload and
> promote actually if needed).

This looks great as a starting point. I think we should make TestLib
depend on PostgresNode instead of the other way around. I will have a
look at that (I realize this means messing with the existing tests).

Makes sense. My thoughts following that is that we should keep a track of the nodes started as an array which is part of TestLib, with PGHOST set once at startup using tempdir_short. That's surely an refactoring patch somewhat independent of the recovery test suite. I would not mind writing something among those lines if needed.

> I have also arrived at the conclusion that it is not really worth
> adding a node status flag in PostgresNode because the port number
> saved there is sufficient when doing free port lookup, and the list of
> nodes used in a recovery test are saved in an array.

I don't disagree with this in principle, but I think the design that you
get a new PostgresNode object by calling get_free_port is strange. I
think the port lookup code should be part of either TestLib or
PostgresNode, not RecoveryTest.

I'd vote for TestLib. I have written PostgresNode this way to allow users to set up arbitrary port numbers if they'd like to do so. That's more flexible.

Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

25 November 2015, 01:55:32

Michael Paquier wrote:
> On Wed, Nov 25, 2015 at 6:22 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
> 
> > Michael Paquier wrote:

> > This looks great as a starting point.  I think we should make TestLib
> > depend on PostgresNode instead of the other way around.  I will have a
> > look at that (I realize this means messing with the existing tests).
> 
> Makes sense. My thoughts following that is that we should keep a track of
> the nodes started as an array which is part of TestLib, with PGHOST set
> once at startup using tempdir_short. That's surely an refactoring patch
> somewhat independent of the recovery test suite. I would not mind writing
> something among those lines if needed.

OK, please do.

We can split this up in two patches: one introducing PostgresNode
(+ RecursiveCopy) together with the refactoring of existing test code,
and a subsequent one introducing RecoveryTest and the corresponding
subdir.  Sounds good?

> > > I have also arrived at the conclusion that it is not really worth
> > > adding a node status flag in PostgresNode because the port number
> > > saved there is sufficient when doing free port lookup, and the list of
> > > nodes used in a recovery test are saved in an array.
> >
> > I don't disagree with this in principle, but I think the design that you
> > get a new PostgresNode object by calling get_free_port is strange.  I
> > think the port lookup code should be part of either TestLib or
> > PostgresNode, not RecoveryTest.
> 
> I'd vote for TestLib. I have written PostgresNode this way to allow users
> to set up arbitrary port numbers if they'd like to do so. That's more
> flexible.

That works for me.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

25 November 2015, 02:01:04

On Wed, Nov 25, 2015 at 10:55 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:
> On Wed, Nov 25, 2015 at 6:22 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>
> > Michael Paquier wrote:

> > This looks great as a starting point. I think we should make TestLib
> > depend on PostgresNode instead of the other way around. I will have a
> > look at that (I realize this means messing with the existing tests).
>
> Makes sense. My thoughts following that is that we should keep a track of
> the nodes started as an array which is part of TestLib, with PGHOST set
> once at startup using tempdir_short. That's surely an refactoring patch
> somewhat independent of the recovery test suite. I would not mind writing
> something among those lines if needed.

OK, please do.

We can split this up in two patches: one introducing PostgresNode
(+ RecursiveCopy) together with the refactoring of existing test code,
and a subsequent one introducing RecoveryTest and the corresponding
subdir. Sounds good?

Yeah, that matches my line of thoughts. Will do so.

Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

25 November 2015, 13:20:49

On Wed, Nov 25, 2015 at 11:00 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Wed, Nov 25, 2015 at 10:55 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Michael Paquier wrote:
> On Wed, Nov 25, 2015 at 6:22 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>
> > Michael Paquier wrote:

> > This looks great as a starting point. I think we should make TestLib
> > depend on PostgresNode instead of the other way around. I will have a
> > look at that (I realize this means messing with the existing tests).
>
> Makes sense. My thoughts following that is that we should keep a track of
> the nodes started as an array which is part of TestLib, with PGHOST set
> once at startup using tempdir_short. That's surely an refactoring patch
> somewhat independent of the recovery test suite. I would not mind writing
> something among those lines if needed.

OK, please do.

We can split this up in two patches: one introducing PostgresNode
(+ RecursiveCopy) together with the refactoring of existing test code,
and a subsequent one introducing RecoveryTest and the corresponding
subdir. Sounds good?

Yeah, that matches my line of thoughts. Will do so.

The result of a couple of hours of hacking is attached:

- 0001 is the refactoring adding PostgresNode and RecursiveCopy. I have also found that it is quite advantageous to move some of the routines that are synonyms of system() and the stuff used for logging into another low-level library that PostgresNode depends on, that I called TestBase in this patch. This way, all the infrastructure depends on the same logging management. Existing tests have been refactored to fit into the new code, and this leads to a couple of simplifications particularly in pg_rewind tests because there is no more need to have there routines for environment cleanup and logging. I have done tests on OSX and Windows using it and tests are passing. I have as well tested that ssl tests were working.

- 0002 adds the recovery tests with RecoveryTest.pm now located in src/test/recovery.

Regards,

Michael

Attachment

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

27 November 2015, 22:53:30

Michael Paquier wrote:

> The result of a couple of hours of hacking is attached:
> - 0001 is the refactoring adding PostgresNode and RecursiveCopy. I have
> also found that it is quite advantageous to move some of the routines that
> are synonyms of system() and the stuff used for logging into another
> low-level library that PostgresNode depends on, that I called TestBase in
> this patch. This way, all the infrastructure depends on the same logging
> management. Existing tests have been refactored to fit into the new code,
> and this leads to a couple of simplifications particularly in pg_rewind
> tests because there is no more need to have there routines for environment
> cleanup and logging. I have done tests on OSX and Windows using it and
> tests are passing. I have as well tested that ssl tests were working.

Here's another version of this.  I changed the packages a bit more.  For
starters, I moved the routines around a bit; some of your choices seemed
more about keeping stuff where it was originally rather than moving it
to where it made sense.  These are the routines in each module:

TestBase:  system_or_bail system_log run_log slurp_dir slurp_file
append_to_file

TestLib:    get_new_node teardown_node psql poll_query_until command_ok
command_fails command_exit_is program_help_ok program_version_ok
program_options_handling_ok command_like issues_sql_like

I tried to get rid of teardown_node by having a DESTROY method for
PostgresNode; that method would call "pg_ctl stop -m immediate".  That
would have been much cleaner.  However, when a test fails this doesn't
work sanely because the END block for File::Temp runs earlier than that
DESTROY block, which means the datadir is already gone by the time
pg_ctl stop runs, so the node stop doesn't work at all.  (Perhaps we
could fix this by noting postmaster's PID at start time, and then
sending a signal directly instead of relying on pg_ctl).

I moved all the initialization code (deleting stuff from environment,
detecting Windows, opening SimpleTie filedescs etc) into BEGIN blocks,
which run earlier than any other code.

I perltidy'ed PostgresNode (and all the other files actually), to have
the style match the rest of our code.  I also updated some code to be
more Perlish.

I added a lot of error checking in RecursiveCopy.

You had a "cat" call somewhere, which I replaced with slurp_file.

I considered updating RewindTest so that it didn't have to export the
node global variables, but decided not to, not because of the huge code
churn for the t/*.pl files but because of the problem with the DESTROY
method above: it didn't actually buy anything.

Hmm. I just noticed RewindTest sets $ENV{PGDATABASE} outside BEGIN.  Not
sure what to think of that.  Could instead pass the database name in
$node->getConnStr() calls, like run_pg_rewind() is already doing.

I tried all the t/ tests we have and all of them pass for me.  If I'm
able, I will push this on my Sunday late evening, so that I can fix
whatever gets red on Monday first thing ...

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

taptest.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

28 November 2015, 11:20:26

On Sat, Nov 28, 2015 at 7:53 AM, Alvaro Herrera wrote:
> I moved all the initialization code (deleting stuff from environment,
> detecting Windows, opening SimpleTie filedescs etc) into BEGIN blocks,
> which run earlier than any other code.

Ah, OK. Thanks. That makes visibly the whole set of modules more consistent.

> Hmm. I just noticed RewindTest sets $ENV{PGDATABASE} outside BEGIN.  Not
> sure what to think of that.  Could instead pass the database name in
> $node->getConnStr() calls, like run_pg_rewind() is already doing.

Yes, let's remove that and pass the database name to getConnStr().

> I tried all the t/ tests we have and all of them pass for me.  If I'm
> able, I will push this on my Sunday late evening, so that I can fix
> whatever gets red on Monday first thing ...

I have done as well additional tests on Windows and this patch is
showing a green status.

A separate issue, but as long as we are working on this set of tests:
I have noticed that config_default.pl is missing the flag tap_tests in
its list. See the patch attached. Could you apply that as well and
backpatch?

I have as well noticed that RewindTest.pm is missing "1;" on its last
line. When this is loaded this would lead to compilation errors.
--
Michael

Attachment

20151128_fix_config_msvc.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

28 November 2015, 15:14:20

Michael Paquier wrote:
> On Sat, Nov 28, 2015 at 7:53 AM, Alvaro Herrera wrote:

> > Hmm. I just noticed RewindTest sets $ENV{PGDATABASE} outside BEGIN.  Not
> > sure what to think of that.  Could instead pass the database name in
> > $node->getConnStr() calls, like run_pg_rewind() is already doing.
>
> Yes, let's remove that and pass the database name to getConnStr().

Ok.


> A separate issue, but as long as we are working on this set of tests:
> I have noticed that config_default.pl is missing the flag tap_tests in
> its list. See the patch attached. Could you apply that as well and
> backpatch?

> I have as well noticed that RewindTest.pm is missing "1;" on its last
> line. When this is loaded this would lead to compilation errors.

Sure.


> > I tried all the t/ tests we have and all of them pass for me.  If I'm
> > able, I will push this on my Sunday late evening, so that I can fix
> > whatever gets red on Monday first thing ...
>
> I have done as well additional tests on Windows and this patch is
> showing a green status.

Great.

Here's your recovery test patch rebased, for your (and others'!)
perusal.  It passes for me.  (Test 003 is unchanged.)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0002.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

29 November 2015, 12:40:06

On Sun, Nov 29, 2015 at 12:13 AM, Alvaro Herrera wrote:
> Here's your recovery test patch rebased, for your (and others'!)
> perusal.  It passes for me.  (Test 003 is unchanged.)

Are you planning to push that as well? It does not have much coverage
but I guess that's quite good for a first shot, and that can serve as
example for future tests.

Still, the first patch adds enough infrastructure to allow any other
module to have more complex regression test scenarios, the first two
targets coming immediately to my mind being the quorum syncrep patch
and pg_rewind and its timeline switch manipulation. So that's more
than welcome!
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Noah Misch

Date:

29 November 2015, 21:28:24

On Fri, Nov 27, 2015 at 07:53:10PM -0300, Alvaro Herrera wrote:
> Michael Paquier wrote:
> > The result of a couple of hours of hacking is attached:
> > - 0001 is the refactoring adding PostgresNode and RecursiveCopy. I have
> > also found that it is quite advantageous to move some of the routines that
> > are synonyms of system() and the stuff used for logging into another
> > low-level library that PostgresNode depends on, that I called TestBase in
> > this patch.

> Here's another version of this.  I changed the packages a bit more.  For
> starters, I moved the routines around a bit; some of your choices seemed
> more about keeping stuff where it was originally rather than moving it
> to where it made sense.  These are the routines in each module:
> 
> TestBase:  system_or_bail system_log run_log slurp_dir slurp_file
> append_to_file
> 
> TestLib:    get_new_node teardown_node psql poll_query_until command_ok
> command_fails command_exit_is program_help_ok program_version_ok
> program_options_handling_ok command_like issues_sql_like

The proposed code is short on guidance about when to put a function in TestLib
versus TestBase.  TestLib has no header comment.  The TestBase header comment
would permit, for example, command_ok() in that module.  I would try instead
keeping TestLib as the base module and moving into PostgresNode the functions
that deal with PostgreSQL clusters (get_new_node teardown_node psql
poll_query_until issues_sql_like).

> I tried to get rid of teardown_node by having a DESTROY method for
> PostgresNode; that method would call "pg_ctl stop -m immediate".  That
> would have been much cleaner.  However, when a test fails this doesn't
> work sanely because the END block for File::Temp runs earlier than that
> DESTROY block, which means the datadir is already gone by the time
> pg_ctl stop runs, so the node stop doesn't work at all.  (Perhaps we
> could fix this by noting postmaster's PID at start time, and then
> sending a signal directly instead of relying on pg_ctl).

You could disable File::Temp cleanup and handle cleanup yourself at the
desired time.  (I haven't reviewed whether the goal of removing teardown_node
is otherwise good.)

> +my $node = get_new_node();
> +# Initialize node without replication settings
> +$node->initNode(0);
> +$node->startNode();
> +my $pgdata = $node->getDataDir();
> +
> +$ENV{PGPORT} = $node->getPort();

Starting a value retrieval method name with "get" is not Perlish.  The TAP
suites currently follow "man perlstyle" in using underscored_lower_case method
names.  No PostgreSQL Perl code uses lowerFirstCamelCase, though some uses
CamelCase.  The word "Node" is redundant.  Use this style:
 $node->init(0); $node->start; my $pgdata = $node->data_dir; $ENV{PGPORT} = $node->port;

As a matter of opinion, I recommend giving "init" key/value arguments instead
of the single Boolean argument.  The method could easily need more options in
the future, and this makes the call site self-documenting:
 $node->init(hba_permit_replication => 0);

> -    'pg_controldata with nonexistent directory fails');
> +              'pg_controldata with nonexistent directory fails');

perltidy will undo this whitespace-only change.

> --- a/src/bin/pg_rewind/t/001_basic.pl
> +++ b/src/bin/pg_rewind/t/001_basic.pl
> @@ -1,9 +1,11 @@
> +# Basic pg_rewind test.
> +
>  use strict;
>  use warnings;
> -use TestLib;
> -use Test::More tests => 8;
>  
>  use RewindTest;
> +use TestLib;
> +use Test::More tests => 8;

Revert all changes to this file.  Audit the rest of the patch for whitespace
change unrelated to the subject.

> -    'fails with nonexistent table');
> +              'fails with nonexistent table');

> -'CREATE TABLE test1 (a int); CREATE INDEX test1x ON test1 (a); CLUSTER test1 USING test1x';
> +    'CREATE TABLE test1 (a int); CREATE INDEX test1x ON test1 (a); CLUSTER test1 USING test1x';

perltidy will undo these whitespace-only changes.

> +# cluster -a is not compatible with -d, hence enforce environment variables

s/cluster -a/clusterdb -a/

> -issues_sql_like(
> +$ENV{PGPORT} = $node->getPort();
> +
> +issues_sql_like($node,

perltidy will move $node to its own line.

> -command_fails([ 'createuser', 'user1' ], 'fails if role already exists');
> +command_fails([ 'createuser', 'user1' ],
> +              'fails if role already exists');

perltidy will undo this whitespace-only change.

> @@ -0,0 +1,252 @@
> +# PostgresNode, simple node representation for regression tests.
> +#
> +# Regression tests should use this basic class infrastructure to define nodes
> +# that need used in the test modules/scripts.
> +package PostgresNode;

Consider just saying, "Class representing a data directory and postmaster."

> +    my $self   = {
> +        _port     => undef,
> +        _host     => undef,
> +        _basedir  => undef,
> +        _applname => undef,
> +        _logfile  => undef };
> +
> +    # Set up each field
> +    $self->{_port}     = $pgport;
> +    $self->{_host}     = $pghost;
> +    $self->{_basedir}  = TestBase::tempdir;
> +    $self->{_applname} = "node_$pgport";
> +    $self->{_logfile}  = "$TestBase::log_path/node_$pgport.log";

Why set fields to undef immediately before filling them?

> @@ -0,0 +1,143 @@
> +# Set of low-level routines dedicated to base tasks for regression tests, like
> +# command execution and logging.
> +#
> +# This module should not depend on any other PostgreSQL regression test
> +# modules.
> +package TestBase;

This is no mere set of routines.  Just "use"-ing this module creates some
directories and alters stdin/stdout/stderr.

> +BEGIN
> +{
> +    $windows_os = $Config{osname} eq 'MSWin32' || $Config{osname} eq 'msys';
> +
> +    # Determine output directories, and create them.  The base path is the
> +    # TESTDIR environment variable, which is normally set by the invoking
> +    # Makefile.
> +    $tmp_check = $ENV{TESTDIR} ? "$ENV{TESTDIR}/tmp_check" : "tmp_check";
> +    $log_path = "$tmp_check/log";
> +
> +    mkdir $tmp_check;
> +    mkdir $log_path;

Never mutate the filesystem in a BEGIN block, because "perl -c" runs BEGIN
blocks.  (Likewise for the BEGIN block this patch adds to TestLib.)

nm

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Andrew Dunstan

Date:

29 November 2015, 22:57:29


On 11/29/2015 04:28 PM, Noah Misch wrote:
> +BEGIN
> +{
> +    $windows_os = $Config{osname} eq 'MSWin32' || $Config{osname} eq 'msys';
> +
> +    # Determine output directories, and create them.  The base path is the
> +    # TESTDIR environment variable, which is normally set by the invoking
> +    # Makefile.
> +    $tmp_check = $ENV{TESTDIR} ? "$ENV{TESTDIR}/tmp_check" : "tmp_check";
> +    $log_path = "$tmp_check/log";
> +
> +    mkdir $tmp_check;
> +    mkdir $log_path;
> Never mutate the filesystem in a BEGIN block, because "perl -c" runs BEGIN
> blocks.  (Likewise for the BEGIN block this patch adds to TestLib.)


Yeah, those two lines might belong in an INIT block. "perldoc perlmod" 
for details.


cheers

andrew

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

30 November 2015, 07:53:17

On Mon, Nov 30, 2015 at 6:28 AM, Noah Misch <noah@leadboat.com> wrote:
>
> On Fri, Nov 27, 2015 at 07:53:10PM -0300, Alvaro Herrera wrote:
> > Michael Paquier wrote:
> > > The result of a couple of hours of hacking is attached:
> > > - 0001 is the refactoring adding PostgresNode and RecursiveCopy. I have
> > > also found that it is quite advantageous to move some of the routines that
> > > are synonyms of system() and the stuff used for logging into another
> > > low-level library that PostgresNode depends on, that I called TestBase in
> > > this patch.
>
> > Here's another version of this.  I changed the packages a bit more.  For
> > starters, I moved the routines around a bit; some of your choices seemed
> > more about keeping stuff where it was originally rather than moving it
> > to where it made sense.  These are the routines in each module:
> >
> > TestBase:  system_or_bail system_log run_log slurp_dir slurp_file
> > append_to_file
> >
> > TestLib:    get_new_node teardown_node psql poll_query_until command_ok
> > command_fails command_exit_is program_help_ok program_version_ok
> > program_options_handling_ok command_like issues_sql_like
>
> The proposed code is short on guidance about when to put a function in TestLib
> versus TestBase.  TestLib has no header comment.  The TestBase header comment
> would permit, for example, command_ok() in that module.  I would try instead
> keeping TestLib as the base module and moving into PostgresNode the functions
> that deal with PostgreSQL clusters (get_new_node teardown_node psql
> poll_query_until issues_sql_like).

PostgresNode is wanted to be a base representation of how of node is,
not of how to operate on it. The ways to perform the tests, which
works on a node, is wanted as a higher-level operation.

Logging and base configuration of a test set is a lower level of
operations than PostgresNode, because cluster nodes need actually to
perform system calls, some of those system calls like run_log allowing
to log in the centralized log file. I have tried to make the headers
of those modules more verbose, please see attached.

>
> > +my $node = get_new_node();
> > +# Initialize node without replication settings
> > +$node->initNode(0);
> > +$node->startNode();
> > +my $pgdata = $node->getDataDir();
> > +
> > +$ENV{PGPORT} = $node->getPort();
>
> Starting a value retrieval method name with "get" is not Perlish.  The TAP
> suites currently follow "man perlstyle" in using underscored_lower_case method
> names.  No PostgreSQL Perl code uses lowerFirstCamelCase, though some uses
> CamelCase.  The word "Node" is redundant.  Use this style:
>
>   $node->init(0);
>   $node->start;
>   my $pgdata = $node->data_dir;
>   $ENV{PGPORT} = $node->port;

I have switched the style this way.

> As a matter of opinion, I recommend giving "init" key/value arguments instead
> of the single Boolean argument.  The method could easily need more options in
> the future, and this makes the call site self-documenting:
>
>   $node->init(hba_permit_replication => 0);

Done.

>
> > -     'pg_controldata with nonexistent directory fails');
> > +                       'pg_controldata with nonexistent directory fails');
>
> perltidy will undo this whitespace-only change.

Cleaned up.

>
> > --- a/src/bin/pg_rewind/t/001_basic.pl
> > +++ b/src/bin/pg_rewind/t/001_basic.pl
> > @@ -1,9 +1,11 @@
> > +# Basic pg_rewind test.
> > +
> >  use strict;
> >  use warnings;
> > -use TestLib;
> > -use Test::More tests => 8;
> >
> >  use RewindTest;
> > +use TestLib;
> > +use Test::More tests => 8;
>
> Revert all changes to this file.  Audit the rest of the patch for whitespace
> change unrelated to the subject.

Done.

>
>
> > -     'fails with nonexistent table');
> > +                       'fails with nonexistent table');
>
> > -'CREATE TABLE test1 (a int); CREATE INDEX test1x ON test1 (a); CLUSTER test1 USING test1x';
> > +     'CREATE TABLE test1 (a int); CREATE INDEX test1x ON test1 (a); CLUSTER test1 USING test1x';
>
> perltidy will undo these whitespace-only changes.

Cleaned up.

>
> > +# cluster -a is not compatible with -d, hence enforce environment variables
>
> s/cluster -a/clusterdb -a/

Fixed.

>
> > -command_fails([ 'createuser', 'user1' ], 'fails if role already exists');
> > +command_fails([ 'createuser', 'user1' ],
> > +                       'fails if role already exists');
>
> perltidy will undo this whitespace-only change.
>
> > @@ -0,0 +1,252 @@
> > +# PostgresNode, simple node representation for regression tests.
> > +#
> > +# Regression tests should use this basic class infrastructure to define nodes
> > +# that need used in the test modules/scripts.
> > +package PostgresNode;
>
> Consider just saying, "Class representing a data directory and postmaster."

OK, I have changed this description:
+# PostgresNode, class representing a data directory and postmaster.
+#
+# This contains a basic set of routines able to work on a PostgreSQL node,
+# allowing to start, stop, backup and initialize it with various options.

>
> > +     my $self   = {
> > +             _port     => undef,
> > +             _host     => undef,
> > +             _basedir  => undef,
> > +             _applname => undef,
> > +             _logfile  => undef };
> > +
> > +     # Set up each field
> > +     $self->{_port}     = $pgport;
> > +     $self->{_host}     = $pghost;
> > +     $self->{_basedir}  = TestBase::tempdir;
> > +     $self->{_applname} = "node_$pgport";
> > +     $self->{_logfile}  = "$TestBase::log_path/node_$pgport.log";
>
> Why set fields to undef immediately before filling them?

Fixed.

>
> > @@ -0,0 +1,143 @@
> > +# Set of low-level routines dedicated to base tasks for regression tests, like
> > +# command execution and logging.
> > +#
> > +# This module should not depend on any other PostgreSQL regression test
> > +# modules.
> > +package TestBase;
>
> This is no mere set of routines.  Just "use"-ing this module creates some
> directories and alters stdin/stdout/stderr.

I have updated the description of this file.

>
> > +BEGIN
> > +{
> > +     $windows_os = $Config{osname} eq 'MSWin32' || $Config{osname} eq 'msys';
> > +
> > +     # Determine output directories, and create them.  The base path is the
> > +     # TESTDIR environment variable, which is normally set by the invoking
> > +     # Makefile.
> > +     $tmp_check = $ENV{TESTDIR} ? "$ENV{TESTDIR}/tmp_check" : "tmp_check";
> > +     $log_path = "$tmp_check/log";
> > +
> > +     mkdir $tmp_check;
> > +     mkdir $log_path;
>
> Never mutate the filesystem in a BEGIN block, because "perl -c" runs BEGIN
> blocks.  (Likewise for the BEGIN block this patch adds to TestLib.)

Hm. It seems to me that the whole block should be part of INIT then,
because the log file where STDERR and STDOUT is recaptured depends on
those to be created as well. By doing this change, please note that
compilation errors are not recaptured into the log file (thanks Andrew
for the pointers to perlmod).

I have as well updated pg_rewind tests to remove PGDATABASE. Patch to
address those issues is attached.
Regards,
--
Michael

Attachment

20151130_tapcheck.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

30 November 2015, 14:51:57

Michael Paquier wrote:
> On Mon, Nov 30, 2015 at 6:28 AM, Noah Misch <noah@leadboat.com> wrote:

> > The proposed code is short on guidance about when to put a function in TestLib
> > versus TestBase.  TestLib has no header comment.  The TestBase header comment
> > would permit, for example, command_ok() in that module.  I would try instead
> > keeping TestLib as the base module and moving into PostgresNode the functions
> > that deal with PostgreSQL clusters (get_new_node teardown_node psql
> > poll_query_until issues_sql_like).
> 
> PostgresNode is wanted to be a base representation of how of node is,
> not of how to operate on it. The ways to perform the tests, which
> works on a node, is wanted as a higher-level operation.
> 
> Logging and base configuration of a test set is a lower level of
> operations than PostgresNode, because cluster nodes need actually to
> perform system calls, some of those system calls like run_log allowing
> to log in the centralized log file. I have tried to make the headers
> of those modules more verbose, please see attached.

I'm not terribly convinced by this argument TBH.  Perhaps we can have
PostgresNode be one package, and the logging routines be another
package, and we create a higher-level package whose @ISA=(PostgresNode,
LoggingWhatever) and then we move the routines suggested by Noah into
that new package.  Then the tests use that instead of PostgresNode
directly.


> > > --- a/src/bin/pg_rewind/t/001_basic.pl
> > > +++ b/src/bin/pg_rewind/t/001_basic.pl
> > > @@ -1,9 +1,11 @@
> > > +# Basic pg_rewind test.
> > > +
> > >  use strict;
> > >  use warnings;
> > > -use TestLib;
> > > -use Test::More tests => 8;
> > >
> > >  use RewindTest;
> > > +use TestLib;
> > > +use Test::More tests => 8;
> >
> > Revert all changes to this file.  Audit the rest of the patch for whitespace
> > change unrelated to the subject.
> 
> Done.

I perltidied several files, though not consistently.  Regarding this
particular hunk, what is going on here is that I moved "use strict;use
warnings" as one stanza, followed by all the other "use" lines as
another stanza, alphabetically.  It was previously a bit messy, with
@EXPORTS and other stuff in between "use" lines.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

01 December 2015, 06:22:29

On Mon, Nov 30, 2015 at 11:51 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>> On Mon, Nov 30, 2015 at 6:28 AM, Noah Misch <noah@leadboat.com> wrote:
>
>> > The proposed code is short on guidance about when to put a function in TestLib
>> > versus TestBase.  TestLib has no header comment.  The TestBase header comment
>> > would permit, for example, command_ok() in that module.  I would try instead
>> > keeping TestLib as the base module and moving into PostgresNode the functions
>> > that deal with PostgreSQL clusters (get_new_node teardown_node psql
>> > poll_query_until issues_sql_like).
>>
>> PostgresNode is wanted to be a base representation of how of node is,
>> not of how to operate on it. The ways to perform the tests, which
>> works on a node, is wanted as a higher-level operation.
>>
>> Logging and base configuration of a test set is a lower level of
>> operations than PostgresNode, because cluster nodes need actually to
>> perform system calls, some of those system calls like run_log allowing
>> to log in the centralized log file. I have tried to make the headers
>> of those modules more verbose, please see attached.
>
> I'm not terribly convinced by this argument TBH.  Perhaps we can have
> PostgresNode be one package, and the logging routines be another
> package, and we create a higher-level package whose @ISA=(PostgresNode,
> LoggingWhatever) and then we move the routines suggested by Noah into
> that new package.  Then the tests use that instead of PostgresNode
> directly.

OK... I have merged TestLib and PostgresNode of the previous patch
into PostgresNode into the way suggested by Noah. TestBase has been
renamed back to TestLib, and includes as well the base test functions
like command_ok.
Regards,
--
Michael

Attachment

20151201_tapcheck_v12.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

01 December 2015, 23:11:35

Michael Paquier wrote:

> OK... I have merged TestLib and PostgresNode of the previous patch
> into PostgresNode into the way suggested by Noah. TestBase has been
> renamed back to TestLib, and includes as well the base test functions
> like command_ok.

Great, thanks.  Here's one more version, hopefully the last one.

- I discovered that not setting PGPORT was causing some of the tests
  that fail (using command_fails) to fail to test what was being tested.
  The problem is that the command was failing with "could not connect to
  server" instead of failing because of trying to cluster a nonexistant
  table, etc.  Unfortunately the only way to verify this is by looking
  at the regress_log_xxx_foobar file.  Not ideal, but not this patch's
  fault.

- I changed the routines moved to PostgresNode so that they are instance
  methods, i.e. $node->poll_until_query; also psql and
  issues_query_like.  The latter also sets "local $PGPORT" so that the
  command is run with the correct port.

- It would be nice to have command_ok and command_fails in PostgresNode
  too; that would remove the need for setting $ENV{PGPORT} but it's
  possible to run commands outside a node too, so we'd need duplicates,
  which would be worse.

- I changed start/stop/restart so that they keep track of the postmaster
  PID; also added a DESTROY sub to PostgresNode that sends SIGQUIT.
  This means that when the test finishes, the server gets an immediate
  stop signal.  We were getting a lot of errors in the server log about
  failing to write to the stats file otherwise, until the node noticed
  that the datadir was gone.

- I removed the @active_nodes array, which is now unnecessary (per
  above).

- I moved the "delete $ENV{PGxxx}" back to a BEGIN block in TestLib.
  I did it because it's possible to write test scripts without
  PostgresNode, and it's not nice to have those pick up the environment.
  This still affects anything using PostgresNode because that one uses
  TestLib.

Finally, I ran perltidy on all the files, which strangely changed stuff
that I didn't expect it to change.  I wonder if this is related to the
perltidy version.  Do you see further changes if you run perltidy on the
patched tree?  This is my version:

$ perltidy --version
This is perltidy, v20140328

Copyright 2000-2014, Steve Hancock

Perltidy is free software and may be copied under the terms of the GNU
General Public License, which is included in the distribution files.

Complete documentation for perltidy can be found using 'man perltidy'
or on the internet at http://perltidy.sourceforge.net.


--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

02 December 2015, 02:25:18

On Wed, Dec 2, 2015 at 8:11 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> - I discovered that not setting PGPORT was causing some of the tests
>   that fail (using command_fails) to fail to test what was being tested.
>   The problem is that the command was failing with "could not connect to
>   server" instead of failing because of trying to cluster a nonexistant
>   table, etc.  Unfortunately the only way to verify this is by looking
>   at the regress_log_xxx_foobar file.  Not ideal, but not this patch's
>   fault.

Nice catch.

> - I changed the routines moved to PostgresNode so that they are instance
>   methods, i.e. $node->poll_until_query; also psql and
>   issues_query_like.  The latter also sets "local $PGPORT" so that the
>   command is run with the correct port.

OK.

> - It would be nice to have command_ok and command_fails in PostgresNode
>   too; that would remove the need for setting $ENV{PGPORT} but it's
>   possible to run commands outside a node too, so we'd need duplicates,
>   which would be worse.

I am fine to let it the way your patch does it. There are already many changes.

> - I removed the @active_nodes array, which is now unnecessary (per
>   above).

So that's basically replaced by @all_nodes.

> - I moved the "delete $ENV{PGxxx}" back to a BEGIN block in TestLib.
>   I did it because it's possible to write test scripts without
>   PostgresNode, and it's not nice to have those pick up the environment.
>   This still affects anything using PostgresNode because that one uses
>   TestLib.

OK.

> Finally, I ran perltidy on all the files, which strangely changed stuff
> that I didn't expect it to change.  I wonder if this is related to the
> perltidy version.  Do you see further changes if you run perltidy on the
> patched tree?

SimpleTee.pm shows some diffs, though it doesn't seem that this patch
should care about that. The rest is showing no diffs. And I have used
perltidy v20140711.
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

02 December 2015, 03:01:19

Michael Paquier wrote:
> On Wed, Dec 2, 2015 at 8:11 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > - It would be nice to have command_ok and command_fails in PostgresNode
> >   too; that would remove the need for setting $ENV{PGPORT} but it's
> >   possible to run commands outside a node too, so we'd need duplicates,
> >   which would be worse.
> 
> I am fine to let it the way your patch does it. There are already many changes.

Idea: we can have a bare command_ok exported by TestLib just as
currently, and instance method PostgresNode->command_ok that first sets
local $ENV{PGPORT} and then calls the other one.

> > - I removed the @active_nodes array, which is now unnecessary (per
> >   above).
> 
> So that's basically replaced by @all_nodes.

@all_nodes is only used to look for unused port numbers.

> > Finally, I ran perltidy on all the files, which strangely changed stuff
> > that I didn't expect it to change.  I wonder if this is related to the
> > perltidy version.  Do you see further changes if you run perltidy on the
> > patched tree?
> 
> SimpleTee.pm shows some diffs, though it doesn't seem that this patch
> should care about that. The rest is showing no diffs. And I have used
> perltidy v20140711.

Yes, the patch doesn't modify SimpleTee -- I just used "find" to indent
perl files.  What I don't understand is why doesn't my perltidy run
match what was in master currently.  It should be a no-op ...

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

02 December 2015, 04:04:40

On Wed, Dec 2, 2015 at 12:01 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>> On Wed, Dec 2, 2015 at 8:11 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
>> > - It would be nice to have command_ok and command_fails in PostgresNode
>> >   too; that would remove the need for setting $ENV{PGPORT} but it's
>> >   possible to run commands outside a node too, so we'd need duplicates,
>> >   which would be worse.
>>
>> I am fine to let it the way your patch does it. There are already many changes.
>
> Idea: we can have a bare command_ok exported by TestLib just as
> currently, and instance method PostgresNode->command_ok that first sets
> local $ENV{PGPORT} and then calls the other one.

Hm. That would be cleaner and make the code more consistent. Now as
TestLib exports command_ok, command_like and command_fails, we would
get redefinition errors when compiling the code if those routines are
not named differently in PostgresNode. If you want to have the names
consistent, then I guess that the only way would be to remove those
routines from the export list of TestLib and call them directly as for
example TestLib::command_ok(). See for example the patch attached that
applies on top on your patch 2 that adds a set of routines in
PostgresNode with a slightly different name.

>> > Finally, I ran perltidy on all the files, which strangely changed stuff
>> > that I didn't expect it to change.  I wonder if this is related to the
>> > perltidy version.  Do you see further changes if you run perltidy on the
>> > patched tree?
>>
>> SimpleTee.pm shows some diffs, though it doesn't seem that this patch
>> should care about that. The rest is showing no diffs. And I have used
>> perltidy v20140711.
>
> Yes, the patch doesn't modify SimpleTee -- I just used "find" to indent
> perl files.  What I don't understand is why doesn't my perltidy run
> match what was in master currently.  It should be a no-op ...

Well I don't get it either :)
I did a run on top of your two patches and saw no differences.
--
Michael

Attachment

20151202_taptests-3.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

02 December 2015, 06:21:01

On Wed, Dec 2, 2015 at 1:04 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Wed, Dec 2, 2015 at 12:01 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Michael Paquier wrote:
>>> On Wed, Dec 2, 2015 at 8:11 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>
>>> > - It would be nice to have command_ok and command_fails in PostgresNode
>>> >   too; that would remove the need for setting $ENV{PGPORT} but it's
>>> >   possible to run commands outside a node too, so we'd need duplicates,
>>> >   which would be worse.
>>>
>>> I am fine to let it the way your patch does it. There are already many changes.
>>
>> Idea: we can have a bare command_ok exported by TestLib just as
>> currently, and instance method PostgresNode->command_ok that first sets
>> local $ENV{PGPORT} and then calls the other one.
>
> Hm. That would be cleaner and make the code more consistent. Now as
> TestLib exports command_ok, command_like and command_fails, we would
> get redefinition errors when compiling the code if those routines are
> not named differently in PostgresNode. If you want to have the names
> consistent, then I guess that the only way would be to remove those
> routines from the export list of TestLib and call them directly as for
> example TestLib::command_ok(). See for example the patch attached that
> applies on top on your patch 2 that adds a set of routines in
> PostgresNode with a slightly different name.

Well, Alvaro has whispered me a more elegant method by using TestLib()
to only import a portion of the routines and avoid the redefinition
errors. Hence, patch 0001 attached creates equivalents of command_*
for PostgresNode and tests use it without setting PGPORT. Patch 0002
is a run of perltidy on the whole.
--
Michael

Attachment

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Noah Misch

Date:

02 December 2015, 07:03:20

On Tue, Dec 01, 2015 at 08:11:21PM -0300, Alvaro Herrera wrote:
> Finally, I ran perltidy on all the files, which strangely changed stuff
> that I didn't expect it to change.  I wonder if this is related to the
> perltidy version.

The last pgindent run (commit 807b9e0) used perltidy v20090616, and perltidy
behavior has changed slightly over time.  Install that version to do your own
perltidy runs.

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

02 December 2015, 19:34:04

Noah Misch wrote:
> On Tue, Dec 01, 2015 at 08:11:21PM -0300, Alvaro Herrera wrote:
> > Finally, I ran perltidy on all the files, which strangely changed stuff
> > that I didn't expect it to change.  I wonder if this is related to the
> > perltidy version.
> 
> The last pgindent run (commit 807b9e0) used perltidy v20090616, and perltidy
> behavior has changed slightly over time.  Install that version to do your own
> perltidy runs.

I tried that version, but it seems to emit the same.  How did you figure
that that was the version used, anyway?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

02 December 2015, 21:59:21

Michael Paquier wrote:

> Well, Alvaro has whispered me a more elegant method by using TestLib()
> to only import a portion of the routines and avoid the redefinition
> errors. Hence, patch 0001 attached creates equivalents of command_*
> for PostgresNode and tests use it without setting PGPORT. Patch 0002
> is a run of perltidy on the whole.

It seemed better to me to have the import list be empty, i.e. "use
TestLib ()" and then qualify the routine names inside PostgresNode,
instead of having to list the names of the routines to import, so I
pushed it that way after running the tests a few more times.

(Another option would be to have those routines be in EXPORT_OK instead
of EXPORT, but then every other user of TestLib would have to
explicitely declare that it wants those routines imported.  Maybe this
is a good change; if anyone wants to push for that, patches welcome.)

I didn't push the changed for config_default you requested a few
messages upthread; it's not clear to me how setting it to undef affects
the whole thing.  If setting it to undef makes the MSVC toolchain run
the tap tests in the default config, then I can do it; let's be clear
about what branch to backpatch this to.  Also the "1;" at the end of
RewindTest.

Thanks for the patch.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

03 December 2015, 06:44:54

On Thu, Dec 3, 2015 at 6:59 AM, Alvaro Herrera wrote:
> I didn't push the changed for config_default you requested a few
> messages upthread; it's not clear to me how setting it to undef affects
> the whole thing.  If setting it to undef makes the MSVC toolchain run
> the tap tests in the default config, then I can do it; let's be clear
> about what branch to backpatch this to.  Also the "1;" at the end of
> RewindTest.


Setting it to undef will prevent the tests to run, per vcregress.pl:
    die "Tap tests not enabled in configuration"
      unless $config->{tap_tests};
Also, setting it to undef will match the existing behavior on
platforms where ./configure is used because the switch
--enable-tap-tests needs to be used there. And I would believe that in
most cases Windows environments are not going to have IPC::Run
deployed.

I have also rebased the recovery test suite as the attached, using the
infrastructure that has been committed lately.
Regards,
--
Michael

Attachment

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

03 December 2015, 19:03:20

On Thu, Dec 3, 2015 at 6:59 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I didn't push the changes for config_default you requested a few
> messages upthread; it's not clear to me how setting it to undef affects
> the whole thing.  If setting it to undef makes the MSVC toolchain run
> the tap tests in the default config, then I can do it; let's be clear
> about what branch to backpatch this to.  Also the "1;" at the end of
> RewindTest.

Setting it to undef will prevent the tests to run, per vcregress.pl:
sub tap_check
{
    die "Tap tests not enabled in configuration"
      unless $config->{tap_tests};
Also, setting it to undef will match the existing behavior on
platforms where ./configure is used because the switch
--enable-tap-tests needs to be used there. And I would believe that in
most cases Windows environments are not going to have IPC::Run
deployed.

I have also rebased the recovery test suite as the attached, using the
infrastructure that has been committed lately.
Regards,
--
Michael

Attachment

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

04 December 2015, 05:32:37

On Thu, Dec 3, 2015 at 3:44 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Dec 3, 2015 at 6:59 AM, Alvaro Herrera wrote:
>> I didn't push the changed for config_default you requested a few
>> messages upthread; it's not clear to me how setting it to undef affects
>> the whole thing.  If setting it to undef makes the MSVC toolchain run
>> the tap tests in the default config, then I can do it; let's be clear
>> about what branch to backpatch this to.  Also the "1;" at the end of
>> RewindTest.
>
>
> Setting it to undef will prevent the tests to run, per vcregress.pl:
>     die "Tap tests not enabled in configuration"
>       unless $config->{tap_tests};
> Also, setting it to undef will match the existing behavior on
> platforms where ./configure is used because the switch
> --enable-tap-tests needs to be used there. And I would believe that in
> most cases Windows environments are not going to have IPC::Run
> deployed.
>
> I have also rebased the recovery test suite as the attached, using the
> infrastructure that has been committed lately.

By the way, if there are no objections, I am going to mark this patch
as committed in the CF app. Putting in the infrastructure is already a
good step forward, and I will add an entry in next CF to track the
patch to add tests for recovery itself. Alvaro, what do you think?
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Noah Misch

Date:

04 December 2015, 05:43:29

On Wed, Dec 02, 2015 at 04:33:50PM -0300, Alvaro Herrera wrote:
> Noah Misch wrote:
> > On Tue, Dec 01, 2015 at 08:11:21PM -0300, Alvaro Herrera wrote:
> > > Finally, I ran perltidy on all the files, which strangely changed stuff
> > > that I didn't expect it to change.  I wonder if this is related to the
> > > perltidy version.
> > 
> > The last pgindent run (commit 807b9e0) used perltidy v20090616, and perltidy
> > behavior has changed slightly over time.  Install that version to do your own
> > perltidy runs.
> 
> I tried that version, but it seems to emit the same.
 git checkout 807b9e0 (find src -name \*.pl -o -name \*.pm ) | sort -u | xargs perltidy
--profile=src/tools/pgindent/perltidyrc

perltidy v20090616 leaves the working directory clean, but perltidy v20150815
introduces diffs:
src/backend/catalog/genbki.pl                | 15 ++++++++-------src/bin/pg_basebackup/t/010_pg_basebackup.pl |  2
+-src/tools/msvc/Install.pm                   |  6 +++---src/tools/msvc/Mkvcbuild.pm                  |  2
+-src/tools/msvc/Project.pm                   |  2 +-src/tools/msvc/Solution.pm                   |  5
++---src/tools/msvc/gendef.pl                    |  4 ++--7 files changed, 18 insertions(+), 18 deletions(-)
 

You see a different result?

> How did you figure
> that that was the version used, anyway?

I asked Bruce at one point.

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

04 December 2015, 05:47:39

On Fri, Dec 4, 2015 at 2:43 PM, Noah Misch <noah@leadboat.com> wrote:
> On Wed, Dec 02, 2015 at 04:33:50PM -0300, Alvaro Herrera wrote:
>> How did you figure
>> that that was the version used, anyway?
>
> I asked Bruce at one point.

So we are trying to use the same version over the years to keep code
consistent across back-branches? Do you think we should try to use a
newer version instead with each pgindent run? That would induce a
rebasing cost when back-patching, but we cannot stay with the same
version of perltidy forever either...
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Noah Misch

Date:

04 December 2015, 06:31:15

On Fri, Dec 04, 2015 at 02:47:36PM +0900, Michael Paquier wrote:
> On Fri, Dec 4, 2015 at 2:43 PM, Noah Misch <noah@leadboat.com> wrote:
> > On Wed, Dec 02, 2015 at 04:33:50PM -0300, Alvaro Herrera wrote:
> >> How did you figure
> >> that that was the version used, anyway?
> >
> > I asked Bruce at one point.
> 
> So we are trying to use the same version over the years to keep code
> consistent across back-branches?

No, I recall no policy discussion concerning this.

> Do you think we should try to use a
> newer version instead with each pgindent run? That would induce a
> rebasing cost when back-patching, but we cannot stay with the same
> version of perltidy forever either...

No.  I expect we'll implicitly change perltidy versions when Bruce upgrades
his OS.  Assuming future perltidy changes affect us not much more than past
changes (six years of perltidy development changed eighteen PostgreSQL source
lines), that's just fine.

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

04 December 2015, 16:12:14

Michael Paquier wrote:

> By the way, if there are no objections, I am going to mark this patch
> as committed in the CF app. Putting in the infrastructure is already a
> good step forward, and I will add an entry in next CF to track the
> patch to add tests for recovery itself. Alvaro, what do you think?

Feel free to do that, but I'm likely to look into it before the next CF
anyway.  The thread about this has been open since 2013, so that doesn't
seem unfair.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

04 December 2015, 16:12:35

Noah Misch wrote:

>   git checkout 807b9e0
>   (find src -name \*.pl -o -name \*.pm ) | sort -u | xargs perltidy --profile=src/tools/pgindent/perltidyrc
> 
> perltidy v20090616 leaves the working directory clean, but perltidy v20150815
> introduces diffs:
> 
>  src/backend/catalog/genbki.pl                | 15 ++++++++-------
>  src/bin/pg_basebackup/t/010_pg_basebackup.pl |  2 +-
>  src/tools/msvc/Install.pm                    |  6 +++---
>  src/tools/msvc/Mkvcbuild.pm                  |  2 +-
>  src/tools/msvc/Project.pm                    |  2 +-
>  src/tools/msvc/Solution.pm                   |  5 ++---
>  src/tools/msvc/gendef.pl                     |  4 ++--
>  7 files changed, 18 insertions(+), 18 deletions(-)
> 
> You see a different result?

Oh, you're right -- on that commit, I get the same results as you with
v20090616 (no changes).  I don't have v20150815; the version packaged by
Debian is v20140328, and Install.pm is not changed by that one (so I
only get 15 lines changed, not 18).

I think my confusion stems from code that was introduced after the last
pgindent run.  I guess I could have tidied the original files prior to
patching.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

05 December 2015, 11:45:00

On Sat, Dec 5, 2015 at 1:11 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>
>> By the way, if there are no objections, I am going to mark this patch
>> as committed in the CF app. Putting in the infrastructure is already a
>> good step forward, and I will add an entry in next CF to track the
>> patch to add tests for recovery itself. Alvaro, what do you think?
>
> Feel free to do that, but I'm likely to look into it before the next CF
> anyway.  The thread about this has been open since 2013, so that doesn't
> seem unfair.

Let's see then. For the time being I have marked this patch as
committed, and created a new entry for the set of tests regarding
standbys:
https://commitfest.postgresql.org/8/438/
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Tom Lane

Date:

05 December 2015, 19:09:45

Andrew Dunstan <andrew@dunslane.net> writes:
> On 11/29/2015 04:28 PM, Noah Misch wrote:
>> Never mutate the filesystem in a BEGIN block, because "perl -c" runs BEGIN
>> blocks.  (Likewise for the BEGIN block this patch adds to TestLib.)

> Yeah, those two lines might belong in an INIT block. "perldoc perlmod" 
> for details.

BTW, if we consider that gospel, why has PostgresNode.pm got this in its
BEGIN block?
# PGHOST is set once and for all through a single series of tests when# this module is loaded.$test_pghost =
$TestLib::windows_os? "127.0.0.1" : TestLib::tempdir_short;$ENV{PGHOST}     = $test_pghost;

On non-Windows machines, the call of tempdir_short will result in
filesystem activity, ie creation of a directory.  Offhand it looks like
all of the activity in this BEGIN block could go to an INIT block instead.

I'm also bemused by why there was any question about this being wrong:
# XXX: Should this use PG_VERSION_NUM?$last_port_assigned = 90600 % 16384 + 49152;

If that's not a hard-coded PG version number then I don't know
what it is.  Maybe it would be better to use random() instead,
but surely this isn't good as-is.
        regards, tom lane

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Noah Misch

Date:

07 December 2015, 03:06:10

On Wed, Dec 02, 2015 at 06:59:09PM -0300, Alvaro Herrera wrote:
> It seemed better to me to have the import list be empty, i.e. "use
> TestLib ()" and then qualify the routine names inside PostgresNode,
> instead of having to list the names of the routines to import, so I
> pushed it that way after running the tests a few more times.

I inspected commit 1caef31:

> --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
> +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl

> -like($recovery_conf, qr/^standby_mode = 'on[']$/m, 'recovery.conf sets standby_mode');
> -like($recovery_conf, qr/^primary_conninfo = '.*port=$ENV{PGPORT}.*'$/m, 'recovery.conf sets primary_conninfo');
> -
> -command_ok([ 'pg_basebackup', '-D', "$tempdir/backupxf", '-X', 'fetch' ],
> +like(
> +    $recovery_conf,
> +    qr/^standby_mode = 'on[']$/m,
> +    'recovery.conf sets standby_mode');
> +like(
> +    $recovery_conf,
> +    qr/^primary_conninfo = '.*port=$ENV{PGPORT}.*'$/m,
> +    'recovery.conf sets primary_conninfo');

This now elicits a diagnostic:
 Use of uninitialized value $ENV{"PGPORT"} in regexp compilation at t/010_pg_basebackup.pl line 175, <> line 1.

> --- a/src/bin/pg_controldata/t/001_pg_controldata.pl
> +++ b/src/bin/pg_controldata/t/001_pg_controldata.pl

> -standard_initdb "$tempdir/data";
> -command_like([ 'pg_controldata', "$tempdir/data" ],
> +
> +my $node = get_new_node();
> +$node->init;
> +$node->start;

Before the commit, this test did not start a server.

> --- /dev/null
> +++ b/src/test/perl/PostgresNode.pm

> +sub _update_pid
> +{
> +    my $self = shift;
> +
> +    # If we can open the PID file, read its first line and that's the PID we
> +    # want.  If the file cannot be opened, presumably the server is not
> +    # running; don't be noisy in that case.
> +    open my $pidfile, $self->data_dir . "/postmaster.pid";
> +    if (not defined $pidfile)

$ grep -h 'at /.*line' src/bin/*/tmp_check/log/* |sort|uniq -c     1 cannot remove directory for
/data/nmisch/src/pg/postgresql/src/bin/scripts/tmp_testNNCZ:Directory not empty at
/home/nmisch/sw/cpan/lib/perl5/File/Temp.pmline 784     1 cannot remove directory for
/data/nmisch/src/pg/postgresql/src/bin/scripts/tmp_testNNCZ/pgdata/base/13264:Directory not empty at
/home/nmisch/sw/cpan/lib/perl5/File/Temp.pmline 784     1 cannot remove directory for
/data/nmisch/src/pg/postgresql/src/bin/scripts/tmp_testNNCZ/pgdata/base:Directory not empty at
/home/nmisch/sw/cpan/lib/perl5/File/Temp.pmline 784     1 cannot remove directory for
/data/nmisch/src/pg/postgresql/src/bin/scripts/tmp_testNNCZ/pgdata:Directory not empty at
/home/nmisch/sw/cpan/lib/perl5/File/Temp.pmline 784    28 readline() on closed filehandle $pidfile at
/data/nmisch/src/pg/postgresql/src/bin/pg_rewind/../../../src/test/perl/PostgresNode.pmline 308.    28 Use of
uninitializedvalue in concatenation (.) or string at
/data/nmisch/src/pg/postgresql/src/bin/pg_rewind/../../../src/test/perl/PostgresNode.pmline 309.

$pidfile is always defined.  Test the open() return value.

> +    {
> +        $self->{_pid} = undef;
> +        print "# No postmaster PID\n";
> +        return;
> +    }
> +
> +    $self->{_pid} = <$pidfile>;

chomp() that value to remove its trailing newline.

> +    print "# Postmaster PID is $self->{_pid}\n";
> +    close $pidfile;
> +}

> +        my $devnull = $TestLib::windows_os ? "nul" : "/dev/null";

Unused variable.

> +sub DESTROY
> +{
> +    my $self = shift;
> +    return if not defined $self->{_pid};
> +    print "# signalling QUIT to $self->{_pid}\n";
> +    kill 'QUIT', $self->{_pid};

On Windows, that kill() is the wrong thing.  I suspect "pg_ctl kill" will be
the right thing, but that warrants specific testing.

Postmaster log file names became less informative.  Before the commit:

-rw-------. 1 nmisch nmisch 211265 2015-12-06 22:35:59.931114215 +0000 regress_log_001_basic
-rw-------. 1 nmisch nmisch 268887 2015-12-06 22:36:21.871165951 +0000 regress_log_002_databases
-rw-------. 1 nmisch nmisch 206808 2015-12-06 22:36:41.861213736 +0000 regress_log_003_extrafiles
-rw-------. 1 nmisch nmisch   7464 2015-12-06 22:37:00.371256795 +0000 master.log
-rw-------. 1 nmisch nmisch   6648 2015-12-06 22:37:01.381259211 +0000 standby.log
-rw-------. 1 nmisch nmisch 208561 2015-12-06 22:37:02.381261374 +0000 regress_log_004_pg_xlog_symlink

After:

-rw-------. 1 nmisch nmisch 219581 2015-12-06 23:00:50.504643175 +0000 regress_log_001_basic
-rw-------. 1 nmisch nmisch 276315 2015-12-06 23:01:12.924697085 +0000 regress_log_002_databases
-rw-------. 1 nmisch nmisch 213940 2015-12-06 23:01:33.574747195 +0000 regress_log_003_extrafiles
-rw-------. 1 nmisch nmisch   4114 2015-12-06 23:01:40.914764850 +0000 node_57834.log
-rw-------. 1 nmisch nmisch   6166 2015-12-06 23:01:43.184770615 +0000 node_57833.log
-rw-------. 1 nmisch nmisch   5550 2015-12-06 23:01:52.504792997 +0000 node_57835.log
-rw-------. 1 nmisch nmisch   9494 2015-12-06 23:01:53.514794802 +0000 node_57836.log
-rw-------. 1 nmisch nmisch 216212 2015-12-06 23:01:54.544797739 +0000 regress_log_004_pg_xlog_symlink

Should nodes get a name, so we instead see master_57834.log?

See also the things Tom Lane identified in <27550.1449342569@sss.pgh.pa.us>.

nm

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

07 December 2015, 05:34:50

On Mon, Dec 7, 2015 at 12:06 PM, Noah Misch <noah@leadboat.com> wrote:
> On Wed, Dec 02, 2015 at 06:59:09PM -0300, Alvaro Herrera wrote:
>> --- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
>> +++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
>
>> -like($recovery_conf, qr/^standby_mode = 'on[']$/m, 'recovery.conf sets standby_mode');
>> -like($recovery_conf, qr/^primary_conninfo = '.*port=$ENV{PGPORT}.*'$/m, 'recovery.conf sets primary_conninfo');
>> -
>> -command_ok([ 'pg_basebackup', '-D', "$tempdir/backupxf", '-X', 'fetch' ],
>> +like(
>> +     $recovery_conf,
>> +     qr/^standby_mode = 'on[']$/m,
>> +     'recovery.conf sets standby_mode');
>> +like(
>> +     $recovery_conf,
>> +     qr/^primary_conninfo = '.*port=$ENV{PGPORT}.*'$/m,
>> +     'recovery.conf sets primary_conninfo');
>
> This now elicits a diagnostic:
>
>   Use of uninitialized value $ENV{"PGPORT"} in regexp compilation at t/010_pg_basebackup.pl line 175, <> line 1.

Fixed.

>> --- a/src/bin/pg_controldata/t/001_pg_controldata.pl
>> +++ b/src/bin/pg_controldata/t/001_pg_controldata.pl
>
>> -standard_initdb "$tempdir/data";
>> -command_like([ 'pg_controldata', "$tempdir/data" ],
>> +
>> +my $node = get_new_node();
>> +$node->init;
>> +$node->start;
>
> Before the commit, this test did not start a server.

Fixed.

>> --- /dev/null
>> +++ b/src/test/perl/PostgresNode.pm
>
>> +sub _update_pid
>> +{
>> +     my $self = shift;
>> +
>> +     # If we can open the PID file, read its first line and that's the PID we
>> +     # want.  If the file cannot be opened, presumably the server is not
>> +     # running; don't be noisy in that case.
>> +     open my $pidfile, $self->data_dir . "/postmaster.pid";
>> +     if (not defined $pidfile)
>
> $ grep -h 'at /.*line' src/bin/*/tmp_check/log/* |sort|uniq -c
>       1 cannot remove directory for /data/nmisch/src/pg/postgresql/src/bin/scripts/tmp_testNNCZ: Directory not empty
at/home/nmisch/sw/cpan/lib/perl5/File/Temp.pm line 784 
>       1 cannot remove directory for /data/nmisch/src/pg/postgresql/src/bin/scripts/tmp_testNNCZ/pgdata/base/13264:
Directorynot empty at /home/nmisch/sw/cpan/lib/perl5/File/Temp.pm line 784 
>       1 cannot remove directory for /data/nmisch/src/pg/postgresql/src/bin/scripts/tmp_testNNCZ/pgdata/base:
Directorynot empty at /home/nmisch/sw/cpan/lib/perl5/File/Temp.pm line 784 
>       1 cannot remove directory for /data/nmisch/src/pg/postgresql/src/bin/scripts/tmp_testNNCZ/pgdata: Directory not
emptyat /home/nmisch/sw/cpan/lib/perl5/File/Temp.pm line 784 
>      28 readline() on closed filehandle $pidfile at
/data/nmisch/src/pg/postgresql/src/bin/pg_rewind/../../../src/test/perl/PostgresNode.pmline 308. 
>      28 Use of uninitialized value in concatenation (.) or string at
/data/nmisch/src/pg/postgresql/src/bin/pg_rewind/../../../src/test/perl/PostgresNode.pmline 309. 
>
> $pidfile is always defined.  Test the open() return value.

That's something I have addressed here:
http://www.postgresql.org/message-id/CAB7nPqTOP28Zxv_SXFo2axGJoesfvLLMO6syddAfV0DUvsFMDw@mail.gmail.com
I am including the fix as well here to do a group shot.

>> +     {
>> +             $self->{_pid} = undef;
>> +             print "# No postmaster PID\n";
>> +             return;
>> +     }
>> +
>> +     $self->{_pid} = <$pidfile>;
>
> chomp() that value to remove its trailing newline.

Right.

>> +     print "# Postmaster PID is $self->{_pid}\n";
>> +     close $pidfile;
>> +}
>
>> +             my $devnull = $TestLib::windows_os ? "nul" : "/dev/null";
>
> Unused variable.

Removed.

>> +sub DESTROY
>> +{
>> +     my $self = shift;
>> +     return if not defined $self->{_pid};
>> +     print "# signalling QUIT to $self->{_pid}\n";
>> +     kill 'QUIT', $self->{_pid};
>
> On Windows, that kill() is the wrong thing.  I suspect "pg_ctl kill" will be
> the right thing, but that warrants specific testing.

I don't directly see any limitation with the use of kill on Windows..
http://perldoc.perl.org/functions/kill.html
But indeed using directly pg_ctl kill seems like a better fit for the
PG infrastructure.

> Postmaster log file names became less informative.  Before the commit:
> Should nodes get a name, so we instead see master_57834.log?

I am not sure that this is necessary. There is definitely a limitation
regarding the fact that log files of nodes get overwritten after each
test, hence I would tend with just appending the test name in front of
the node_* files similarly to the other files.

> See also the things Tom Lane identified in <27550.1449342569@sss.pgh.pa.us>.

Yep, I marked this email as something to address when it was sent.

On Sun, Dec 6, 2015 at 4:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> BTW, if we consider that gospel, why has PostgresNode.pm got this in its
> BEGIN block?
>
>         # PGHOST is set once and for all through a single series of tests when
>         # this module is loaded.
>         $test_pghost =
>           $TestLib::windows_os ? "127.0.0.1" : TestLib::tempdir_short;
>         $ENV{PGHOST}     = $test_pghost;
>
> On non-Windows machines, the call of tempdir_short will result in
> filesystem activity, ie creation of a directory.  Offhand it looks like
> all of the activity in this BEGIN block could go to an INIT block instead.

OK, the whole block is switched to INIT instead.

> I'm also bemused by why there was any question about this being wrong:
>
>         # XXX: Should this use PG_VERSION_NUM?
>         $last_port_assigned = 90600 % 16384 + 49152;

> If that's not a hard-coded PG version number then I don't know
> what it is.  Maybe it would be better to use random() instead,
> but surely this isn't good as-is.

We would definitely want something within the ephemeral port range, so
we are up to that:
rand() * 16384 + 49152;

All those issues are addressed as per the patch attached.
Regards,
--
Michael

Attachment

20151206_tap_fixes.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

07 December 2015, 22:43:51

Michael Paquier wrote:

> > If that's not a hard-coded PG version number then I don't know
> > what it is.  Maybe it would be better to use random() instead,
> > but surely this isn't good as-is.
> 
> We would definitely want something within the ephemeral port range, so
> we are up to that:
> rand() * 16384 + 49152;

Yes, this seems to produce the correct range.

Thanks Noah and Tom for the review, and thanks Michael for the patch.  I
pushed it.  A slight fix was to change the chomp() call; it was always
returning 1 (number of elements chomped) so it tried to kill init.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

07 December 2015, 22:46:20

Michael Paquier wrote:
> On Thu, Dec 3, 2015 at 6:59 AM, Alvaro Herrera wrote:
> > I didn't push the changed for config_default you requested a few
> > messages upthread; it's not clear to me how setting it to undef affects
> > the whole thing.  If setting it to undef makes the MSVC toolchain run
> > the tap tests in the default config, then I can do it; let's be clear
> > about what branch to backpatch this to.  Also the "1;" at the end of
> > RewindTest.
> 
> Setting it to undef will prevent the tests to run, per vcregress.pl:
>     die "Tap tests not enabled in configuration"
>       unless $config->{tap_tests};
> Also, setting it to undef will match the existing behavior on
> platforms where ./configure is used because the switch
> --enable-tap-tests needs to be used there. And I would believe that in
> most cases Windows environments are not going to have IPC::Run
> deployed.

But if I don't set it to anything, then it will be "initialized" as
undef, so it has the same effect.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Tom Lane

Date:

07 December 2015, 22:54:32

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Michael Paquier wrote:
>> We would definitely want something within the ephemeral port range, so
>> we are up to that:
>> rand() * 16384 + 49152;

> Yes, this seems to produce the correct range.

Speaking of ephemeral port range ... if get_new_node() were to run
past port 65535, which is certainly possible with this new code,
it would fail altogether.  Seems it needs to wrap around properly
within that port range.
        regards, tom lane

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

07 December 2015, 23:08:06

Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Michael Paquier wrote:
> >> We would definitely want something within the ephemeral port range, so
> >> we are up to that:
> >> rand() * 16384 + 49152;
> 
> > Yes, this seems to produce the correct range.
> 
> Speaking of ephemeral port range ... if get_new_node() were to run
> past port 65535, which is certainly possible with this new code,
> it would fail altogether.  Seems it needs to wrap around properly
> within that port range.

Oh, of course.  Pushed fix.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

08 December 2015, 01:27:13

On Tue, Dec 8, 2015 at 7:46 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Michael Paquier wrote:
> On Thu, Dec 3, 2015 at 6:59 AM, Alvaro Herrera wrote:
> > I didn't push the changed for config_default you requested a few
> > messages upthread; it's not clear to me how setting it to undef affects
> > the whole thing. If setting it to undef makes the MSVC toolchain run
> > the tap tests in the default config, then I can do it; let's be clear
> > about what branch to backpatch this to. Also the "1;" at the end of
> > RewindTest.
>
> Setting it to undef will prevent the tests to run, per vcregress.pl:
> die "Tap tests not enabled in configuration"
> unless $config->{tap_tests};
> Also, setting it to undef will match the existing behavior on
> platforms where ./configure is used because the switch
> --enable-tap-tests needs to be used there. And I would believe that in
> most cases Windows environments are not going to have IPC::Run
> deployed.

But if I don't set it to anything, then it will be "initialized" as
undef, so it has the same effect.

Yes, it will have the same effect. Still it is a problem to not list it in default_config.pl as the other options, no? And that's as well the case with GetFakeConfigure, which should list it I think for consistency with the rest. See the attached for example.

Michael

Attachment

20151208_tap_msvc_fix.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

09 December 2015, 21:47:09

I've been giving RecoveryTest.pm a look. I wonder if we really need that
as a separate package.  My first thought was that we could have another
class that inherits from PostgresNode (say RecoveryNode).  But later it
occured to me that we could have the new functions just be part of
PostgresNode itself directly; so we would have some new PostgresNode
methods:$node->enable_streaming$node->enable_restoring$node->enable_archiving$node->wait (your
RecoveryTest::wait_for_node;better name for this?)
 

and some additional constructors:make_mastermake_stream_standbymake_archive_standby

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Noah Misch

Date:

10 December 2015, 02:55:01

On Mon, Dec 07, 2015 at 02:34:39PM +0900, Michael Paquier wrote:
> On Mon, Dec 7, 2015 at 12:06 PM, Noah Misch <noah@leadboat.com> wrote:
> > On Wed, Dec 02, 2015 at 06:59:09PM -0300, Alvaro Herrera wrote:
> >> +sub DESTROY
> >> +{
> >> +     my $self = shift;
> >> +     return if not defined $self->{_pid};
> >> +     print "# signalling QUIT to $self->{_pid}\n";
> >> +     kill 'QUIT', $self->{_pid};
> >
> > On Windows, that kill() is the wrong thing.  I suspect "pg_ctl kill" will be
> > the right thing, but that warrants specific testing.
> 
> I don't directly see any limitation with the use of kill on Windows..
> http://perldoc.perl.org/functions/kill.html
> But indeed using directly pg_ctl kill seems like a better fit for the
> PG infrastructure.

From http://perldoc.perl.org/perlwin32.html, "Using signals under this port
should currently be considered unsupported."  Windows applications cannot
handle SIGQUIT: https://msdn.microsoft.com/en-us/library/xdkz3x12.aspx.  The
PostgreSQL backend does not generate or expect Windows signals; see its
signal.c emulation facility.

> > Postmaster log file names became less informative.  Before the commit:
> > Should nodes get a name, so we instead see master_57834.log?
> 
> I am not sure that this is necessary.

In general, you'd need to cross-reference the main log file to determine which
postmaster log corresponds to which action within the test.  I did plenty of
"grep $PATTERN src/bin/pg_rewind/tmp_check/log/master.log" while debugging
that test.  I'd like to be able to use /*master*.log, not rely on timestamps
or on scraping regress_log_002_databases to determine which logs are master
logs.  Feel free to skip this point if I'm the only one minding, though.

> There is definitely a limitation
> regarding the fact that log files of nodes get overwritten after each
> test, hence I would tend with just appending the test name in front of
> the node_* files similarly to the other files.

They got appended, not overwritten.  I like how you changed that to not
happen, but it doesn't address what I was reporting.

nm

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

10 December 2015, 05:52:56

On Thu, Dec 10, 2015 at 6:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I've been giving RecoveryTest.pm a look. I wonder if we really need that
> as a separate package.  My first thought was that we could have another
> class that inherits from PostgresNode (say RecoveryNode).  But later it
> occured to me that we could have the new functions just be part of
> PostgresNode itself directly; so we would have some new PostgresNode
> methods:
>         $node->enable_streaming
>         $node->enable_restoring
>         $node->enable_archiving

Sure.

>         $node->wait (your RecoveryTest::wait_for_node; better name for this?)

wait_for_access?

> and some additional constructors:
>         make_master
>         make_stream_standby
>         make_archive_standby

I have done that a little bit differently. Those are completely
remove, then init() and init_from_backup() are extended with a new set
of parameters to enable archive, streaming or restore on a node.

Which gives the patch attached.
--
Michael

Attachment

20151210_recovery_suite.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

10 December 2015, 20:36:04

Noah Misch wrote:
> On Mon, Dec 07, 2015 at 02:34:39PM +0900, Michael Paquier wrote:

> > I don't directly see any limitation with the use of kill on Windows..
> > http://perldoc.perl.org/functions/kill.html
> > But indeed using directly pg_ctl kill seems like a better fit for the
> > PG infrastructure.
> 
> From http://perldoc.perl.org/perlwin32.html, "Using signals under this port
> should currently be considered unsupported."  Windows applications cannot
> handle SIGQUIT: https://msdn.microsoft.com/en-us/library/xdkz3x12.aspx.  The
> PostgreSQL backend does not generate or expect Windows signals; see its
> signal.c emulation facility.

Makes sense.  What we're doing now is what you suggested, so we should
be fine.

> > > Postmaster log file names became less informative.  Before the commit:
> > > Should nodes get a name, so we instead see master_57834.log?
> > 
> > I am not sure that this is necessary.
> 
> In general, you'd need to cross-reference the main log file to determine which
> postmaster log corresponds to which action within the test.  I did plenty of
> "grep $PATTERN src/bin/pg_rewind/tmp_check/log/master.log" while debugging
> that test.  I'd like to be able to use /*master*.log, not rely on timestamps
> or on scraping regress_log_002_databases to determine which logs are master
> logs.  Feel free to skip this point if I'm the only one minding, though.

Since we now have the node name in the log file name, perhaps we no
longer need the port number in there.  In fact, I find having the file
name change on every run (based on the port number) is slightly
annoying.  I vote we change it back to using the node name without the
port number.  (Also, some PostgresNode messages refer to the instance by
datadir and port number, which is unnecessary: it would be clearer to
use the name instead.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

11 December 2015, 06:44:18

On Fri, Dec 11, 2015 at 5:35 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Noah Misch wrote:
>> On Mon, Dec 07, 2015 at 02:34:39PM +0900, Michael Paquier wrote:
>> > > Postmaster log file names became less informative.  Before the commit:
>> > > Should nodes get a name, so we instead see master_57834.log?
>> >
>> > I am not sure that this is necessary.
>>
>> In general, you'd need to cross-reference the main log file to determine which
>> postmaster log corresponds to which action within the test.  I did plenty of
>> "grep $PATTERN src/bin/pg_rewind/tmp_check/log/master.log" while debugging
>> that test.  I'd like to be able to use /*master*.log, not rely on timestamps
>> or on scraping regress_log_002_databases to determine which logs are master
>> logs.  Feel free to skip this point if I'm the only one minding, though.
>
> Since we now have the node name in the log file name, perhaps we no
> longer need the port number in there

There is no node name in the log file name as of now, they are built
using the port number, and the information of a node is dumped into
the central log file when created (see dump_info).

> In fact, I find having the file
> name change on every run (based on the port number) is slightly
> annoying.  I vote we change it back to using the node name without the
> port number.  (Also, some PostgresNode messages refer to the instance by
> datadir and port number, which is unnecessary: it would be clearer to
> use the name instead.)

OK, so... What we have now as log file for a specific node is that:
${testname}_node_${port}.log
which is equivalent to that:
${testname}_${applname}.log

I guess that to complete your idea we could allow PostgresNode to get
a custom name for its log file through an optional parameter like
logfile => 'myname' or similar. And if nothing is defined, process
falls back to applname. So this would give the following:
${testname}_${logfile}.log

It seems that we had better keep the test name as a prefix of the log
file name though, to avoid an overlap with any other test in the same
series. Thoughts?
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

11 December 2015, 11:48:47

Michael Paquier wrote:
> On Fri, Dec 11, 2015 at 5:35 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:

> > Since we now have the node name in the log file name, perhaps we no
> > longer need the port number in there
> 
> There is no node name in the log file name as of now, they are built
> using the port number, and the information of a node is dumped into
> the central log file when created (see dump_info).

Yeah, I realized this after posting.  What I thought was the node name,
based on some of the files I had laying around, was actually the test
name.

> I guess that to complete your idea we could allow PostgresNode to get
> a custom name for its log file through an optional parameter like
> logfile => 'myname' or similar. And if nothing is defined, process
> falls back to applname. So this would give the following:
> ${testname}_${logfile}.log

Sure. I don't think we should the name only for the log file, though,
but also for things like the "## " informative messages we print here
and there.  That would make the log file simpler to follow.  Also, I'm
not sure about having it be optional.  (TBH I'm not sure about applname
either; why do we keep that one?)

> It seems that we had better keep the test name as a prefix of the log
> file name though, to avoid an overlap with any other test in the same
> series. Thoughts?

Yes, agreed on that.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

11 December 2015, 12:34:45

On Fri, Dec 11, 2015 at 8:48 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>> On Fri, Dec 11, 2015 at 5:35 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>> I guess that to complete your idea we could allow PostgresNode to get
>> a custom name for its log file through an optional parameter like
>> logfile => 'myname' or similar. And if nothing is defined, process
>> falls back to applname. So this would give the following:
>> ${testname}_${logfile}.log
>
> Sure. I don't think we should the name only for the log file, though,
> but also for things like the "## " informative messages we print here
> and there.  That would make the log file simpler to follow.  Also, I'm
> not sure about having it be optional.  (TBH I'm not sure about applname
> either; why do we keep that one?)

OK, so let's do this: the node name is a mandatory argument of
get_new_node, which is passed to "new PostgresNode" like the port and
the host, and it is then used in the log file name as well as in the
information messages you are mentioning. That's a patch simple enough.
Are you fine with this approach?

Regarding the application name, I still think it is useful to have it
though. pg_rewind should actually use it, and the other patch adding
the recovery routines will use it.
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Noah Misch

Date:

12 December 2015, 02:38:00

On Fri, Dec 11, 2015 at 09:34:34PM +0900, Michael Paquier wrote:
> On Fri, Dec 11, 2015 at 8:48 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Michael Paquier wrote:
> >> On Fri, Dec 11, 2015 at 5:35 AM, Alvaro Herrera
> >> <alvherre@2ndquadrant.com> wrote:
> >> I guess that to complete your idea we could allow PostgresNode to get
> >> a custom name for its log file through an optional parameter like
> >> logfile => 'myname' or similar. And if nothing is defined, process
> >> falls back to applname. So this would give the following:
> >> ${testname}_${logfile}.log
> >
> > Sure. I don't think we should the name only for the log file, though,
> > but also for things like the "## " informative messages we print here
> > and there.  That would make the log file simpler to follow.  Also, I'm
> > not sure about having it be optional.  (TBH I'm not sure about applname
> > either; why do we keep that one?)
> 
> OK, so let's do this: the node name is a mandatory argument of
> get_new_node, which is passed to "new PostgresNode" like the port and
> the host, and it is then used in the log file name as well as in the
> information messages you are mentioning. That's a patch simple enough.
> Are you fine with this approach?

Sounds reasonable so far.

> Regarding the application name, I still think it is useful to have it
> though. pg_rewind should actually use it, and the other patch adding
> the recovery routines will use it.

Using the application_name connection parameter is fine, but I can't think of
a reason to set it to "node_".$node->port instead of $node->name.  And I can't
think of a use for the $node->applname field once you have $node->name.  What
use case would benefit?

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

12 December 2015, 11:29:23

On Sat, Dec 12, 2015 at 11:37 AM, Noah Misch <noah@leadboat.com> wrote:
> On Fri, Dec 11, 2015 at 09:34:34PM +0900, Michael Paquier wrote:
>> On Fri, Dec 11, 2015 at 8:48 PM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>> > Michael Paquier wrote:
>> >> On Fri, Dec 11, 2015 at 5:35 AM, Alvaro Herrera
>> >> <alvherre@2ndquadrant.com> wrote:
>> >> I guess that to complete your idea we could allow PostgresNode to get
>> >> a custom name for its log file through an optional parameter like
>> >> logfile => 'myname' or similar. And if nothing is defined, process
>> >> falls back to applname. So this would give the following:
>> >> ${testname}_${logfile}.log
>> >
>> > Sure. I don't think we should the name only for the log file, though,
>> > but also for things like the "## " informative messages we print here
>> > and there.  That would make the log file simpler to follow.  Also, I'm
>> > not sure about having it be optional.  (TBH I'm not sure about applname
>> > either; why do we keep that one?)
>>
>> OK, so let's do this: the node name is a mandatory argument of
>> get_new_node, which is passed to "new PostgresNode" like the port and
>> the host, and it is then used in the log file name as well as in the
>> information messages you are mentioning. That's a patch simple enough.
>> Are you fine with this approach?
>
> Sounds reasonable so far.

OK, done so.

>> Regarding the application name, I still think it is useful to have it
>> though. pg_rewind should actually use it, and the other patch adding
>> the recovery routines will use it.
>
> Using the application_name connection parameter is fine, but I can't think of
> a reason to set it to "node_".$node->port instead of $node->name.  And I can't
> think of a use for the $node->applname field once you have $node->name.  What
> use case would benefit?

I have the applname stuff, and updated the log messages to use the
node name for clarity.

The patch to address those points is attached.
Regards,
--
Michael

Attachment

20151212_tap_node_name.patch

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

21 December 2015, 07:45:29

On Sat, Dec 12, 2015 at 8:29 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Dec 12, 2015 at 11:37 AM, Noah Misch <noah@leadboat.com> wrote:
>> On Fri, Dec 11, 2015 at 09:34:34PM +0900, Michael Paquier wrote:
>>> On Fri, Dec 11, 2015 at 8:48 PM, Alvaro Herrera
>>> <alvherre@2ndquadrant.com> wrote:
>>> > Michael Paquier wrote:
>>> >> On Fri, Dec 11, 2015 at 5:35 AM, Alvaro Herrera
>>> >> <alvherre@2ndquadrant.com> wrote:
>>> >> I guess that to complete your idea we could allow PostgresNode to get
>>> >> a custom name for its log file through an optional parameter like
>>> >> logfile => 'myname' or similar. And if nothing is defined, process
>>> >> falls back to applname. So this would give the following:
>>> >> ${testname}_${logfile}.log
>>> >
>>> > Sure. I don't think we should the name only for the log file, though,
>>> > but also for things like the "## " informative messages we print here
>>> > and there.  That would make the log file simpler to follow.  Also, I'm
>>> > not sure about having it be optional.  (TBH I'm not sure about applname
>>> > either; why do we keep that one?)
>>>
>>> OK, so let's do this: the node name is a mandatory argument of
>>> get_new_node, which is passed to "new PostgresNode" like the port and
>>> the host, and it is then used in the log file name as well as in the
>>> information messages you are mentioning. That's a patch simple enough.
>>> Are you fine with this approach?
>>
>> Sounds reasonable so far.
>
> OK, done so.
>
>>> Regarding the application name, I still think it is useful to have it
>>> though. pg_rewind should actually use it, and the other patch adding
>>> the recovery routines will use it.
>>
>> Using the application_name connection parameter is fine, but I can't think of
>> a reason to set it to "node_".$node->port instead of $node->name.  And I can't
>> think of a use for the $node->applname field once you have $node->name.  What
>> use case would benefit?
>
> I have the applname stuff, and updated the log messages to use the
> node name for clarity.
>
> The patch to address those points is attached.

As this thread is stalling a bit, please find attached a series of
patch gathering all the pending issues for this thread:
- 0001, fix config_default.pl for MSVC builds to take into account TAP tests
- 0002, append a node name in get_new_node (per Noah's request)
- 0003, the actual recovery test suite
Hopefully this facilitates future reviews.
Regards,
--
Michael

Attachment

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

22 January 2016, 06:17:59

On Mon, Dec 21, 2015 at 4:45 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> As this thread is stalling a bit, please find attached a series of
> patch gathering all the pending issues for this thread:
> - 0001, fix config_default.pl for MSVC builds to take into account TAP tests
> - 0002, append a node name in get_new_node (per Noah's request)
> - 0003, the actual recovery test suite
> Hopefully this facilitates future reviews.

Patch 2 has been pushed as c8642d9 (thanks Alvaro). The remaining two
patches still apply and pass cleanly.
-- 
Michael

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Stas Kelvich

Date:

04 February 2016, 08:23:50

Hi.

I’ve looked over proposed patch and migrated my shell tests scripts that i’ve used for testing twophase commits on
master/slaveto this test framework. Everything looks mature, and I didn’t encountered any problems with writing new
testsusing this infrastructure. 

From my point of view I don’t see any problems to commit this patches in their current state.

Also some things that came into mind about test suite:

0) There are several routines that does actual checking, like is/command_ok/command_fails. I think it will be very
handyto have wrappers psql_ok/psql_fails that calls psql through the command_ok/fails. 

1) Better to raise more meaningful error when IPC::Run is absend.

2) --enable-tap-tests deserves mention in test/recovery/README and more obvious error message when one trying to run
makecheck in test/recovery without --enable-tap-tests. 

3) Is it hard to give ability to run TAP tests in extensions?

4) It will be handy if make check will write path to log files in case of failed test.

5) psql() accepts database name as a first argument, but everywhere in tests it is ‘postgres’. Isn’t it simpler to
storedbname in connstr, and have separate function to change database? 

6) Clean logs on prove restart? Clean up tmp installations?

7) Make check sets PGPORT PG_REGRESS for prove. Is it necessary?

> On 22 Jan 2016, at 09:17, Michael Paquier <michael.paquier@gmail.com> wrote:
>
> On Mon, Dec 21, 2015 at 4:45 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> As this thread is stalling a bit, please find attached a series of
>> patch gathering all the pending issues for this thread:
>> - 0001, fix config_default.pl for MSVC builds to take into account TAP tests
>> - 0002, append a node name in get_new_node (per Noah's request)
>> - 0003, the actual recovery test suite
>> Hopefully this facilitates future reviews.
>
> Patch 2 has been pushed as c8642d9 (thanks Alvaro). The remaining two
> patches still apply and pass cleanly.
> --
> Michael
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Kyotaro HORIGUCHI

Date:

04 February 2016, 09:08:06

Hello, I'm studying this.

Two hunks in 0003 needed a fix but the other part applied cleanly
on master.

At Fri, 22 Jan 2016 15:17:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTTAtVCEXAoyMtF4Xu9g=mXY4cjnP=+hy7jgYfnFzM=JA@mail.gmail.com>
> On Mon, Dec 21, 2015 at 4:45 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > As this thread is stalling a bit, please find attached a series of
> > patch gathering all the pending issues for this thread:
> > - 0001, fix config_default.pl for MSVC builds to take into account TAP tests
> > - 0002, append a node name in get_new_node (per Noah's request)
> > - 0003, the actual recovery test suite
> > Hopefully this facilitates future reviews.
> 
> Patch 2 has been pushed as c8642d9 (thanks Alvaro). The remaining two
> patches still apply and pass cleanly.

The TAP test framework doesn't remove existing temporary
directories when a test script suite (or a prove?) starts, and it
in turn removes all temprorary directories it has made even if
ended with fairures. It would be sometimes inconvenient to find
the cause of the failures and inconsistent with the behavior of
the ordinary(?)  make check, as far as my understanding goes.

tmp_check is left remained but it would be ok to preserve logs,
which is located in tmp_check differently than the ordinary
regressions.

One annoyance is the name of data directories is totally
meaningless. We cannot investigate them even if it is left
behind.

Addition to them, maybe it is useful that a test script can get
stderr content from PostgresNode->psql(). Setting
client_min_messages lower can give a plenty of useful information
about how server is working.


So, I'd like to propose four (or five) changes to this harness.
- prove_check to remove all in tmp_check
- TestLib to preserve temporary directories/files if the current  test fails.
- PostgresNode::get_new_node to create data directory with  meaningful basenames.
- PostgresNode::psql to return a list of ($stdout, $stderr) if  requested. (The previous behavior is not changed)
- (recovery/t/00x_* gives test number to node name)

As a POC, the attached diff will appliy on the 0001 and (fixed)
0003 patches.

It might be good to give test number to the name of temp dirs by
any automated way, but it is not included in it.


Opinions? Thoughts?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index 4602e3e..75ddcaf 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -51,7 +51,7 @@ sub new    my $self   = {        _port     => $pgport,        _host     => $pghost,
-        _basedir  => TestLib::tempdir,
+        _basedir  => TestLib::tempdir($name),        _name     => $name,        _logfile  =>
"$TestLib::log_path/${testname}_${name}.log"};
 
@@ -513,10 +513,16 @@ sub psql        print "#### Begin standard error\n";        print $stderr;        print "#### End
standarderror\n";
 
+        if (wantarray)
+        {
+            chomp $stderr;
+            $stderr =~ s/\r//g if $Config{osname} eq 'msys';
+        }    }    chomp $stdout;    $stdout =~ s/\r//g if $Config{osname} eq 'msys';
-    return $stdout;
+
+    return wantarray ? ($stdout, $stderr) : $stdout;}# Run a query once a second, until it returns 't' (i.e. SQL
booleantrue).
 
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 3d11cbb..841dc06 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -107,13 +107,24 @@ INIT    autoflush TESTLOG 1;}
+END
+{
+    # Preserve temporary directory for this test if failure
+    if (!Test::More->builder->is_passing)
+    {
+        $File::Temp::KEEP_ALL = 1;
+    }
+}
+## Helper functions#sub tempdir{
+    my ($prefix) = @_;
+    $prefix = "tmp_test" if (!$prefix);    return File::Temp::tempdir(
-        'tmp_testXXXX',
+        $prefix.'_XXXX',        DIR => $tmp_check,        CLEANUP => 1);}
diff --git a/src/test/recovery/t/001_stream_rep.pl b/src/test/recovery/t/001_stream_rep.pl
index 3ed9be3..8a00d00 100644
--- a/src/test/recovery/t/001_stream_rep.pl
+++ b/src/test/recovery/t/001_stream_rep.pl
@@ -6,16 +6,18 @@ use TestLib;use Test::More tests => 4;# Initialize master node
-my $node_master = get_new_node('master');
+my $node_master = get_new_node('001_master');$node_master->init(allows_streaming => 1);$node_master->start;my
$backup_name= 'my_backup';
 
+$File::Temp::KEEP_ALL = 0;
+# Take backup$node_master->backup($backup_name);# Create streaming standby linking to master
-my $node_standby_1 = get_new_node('standby_1');
+my $node_standby_1 = get_new_node('001_standby_1');$node_standby_1->init_from_backup($node_master, $backup_name,
                          has_streaming => 1);$node_standby_1->start;
 
@@ -25,7 +27,7 @@ $node_standby_1->start;$node_standby_1->backup($backup_name);# Create second standby node linking to
standby1
 
-my $node_standby_2 = get_new_node('standby_2');
+my $node_standby_2 = get_new_node('001_standby_2');$node_standby_2->init_from_backup($node_standby_1, $backup_name,
                             has_streaming => 1);$node_standby_2->start;
 
@@ -43,11 +45,11 @@ $caughtup_query = "SELECT pg_last_xlog_replay_location() = write_location FROM
p$node_standby_1->poll_query_until('postgres',$caughtup_query)    or die "Timed out while waiting for standby 2 to
catchup";
 
-my $result = $node_standby_1->psql('postgres', "SELECT count(*) FROM tab_int");
+my $result = $node_standby_1->psql('postgres', "set client_min_messages to debug5;SELECT count(*) FROM tab_int");print
"standby1: $result\n";
 
-is($result, qq(1002), 'check streamed content on standby 1');
+isnt($result[0], qq(1002), 'check streamed content on standby 1');
-$result =  $node_standby_2->psql('postgres', "SELECT count(*) FROM tab_int");
+my $result =  $node_standby_2->psql('postgres', "SELECT count(*) FROM tab_int");print "standby 2:
$result\n";is($result,qq(1002), 'check streamed content on standby 2');
 
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index 930125c..bb620c6 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -7,7 +7,7 @@ use Test::More tests => 1;use File::Copy;# Initialize master node, doing archives
-my $node_master = get_new_node('master');
+my $node_master = get_new_node('002_master');$node_master->init(has_archiving => 1,                   allows_streaming
=>1);my $backup_name = 'my_backup';
 
@@ -19,7 +19,7 @@ $node_master->start;$node_master->backup($backup_name);# Initialize standby node from backup,
fetchingWAL from archives
 
-my $node_standby = get_new_node('standby');
+my $node_standby = get_new_node('002_standby');$node_standby->init_from_backup($node_master, $backup_name,
                  has_restoring => 1);$node_standby->append_conf('postgresql.conf', qq(
 
diff --git a/src/test/recovery/t/003_recovery_targets.pl b/src/test/recovery/t/003_recovery_targets.pl
index 293603a..4d59277 100644
--- a/src/test/recovery/t/003_recovery_targets.pl
+++ b/src/test/recovery/t/003_recovery_targets.pl
@@ -16,7 +16,7 @@ sub test_recovery_standby    my $num_rows = shift;    my $until_lsn = shift;
-    my $node_standby = get_new_node($node_name);
+    my $node_standby = get_new_node("003_".$node_name);    $node_standby->init_from_backup($node_master, 'my_backup',
                                 has_restoring => 1);
 
@@ -43,7 +43,7 @@ sub test_recovery_standby}# Initialize master node
-my $node_master = get_new_node('master');
+my $node_master = get_new_node('003_master');$node_master->init(has_archiving => 1, allows_streaming => 1);# Start it
diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index c58c602..0a2362b 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -11,7 +11,7 @@ use Test::More tests => 1;$ENV{PGDATABASE} = 'postgres';# Initialize master node
-my $node_master = get_new_node('master');
+my $node_master = get_new_node('004_master');$node_master->init(allows_streaming => 1);$node_master->start;
@@ -20,11 +20,11 @@ my $backup_name = 'my_backup';$node_master->backup($backup_name);# Create two standbys linking to
it
-my $node_standby_1 = get_new_node('standby_1');
+my $node_standby_1 = get_new_node('004_standby_1');$node_standby_1->init_from_backup($node_master, $backup_name,
                          has_streaming => 1);$node_standby_1->start;
 
-my $node_standby_2 = get_new_node('standby_2');
+my $node_standby_2 = get_new_node('004_standby_2');$node_standby_2->init_from_backup($node_master, $backup_name,
                          has_streaming => 1);$node_standby_2->start;
 
diff --git a/src/test/recovery/t/005_replay_delay.pl b/src/test/recovery/t/005_replay_delay.pl
index 14d9b29..ae209e4 100644
--- a/src/test/recovery/t/005_replay_delay.pl
+++ b/src/test/recovery/t/005_replay_delay.pl
@@ -6,7 +6,7 @@ use TestLib;use Test::More tests => 2;# Initialize master node
-my $node_master = get_new_node();
+my $node_master = get_new_node('005_master');$node_master->init(allows_streaming => 1);$node_master->start;
@@ -18,7 +18,7 @@ my $backup_name = 'my_backup';$node_master->backup($backup_name);# Create streaming standby from
backup
-my $node_standby = get_new_node();
+my $node_standby = get_new_node('005_standby');$node_standby->init_from_backup($node_master, $backup_name,
                  has_streaming => 1);$node_standby->append_conf('recovery.conf', qq(

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

04 February 2016, 09:59:13

On Thu, Feb 4, 2016 at 5:23 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

(Please do not top-post, this breaks the thread flow.)

> I’ve looked over proposed patch and migrated my shell tests scripts that i’ve used for testing twophase commits on
master/slaveto this test framework. Everything looks mature, and I didn’t encountered any problems with writing new
testsusing this infrastructure. 
> From my point of view I don’t see any problems to commit this patches in their current state.

Thanks for the review!

> 0) There are several routines that does actual checking, like is/command_ok/command_fails. I think it will be very
handyto have wrappers psql_ok/psql_fails that calls psql through the command_ok/fails. 

Do you have a test case in mind for it?

> 1) Better to raise more meaningful error when IPC::Run is absent.

This has been discussed before, and as far as I recall the current
behavior has been concluded as being fine. That's where
--enable-tap-tests becomes meaningful.

> 2) --enable-tap-tests deserves mention in test/recovery/README and more obvious error message when one trying to run
makecheck in test/recovery without --enable-tap-tests. 

When running without --enable-tap-tests from src/test/recovery you
would get the following error per how prove_check is defined:
"TAP tests not enabled"

> 3) Is it hard to give ability to run TAP tests in extensions?

Not really. You would need to enforce a check rule or similar. For the
recovery test suite I have mapped the check rule with prove_check.

> 4) It will be handy if make check will write path to log files in case of failed test.

Hm, perhaps. The log files are hardcoded in log/, so it is not like we
don't know it. That's an argument for the main TAP suite though, not
really this series of patch.

> 5) psql() accepts database name as a first argument, but everywhere in tests it is ‘postgres’. Isn’t it simpler to
storedbname in connstr, and have separate function to change database? 
> 6) Clean logs on prove restart? Clean up tmp installations?

Those are issues proper to the main TAP infrastructure, though I agree
that we could improve things here, particularly for temporary
installations that get automatically... Hm... Cleaned up should a test
failure happen?

> 7) Make check sets PGPORT PG_REGRESS for prove. Is it necessary?

No, that's not needed (I think I noticed that at some point) and
that's a bug. We could live without setting it.
--
Michael

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Victor Wagner

Date:

04 February 2016, 13:44:29

On Thu, 4 Feb 2016 12:59:03 +0300
Michael Paquier <michael.paquier@gmail.com> wrote:
for it?
> 
> > 1) Better to raise more meaningful error when IPC::Run is absent.  
> 
> This has been discussed before, and as far as I recall the current
> behavior has been concluded as being fine. That's where
> --enable-tap-tests becomes meaningful.

Really, it is not so hard to add configure checks for perl modules.
And we need to test not only for IPC::Run, but for Test::More too,
because some Linux distributions put modules which come with perl into
separate package.

The only problem that most m4 files with tests for perl modules, which
can be found in the Internet, have GPL license, so someone have either
to write his own and publish under PostgreSQL license or contact
author of one of them and ask to publish it under PostgreSQL license.

First seems to be much easier.

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

04 February 2016, 15:33:36

On Thu, Feb 4, 2016 at 4:43 PM, Victor Wagner <vitus@wagner.pp.ru> wrote:
> On Thu, 4 Feb 2016 12:59:03 +0300
> Michael Paquier <michael.paquier@gmail.com> wrote:
>> > 1) Better to raise more meaningful error when IPC::Run is absent.
>>
>> This has been discussed before, and as far as I recall the current
>> behavior has been concluded as being fine. That's where
>> --enable-tap-tests becomes meaningful.
>
> Really, it is not so hard to add configure checks for perl modules.
> And we need to test not only for IPC::Run, but for Test::More too,
> because some Linux distributions put modules which come with perl into
> separate package.

The last time we discussed about that on this list we concluded that
it was not really necessary to have such checks, for one it makes the
code more simple, and because this is leveraged by the presence of
--enable-tap-tests, tests which can get actually costly with
check-world. But this is digressing the subject of this thread, which
deals with the fact of having recovery tests integrated in core...
-- 
Michael

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Victor Wagner

Date:

04 February 2016, 18:13:43

On Thu, 4 Feb 2016 18:33:27 +0300
Michael Paquier <michael.paquier@gmail.com> wrote:


> > Really, it is not so hard to add configure checks for perl modules.
> > And we need to test not only for IPC::Run, but for Test::More too,
> > because some Linux distributions put modules which come with perl
> > into separate package.  
> 
> The last time we discussed about that on this list we concluded that
> it was not really necessary to have such checks, for one it makes the
> code more simple, and because this is leveraged by the presence of
> --enable-tap-tests, tests which can get actually costly with
> check-world. But this is digressing the subject of this thread, which
> deals with the fact of having recovery tests integrated in core...

Of course, such configure tests should be run only if
--enable-tap-tests is passed to the configure script

It would look  like

if test "$enable_tap_tests" = "yes"; then AX_PROG_PERL_MODULES( Test::More, , AC_MSG_ERROR([Test::More is necessary to
runTAP Tests]) AX_PROG_PERL_MODULES( IPC::Run, , AC_MSG_ERROR([IPC::Run is necessary to run TAP Tests])
 
fi

in the configure.in

May be it is not strictly necessary, but it is really useful to see
such problems as clear error message during configure stage, rather
than successfully configure, compile, run tests and only then find out,
that something is forgotten.

I don't see why having such tests in the configure.in, makes code more
complex. It just prevents configure to finish successfully if
--enable-tap-tests is specified and required modules are not available.




--                                   Victor Wagner <vitus@wagner.pp.ru>

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Victor Wagner

Date:

04 February 2016, 18:18:31

This patch adds a long-awaited functionality to the PostgreSQL test
suite - testing of cluster configuration.

It contains bare minimum of replication and recovery test, but it should
be a good starting point for other people. 

Really, adding a much more tests for replication and recovery
is problematic, because these tests are resource-hungry, and take big
enough time even on relatively powerful machines, but  it seems to be
necessary, because they need to create several temporary installation.

So, set of tests, included into this patch is reasonably good choice. 

I think that readability of tests can be improved a bit, because these
tests would serve as an example for all tap test writers.

It's quite good that patch sets standard of using 'use strict; use
warnings;' in the test script.

It is bad, that Postgres-specific perl modules do not have embedded
documentation. It would be nice to see POD documentation in the
TestLib.pm and PostgresNode.pm instead of just comments. It would be
much easier to test writers to read documentation using perldoc utility,
rather than browse through the code.

I'll second Stas' suggestion about psql_ok/psql_fail functions.

1. psql_ok instead of just psql would provide visual feedback for the
reader of code. One would see 'here condition is tested, here is
something ended with _ok/_fail'.

It would be nice that seeing say "use Test::More tests => 4"
one can immediately see "Yes, there is three _ok's and one _fail in the
script'

2. I have use case for psql_fail code. In my libpq failover patch there
is number of cases, where it should be tested that connection is not
established,

But this is rather about further evolution of the tap test library, not
about this set of tests.

I think that this patch should be commited as soon as possible in its
current form (short of already reported reject in the PostgresNode.pm
init function).


--                                   Victor Wagner <vitus@wagner.pp.ru>

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

04 February 2016, 19:18:01

On Thu, Feb 4, 2016 at 9:18 PM, Victor Wagner wrote:
> It's quite good that patch sets standard of using 'use strict; use
> warnings;' in the test script.

FWIW, this is decided as being a standard rule for any modules/script
added in the main tree.

> It is bad, that Postgres-specific perl modules do not have embedded
> documentation. It would be nice to see POD documentation in the
> TestLib.pm and PostgresNode.pm instead of just comments. It would be
> much easier to test writers to read documentation using perldoc utility,
> rather than browse through the code.

Why not. I am no perlist but those prove to be helpful, however those
Postgres modules are not dedicated to a large audience, so we could
live without for now.

> I think that this patch should be commited as soon as possible in its
> current form (short of already reported reject in the PostgresNode.pm
> init function).

Thanks for your enthusiasm. Now, to do an auto-critic of my patch:

+       if ($params{allows_streaming})
+       {
+               print $conf "wal_level = hot_standby\n";
+               print $conf "max_wal_senders = 5\n";
+               print $conf "wal_keep_segments = 20\n";
+               print $conf "max_wal_size = 128MB\n";
+               print $conf "shared_buffers = 1MB\n";
+               print $conf "wal_log_hints = on\n";
+               print $conf "hot_standby = on\n";
+       }
This could have more thoughts, particularly for wal_log_hints which is
not used all the time, I think that we'd actually want to complete
that with an optional hash of parameter/values that get appended at
the end of the configuration file, then pass wal_log_hints in the
tests where it is needed. The default set of parameter is maybe fine
if done this way, still wal_keep_segments could be removed.

+# Tets for timeline switch
+# Encure that a standby is able to follow a newly-promoted standby
Two typos in two lines.
--
Michael

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Noah Misch

Date:

05 February 2016, 03:21:43

On Thu, Feb 04, 2016 at 09:13:46PM +0300, Victor Wagner wrote:
> On Thu, 4 Feb 2016 18:33:27 +0300 Michael Paquier <michael.paquier@gmail.com> wrote:

> > > Really, it is not so hard to add configure checks for perl modules.
> > > And we need to test not only for IPC::Run, but for Test::More too,
> > > because some Linux distributions put modules which come with perl
> > > into separate package.  
> > 
> > The last time we discussed about that on this list we concluded that
> > it was not really necessary to have such checks, for one it makes the
> > code more simple, and because this is leveraged by the presence of
> > --enable-tap-tests, tests which can get actually costly with
> > check-world. But this is digressing the subject of this thread, which
> > deals with the fact of having recovery tests integrated in core...
> 
> Of course, such configure tests should be run only if
> --enable-tap-tests is passed to the configure script
> 
> It would look  like
> 
> if test "$enable_tap_tests" = "yes"; then
>   AX_PROG_PERL_MODULES( Test::More, , AC_MSG_ERROR([Test::More is
>   necessary to run TAP Tests])
>   AX_PROG_PERL_MODULES( IPC::Run, , AC_MSG_ERROR([IPC::Run is
>   necessary to run TAP Tests])
> fi
> 
> in the configure.in
> 
> May be it is not strictly necessary, but it is really useful to see
> such problems as clear error message during configure stage, rather
> than successfully configure, compile, run tests and only then find out,
> that something is forgotten.
> 
> I don't see why having such tests in the configure.in, makes code more
> complex. It just prevents configure to finish successfully if
> --enable-tap-tests is specified and required modules are not available.

Even if detecting missing modules at "configure" time is the right thing, it
belongs in a distinct patch, discussed on a distinct thread.  The absence of
IPC::Run affects the proposed replication tests in the same way it affects
current TAP suites, so this thread has no business revisiting it.

Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

05 February 2016, 07:20:20

On Thu, Feb 4, 2016 at 11:58 PM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:
>> On 04 Feb 2016, at 12:59, Michael Paquier <michael.paquier@gmail.com> wrote:
>>> 0) There are several routines that does actual checking, like is/command_ok/command_fails. I think it will be very
handyto have wrappers psql_ok/psql_fails that calls psql through the command_ok/fails. 
>>
>> Do you have a test case in mind for it?
>
> Yes, I’ve used that to test prepare/commit while recovery (script attached, it’s in WIP state, i’ll submit that later
alongwith other twophase stuff). 

Oh, OK, now I see. Well it seems to make sense for your case, though
it does not seem to be directly linked to the patch here. We could
incrementally add something on top of the existing infrastructure that
gets into the code tree once the 2PC patch gets in a more advanced
shape.

>>> 2) --enable-tap-tests deserves mention in test/recovery/README and more obvious error message when one trying to
runmake check in test/recovery without --enable-tap-tests. 
>>
>> When running without --enable-tap-tests from src/test/recovery you
>> would get the following error per how prove_check is defined:
>> "TAP tests not enabled"
>
> Yes, but that message doesn’t mention --enable-tap-tests and README also silent about that too. I didn’t know about
thatflag and had to search in makefiles for this error message to see what conditions leads to it. I think we can save
planetfrom one more stackoverflow question if the error message will mention that flag. 

Well, that works for the whole TAP test infrastructure and not really
this patch only. Let's not forget that the goal of this thread is to
provide a basic set of tests and routines to help people building test
cases for more advanced clustering scenarios, so I'd rather not
complicate the code with side things and remain focused on the core
problem.
--
Michael

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

17 February 2016, 12:52:15

On Fri, Feb 5, 2016 at 4:17 AM, Michael Paquier wrote:
> Thanks for your enthusiasm. Now, to do an auto-critic of my patch:
> +       if ($params{allows_streaming})
> +       {
> +               print $conf "wal_level = hot_standby\n";
> +               print $conf "max_wal_senders = 5\n";
> +               print $conf "wal_keep_segments = 20\n";
> +               print $conf "max_wal_size = 128MB\n";
> +               print $conf "shared_buffers = 1MB\n";
> +               print $conf "wal_log_hints = on\n";
> +               print $conf "hot_standby = on\n";
> +       }
> This could have more thoughts, particularly for wal_log_hints which is
> not used all the time, I think that we'd actually want to complete
> that with an optional hash of parameter/values that get appended at
> the end of the configuration file, then pass wal_log_hints in the
> tests where it is needed. The default set of parameter is maybe fine
> if done this way, still wal_keep_segments could be removed.

At the end I have refrained from doing that, and refactoring
setup_cluster@RewindTest.pm to use the new option allows_streaming,
simplifying a bit the code. The introduction of allows_streaming could
be done in a separate patch, though it did not seem worth the
complication when hacking at that. The split is simple, though.

> +# Tets for timeline switch
> +# Encure that a standby is able to follow a newly-promoted standby
> Two typos in two lines.

Fixed.

I also found an issue with the use of application_name causing test
001 to fail, bug squashed on the way. The generation of paths for
archive_command and restore_command was incorrect on Windows. Those
need to use two backslashes (to detect correctly files) and need to be
double-quoted (to avoid errors with command copy should a space be
included in the path). I have added as well a new subcommand in
vcregress.pl called recoverycheck where one can run the recovery test
suite on Windows using MSVC.

Attached are rebased patches, split into 3 parts doing the following:
- 0001, fix default configuration of MSVC builds ignoring TAP tests
- 0002, add a promote routine in PostgresNode.pm. pg_rewind's tests
can make immediate use of that.
- 0003, the actual test suite.
This is registered in CF 2016-03 as well for further consideration.
--
Michael

Attachment

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

26 February 2016, 02:59:01

On Wed, Feb 17, 2016 at 9:52 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Feb 5, 2016 at 4:17 AM, Michael Paquier wrote:
>> Thanks for your enthusiasm. Now, to do an auto-critic of my patch:
>> +       if ($params{allows_streaming})
>> +       {
>> +               print $conf "wal_level = hot_standby\n";
>> +               print $conf "max_wal_senders = 5\n";
>> +               print $conf "wal_keep_segments = 20\n";
>> +               print $conf "max_wal_size = 128MB\n";
>> +               print $conf "shared_buffers = 1MB\n";
>> +               print $conf "wal_log_hints = on\n";
>> +               print $conf "hot_standby = on\n";
>> +       }
>> This could have more thoughts, particularly for wal_log_hints which is
>> not used all the time, I think that we'd actually want to complete
>> that with an optional hash of parameter/values that get appended at
>> the end of the configuration file, then pass wal_log_hints in the
>> tests where it is needed. The default set of parameter is maybe fine
>> if done this way, still wal_keep_segments could be removed.
>
> At the end I have refrained from doing that, and refactoring
> setup_cluster@RewindTest.pm to use the new option allows_streaming,
> simplifying a bit the code. The introduction of allows_streaming could
> be done in a separate patch, though it did not seem worth the
> complication when hacking at that. The split is simple, though.
>
>> +# Tets for timeline switch
>> +# Encure that a standby is able to follow a newly-promoted standby
>> Two typos in two lines.
>
> Fixed.
>
> I also found an issue with the use of application_name causing test
> 001 to fail, bug squashed on the way. The generation of paths for
> archive_command and restore_command was incorrect on Windows. Those
> need to use two backslashes (to detect correctly files) and need to be
> double-quoted (to avoid errors with command copy should a space be
> included in the path). I have added as well a new subcommand in
> vcregress.pl called recoverycheck where one can run the recovery test
> suite on Windows using MSVC.
>
> Attached are rebased patches, split into 3 parts doing the following:
> - 0001, fix default configuration of MSVC builds ignoring TAP tests
> - 0002, add a promote routine in PostgresNode.pm. pg_rewind's tests
> can make immediate use of that.
> - 0003, the actual test suite.
> This is registered in CF 2016-03 as well for further consideration.

Here is a rebased set after the conflicts created by e640093, with the
following changes:
- In 0002, added perldoc for new promote routine
- In 0003, added perldoc documentation for the new options introduced
in init and init_from_backup, and fixed some log entries not using the
node name to identify the node involved when enabling archive,
streaming or recovery.
- Craig has pinged me regarding tap_tests being incorrectly updated in
config_default.pl in 0003.
I just re-ran the tests on OSX and Windows (MSVC 2010 with Win7) to be
sure that nothing broke, and nothing has been reported as broken.
--
Michael

Attachment

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Craig Ringer

Date:

26 February 2016, 04:47:59

On 26 February 2016 at 10:58, Michael Paquier <michael.paquier@gmail.com> wrote:

Here is a rebased set after the conflicts created by e640093, with the
following changes:

Thanks for rebasing on top of that. Not totally fair when your patch came first, but I guess it was simpler to merge the other one first.

- In 0002, added perldoc for new promote routine
- In 0003, added perldoc documentation for the new options introduced
in init and init_from_backup, and fixed some log entries not using the
node name to identify the node involved when enabling archive,
streaming or recovery.

Very much appreciated.

- Craig has pinged me regarding tap_tests being incorrectly updated in
config_default.pl in 0003.
I just re-ran the tests on OSX and Windows (MSVC 2010 with Win7) to be
sure that nothing broke, and nothing has been reported as broken.

I've looked over the tests. I see that you've updated the docs for the Windows tests to reflect the changes, which is good, thanks.

I like the patch and would love to see it committed soon.

I do have one major disagreement, which is that you turn autovacuum off if streaming is enabled. This is IMO completely wrong and must be removed. It's making the tests ignore a major and important part of real-world use.

If you did it to make it easier to test replay catchup etc, just use pg_xlog_location_diff instead of an equality test. Instead of:

my $caughtup_query = "SELECT '$current_lsn'::pg_lsn <= pg_last_xlog_replay_location()";

use

my $caughtup_query = "SELECT pg_xlog_location_diff('$current_lsn', pg_last_xlog_replay_location()) <= 0";

so it doesn't care if we replay past the expected LSN on the master due to autovacuum activity. That's what's done in the real world and what should be covered by the tests IMO.

The patch sets

tap_tests => 1,

in config_default.pl. Was that on purpose? I'd have no problem with running the TAP tests by default if they worked by default, but the docs say that at least with ActiveState's Perl you have to jump through some hoops to get IPC::Run.

Typo in PostgresNode.pm: passiong should be 'passing' .

Otherwise looks _really_ good and I'd love to see this committed very soon.

I'd like a way to append parameters in a way that won't clobber settings made implicitly by the module through things like enable_streaming but I can add that in a followup patch. It doesn't need to complicate this one. I'm thinking of having the tests append an include_dir directive when they create a node, maintain a hash of all parameters and rewrite a postgresql.conf.taptests file in the include_dir when params are updated. Also exposing a 'reload' call.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

26 February 2016, 05:43:33

On Fri, Feb 26, 2016 at 1:47 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 26 February 2016 at 10:58, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> Here is a rebased set after the conflicts created by e640093, with the
>> following changes:
>
> Thanks for rebasing on top of that. Not totally fair when your patch came
> first, but I guess it was simpler to merge the other one first.

At this point the final result is the same. It does not matter what
gets in first.

> I do have one major disagreement, which is that you turn autovacuum off if
> streaming is enabled. This is IMO completely wrong and must be removed. It's
> making the tests ignore a major and important part of real-world use.

This has been chosen for consistency with what is in pg_rewind tests,
the idea being to keep the runs more stable with a WAL output under
control to allow predictable results. Though I do not see any direct
reason to not remove it actually now that I think about it.

> If you did it to make it easier to test replay catchup etc, just use
> pg_xlog_location_diff instead of an equality test. Instead of:
> my $caughtup_query = "SELECT '$current_lsn'::pg_lsn <=
> pg_last_xlog_replay_location()";
> use
> my $caughtup_query = "SELECT pg_xlog_location_diff('$current_lsn',
> pg_last_xlog_replay_location()) <= 0";
> so it doesn't care if we replay past the expected LSN on the master due to
> autovacuum activity. That's what's done in the real world and what should be
> covered by the tests IMO.

Those two statements have the same meaning. pg_xlog_location_diff does
exactly the same thing as the pg_lsn data type in terms of LSN
comparisons. Choosing one or the other is really a matter of taste.
Though I see that 001 is the only test that uses an equality, this
should not be the case I agree.

> The patch sets
>
>     tap_tests => 1,
>
> in config_default.pl. Was that on purpose? I'd have no problem with running
> the TAP tests by default if they worked by default, but the docs say that at
> least with ActiveState's Perl you have to jump through some hoops to get
> IPC::Run.

No, this was an error in the previous version of the patch 0003. Those
tests should be disabled by default, to match what ./configure does,
and also because installing IPC::Run requires some extra operations,
but that's easily doable with a bit of black magic.

> Typo in PostgresNode.pm: passiong should be 'passing'.

Oops.

> I'd like a way to append parameters in a way that won't clobber settings
> made implicitly by the module through things like enable_streaming but I can
> add that in a followup patch. It doesn't need to complicate this one.

This is something that I have been thinking about for some time while
hacking this thing, but I finished with the current version to not
complicate the patch more than it needs to be, and because the current
version is enough for the needs of all the tests present. Surely this
can be extended further more. One idea that I had was for example to
pass as parameter to init() and init_from_backup() a set of key/values
that would be appended to postgresql.conf.

> I'm thinking of having the tests append an include_dir directive when they
> create a node, maintain a hash of all parameters and rewrite a
> postgresql.conf.taptests file in the include_dir when params are updated.
> Also exposing a 'reload' call.

The reload wrapper would make sense to have. That has not proved to be
necessary yet.
--
Michael

Attachment

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Craig Ringer

Date:

26 February 2016, 07:06:16

On 26 February 2016 at 13:43, Michael Paquier <michael.paquier@gmail.com> wrote:

> my $caughtup_query = "SELECT '$current_lsn'::pg_lsn <=
> pg_last_xlog_replay_location()";
> use
> my $caughtup_query = "SELECT pg_xlog_location_diff('$current_lsn',
> pg_last_xlog_replay_location()) <= 0";
> so it doesn't care if we replay past the expected LSN on the master due to
> autovacuum activity. That's what's done in the real world and what should be
> covered by the tests IMO.

Those two statements have the same meaning. pg_xlog_location_diff does
exactly the same thing as the pg_lsn data type in terms of LSN
comparisons.

Ah. Whoops. I meant to write '=' in the first, to reflect what the code does.

You're quite right that casting to pg_lsn has the same effect and looks cleaner.

I think this looks good as of the last version. I'm not keen on disabling autovacuum but that can be addressed in a followup that makes it easier to configure params, as discussed. I definitely don't want to complicate this patch with it.

Should be committed ASAP IMO.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

26 February 2016, 15:59:19

Victor Wagner wrote:

> I'll second Stas' suggestion about psql_ok/psql_fail functions.
> 
> 1. psql_ok instead of just psql would provide visual feedback for the
> reader of code. One would see 'here condition is tested, here is
> something ended with _ok/_fail'.
> 
> It would be nice that seeing say "use Test::More tests => 4"
> one can immediately see "Yes, there is three _ok's and one _fail in the
> script'
> 
> 2. I have use case for psql_fail code. In my libpq failover patch there
> is number of cases, where it should be tested that connection is not
> established,
> 
> But this is rather about further evolution of the tap test library, not
> about this set of tests.

This makes sense to me.  Please submit a patch for this.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

26 February 2016, 16:02:11

Michael Paquier wrote:
> Attached are rebased patches, split into 3 parts doing the following:
> - 0001, fix default configuration of MSVC builds ignoring TAP tests

BTW you keep submitting this one and I keep ignoring it.  I think you
should start a separate thread for this one, so that some
Windows-enabled committer can look at it and maybe push it.  I still
don't understand why this matters.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

26 February 2016, 18:43:24

Kyotaro HORIGUCHI wrote:

> So, I'd like to propose four (or five) changes to this harness.
> 
>  - prove_check to remove all in tmp_check
> 
>  - TestLib to preserve temporary directories/files if the current
>    test fails.
> 
>  - PostgresNode::get_new_node to create data directory with
>    meaningful basenames.
> 
>  - PostgresNode::psql to return a list of ($stdout, $stderr) if
>    requested. (The previous behavior is not changed)
> 
>  - (recovery/t/00x_* gives test number to node name)
> 
> As a POC, the attached diff will appliy on the 0001 and (fixed)
> 0003 patches.

These changes all seem very reasonable to me.  I'm not so sure about the
last one.  Perhaps the framework ought to generate an appropriate subdir
name using the test file name plus the node name, so that instead of
tmp_XXXX it becomes tmp_001_master_XXXX or something like that?  Having
be a coding convention doesn't look real nice to me.

I didn't try to apply your patch but I'm fairly certain it would
conflict with what's here now; can you please rebase and resend?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

26 February 2016, 19:30:22

Craig Ringer wrote:

> Should be committed ASAP IMO.

Finally pushed it.  Let's see how it does in the buildfarm.  Now let's
get going and add more tests, I know there's no shortage of people with
test scripts waiting for this.

Thanks, Michael, for the persistency, and thanks to all reviewers.  (I'm
sorry we seem to have lost Amir Rohan in the process.  He was doing
a great job.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

26 February 2016, 22:38:01

On Sat, Feb 27, 2016 at 4:30 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Craig Ringer wrote:
>> Should be committed ASAP IMO.
>
> Finally pushed it.  Let's see how it does in the buildfarm.  Now let's
> get going and add more tests, I know there's no shortage of people with
> test scripts waiting for this.
>
> Thanks, Michael, for the persistency, and thanks to all reviewers.  (I'm
> sorry we seem to have lost Amir Rohan in the process.  He was doing
> a great job.)

Date of first message of this thread: Mon, 2 Dec 2013 15:40:41 +0900
Date of this message: Fri, 26 Feb 2016 16:30:08 -0300
This has been a long trip. Thanks a lot to all involved. Many people
have reviewed and helped out with this patch.
-- 
Michael

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

28 February 2016, 13:41:55

On Sat, Feb 27, 2016 at 7:37 AM, Michael Paquier wrote:
> Date of first message of this thread: Mon, 2 Dec 2013 15:40:41 +0900
> Date of this message: Fri, 26 Feb 2016 16:30:08 -0300
> This has been a long trip. Thanks a lot to all involved. Many people
> have reviewed and helped out with this patch.

I just had a closer look at what has been committed, and I found a
couple of minor issues, addressed via the patch attached:
1) install-windows.sgml should use the markup command when mentioning
bincheck and recoverycheck
2) src/test/recovery/.gitignore is listing /regress_log/ but that's
not needed (this is a remnant of a previous version of the patch
posted on this thread).
3) Header of 002_archiving.pl mentions that the tests are running on a
warm standby, but that's a hot standby (found by Craig and reported on
github on my account)
4) src/test/recovery/Makefile is missing a clean target, to remove tmp_check/.
5) src/tools/msvc/clean.bat is missing the same cleanup command for
the same thing after running the tests.
6) Header of 004_timeline_switch.pl should perhaps mention that a
cascading standby is used (idea of Craig, a good addition IMO)

Regards,
--
Michael

Attachment

test-recovery-fixes.patch

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

28 February 2016, 14:06:19

On Sun, Feb 28, 2016 at 10:41 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Sat, Feb 27, 2016 at 7:37 AM, Michael Paquier wrote:
>> Date of first message of this thread: Mon, 2 Dec 2013 15:40:41 +0900
>> Date of this message: Fri, 26 Feb 2016 16:30:08 -0300
>> This has been a long trip. Thanks a lot to all involved. Many people
>> have reviewed and helped out with this patch.
>
> I just had a closer look at what has been committed, and I found a
> couple of minor issues, addressed via the patch attached:
> 1) install-windows.sgml should use the markup command when mentioning
> bincheck and recoverycheck
> 2) src/test/recovery/.gitignore is listing /regress_log/ but that's
> not needed (this is a remnant of a previous version of the patch
> posted on this thread).
> 3) Header of 002_archiving.pl mentions that the tests are running on a
> warm standby, but that's a hot standby (found by Craig and reported on
> github on my account)
> 4) src/test/recovery/Makefile is missing a clean target, to remove tmp_check/.
> 5) src/tools/msvc/clean.bat is missing the same cleanup command for
> the same thing after running the tests.
> 6) Header of 004_timeline_switch.pl should perhaps mention that a
> cascading standby is used (idea of Craig, a good addition IMO)

7) src/test/README is not describing recovery/
8) This description in src/test/recovery/README is not exact, it
mentions a set of routines that are now part of PostgresNode.pm:
+This directory contains a test suite for recovery and replication,
+testing mainly the interactions of recovery.conf with cluster
+instances by providing a simple set of routines that can be used
+to define a custom cluster for a test, including backup, archiving,
+and streaming configuration.
I would suggest removing the last 4 lines and simplify the paragraph.
9) I have no logical explanation to explain why I am seeing all those
things now.
v2 is attached.
--
Michael

Attachment

test-recovery-fixes-v2.patch

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Craig Ringer

Date:

28 February 2016, 14:22:14

On 27 February 2016 at 06:37, Michael Paquier <michael.paquier@gmail.com> wrote:

On Sat, Feb 27, 2016 at 4:30 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Craig Ringer wrote:
>> Should be committed ASAP IMO.
>
> Finally pushed it. Let's see how it does in the buildfarm. Now let's
> get going and add more tests, I know there's no shortage of people with
> test scripts waiting for this.
>
> Thanks, Michael, for the persistency, and thanks to all reviewers. (I'm
> sorry we seem to have lost Amir Rohan in the process. He was doing
> a great job.)

Date of first message of this thread: Mon, 2 Dec 2013 15:40:41 +0900
Date of this message: Fri, 26 Feb 2016 16:30:08 -0300
This has been a long trip. Thanks a lot to all involved. Many people
have reviewed and helped out with this patch.

Congratulations and thanks.

I don't see any new buildfarm failures. The BinInstallCheck failure on Windows predates this and the isolationtest failure on OpenBSD is unrelated.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

28 February 2016, 22:40:34

On Sun, Feb 28, 2016 at 11:22 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 27 February 2016 at 06:37, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> On Sat, Feb 27, 2016 at 4:30 AM, Alvaro Herrera
>> <alvherre@2ndquadrant.com> wrote:
>> > Craig Ringer wrote:
>> >> Should be committed ASAP IMO.
>> >
>> > Finally pushed it.  Let's see how it does in the buildfarm.  Now let's
>> > get going and add more tests, I know there's no shortage of people with
>> > test scripts waiting for this.
>> >
>> > Thanks, Michael, for the persistency, and thanks to all reviewers.  (I'm
>> > sorry we seem to have lost Amir Rohan in the process.  He was doing
>> > a great job.)
>>
>> Date of first message of this thread: Mon, 2 Dec 2013 15:40:41 +0900
>> Date of this message: Fri, 26 Feb 2016 16:30:08 -0300
>> This has been a long trip. Thanks a lot to all involved. Many people
>> have reviewed and helped out with this patch.
>
>
> Congratulations and thanks.
>
> I don't see any new buildfarm failures. The BinInstallCheck failure on
> Windows predates this and the isolationtest failure on OpenBSD is unrelated.

The buildfarm does not have infrastructure to test that yet.. I need
to craft a patch for the client-side code and send it to Andrew. Will
try to do so today.
-- 
Michael

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

29 February 2016, 04:35:59

On Mon, Feb 29, 2016 at 7:40 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> The buildfarm does not have infrastructure to test that yet.. I need
> to craft a patch for the client-side code and send it to Andrew. Will
> try to do so today.

For those interested, here is where things are going to happen:
https://github.com/PGBuildFarm/client-code/pull/7
(This patch could be more refactored, I am not sure how much Andrew
would like things to be less duplicated, so I went to the most simple
solution).
-- 
Michael

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

29 February 2016, 21:26:11

Michael Paquier wrote:

> 9) I have no logical explanation to explain why I am seeing all those
> things now.

Happens all the time ...

Pushed, thanks.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

01 March 2016, 02:28:35

On Tue, Mar 1, 2016 at 6:25 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>
>> 9) I have no logical explanation to explain why I am seeing all those
>> things now.
>
> Happens all the time ...
>
> Pushed, thanks.

Thanks. I am going to patch by buildfarm scripts to run those tests on
hamster. Let's see what happens for this machine.
-- 
Michael

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

01 March 2016, 06:24:17

On Tue, Mar 1, 2016 at 11:28 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Mar 1, 2016 at 6:25 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Michael Paquier wrote:
>>
>>> 9) I have no logical explanation to explain why I am seeing all those
>>> things now.
>>
>> Happens all the time ...
>>
>> Pushed, thanks.
>
> Thanks. I am going to patch by buildfarm scripts to run those tests on
> hamster. Let's see what happens for this machine.

It looks like it works fine (see step recovery-check):
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hamster&dt=2016-03-01%2002%3A34%3A26
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hamster&dt=2016-03-01%2002%3A34%3A26&stg=recovery-check
Let's that run for a couple of days..
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Kyotaro HORIGUCHI

Date:

01 March 2016, 08:14:45

Hello,

At Fri, 26 Feb 2016 15:43:14 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in
<20160226184314.GA205945@alvherre.pgsql>
> Kyotaro HORIGUCHI wrote:
> 
> > So, I'd like to propose four (or five) changes to this harness.
> > 
> >  - prove_check to remove all in tmp_check
> > 
> >  - TestLib to preserve temporary directories/files if the current
> >    test fails.
> > 
> >  - PostgresNode::get_new_node to create data directory with
> >    meaningful basenames.
> > 
> >  - PostgresNode::psql to return a list of ($stdout, $stderr) if
> >    requested. (The previous behavior is not changed)
> > 
> >  - (recovery/t/00x_* gives test number to node name)
> > 
> > As a POC, the attached diff will appliy on the 0001 and (fixed)
> > 0003 patches.
> 
> These changes all seem very reasonable to me.  I'm not so sure about the
> last one.  Perhaps the framework ought to generate an appropriate subdir
> name using the test file name plus the node name, so that instead of
> tmp_XXXX it becomes tmp_001_master_XXXX or something like that?  Having
> be a coding convention doesn't look real nice to me.

Thank you for mentioning this.

Sorry, the last one accidentally contained garbage code to
intentionally raise an error. The last one names the nodes as
such like '001_master_24f8'. Maybe prefixing "tmp_" would be
better.

> I didn't try to apply your patch but I'm fairly certain it would
> conflict with what's here now; can you please rebase and resend?

This was a very small patch made on the old, uncommited
patches. I remade the patch and split it into two parts.

0001-Change-behavior-...
 Changes of PostmasterNode.pm and TestLib.pm to add some features and change a behavior.

0002-Prefix-test-numbers-to-node-
 This is rather a example usage of 0001- patch (except for stderr stuff). 00n_xxx test leaves temporary directories
withthe names of 00n_(master|standby)_XXXX on failure. If this is considered reasonable, I'll make same patches for the
other/t/nnn_*.pl tests.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

01 March 2016, 12:46:05

On Tue, Mar 1, 2016 at 5:13 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> 0001-Change-behavior-...
>
>   Changes of PostmasterNode.pm and TestLib.pm to add some
>   features and change a behavior.

+   # Preserve temporary directory for this test if failure
+   $File::Temp::KEEP_ALL = 1 unless Test::More->builder->is_passing;
+1. Having the data folders being removed even in case of a failure is
really annoying.

+   # Preserve temporary directory for this test if failure
+   $File::Temp::KEEP_ALL = 1 unless Test::More->builder->is_passing;
s/if failure/in the event of a failure/?

+   return wantarray ? ($stdout, $stderr) : $stdout;
So you are willing to extend that so as you could perform conparison
tests on the error strings returned. Why no, it looks useful, though
now there is no test in need of it I think. So without a proper need I
think that we could live without.

> 0002-Prefix-test-numbers-to-node-
>
>   This is rather a example usage of 0001- patch (except for
>   stderr stuff). 00n_xxx test leaves temporary directories with
>   the names of 00n_(master|standby)_XXXX on failure. If this is
>   considered reasonable, I'll make same patches for the other
>   /t/nnn_*.pl tests.

-my $node_master = get_new_node('master');
+my $node_master = get_new_node('001_master');
I am not a fan of appending the test number in the node name. For one,
this complicates the log file name associated with a node by
duplicating the test number in its name. Also, it is possible to
easily get the name of the data folder for a node by looking at the
logs.
-- 
Michael

Re: [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Michael Paquier

Date:

01 March 2016, 12:51:37

On Sat, Feb 27, 2016 at 1:02 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Michael Paquier wrote:
>
>> Attached are rebased patches, split into 3 parts doing the following:
>> - 0001, fix default configuration of MSVC builds ignoring TAP tests
>
> BTW you keep submitting this one and I keep ignoring it.  I think you
> should start a separate thread for this one, so that some
> Windows-enabled committer can look at it and maybe push it.  I still
> don't understand why this matters.

OK. I will create a separate thread on hackers. I think that this is
still a bug.
-- 
Michael

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Craig Ringer

Date:

01 March 2016, 13:05:39

On 1 March 2016 at 20:45, Michael Paquier <michael.paquier@gmail.com> wrote:

On Tue, Mar 1, 2016 at 5:13 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> 0001-Change-behavior-...
>
> Changes of PostmasterNode.pm and TestLib.pm to add some
> features and change a behavior.

+ # Preserve temporary directory for this test if failure
+ $File::Temp::KEEP_ALL = 1 unless Test::More->builder->is_passing;
+1. Having the data folders being removed even in case of a failure is
really annoying.

I agree on all points re your review. Keeping tempdirs is really needed, the tempdir name change is good, the the rest I'm not keen on.

I've addressed the need to get stderr from psql in a patch I'll submit separately, which provides a thinner wrapper around IPC::Run for more complex needs, then uses that for the existing 'psql' function. It also provides a 'psql_check' that dies on error.

I'll incorporate the wanted changes into that patch.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Craig Ringer

Date:

01 March 2016, 13:53:19

On 1 March 2016 at 21:05, Craig Ringer <craig@2ndquadrant.com> wrote:

On 1 March 2016 at 20:45, Michael Paquier <michael.paquier@gmail.com> wrote:
On Tue, Mar 1, 2016 at 5:13 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> 0001-Change-behavior-...
>
> Changes of PostmasterNode.pm and TestLib.pm to add some
> features and change a behavior.

+ # Preserve temporary directory for this test if failure
+ $File::Temp::KEEP_ALL = 1 unless Test::More->builder->is_passing;
+1. Having the data folders being removed even in case of a failure is
really annoying.

I agree on all points re your review. Keeping tempdirs is really needed, the tempdir name change is good, the the rest I'm not keen on.

I've addressed the need to get stderr from psql in a patch I'll submit separately, which provides a thinner wrapper around IPC::Run for more complex needs, then uses that for the existing 'psql' function. It also provides a 'psql_check' that dies on error.

I'll incorporate the wanted changes into that patch.

OK, done.

https://commitfest.postgresql.org/9/569/#

Michael, I'd value your thoughts on the patches. I think the psql one is important, I found the existing 'psql' function too limiting and I think defaulting to ignoring errors is sub-optimal.

Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

From

Alvaro Herrera

Date:

01 March 2016, 15:10:31

Michael Paquier wrote:
> On Tue, Mar 1, 2016 at 5:13 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

> +   return wantarray ? ($stdout, $stderr) : $stdout;
> So you are willing to extend that so as you could perform conparison
> tests on the error strings returned. Why no, it looks useful, though
> now there is no test in need of it I think. So without a proper need I
> think that we could live without.

Does this change let us implement psql_ok and psql_fail?  I think I've
seen a few places already, both in committed code and in submitted
patches, that test for some kind of failure from psql.

> > 0002-Prefix-test-numbers-to-node-
> >
> >   This is rather a example usage of 0001- patch (except for
> >   stderr stuff). 00n_xxx test leaves temporary directories with
> >   the names of 00n_(master|standby)_XXXX on failure. If this is
> >   considered reasonable, I'll make same patches for the other
> >   /t/nnn_*.pl tests.
> 
> -my $node_master = get_new_node('master');
> +my $node_master = get_new_node('001_master');
> I am not a fan of appending the test number in the node name. For one,
> this complicates the log file name associated with a node by
> duplicating the test number in its name. Also, it is possible to
> easily get the name of the data folder for a node by looking at the
> logs.

Why don't we use something similar to what's in $test_logfile in
TestLib?

> Also, it is possible to easily get the name of the data folder for a
> node by looking at the logs.

No disagreement on it being possible, but "easily" seems a bad
description for that.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services