Re: In-core regression tests for replication, cascading, archiving, PITR, etc. - Mailing list pgsql-hackers

From Amir Rohan
Subject Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Date
Msg-id trinity-b4a8035d-59af-4c42-a37e-258f0f28e44a-1443795007012@3capp-mailcom-lxa08
Whole thread Raw
Responses Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
List pgsql-hackers
On 10/02/2015 03:33 PM, Michael Paquier wrote:

> Any server instances created during the tests should never use a
> user-defined port for portability. Hence using those ports as keys
> just made sense. We could have for example custom names, that have
> port values assigned to them, but that's actually an overkill and
> complicates the whole facility.
> 

Something like:
   global nPortsAssigned = 0;   AssignPort() -> return is_ok(nPortsAssigned++)

was what I used.

>> 2) Behaviour (paths in particular) is hardwired rather then overridable
>> defaults.
> 
> This is the case of all the TAP tests. We could always use the same
> base directory for all the nodes and then embed a sub-directory whose
> name is decided using the port number. But I am not really sure if
> that's a win.
> 

I understand, but it eliminates the kind of scenarios this convenience
package lets you express... conveniently.

>> This is exactly what I needed to test, problems:
>> 3) Can't stop server without clearing its testing data (the maps holding
>> paths and things). But that data might be specifically
>> needed, in particular the backup shouldn't disappear when the
>> server melts down or we have a very low-grade DBA on our hands.
> 
> OK, you have a point here. You may indeed want routines for to enable
> and disable a node completely decoupled from start and stop, with
> something like enable_node and disable_node that basically registers
> or unregisters it from the list of active nodes. I have updated the
> patch this way.
> 

Excellent.

>> 4) Port assignment relies on liveness checks on running servers.
>> If a server is shut down and a new instantiated, the port will get
>> reused, data will get trashed, and various confusing things can happen.
> 
> Right. The safest way to do that is to check in get_free_port if a
> port number is used by a registered node, and continue to loop in if
> that's the case. So done.
> 

That eliminates the "sweet and gentle" variant of the scenario, but it's
susceptible to the "ABA problem":
https://en.wikipedia.org/wiki/ABA_problem
https://youtu.be/CmxkPChOcvw?t=786

Granted, you have to try fairly hard to shoot yourself in the leg,
but since the solution is so simple, why not? If we never reuse ports
within a single test, this goes away.

>> 5) Servers are shutdown with -m 'immediate', which can lead to races
>> in the script when archiving is turned on. That may be good for some
>> tests, but there's no control over it.
> 
> I hesitated with fast here actually. So changed this way. We would
> want as wall a teardown command to stop the node with immediate and
> unregister the node from the active list.
> 

In particular, I was shutting down an archiving node and the archiving
was truncated. I *think* smart doesn't do this. But again, it's really
that the test writer can't easily override, not that the default is wrong.

>> Other issues:
>> 6. Directory structure, used one directory per thing but more logical
>> to place all things related to an instance under a single directory,
>> and name them according to role (57333_backup, and so on).
> 
> Er, well. The first version of the patch did so, and then I switched
> to an approach closer to what the existing TAP facility is doing. But
> well let's simplify things a bit.
> 

I know, I know, but:
1) an instance is a "thing" in your script, so having its associated
paraphernalia in one place makes more sense (maybe only to me).
2) That's often how folks (well, how I) arrange things in deployment,
at least with archive/backup as symlinks to the nas.

Alternatively, naming the dirs with a prefix (srv_foo_HASH,
backup_foo_backupname_HASH, etc') would work as well.

>> 7. enable_restoring() uses "cp -i" 'archive_command', not a good fit
>> for an automated test.
> 
> This seems like a good default to me, and actually that's portable on
> Windows easily. One could always append a custom archive_command in a
> test when for example testing conflicting archiving when archive_mode
> = always.
> 

Ok, I wasn't sure about this, but specifically activating a switch that
asks for input from the user during a test? hmm.
>> 8. No canned way to output a pprinted overview of the running system
>> (paths, ports, for manual checking).
> 
> Hm. Why not... Are you suggesting something like print_current_conf
> that goes through all the registered nodes and outputs that? How would
> you use it?
>

For one thin, I could open a few terminals and `$(print_env_for_server
5437), so psql just worked.

I wish PEG had that as well.


>> 10. If a test passes/fails or dies due to a bug, everything is cleaned.
>> Great for testing, bad for postmortem.
> 
> That's something directly related to TestLib.pm where
> File:Temp:tempdir creates a temporary path with CLEANUP = 1. We had
> discussions regarding that actually...
> 

>> 11. a canned "server is responding to queries" helper would be convenient.
> 
> Do you mean a wrapper on pg_isready? Do you have use cases in mind for it?
> 

Block until recovery is finished, before testing. eliminate races, and
avoid the stupid sleep(3) I used.

>> It might be a good idea to:
>> 1) Never reuse ports during a test. Liveness checking is used
>> to avoid collisions, but should not determine order of assignment.
> 
> Agreed. As far as I can see the problem here is related to the fact
> that the port of non-running server may be fetched by another one.
> That's a bug of my patch.
> 
>> 2) Decouple cleanup from server shutdown. Do the cleanup as the end of
>> test only, and allow the user to keep things around.
> 
> Agreed here.
> 
>> 3) Adjust the directory structure to one top directory per server with
>> (PGDATA, backup, archive) subdirs.
> 
> Hm. OK. The first version of the patch actually did so.
> 

Well, why does "consistency with TAP test" trump the advantages I
mentioned? does TAP actually care?

>> 4) Instead of passing ports around as keys, have _explicit functions
>> which can be called directly by the user (I'd like the backup *HERE*
>> please), with the current functions refactored to merely invoke them
>> by interpolating in the values associated with the port they were given.
> 
> I don't really see in what this would be a win. We definitely should
> have all the data depending on temporary paths during the tests to
> facilitate the cleanup wrapper work.
> 

The trouble was that I reused paths between servers. If shutdown/cleanup
are decoupled, this is probably not needed.

>> 4b) server shutdown should perhaps be "smart" by default, or segmented
>> into calmly_bring_to_a_close(), pull_electric_plug() and
>> drop_down_the_stairs_into_swimming_pool().
> 
> Nope, not agreeing here. "immediate" is rather violent to stop a node,
> hence I have switched it to use "fast" and there is now a
> teardown_node routine that uses immediate, that's more aimed at
> cleanup up existing things fiercely.
> 

Ok, not as the default, but possible to request a specific kind of
shutdown. I needed smart in my case. Plus, in a scenario, you might
expressly be testing behavior for a specific mode, it needs to be
controllable.

> I have as well moved RecoveryTest.pm to src/test/perl so as all the
> consumers of prove_check can use it by default, and decoupled
> start_node from make_master and make_*_standby so as it is possible to
> add for example new parameters to their postgresql.conf and
> recovery.conf files before starting them.
> 
> Thanks a lot for the feedback! Attached is an updated patch with all
> the things mentioned above done. Are included as well the typo fixes
> you sent upthread.
> Regards,
> 

Great! I'll look at it and post again if there's more. If any of the
above extra explanations make it clearer why I suggested some changes
you didn't like...

Thanks,
Amir




pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Next
From: Takashi Ohnishi
Date:
Subject: Connection string parameter 'replication' in documentation