Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc. - Mailing list pgsql-hackers

From Amir Rohan
Subject Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.
Date
Msg-id 5616317D.8060108@zoho.com
Whole thread Raw
In response to Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: Re: In-core regression tests for replication, cascading, archiving, PITR, etc.  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers
On 10/08/2015 10:39 AM, Michael Paquier wrote:
> On Thu, Oct 8, 2015 at 3:59 PM, Amir Rohan <amir.rohan@zoho.com> wrote:
>> On 10/08/2015 08:19 AM, Michael Paquier wrote:
>>> On Wed, Oct 7, 2015 at 5:44 PM, Amir Rohan wrote:
>>>> On 10/07/2015 10:29 AM, Michael Paquier wrote:
>>>>> On Wed, Oct 7, 2015 at 4:16 PM, Amir Rohan wrote:
>>>>>> Also, the removal of poll_query_until from pg_rewind looks suspiciously
>>>>>> like a copy-paste gone bad. Pardon if I'm missing something.
>>>>>
>>>>> Perhaps. Do you have a suggestion regarding that? It seems to me that
>>>>> this is more useful in TestLib.pm as-is.
>>>>>
>>>>
>>>> My mistake, the patch only shows some internal function being deleted
>>>> but RewindTest.pm (obviously) imports TestLib. You're right, TestLib is
>>>> a better place for it.
>>>
>>> OK. Here is a new patch version. I have removed the restriction
>>> preventing to call make_master multiple times in the same script (one
>>> may actually want to test some stuff related to logical decoding or
>>> FDW for example, who knows...), forcing PGHOST to always use the same
>>> value after it has been initialized. I have added a sanity check
>>> though, it is not possible to create a node based on a base backup if
>>> no master are defined. This looks like a cheap insurance... I also
>>> refactored a bit the code, using the new init_node_info to fill in the
>>> fields of a newly-initialized node, and I removed get_free_port,
>>> init_node, init_node_from_backup, enable_restoring and
>>> enable_streaming from the list of routines exposed to the users, those
>>> can be used directly with make_master, make_warm_standby and
>>> make_hot_standby. We could add them again if need be, somebody may
>>> want to be able to get a free port, set up a node without those
>>> generic routines, just that it does not seem necessary now.
>>> Regards,
>>>
>>
>> If you'd like, I can write up some tests for cascading replication which
>> are currently missing.
> 
> 001 is testing cascading, like that node1 -> node2 -> node3.
> 
>> Someone mentioned a daisy chain setup which sounds fun. Anything else in
>> particular? Also, it would be nice to have some canned way to measure
>> end-to-end replication latency for variable number of nodes.
> 
> Hm. Do you mean comparing the LSN position between two nodes even if
> both nodes are not connected to each other? What would you use it for?
> 

In a cascading replication setup, the typical _time_ it takes for a
COMMIT on master to reach the slave (assuming constant WAL generation
rate) is an important operational metric.

It would be useful to catch future regressions for that metric,
which may happen even when a patch doesn't outright break cascading
replication. Just automating the measurement could be useful if
there's no pg facility that tracks performance over time in
a regimented fashion. I've seen multiple projects which consider
a "benchmark suite" to be part of its testing strategy.


As for the "daisy chain" thing, it was (IIRC) mentioned in a josh berkus
talk I caught on youtube. It's possible to setup cascading replication,
take down the master, and then reinsert it as replicating slave, so that
you end up with *all* servers replicating from the
ancestor in the chain, and no master. I think it was more
a fun hack then anything, but also an interesting corner case to
investigate.

>> What about going back through the commit log and writing some regression
>> tests for the real stinkers, if someone care to volunteer some candidate
>> bugs
> 
> I have drafted a list with a couple of items upthread:
> http://www.postgresql.org/message-id/CAB7nPqSgffSPhOcrhFoAsDAnipvn6WsH2nYkf1KayRm+9_MTGw@mail.gmail.com
> So based on the existing patch the list becomes as follows:
> - wal_retrieve_retry_interval with a high value, say setting to for
> example 2/3s and loop until it is applied by checking it is it has
> been received by the standby every second.
> - recovery_target_action
> - archive_cleanup_command
> - recovery_end_command
> - pg_xlog_replay_pause and pg_xlog_replay_resume
> In the list of things that could have a test, I recall that we should
> test as well 2PC with the recovery delay, look at a1105c3d. This could
> be included in 005.

a1105c3 Mar 23 Fix copy & paste error in 4f1b890b137.  Andres Freund
4f1b890 Mar 15 Merge the various forms of transaction commit & abort
records.  Andres Freund

Is that the right commit?

> The advantage of implementing that now is that we could see if the
> existing routines are solid enough or not. 

I can do this if you point me at a self-contained thread/#issue.

> Still, looking at what the
> patch has now I think that we had better get a committer look at it,
> and if the core portion gets integrated we could already use it for
> the patch implementing quorum synchronous replication and in doing
> more advanced tests with pg_rewind regarding the timeline handling
> (both patches of this CF). 
>
> I don't mind adding more now, though I
> think that the set of sample tests included in this version is enough
> as a base implementation of the facility and shows what it can do.

Sure.




pgsql-hackers by date:

Previous
From: Taiki Kondo
Date:
Subject: Re: [Proposal] Table partition + join pushdown
Next
From: Fujii Masao
Date:
Subject: Re: Freeze avoidance of very large table.