Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes - Mailing list pgsql-hackers

From Steve Singer
Subject Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes
Date
Msg-id BLU0-SMTP66008BB671292E1CB598B7DCFD0@phx.gbl
Whole thread Raw
In response to [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes
List pgsql-hackers
On 12-06-13 07:28 AM, Andres Freund wrote:
> From: Andres Freund<andres@anarazel.de>
>
> The individual changes need to be identified by an xid. The xid can be a
> subtransaction or a toplevel one, at commit those can be reintegrated by doing
> a k-way mergesort between the individual transaction.
>
> Callbacks for apply_begin, apply_change and apply_commit are provided to
> retrieve complete transactions.
>
> Missing:
> - spill-to-disk
> - correct subtransaction merge, current behaviour is simple/wrong
> - DDL handling (?)
> - resource usage controls
Here is an initial review of the ApplyCache patch.

This patch provides a module for taking actions in the WAL stream and 
groups the actions by transaction, then passing these change records to 
a set of plugin functions.

For each transaction it encounters it keeps a list of the actions in 
that transaction. The ilist included in an earlier patch is used, 
changes resulting from that patch review would effect the code here but 
not in a way that chances the design.  When the module sees a commit for 
a transaction it calls the apply_change callback for each change.

I can think of three ways that a replication system like this could try 
to apply transactions.

1) Each time it sees a new transaction it could open up a new 
transaction on the replica and makes that change.  It leaves the
transaction open and goes on applying the next change (which might be 
for the current transaction or might be for another one).
When it comes across a commit record it would then commit the 
transaction.   If 100 concurrent transactions were open on the origin 
then 100 concurrent transactions will be open on the replica.

2) Determine the commit order of the transactions, group all the changes 
for a particular transaction together and apply them in that order for 
the transaction that committed first, commit that transaction and then 
move onto the transaction that committed second.

3) Group the transactions in a way that you move the replica from one 
consistent snapshot to another.  This is what Slony and Londiste do 
because they don't have the commit order or commit timestamps. Built-in 
replication can do better.

This patch implements option (2).   If we had a way of implementing 
option (1) efficiently would we be better off?

Option (2) requires us to put unparsed WAL data (HeapTuples) in the 
apply cache.  You can't translate this to an independent LCR until you 
call the apply_change record (which happens once the commit is 
encountered).  The reason for this is because some of the changes might 
be DDL (or things generated by a DDL trigger) that will change the 
translation catalog so you can't translate the HeapData to LCR's until 
your at a stage where you can update the translation catalog.  In both 
cases you might need to see later WAL records before you can convert an 
earlier one into an LCR (ie TOAST).

Some of my concerns with the apply cache are

Big transactions (bulk loads, mass updates) will be cached in the apply 
cache until the commit comes along.  One issue Slony has todo with bulk 
operations is that the replicas can't start processing the bulk INSERT 
until after it has commited.  If it takes 10 hours to load the data on 
the master it will take another 10 hours (at best) to load the data into 
the replica(20 hours after you start the process).  With binary 
streaming replication your replica is done processing the bulk update 
shortly after the master is.

Long running transactions can sit in the cache for a long time.  When 
you spill to disk we would want the long running but inactive ones 
spilled to disk first.  This is solvable but adds to the complexity of 
this module, how were you planning on managing which items of the list 
get spilled to disk?

The idea that we can safely reorder the commands into transactional 
groupings works (as far as I know) today because DDL commands get big 
heavy locks that are held until the end of the transaction.  I think 
Robert mentioned earlier in the parent thread that maybe some of that 
will be changed one day.

The downsides of (1) that I see are:

We would want a single backend to keep open multiple transactions at 
once. How hard would that be to implement? Would subtransactions be good 
enough here?

Applying (or even translating WAL to LCR's) the changes in parallel 
across transactions might complicate the catalog structure because each 
concurrent transaction might need its own version of the catalog (or can 
you depend on the locking at the master for this? I think you can today)

With approach (1) changes that are part of a rolledback transaction 
would have more overhead because you would call apply_change on them.

With approach (1) a later component could still group the LCR's by 
transaction before applying by running the LCR's through a data 
structure very similar to the ApplyCache.


I think I need more convincing that approach (2), what this patch 
implements, is the best way doing things, compared (1). I will hold off 
on a more detailed review of the code until I get a better sense of if 
the design will change.

Steve



pgsql-hackers by date:

Previous
From: Florian Pflug
Date:
Subject: Re: libpq compression
Next
From: Peter Geoghegan
Date:
Subject: Re: sortsupport for text