Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes - Mailing list pgsql-hackers
From | Steve Singer |
---|---|
Subject | Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes |
Date | |
Msg-id | BLU0-SMTP66008BB671292E1CB598B7DCFD0@phx.gbl Whole thread Raw |
In response to | [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: [PATCH 08/16] Introduce the ApplyCache module which can reassemble transactions from a stream of interspersed changes
|
List | pgsql-hackers |
On 12-06-13 07:28 AM, Andres Freund wrote: > From: Andres Freund<andres@anarazel.de> > > The individual changes need to be identified by an xid. The xid can be a > subtransaction or a toplevel one, at commit those can be reintegrated by doing > a k-way mergesort between the individual transaction. > > Callbacks for apply_begin, apply_change and apply_commit are provided to > retrieve complete transactions. > > Missing: > - spill-to-disk > - correct subtransaction merge, current behaviour is simple/wrong > - DDL handling (?) > - resource usage controls Here is an initial review of the ApplyCache patch. This patch provides a module for taking actions in the WAL stream and groups the actions by transaction, then passing these change records to a set of plugin functions. For each transaction it encounters it keeps a list of the actions in that transaction. The ilist included in an earlier patch is used, changes resulting from that patch review would effect the code here but not in a way that chances the design. When the module sees a commit for a transaction it calls the apply_change callback for each change. I can think of three ways that a replication system like this could try to apply transactions. 1) Each time it sees a new transaction it could open up a new transaction on the replica and makes that change. It leaves the transaction open and goes on applying the next change (which might be for the current transaction or might be for another one). When it comes across a commit record it would then commit the transaction. If 100 concurrent transactions were open on the origin then 100 concurrent transactions will be open on the replica. 2) Determine the commit order of the transactions, group all the changes for a particular transaction together and apply them in that order for the transaction that committed first, commit that transaction and then move onto the transaction that committed second. 3) Group the transactions in a way that you move the replica from one consistent snapshot to another. This is what Slony and Londiste do because they don't have the commit order or commit timestamps. Built-in replication can do better. This patch implements option (2). If we had a way of implementing option (1) efficiently would we be better off? Option (2) requires us to put unparsed WAL data (HeapTuples) in the apply cache. You can't translate this to an independent LCR until you call the apply_change record (which happens once the commit is encountered). The reason for this is because some of the changes might be DDL (or things generated by a DDL trigger) that will change the translation catalog so you can't translate the HeapData to LCR's until your at a stage where you can update the translation catalog. In both cases you might need to see later WAL records before you can convert an earlier one into an LCR (ie TOAST). Some of my concerns with the apply cache are Big transactions (bulk loads, mass updates) will be cached in the apply cache until the commit comes along. One issue Slony has todo with bulk operations is that the replicas can't start processing the bulk INSERT until after it has commited. If it takes 10 hours to load the data on the master it will take another 10 hours (at best) to load the data into the replica(20 hours after you start the process). With binary streaming replication your replica is done processing the bulk update shortly after the master is. Long running transactions can sit in the cache for a long time. When you spill to disk we would want the long running but inactive ones spilled to disk first. This is solvable but adds to the complexity of this module, how were you planning on managing which items of the list get spilled to disk? The idea that we can safely reorder the commands into transactional groupings works (as far as I know) today because DDL commands get big heavy locks that are held until the end of the transaction. I think Robert mentioned earlier in the parent thread that maybe some of that will be changed one day. The downsides of (1) that I see are: We would want a single backend to keep open multiple transactions at once. How hard would that be to implement? Would subtransactions be good enough here? Applying (or even translating WAL to LCR's) the changes in parallel across transactions might complicate the catalog structure because each concurrent transaction might need its own version of the catalog (or can you depend on the locking at the master for this? I think you can today) With approach (1) changes that are part of a rolledback transaction would have more overhead because you would call apply_change on them. With approach (1) a later component could still group the LCR's by transaction before applying by running the LCR's through a data structure very similar to the ApplyCache. I think I need more convincing that approach (2), what this patch implements, is the best way doing things, compared (1). I will hold off on a more detailed review of the code until I get a better sense of if the design will change. Steve
pgsql-hackers by date: