Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture |
Date | |
Msg-id | CA+TgmoayRrSkxpAg7N_0dk=i1bujrpZ+Nb8askx=RpVmQJcrrQ@mail.gmail.com Whole thread Raw |
In response to | Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture (Andres Freund <andres@2ndquadrant.com>) |
Responses |
Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture
|
List | pgsql-hackers |
On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Well, the words are fuzzy, but I would define logical replication to >> be something which is independent of the binary format in which stuff >> gets stored on disk. If it's not independent of the disk format, then >> you can't do heterogenous replication (between versions, or between >> products). That precise limitation is the main thing that drives >> people to use anything other than SR in the first place, IME. > Not in mine. The main limitation I see is that you cannot write anything on > the standby. Which sucks majorly for many things. Its pretty much impossible > to "fix" that for SR outside of very limited cases. > While many scenarios don't need multimaster *many* need to write outside of > the standby's replication set. Well, that's certainly a common problem, even if it's not IME the most common, but I don't think we need to argue about which one is more common, because I'm not arguing against it. The point, though, is that if the logical format is independent of the on-disk format, the things we can do are a strict superset of the things we can do if it isn't. I don't want to insist that catalogs be the same (or else you get garbage when you decode tuples). I want to tolerate the fact that they may very well be different. That will in no way preclude writing outside the standby's replication set, nor will it prevent multi-master replication. It will, however, enable heterogenous replication, which is a very important use case. It will also mean that innocent mistakes (like somehow ending up with a column that is text on one server and numeric on another server) produce comprehensible error messages, rather than garbage. > Its not only the logging side which is a limitation in todays replication > scenarios. The apply side scales even worse because its *very* hard to > distribute it between multiple backends. I don't think that making LCR format = on-disk format is going to solve that problem. To solve that problem, we need to track dependencies between transactions, so that if tuple A is modified by T1 and T2, in that order, we apply T1 before T2. But if T3 - which committed after both T1 and T2 - touches none of the same data as T1 or T2 - then we can apply it in parallel, so long as we don't commit until T1 and T2 have committed (because allowing T3 to commit early would produce a serialization anomaly from the point of view of a concurrent reader). >> Because the routines that decode tuples don't include enough sanity >> checks to prevent running off the end of the block, or even the end of >> memory completely. Consider a corrupt TOAST pointer that indicates >> that there is a gigabyte of data stored in an 8kB block. One of the >> common symptoms of corruption IME is TOAST requests for -3 bytes of >> memory. > Yes, but we need to put safeguards against that sort of thing anyway. So sure, > we can have bugs but this is not a fundamental limitation. There's a reason we haven't done that already, though: it's probably going to stink for performance. If it turns out that it doesn't stink for performance, great. But if it causes a 5% slowdown on common use cases, I suspect we're not gonna do it, and I bet I can construct a case where it's worse than that (think: 400 column table with lots of varlenas, sorting by column 400 to return column 399). I think it's treading on dangerous ground to assume we're going to be able to "just go fix" this. > Postgis uses one information table in a few more complex functions but not in > anything low-level. Evidenced by the fact that it was totally normal for that > to go out of sync before < 2.0. > > But even if such a thing would be needed, it wouldn't be problematic to make > extension configuration tables be replicated as well. Ugh. That's a hack on top of a hack. Now it all works great if type X is installed as an extension but if it isn't installed as an extension then the world blows up. > I am pretty sure its not bad-behaved. But how should the code know that? You > want each type to explictly say that its unsafe if it is? Yes, exactly. Or maybe there are varying degrees of non-safety, allowing varying degrees of optimization. Like: wire format = binary format is super-safe. Then having to call an I/O function that promises not to look at any catalogs is a bit less safe. And then there's really unsafe. > I have played with several ideas: > > 1.) > keep the decoding catalog in sync with command/event triggers, correctly > replicating oids. If those log into some internal event table its easy to keep > the catalog in a correct transactional state because the events from that > table get decoded in the transaction and replayed at exactly the right spot in > there *after* it has been reassembled. The locking on the generating side > takes care of the concurrency aspects. I am not following this one completely. > 2.) > Keep the decoding site up2date by replicating the catalog via normal recovery > mechanisms This surely seems better than #1, since it won't do amazingly weird things if the user bypasses the event triggers. > 3.) > Fully versioned catalog One possible way of doing this would be to have the LCR generator run on the primary, but hold back RecentGlobalXmin until it's captured the information that it needs. It seems like as long as tuples can't get pruned, the information you need must still be there, as long as you can figure out which snapshot you need to read it under. But since you know the commit ordering, it seems like you ought to be able to figure out what SnapshotNow would have looked like at any given point in the WAL stream. So you could, at that point in the WAL stream, read the master's catalogs under what we might call SnapshotThen. > 4.) > Log enough information in the walstream to make decoding possible using only > the walstream. > > Advantages: > * Decoding can optionally be done on the master > * No catalog syncing/access required > * its possible to make this architecture independent > > Disadvantage: > * high to very high implementation overhead depending on efficiency aims > * high space overhead in the wal because at least all the catalog information > needs to be logged in a transactional manner repeatedly > * misuses wal far more than other methods > * significant new complexity in somewhat cricital code paths (heapam.c) > * insanely high space overhead if the decoding should be possible architecture > independent I'm not really convinced that the WAL overhead has to be that much with this method. Most of the information you need about the catalogs only needs to be logged when it changes, or once per checkpoint cycle, or once per transaction, or once per transaction per checkpoint cycle.I will concede that it looks somewhat complex, butI am not convinced that it's undoable. > 5.) > The actually good idea. Yours? Hey, look, an elephant! -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: