Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture - Mailing list pgsql-hackers

From Robert Haas
Subject Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture
Date
Msg-id CA+TgmoayRrSkxpAg7N_0dk=i1bujrpZ+Nb8askx=RpVmQJcrrQ@mail.gmail.com
Whole thread Raw
In response to Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: [RFC][PATCH] Logical Replication/BDR prototype and architecture
List pgsql-hackers
On Tue, Jun 19, 2012 at 2:23 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Well, the words are fuzzy, but I would define logical replication to
>> be something which is independent of the binary format in which stuff
>> gets stored on disk.  If it's not independent of the disk format, then
>> you can't do heterogenous replication (between versions, or between
>> products).  That precise limitation is the main thing that drives
>> people to use anything other than SR in the first place, IME.
> Not in mine. The main limitation I see is that you cannot write anything on
> the standby. Which sucks majorly for many things. Its pretty much impossible
> to "fix" that for SR outside of very limited cases.
> While many scenarios don't need multimaster *many* need to write outside of
> the standby's replication set.

Well, that's certainly a common problem, even if it's not IME the most
common, but I don't think we need to argue about which one is more
common, because I'm not arguing against it.  The point, though, is
that if the logical format is independent of the on-disk format, the
things we can do are a strict superset of the things we can do if it
isn't.  I don't want to insist that catalogs be the same (or else you
get garbage when you decode tuples).  I want to tolerate the fact that
they may very well be different.  That will in no way preclude writing
outside the standby's replication set, nor will it prevent
multi-master replication.  It will, however, enable heterogenous
replication, which is a very important use case.  It will also mean
that innocent mistakes (like somehow ending up with a column that is
text on one server and numeric on another server) produce
comprehensible error messages, rather than garbage.

> Its not only the logging side which is a limitation in todays replication
> scenarios. The apply side scales even worse because its *very* hard to
> distribute it between multiple backends.

I don't think that making LCR format = on-disk format is going to
solve that problem.  To solve that problem, we need to track
dependencies between transactions, so that if tuple A is modified by
T1 and T2, in that order, we apply T1 before T2.  But if T3 - which
committed after both T1 and T2 - touches none of the same data as T1
or T2 - then we can apply it in parallel, so long as we don't commit
until T1 and T2 have committed (because allowing T3 to commit early
would produce a serialization anomaly from the point of view of a
concurrent reader).

>> Because the routines that decode tuples don't include enough sanity
>> checks to prevent running off the end of the block, or even the end of
>> memory completely.  Consider a corrupt TOAST pointer that indicates
>> that there is a gigabyte of data stored in an 8kB block.  One of the
>> common symptoms of corruption IME is TOAST requests for -3 bytes of
>> memory.
> Yes, but we need to put safeguards against that sort of thing anyway. So sure,
> we can have bugs but this is not a fundamental limitation.

There's a reason we haven't done that already, though: it's probably
going to stink for performance.  If it turns out that it doesn't stink
for performance, great.  But if it causes a 5% slowdown on common use
cases, I suspect we're not gonna do it, and I bet I can construct a
case where it's worse than that (think: 400 column table with lots of
varlenas, sorting by column 400 to return column 399).  I think it's
treading on dangerous ground to assume we're going to be able to "just
go fix" this.

> Postgis uses one information table in a few more complex functions but not in
> anything low-level. Evidenced by the fact that it was totally normal for that
> to go out of sync before < 2.0.
>
> But even if such a thing would be needed, it wouldn't be problematic to make
> extension configuration tables be replicated as well.

Ugh.  That's a hack on top of a hack.  Now it all works great if type
X is installed as an extension but if it isn't installed as an
extension then the world blows up.

> I am pretty sure its not bad-behaved. But how should the code know that? You
> want each type to explictly say that its unsafe if it is?

Yes, exactly.  Or maybe there are varying degrees of non-safety,
allowing varying degrees of optimization.  Like: wire format = binary
format is super-safe.  Then having to call an I/O function that
promises not to look at any catalogs is a bit less safe.  And then
there's really unsafe.

> I have played with several ideas:
>
> 1.)
> keep the decoding catalog in sync with command/event triggers, correctly
> replicating oids. If those log into some internal event table its easy to keep
> the catalog in a correct transactional state because the events from that
> table get decoded in the transaction and replayed at exactly the right spot in
> there *after* it has been reassembled. The locking on the generating side
> takes care of the concurrency aspects.

I am not following this one completely.

> 2.)
> Keep the decoding site up2date by replicating the catalog via normal recovery
> mechanisms

This surely seems better than #1, since it won't do amazingly weird
things if the user bypasses the event triggers.

> 3.)
> Fully versioned catalog

One possible way of doing this would be to have the LCR generator run
on the primary, but hold back RecentGlobalXmin until it's captured the
information that it needs.  It seems like as long as tuples can't get
pruned, the information you need must still be there, as long as you
can figure out which snapshot you need to read it under.  But since
you know the commit ordering, it seems like you ought to be able to
figure out what SnapshotNow would have looked like at any given point
in the WAL stream.  So you could, at that point in the WAL stream,
read the master's catalogs under what we might call SnapshotThen.

> 4.)
> Log enough information in the walstream to make decoding possible using only
> the walstream.
>
> Advantages:
> * Decoding can optionally be done on the master
> * No catalog syncing/access required
> * its possible to make this architecture independent
>
> Disadvantage:
> * high to very high implementation overhead depending on efficiency aims
> * high space overhead in the wal because at least all the catalog information
> needs to be logged in a transactional manner repeatedly
> * misuses wal far more than other methods
> * significant new complexity in somewhat cricital code paths (heapam.c)
> * insanely high space overhead if the decoding should be possible architecture
> independent

I'm not really convinced that the WAL overhead has to be that much
with this method.  Most of the information you need about the catalogs
only needs to be logged when it changes, or once per checkpoint cycle,
or once per transaction, or once per transaction per checkpoint cycle.I will concede that it looks somewhat complex,
butI am not convinced 
that it's undoable.

> 5.)
> The actually good idea. Yours?

Hey, look, an elephant!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Backport of fsync queue compaction
Next
From: Robert Haas
Date:
Subject: Re: Libxml2 load error on Windows