Re: logical changeset generation v6.2 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical changeset generation v6.2
Date
Msg-id 20131011165710.GA4052323@alap2.anarazel.de
Whole thread Raw
In response to Re: logical changeset generation v6.2  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: logical changeset generation v6.2  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

On 2013-10-11 09:08:43 -0400, Robert Haas wrote:
> On Thu, Oct 10, 2013 at 7:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-09 14:49:46 -0400, Robert Haas wrote:
> >> I spent some time looking at the sample plugin (patch 9/12).  Here are
> >> some review comments:
> >>
> >> - I think that the decoding plugin interface should work more like the
> >> foreign data wrapper interface.  Instead of using pg_dlsym to look up
> >> fixed names, I think there should be a struct of function pointers
> >> that gets filled in and registered somehow.
> >
> > You mean something like CREATE OUTPUT PLUGIN registering a function with
> > an INTERNAL return value returning a filled struct? I thought about
> > that, but it seemed more complex. Happy to change it though if it's
> > preferred.
> 
> I don't see any need for SQL syntax.  I was just thinking that the
> _PG_init function could fill in a structure and then call
> RegisterLogicalReplicationOutputPlugin(&mystruct).

Hm. We can do that, but what'd be the advantage of that? The current
model will correctly handle things like a'shared_preload_libraries'ed
output plugin, because its _PG_init() will not register it. With the
handling done in _PG_init() there would be two.
Being able to use the same .so for output plugin handling and some other
replication solution specific stuff is imo useful.

> >> - Still wondering how we'll use this from a bgworker.
> >
> > Simplified code to consume data:
> 
> Cool.  As long as that use case is supported, I'm happy; I just want
> to make sure we're not presuming that there must be an external
> client.

The included testcases are written using the SQL SRF interface, which in
turn is a usecase that doesn't use walsenders and such, so I hope we
won't break it accidentally ;)

> >> - The output format doesn't look very machine-parseable.   I really
> >> think we ought to provide something that is.  Maybe a CSV-like format,
> >> or maybe something else, but I don't see why someone who wants to do
> >> change logging should be forced to write and install C code.  If
> >> something like Bucardo can run on an unmodified system and extract
> >> change-sets this way without needing a .so file, that's going to be a
> >> huge win for usability.
> >
> > We can change the current format but I really see little to no chance of
> > agreeing on a replication format that's serviceable to several solutions
> > short term. Once we've gained some experience - maybe even this cycle -
> > that might be different.
> 
> I don't see why you're so pessimistic about that.  I know you haven't
> worked it out yet, but what makes this harder than sitting down and
> designing something?

Because every replication solution has different requirements for the
format and they will want filter the output stream with regard to their
own configuration.
E.g. bucardo will want to include the transaction timestamp for conflict
resolution and such.

> >> More generally on this patch set, if I'm going to be committing any of
> >> this, I'd prefer to start with what is currently patches 3 and 4, once
> >> we reach agreement on those.
> >
> > Sounds like a reasonable start.
> 
> Perhaps you could reshuffle the order of the series, if it's not too much work.

Sure, that's no problem. Do I understand correctly that you'd like
wal_decoding: Add information about a tables primary key to struct RelationData
wal_decoding: Add wal_level = logical and log data required for logical decoding

earlier?

> > I'd really like to do so. I am travelling atm, but I will be back
> > tomorrow evening and will push an updated patch this weekend. The issue
> > I know of in the latest patches at
> > http://www.postgresql.org/message-id/20131007133232.GA15202@awork2.anarazel.de
> > is renaming from http://www.postgresql.org/message-id/20131008194758.GB3718183@alap2.anarazel.de
> 
> I'm a bit nervous about the way the combo CID logging.  I would have
> thought that you would emit one record per combo CID, but what you're
> apparently doing is emitting one record per heap tuple that uses a
> combo CID.

I thought and implemented that in the beginning. Unfortunately it's not
enough :(. That's probably the issue that took me longest to understand
in this patchseries...

Combocids can only fix the case where a transaction actually has create
a combocid:

1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid)
2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)

So, if we're decoding data that needs to lookup those rows in TX1 or TX2
we both times need access to cmin and cmax, but neither transaction will
have created a multixact. That can only be an issue in transaction with
catalog modifications.

A slightly more complex variant also requires this if combocids are
involved:

1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = Invalid)
2) TX1: SAVEPOINT foo;
3) TX1-2: UPDATE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = 55, cmax = 56, combo=123)                   new at 0/1: (xmin
=2, xmax=Invalid, cmin = 57, cmax = Invalid)
 
4) TX1-2: ROLLBACK TO foo;
5) TX3: DELETE id = 1 at 0/1: (xmin = 1, xmax=3, cmin = Invalid, cmax = 1)

If you're decoding data that's been inserted after 1) you still need to
see cmin = 55. At 5) you need to see cmin = Invalid.

So just remembering the correct value for each tuple that's needed for a
specific transaction seems like the simplest way here.

> Either way, I also wonder what happens if a (logical?) checkpoint
> occurs between the combo CID record and the heap record to which it
> refers, or how you prevent that from happening.

Logical checkpoints contain the so called 'restart_decoding' LSN, which
is defined to be the LSN at which we can restart reading WAL and are
guaranteed to be able to decode all transactions that haven't been
confirmed as received.
Normal checkpoints shouldn't play any role here.

> What if the combo CID record is written and the transaction aborts
> before writing the heap record (maybe without writing an abort record
> to WAL)?

In the currently implemented model where we log (relfilenode, ctid,
cmin, cmax) we only ever need access to those rows when decoding data
changes from *within* the catalog modifying toplevel transaction. Never
in other toplevle transactions.

> What are the performance implications of this additional WAL logging?
> What's the worst case?

I haven't been able to notice any difference above jitter for anything
that touches actual relations. The overhead there is far bigger than
that single XLogInsert().
Maybe there's something that doesn't interact with much of the system
where the effort is noticeable and which is actually relvant for
performance? I couldn't really think of something.

> noticeable overhead when wal_level < logical?

Couldn't measure anything either, which is not surprising that I
couldn't measure the overhead in the first place.

I've done some parallel INSERT/DELETE pgbenching around the
wal_level=logical and I couldn't measure any overhead with it
disabled. With wal_level = logical, UPDATEs and DELETEs do get a bit
slower, but that's to be expected.

It'd probably not hurt to redo those benchmarks to make sure...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Gibheer
Date:
Subject: Re: Patch for reserved connections for replication users
Next
From: Andres Freund
Date:
Subject: Re: INSERT...ON DUPLICATE KEY LOCK FOR UPDATE