Re: Replication identifiers, take 3 - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Replication identifiers, take 3
Date
Msg-id CA+TgmoZOhCT9_ENZ165=3pRKH+p=Ads8g5E3voM8sSufW_cYYQ@mail.gmail.com
Whole thread Raw
In response to Re: Replication identifiers, take 3  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: Replication identifiers, take 3
List pgsql-hackers
On Fri, Sep 26, 2014 at 5:05 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> Let me try to summarize the information requirements for each of these
>> things.  For #1, you need to know, after crash recovery, for each
>> standby, the last commit LSN which the client has confirmed via a
>> feedback message.
>
> I'm not sure I understand what you mean here? This is all happening on
> the *standby*. The standby needs to know, after crash recovery, the
> latest commit LSN from the primary that it has successfully replayed.

Ah, sorry, you're right: so, you need to know, after crash recovery,
for each machine you are replicating *from*, the last transaction (in
terms of LSN) from that server that you successfully replayed.

>> > Similarly, to solve the problem of identifying the origin of changes
>> > during decoding, the problem can be solved nicely by adding the origin
>> > node of every change to changes/transactions. At first it might seem
>> > to be sufficient to do so on transaction level, but for cascading
>> > scenarios it's very useful to be able to have changes from different
>> > source transactions combinded into a larger one.
>>
>> I think this is a lot more problematic.  I agree that having the data
>> in the commit record isn't sufficient here, because for filtering
>> purposes (for example) you really want to identify the problematic
>> transactions at the beginning, so you can chuck their WAL, rather than
>> reassembling the transaction first and then throwing it out.  But
>> putting the origin ID in every insert/update/delete is pretty
>> unappealing from my point of view - not just because it adds bytes to
>> WAL, though that's a non-trivial concern, but also because it adds
>> complexity - IMHO, a non-trivial amount of complexity.  I'd be a lot
>> happier with a solution where, say, we have a separate WAL record that
>> says "XID 1234 will be decoding for origin 567 until further notice".
>
> I think it actually ends up much simpler than what you propose. In the
> apply process, you simply execute
> SELECT pg_replication_identifier_setup_replaying_from('bdr: this-is-my-identifier');
> or it's C equivalent. That sets a global variable which XLogInsert()
> includes the record.
> Note that this doesn't actually require any additional space in the WAL
> - padding bytes in struct XLogRecord are used to store the
> identifier. These have been unused at least since 8.0.

Sure, that's simpler for logical decoding, for sure.  That doesn't
make it the right decision for the system overall.

> I don't think a solution which logs the change of origin will be
> simpler. When the origin is in every record, you can filter without keep
> track of any state. That's different if you can switch the origin per
> tx. At the very least you need a in memory entry for the origin.

But again, that complexity pertains only to logical decoding.
Somebody who wants to tweak the WAL format for an UPDATE in the future
doesn't need to understand how this works, or care.  You know me: I've
been a huge advocate of logical decoding.  But just like row-level
security or BRIN indexes or any other feature, I think it needs to be
designed in a way that minimizes the impact it has on the rest of the
system.  I simply don't believe your contention that this isn't adding
any complexity to the code path for regular DML operations.  It's
entirely possible we could need bit space in those records in the
future for something that actually pertains to those operations; if
you've burned it for logical decoding, it'll be difficult to claw it
back.  And what if Tom gets around, some day, to doing that pluggable
heap AM work?  Then every heap AM has got to allow for those bits, and
maybe that doesn't happen to be free for them.

Admittedly, these are hypothetical scenarios, but I don't think
they're particularly far-fetched.  And as a fringe benefit, if you do
it the way that I'm proposing, you can use an OID instead of a 16-bit
thing that we picked to be 16 bits because that happens to be 100% of
the available bit-space.  Yeah, there's some complexity on decoding,
but it's minimal: one more piece of fixed-size state to track per XID.
That's trivial compared to what you've already got.

>> What's the point of the short-to-long mappings in the first place?  Is
>> that only required because of the possibility that there might be
>> multiple replication solutions in play on the same node?
>
> In my original proposal, 2 years+ back, I only used short numeric
> ids. And people didn't like it because it requires coordination between
> the replication solutions and possibly between servers. Using a string
> identifier like in the above allows to easily build unique names; and
> allows every solution to add the information it needs into replication
> identifiers.

I get that, but what I'm asking is why those mappings can't be managed
on a per-replication-solution basis.  I think that's just because
there's a limited namespace and so coordination is needed between
multiple replication solutions that might possibly be running on the
same system.  But I want to confirm if that's actually what you're
thinking.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Scaling shared buffer eviction
Next
From: Andres Freund
Date:
Subject: Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}