Re: logical changeset generation v3 - Source for Slony - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical changeset generation v3 - Source for Slony
Date
Msg-id 20121120114432.GB5800@awork2.anarazel.de
Whole thread Raw
In response to Re: logical changeset generation v3 - Source for Slony  (Steve Singer <steve@ssinger.info>)
List pgsql-hackers
Hi,

On 2012-11-19 19:50:32 -0500, Steve Singer wrote:
> On 12-11-18 11:07 AM, Andres Freund wrote:
> >I think we should provide some glue code to do this, otherwise people
> >will start replicating all the bugs I hacked into this... More
> >seriously: I think we should have support code here, no user will want
> >to learn the intracacies of feedback messages and such. Where that would
> >live? No idea.

> libpglogicalrep.so ?

Yea. We don't really have the infrastructure for that yet
though... Robert and me were just talking about that recently...


> >I wholeheartedly aggree. It should also be cleaned up a fair bit before
> >others copy it should we not go for having some client side library.
> >
> >Imo the library could very roughly be something like:
> >
> >state = SetupStreamingLLog(replication-slot, ...);
> >while((message = StreamingLLogNextMessage(state))
> >{
> >      write(outfd, message->data, message->length);
> >      if (received_100_messages)
> >      {
> >           fsync(outfd);
> >           StreamingLLogConfirm(message);
> >      }
> >}
> >
> >Although I guess thats not good enough because StreamingLLogNextMessage
> >would be blocking, but that shouldn't be too hard to work around.
> >
>
> How about we pass a timeout value to StreamingLLogNextMessage (..) where it
> returns if no data is available after the timeout to give the caller a
> chance to do something else.

Doesn't really integrate into the sort of loop thats often built around
poll(2), select(2) and similar. It probably should return NULL if
there's nothing there yet and we should have a
StreamingLLogWaitForMessage() or such.

> >>This is basically the Slony 2.2 sl_log format minus a few columns we no
> >>longer need (txid, actionseq).
> >>command_args is a postgresql text array of column=value pairs.  Ie [
> >>{id=1},{name='steve'},{project='slony'}]
> >It seems to me that that makes escaping unneccesarily complicated, but
> >given you already have all the code... ;)
>
> When I look at the actual code/representation we picked it is closer to
> {column1,value1,column2,value2...}

Still means you need to escape and later pasrse columnN, valueN
values. I would have expected something like (length:data, length:data)+

> >>I don't t think our output plugin will be much more complicated than the
> >>test_decoding plugin.
> >Good. Thats the idea ;). Are you ok with the interface as it is now or
> >would you like to change something?
>
> I'm going to think about this some more and maybe try to write an example
> plugin before I can say anything with confidence.

That would be very good.

> >Yes. We will also need something like that. If you remember the first
> >prototype we sent to the list, it included the concept of an
> >'origin_node' in wal record. I think you actually reviewed that one ;)
> >
> >That was exactly aimed at something like this...
> >
> >Since then my thoughts about how the origin_id looks like have changed a
> >bit:
> >- origin id is internally still represented as an uint32/Oid
> >   - never visible outside of wal/system catalogs
> >- externally visible it gets
> >   - assigned an uuid
> >   - optionally assigned a user defined name
> >- user settable (permissions?) origin when executing sql:
> >   - SET change_origin_uuid = 'uuid';
> >   - SET change_origin_name = 'user-settable-name';
> >   - defaults to the local node
> >- decoding callbacks get passed the origin of a change
> >   - txn->{origin_uuid, origin_name, origin_internal?}
> >- the init decoding callback can setup an array of interesting origins,
> >   so the others don't even get the ReorderBuffer treatment
> >
> >I have to thank the discussion on -hackers and a march through prague
> >with Marko here...

> So would the uuid and optional name assignment be done in the output plugin
> or some else?

That would be postgres infrastructure. The output plugin would get
passed at least uuid and name and potentially the internal name as well
(might be useful to build some internal caching of information).

> When/how does the uuid get generated and where do we store it so the same
> uuid gets returned when postgres restarts.  Slony today stores all this type
> of stuff in user-level tables and user-level functions (because it has no
> other choice).

Would need to be its own system catalog.

> What is the connection between these values and the
> 'slot-id' in your proposal for the init arguments? Does the slot-id need to
> be the external uuid of the other end or is there no direct connection?

None really. The "slot-id" really is only an identifier for a
replication connection (which should live longer than a single
postmaster run) which contains information about the point up to which
you replicated. We need to manage some local resources based on that.

> Today slony allows us to replicate between two databases in the same
> postgresql cluster (I use this for testing all the time)
> Slony also allows for two different 'slony clusters' to be setup in the same
> database (or so I'm told, I don't think I have ever tried this myself).

Yuck. I haven't thought about this very much. I honestly don't see
support for the first case right now. The second shouldn't be too hard,
we already have the database oid available everywhere we need it.

> plugin functions that let me query the local database and then return the
> uuid and origin_name would work in this model.

Should be possible.

> +1 on being able to mark the 'change origin' in a SET command when the
> replication process is pushing data into the replica.

Good.

>
> >>Exactly how we do this filtering is an open question,  I think the output
> >>plugin will at a minimum need to know:
> >>
> >>a) What the slony node id is of the node it is running on.  This is easy to
> >>figure out if the output plugin is able/allowed to query its database.  Will
> >>this be possible? I would expect to be able to query the database as it
> >>exists now(at plugin invocation time) not as it existed in the past when the
> >>WAL was generated.   In addition to the node ID I can see us wanting to be
> >>able to query other slony tables (sl_table,sl_set etc...)
> >Hm. There is no fundamental reason not to allow normal database access
> >to the current database but it won't be all that cheap, so doing it
> >frequently is not a good idea.
> >The reason its not cheap is that you basically need to teardown the
> >postgres internal caches if you switch the timestream in which you are
> >working.
> >
> >Would go something like:
> >
> >TransactionContext = AllocSetCreate(...);
> >RevertFromDecodingSnapshot();
> >InvalidateSystemCaches();
> >StartTransactionCommand();
> >/* do database work */
> >CommitTransactionCommand();
> >/* cleanup memory*/
> >SetupDecodingSnapshot(snapshot, data);
> >InvalidateSystemCaches();
> >
> >Why do you need to be able to query the present? I thought it might be
> >neccesary to allow additional tables be accessed in a timetraveling
> >manner, but not this way round.
> >I guess an initial round of querying during plugin initialization won't
> >be good enough?
>
> For example my output plugin would want the list of replicated tables (or
> the list of tables replicated to a particular replica). This list can change
> over time.  As administrators issue commands to add or remove tables to
> replication or otherwise reshape the cluster the output plugin will need to
> know about this.  I MIGHT be able to get away with having slon disconnect
> and reconnect on reconfiguration events so only the init() call would need
> this data, but I am not sure.
>
> One of the ways slony allows you to shoot your foot off is by changing
> certain configuration things (like dropping a table from a set) while a
> subscription is in progress.   Being able to timetravel the slony
> configuration tables might make this type of foot-gun a lot harder to
> encounter but that might be asking for too much.

Actually timetravel access to those tables is considerably
easier/faster. I wanted to provide such tables anyway (because you need
them to safely write your own pg_enum alike types). It means that you
log slightly (32 + sizeof(XLogRecord) afair) more per modified row.

> >>b) What the slony node id is of the node we are streaming too.   It would be
> >>nice if we could pass extra, arbitrary data/parameters to the output plugins
> >>that could include that, or other things.  At the moment the
> >>start_logical_replication rule in repl_gram.y doesn't allow for that but I
> >>don't see why we couldn't make it do so.
> >Yes, I think we want something like that. I even asked input on that
> >recently ;):
> >http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de
> >
> >Input welcome!
>
> How flexible will the datatypes for the arguments be? If I wanted to pass in
> a list of tables (ie an array?) could I?

I was thinking of just a textual (key = value, ...) style list, similar
to options to EXPLAIN, COPY et al.

> Above I talked about having the init() or change() methods query the local
> database.  Another option might be to make the slon build up this data (by
> querying the database over a normal psql connection) and just passing the
> data in.   However that might mean passing in a list of a few thousand table
> names, which doesn't sound like a good idea.

No, it certainly doesn't.

> >
> >>Even though, from a data-correctness point of view, slony could commit the
> >>transaction on the replica after it sees the t1 commit, we won't want it to
> >>do commits other than on a SYNC boundary.  This means that the replicas will
> >>continue to move between consistent SYNC snapshots and that we can still
> >>track the state/progress of replication by knowing what events (SYNC or
> >>otherwise) have been confirmed.
> >I don't know enough about slony internals, but: why? This will prohibit
> >you from ever doing (per-transaction) synchronous replication...
>
> A lot of this has to do with the stuff I discuss in the section below on
> cluster reshaping that you didn't understand.  Slony depends on knowing what
> data has , or hasn't been sent to a replica at a particular event id.  If
> 'some' transactions in between two SYNC events have committed but not others
> then slony has no idea what data it needs to get elsewhere on a FAILOVER
> type event.  There might be a way to make this work otherwise but I'm not
> sure what that is and how long it will take to debug out the issues.

Ah, it starts to make sense.

The way I solved that issue in the prototype from arround pgcon was that
I included the LSN from the original commit record in the remote
transaction into the commit record of the local transaction (with the
origin_id set to the remote side). That allowed to trivially restore the
exact state of replication after a crash even with
synchronous_commit=off as during replay you could simply ensure the
replication-surely-received-lsn of every remote side was up to date.
Then you can simply do a START_LOGICAL_REPLICATION 'slot'
just-recovered/lsn; and restart applying (*not*
INIT_LOGICAL_REPLICATION).


Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: logical changeset generation v3
Next
From: Marcin Mańk
Date:
Subject: faster ts_headline