Re: logical changeset generation v3 - Source for Slony - Mailing list pgsql-hackers

From Steve Singer
Subject Re: logical changeset generation v3 - Source for Slony
Date
Msg-id BLU0-SMTP271EFDEB5FD775CC6EE26FDC550@phx.gbl
Whole thread Raw
In response to Re: logical changeset generation v3 - Source for Slony  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: logical changeset generation v3 - Source for Slony
List pgsql-hackers
On 12-11-18 11:07 AM, Andres Freund wrote:
> Hi Steve!
>
>
> I think we should provide some glue code to do this, otherwise people
> will start replicating all the bugs I hacked into this... More
> seriously: I think we should have support code here, no user will want
> to learn the intracacies of feedback messages and such. Where that would
> live? No idea.

libpglogicalrep.so ?

> I wholeheartedly aggree. It should also be cleaned up a fair bit before
> others copy it should we not go for having some client side library.
>
> Imo the library could very roughly be something like:
>
> state = SetupStreamingLLog(replication-slot, ...);
> while((message = StreamingLLogNextMessage(state))
> {
>       write(outfd, message->data, message->length);
>       if (received_100_messages)
>       {
>            fsync(outfd);
>            StreamingLLogConfirm(message);
>       }
> }
>
> Although I guess thats not good enough because StreamingLLogNextMessage
> would be blocking, but that shouldn't be too hard to work around.
>

How about we pass a timeout value to StreamingLLogNextMessage (..) where 
it returns if no data is available after the timeout to give the caller 
a chance to do something else.

>> This is basically the Slony 2.2 sl_log format minus a few columns we no
>> longer need (txid, actionseq).
>> command_args is a postgresql text array of column=value pairs.  Ie [
>> {id=1},{name='steve'},{project='slony'}]
> It seems to me that that makes escaping unneccesarily complicated, but
> given you already have all the code... ;)

When I look at the actual code/representation we picked it is closer to 
{column1,value1,column2,value2...}



>> I don't t think our output plugin will be much more complicated than the
>> test_decoding plugin.
> Good. Thats the idea ;). Are you ok with the interface as it is now or
> would you like to change something?

I'm going to think about this some more and maybe try to write an 
example plugin before I can say anything with confidence.

>
> Yes. We will also need something like that. If you remember the first
> prototype we sent to the list, it included the concept of an
> 'origin_node' in wal record. I think you actually reviewed that one ;)
>
> That was exactly aimed at something like this...
>
> Since then my thoughts about how the origin_id looks like have changed a
> bit:
> - origin id is internally still represented as an uint32/Oid
>    - never visible outside of wal/system catalogs
> - externally visible it gets
>    - assigned an uuid
>    - optionally assigned a user defined name
> - user settable (permissions?) origin when executing sql:
>    - SET change_origin_uuid = 'uuid';
>    - SET change_origin_name = 'user-settable-name';
>    - defaults to the local node
> - decoding callbacks get passed the origin of a change
>    - txn->{origin_uuid, origin_name, origin_internal?}
> - the init decoding callback can setup an array of interesting origins,
>    so the others don't even get the ReorderBuffer treatment
>
> I have to thank the discussion on -hackers and a march through prague
> with Marko here...
So would the uuid and optional name assignment be done in the output 
plugin or some else?
When/how does the uuid get generated and where do we store it so the 
same uuid gets returned when postgres restarts.  Slony today stores all 
this type of stuff in user-level tables and user-level functions 
(because it has no other choice).    What is the connection between 
these values and the 'slot-id' in your proposal for the init arguments? 
Does the slot-id need to be the external uuid of the other end or is 
there no direct connection?

Today slony allows us to replicate between two databases in the same 
postgresql cluster (I use this for testing all the time)
Slony also allows for two different 'slony clusters' to be setup in the 
same database (or so I'm told, I don't think I have ever tried this myself).

plugin functions that let me query the local database and then return 
the  uuid and origin_name would work in this model.

+1 on being able to mark the 'change origin' in a SET command when the 
replication process is pushing data into the replica.

>> Exactly how we do this filtering is an open question,  I think the output
>> plugin will at a minimum need to know:
>>
>> a) What the slony node id is of the node it is running on.  This is easy to
>> figure out if the output plugin is able/allowed to query its database.  Will
>> this be possible? I would expect to be able to query the database as it
>> exists now(at plugin invocation time) not as it existed in the past when the
>> WAL was generated.   In addition to the node ID I can see us wanting to be
>> able to query other slony tables (sl_table,sl_set etc...)
> Hm. There is no fundamental reason not to allow normal database access
> to the current database but it won't be all that cheap, so doing it
> frequently is not a good idea.
> The reason its not cheap is that you basically need to teardown the
> postgres internal caches if you switch the timestream in which you are
> working.
>
> Would go something like:
>
> TransactionContext = AllocSetCreate(...);
> RevertFromDecodingSnapshot();
> InvalidateSystemCaches();
> StartTransactionCommand();
> /* do database work */
> CommitTransactionCommand();
> /* cleanup memory*/
> SetupDecodingSnapshot(snapshot, data);
> InvalidateSystemCaches();
>
> Why do you need to be able to query the present? I thought it might be
> neccesary to allow additional tables be accessed in a timetraveling
> manner, but not this way round.
> I guess an initial round of querying during plugin initialization won't
> be good enough?

For example my output plugin would want the list of replicated tables 
(or the list of tables replicated to a particular replica). This list 
can change over time.  As administrators issue commands to add or remove 
tables to replication or otherwise reshape the cluster the output plugin 
will need to know about this.  I MIGHT be able to get away with having 
slon disconnect and reconnect on reconfiguration events so only the 
init() call would need this data, but I am not sure.

One of the ways slony allows you to shoot your foot off is by changing 
certain configuration things (like dropping a table from a set) while a 
subscription is in progress.   Being able to timetravel the slony 
configuration tables might make this type of foot-gun a lot harder to 
encounter but that might be asking for too much.




>> b) What the slony node id is of the node we are streaming too.   It would be
>> nice if we could pass extra, arbitrary data/parameters to the output plugins
>> that could include that, or other things.  At the moment the
>> start_logical_replication rule in repl_gram.y doesn't allow for that but I
>> don't see why we couldn't make it do so.
> Yes, I think we want something like that. I even asked input on that
> recently ;):
> http://archives.postgresql.org/message-id/20121115014250.GA5844@awork2.anarazel.de
>
> Input welcome!

How flexible will the datatypes for the arguments be? If I wanted to 
pass in a list of tables (ie an array?) could I?
Above I talked about having the init() or change() methods query the 
local database.  Another option might be to make the slon build up this 
data (by querying the database over a normal psql connection) and just 
passing the data in.   However that might mean passing in a list of a 
few thousand table names, which doesn't sound like a good idea.

>
>> Even though, from a data-correctness point of view, slony could commit the
>> transaction on the replica after it sees the t1 commit, we won't want it to
>> do commits other than on a SYNC boundary.  This means that the replicas will
>> continue to move between consistent SYNC snapshots and that we can still
>> track the state/progress of replication by knowing what events (SYNC or
>> otherwise) have been confirmed.
> I don't know enough about slony internals, but: why? This will prohibit
> you from ever doing (per-transaction) synchronous replication...

A lot of this has to do with the stuff I discuss in the section below on 
cluster reshaping that you didn't understand.  Slony depends on knowing 
what data has , or hasn't been sent to a replica at a particular event 
id.  If 'some' transactions in between two SYNC events have committed 
but not others then slony has no idea what data it needs to get 
elsewhere on a FAILOVER type event.  There might be a way to make this 
work otherwise but I'm not sure what that is and how long it will take 
to debug out the issues.

> Cool! Don't hesitate to mention anything that you think would make you 
> life easier, chances are that youre not the only one who could benefit 
> from it... Thanks, Andres






pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: logical changeset generation v3
Next
From: Peter Eisentraut
Date:
Subject: removing some dead "keep compiler quiet" code