Re: logical changeset generation v6.2 - Mailing list pgsql-hackers

From Robert Haas
Subject Re: logical changeset generation v6.2
Date
Msg-id CA+TgmoZGGrp36Mrn0Tm=MQ=A1=9w6B4FyOm0LJ0wwQQvhOGseA@mail.gmail.com
Whole thread Raw
In response to Re: logical changeset generation v6.2  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: logical changeset generation v6.2  (Andres Freund <andres@2ndquadrant.com>)
Re: logical changeset generation v6.2  (David Fetter <david@fetter.org>)
Re: logical changeset generation v6.2  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Tue, Oct 15, 2013 at 9:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> It allows you to use the shared libary both as a normal extension loaded
> via shared_preload_library or adhoc and as an output plugin which seems
> like a sensible goal.
> We could have a single _PG_init_output_plugin() symbol that fills in
> such a struct which would then not conflict with using the .so
> independently. If you prefer that I'll change things around.

I think part of the problem may be that you're using the library name
to identify the output plugin.  I'm not excited about that design.
For functions, you give the function a name and that is a pointer to
where to actually find the function, which may be a 2-tuple
<library-name, function-name>, or perhaps just a 1-tuple
<builtin-function-name>, or maybe the whole text of a PL/pgsql
procedure that should be compiled.

Perhaps this ought to work similarly.  Create a function in pg_proc
which returns the structure containing the function pointers.  Then,
when that output plugin is selected, it'll automatically trigger
loading the correct shared library if that's needed; and the shared
library name may (but need not) match the output plugin name.

>> What I'd probably do is
>> emit the data in CSV format, with the first column of each line being
>> a single character indicating what sort of row this is: H means a
>> header row, defining the format of subsequent rows
>> (H,table_name,new_column1,...,new_columnj,old_key_column1,...,old_key_columnk;
>> a new header row is emitted only when the column list changes); I, U,
>> or D means an insert, update, or delete, with column 2 being the
>> transaction ID, column 3 being the table name, and the remaining
>> columns matching the last header row for emitted for that table, T
>> means meta-information about a transaction, whatever we have (e.g.
>> T,txn_id,commit_time).
>
> There's two issues I have with this:
> a) CSV seems like a bad format for this. If a transaction inserts into
> multiple tables the number of columns will constantly change. Many CSV
> parsers don't deal with that all too gracefully. E.g. you can't even
> load the data into another postgres database as an audit log.

We can pick some other separator.  I don't think ragged CSV is a big
problem; I'm actually more worried about having an easy way to handle
embedded commas and newlines and so on.  But I'd be fine with
tab-separated data or something too, if you think that's better.  What
I want is something that someone can parse with a script that can be
written in a reasonable amount of time in their favorite scripting
language.  I predict that if we provide something like this we'll
vastly expand the number of users who can make use of this new
functionality.

User: So, what's new in PostgreSQL 9.4?
Hacker: Well, now we have logical replication!
User: Why is that cool?
Hacker: Well, streaming replication is awesome for HA, but it has
significant limitations.  And trigger-based systems are very mature,
but the overhead is high and their lack of core integration makes them
hard to use.  With this technology, you can build systems that will
replicate individual tables or even parts of tables, multi-master
systems, and lots of other cool stuff.
User: Wow, that sounds great.  How do I use it?
Hacker: Well, first you write an output plugin in C using a special API.
User: Hey, do you know whether the MongoDB guys came to this conference?

Let's try that again.

User: Wow, that sounds great.  How do I use it?
Hacker: Well, currently, the output gets dumped as a series of text
files that are designed to be parsed using a scripting language.  We
have sample parsers written in Perl and Python that you can use as-is
or hack up to meet your needs.

Now, some users are still going to head for the hills.  But at least
from where I sit it sounds a hell of a lot better than the first
answer.  We're not going to solve all of the tooling problems around
this technology in one release, for sure.  But as far as 95% of our
users are concerned, a C API might as well not exist at all.  People
WILL try to machine parse the output of whatever demo plugins we
provide; so I think we should try hard to provide at least one such
plugin that is designed to make that as easy as possible.

> If we go for CSV I think we should put the entire primary key as one
> column (containing all the columns) and the entire row another.
>
> We also don't have any nice facilities for actually writing CSV - so
> we'll need to start extracting escaping code from COPY. In the end all
> that will make the output plugin very hard to use as an example because
> the code will get more complicated.
>
> b) Emitting new row descriptors everytime the schema changes will
> require keeping track of the schema. I think that won't be trivial. It
> also makes consumption of the data more complicated in comparison to
> including the description with every row.
>
> Both are even more true once we extend the format to support streaming
> of transactions while they are performed.

All fair points, but IMHO this is exactly why we need to provide a
well-written output plugin, not leave it to users to solve these
problems.

>> But regular people don't want to write C code; they just want to write
>> a crappy Perl script.  And I think we can facilitate that without too
>> much work.
>
> I think the generic output plugin should be a separate one from the
> example one (which is the one included in the patchset).

That's OK with me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Triggers on foreign tables
Next
From: Hannu Krosing
Date:
Subject: Re: logical changeset generation v6.2