Thread: Changeset Extraction Interfaces

Changeset Extraction Interfaces

From
Andres Freund
Date:
Hi,

Short recap:

From the perspective of the user interface the changeset extraction
feature consists out of two abstract interfaces that the "user" has to
do with:

1) The "slot" or "changestream" management interface which manages
individual streams of changes. The user can create and destroy a
changestream, and most importantly stream the changes.

Simplified, a "logical replication slot" is a position in the WAL and a
bunch of state associated with it. As long as a slot exists, the user
can ask, for all changes that happened since the last time he asked, to
be streamed out.

It is abstract, because different usecases require the changes to be
streamed out via different methods. The series contains two
implementation of that interface:
I) One integrated into walsender that allows for efficient streaming,  including support for synchronous replication.
II) Another that is accessible via SQL functions, very useful for   writing pg_regress/isolationtester tests.

It is, with a relatively low amount of code, possible to add other such
interfaces without touching core code. One example, that has been asked
for by a number of people, is consuming the changestream in a background
worker without involving SQL or connecting to a walsender.

There's basically three major 'verbs' that can be performed on a
stream, currently named (walsender names):
* INIT_LOGICAL_REPLICATION "name" "output_plugin"
* START_LOGICAL_REPLICATION "name" last_received ("option_name" value,...)
* FREE_LOGICAL_REPLICATION "name"

The SQL variant currrently has:
* init_logical_replication(name, plugin)
* start_logical_replication(name, stream_upto, options[])
* stop_logical_replication(name)

You might have noticed the slight inconsistency...

2) The "output plugin" interface, which transforms a changestream
(begin, change, commit) into the desired target format.

There are 5 callbacks, 3 of them obligatory:
* pg_decode_init(context, is_initial) [optional]
* pg_decode_begin(context, txn)
* pg_decode_change(context, txn, relation, change)
* pg_decode_commit(context, txn)
* pg_decode_cleanup(context) [optional]

Every output plugin can be used from every slot management
interface.

The current pain points, that I'd like to discuss, are:
a) Better naming for the slot management between walsender, SQL and  possible future interfaces.

b) Decide which of the SQL functions should be in a contrib module, and  which in core. Currently
init_logical_replication()and  stop_logical_replication() are in core, whereas  start_logical_replication() is in the
'test_logical_decoding' extension. The reasoning behind that is that init/stop ones are  important to the DBA and the
start_logical_replication()SRF isn't  all that useful in the real world because our SRFs don't support  streaming
changesout.
 

c) Which data-types does start_logical_replication() return. Currently
it's OUT location text, OUT xid bigint, OUT data text. Making the 'data'
column text has some obvious disadvantages though - there's obvious
usecases for output plugins that return binary data. But making it bytea
sucks, because the output is harder to read by default...

d) How does a slot acquire the callbacks of an output plugin.

For a), my current feeling is to name them:
* LOGICAL_DECODING_SLOT_CREATE/pg_logical_decoding_slot_create()
* LOGICAL_DECODING_SLOT_STREAM/pg_logical_decoding_slot_extract()
* LOGICAL_DECODING_SLOT_DESTROY/pg_logical_decoding_slot_destroy()
with an intentional discrepancy between stream and extract, to make the
difference obvious. One day we might have the facility - which would be
rather cool - to do the streaming from sql as well.

Better ideas? Leave out the "logical"?

For b), I am happy with that split, I would just like others to comment.

For c), I have better idea than two functions.

d) is my main question, and Robert, Peter G. and I previously argued
about it a fair bit. I know of the following alternatives:

I) The output plugin that's specified in INIT_LOGICAL_REPLICATION is
actually a library name, and we simply lookup the fixed symbol names in
it. That's what currently implemented.
The advantage is that it's pretty easy to implement, works on a HS
standby without involving the primary, and doesn't have a problem if the
library is used in shared_preload_library.
The disadvantages are: All output plugins need to be shared libraries
and there can only be one output plugin per shared library (although you
could route differently, via options, but ugh).

II) Keep the output plugin a library, but only lookup a
_PG_init_output_plugin() which registers/returns the callbacks. Pretty
much the same tradeoffs as I)

III) Keep the output plugin a library, but simply rely on _PG_init()
calling a function to register all callbacks. Imo it's worse than I) and
II) because it basically prohibits using the library in
shared_preload_libraries as well, because then it's _PG_init() doesn't
get called when starting to stream, and another library might have
registered other callbacks.

IV) Make output plugins a SQL-level object/catalog table where a plugin
can be registered, and the callbacks are normal pg_proc entries. It's
more in line with other stuff, but has the disadvantage that we need to
register plugins on the primary, even if we only stream from a
standby. But then, we're used to that with CREATE EXTENSION et al.

I personally lean towards I), followed by II) and IV).

Comments?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Changeset Extraction Interfaces

From
Robert Haas
Date:
On Wed, Dec 4, 2013 at 7:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> There's basically three major 'verbs' that can be performed on a
> stream, currently named (walsender names):
> * INIT_LOGICAL_REPLICATION "name" "output_plugin"
> * START_LOGICAL_REPLICATION "name" last_received ("option_name" value,...)
> * FREE_LOGICAL_REPLICATION "name"
>
> The SQL variant currrently has:
> * init_logical_replication(name, plugin)
> * start_logical_replication(name, stream_upto, options[])
> * stop_logical_replication(name)
>
> You might have noticed the slight inconsistency...

I think this naming is probably not the greatest.  When I hear "init",
I don't think "permanently allocate a resource that will never be
released unless I explicitly throw it away", and when I hear "stop", I
don't think "free that resource".  I suggest naming based around
create/drop, register/unregister, or acquire/release.  Since, as
previously noted, I'm gunning for these slots to apply to ordinary
replication as well, I kind of like ACQUIRE_REPLICATION_SLOT and
RELEASE_REPLICATION_SLOT.  If we're going to make them specific to
logical replication, then your suggestion of CREATE_DECODING_SLOT and
DROP_DECODING_SLOT, or maybe ACQUIRE_DECODING_SLOT and
RELEASE_DECODING_SLOT, sounds fine.

It also strikes me that just as it's possible to stream WAL without
allocating a slot first (since we don't at present have slots),
perhaps it ought also to be possible to stream logical replication
data without acquiring a slot first.  You could argue that it was a
mistake not to introduce slots in the first place, but the stateless
nature of WAL streaming definitely has some benefits, and it's unclear
to me why you shouldn't be able to do the same thing with logical
decoding.

> b) Decide which of the SQL functions should be in a contrib module, and
>    which in core. Currently init_logical_replication() and
>    stop_logical_replication() are in core, whereas
>    start_logical_replication() is in the 'test_logical_decoding'
>    extension. The reasoning behind that is that init/stop ones are
>    important to the DBA and the start_logical_replication() SRF isn't
>    all that useful in the real world because our SRFs don't support
>    streaming changes out.

Seems pretty arbitrary to me.  If start_logical_replication() is
usable with any output plugin, then it ought to be in core.  I think
the name isn't great, though; the actual functionality seems to be
more or less decode-from-last-position-up-to-present, which doesn't
sound much like "start".

> c) Which data-types does start_logical_replication() return. Currently
> it's OUT location text, OUT xid bigint, OUT data text. Making the 'data'
> column text has some obvious disadvantages though - there's obvious
> usecases for output plugins that return binary data. But making it bytea
> sucks, because the output is harder to read by default...

I think having two functions might be sensible.  I'm not sure what
happens if the text function is used and the plugin outputs something
that's not valid in the database encoding, though.  I guess you'd
better check for that and error out.

> d) is my main question, and Robert, Peter G. and I previously argued
> about it a fair bit. I know of the following alternatives:
>
> I) The output plugin that's specified in INIT_LOGICAL_REPLICATION is
> actually a library name, and we simply lookup the fixed symbol names in
> it. That's what currently implemented.
> The advantage is that it's pretty easy to implement, works on a HS
> standby without involving the primary, and doesn't have a problem if the
> library is used in shared_preload_library.
> The disadvantages are: All output plugins need to be shared libraries
> and there can only be one output plugin per shared library (although you
> could route differently, via options, but ugh).

I still dislike this.

> II) Keep the output plugin a library, but only lookup a
> _PG_init_output_plugin() which registers/returns the callbacks. Pretty
> much the same tradeoffs as I)
>
> III) Keep the output plugin a library, but simply rely on _PG_init()
> calling a function to register all callbacks. Imo it's worse than I) and
> II) because it basically prohibits using the library in
> shared_preload_libraries as well, because then it's _PG_init() doesn't
> get called when starting to stream, and another library might have
> registered other callbacks.

I don't understand the disadvantage that you're describing here.  What
I'm imagining is that you have some struct that looks like this:

struct output_plugin
{   char name[64];   void (*begin)(args);   void (*change)(args);   void (*commit)(args);
};

Now you provide a function RegisterOutputPlugin(output_plugin *).  If
there are any output plugins built into core, core will call
RegisterOutputPlugin once for each one.  If a shared library
containing an output plugin is loaded, the libraries _PG_init function
does the same thing.  When someone tries to use a plugin, they ask for
it by name.  We go iterate through the data saved by all previous
calls to RegisterOutputPlugin() until we find one with a matching
name, and then we use the callbacks embedded in that struct.

It doesn't matter when _PG_init() runs, except that it had better run
before the output plugin is requested by name.  It also doesn't matter
that another library may have registered other callbacks, because we
use the name to decide which output plugin to call, from among all
those we know about.

> IV) Make output plugins a SQL-level object/catalog table where a plugin
> can be registered, and the callbacks are normal pg_proc entries. It's
> more in line with other stuff, but has the disadvantage that we need to
> register plugins on the primary, even if we only stream from a
> standby. But then, we're used to that with CREATE EXTENSION et al.

I don't think I'd make every callback a pg_proc entry; I'd make a
single pg_proc entry that returns a struct of function pointers, as we
do for FDWs.  But I think this has merit.  One significant advantage
of it over (III) is that execution of a function in pg_proc can
trigger a library load without any extra pushups, which is nice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Changeset Extraction Interfaces

From
Andres Freund
Date:
Hello Robert,

On 2013-12-11 22:29:46 -0500, Robert Haas wrote:
> On Wed, Dec 4, 2013 at 7:15 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > There's basically three major 'verbs' that can be performed on a
> > stream, currently named (walsender names):
> > * INIT_LOGICAL_REPLICATION "name" "output_plugin"
> > * START_LOGICAL_REPLICATION "name" last_received ("option_name" value,...)
> > * FREE_LOGICAL_REPLICATION "name"
> >
> > The SQL variant currrently has:
> > * init_logical_replication(name, plugin)
> > * start_logical_replication(name, stream_upto, options[])
> > * stop_logical_replication(name)
> >
> > You might have noticed the slight inconsistency...
> 
> I think this naming is probably not the greatest.

Completely agreed.

> When I hear "init",
> I don't think "permanently allocate a resource that will never be
> released unless I explicitly throw it away", and when I hear "stop", I
> don't think "free that resource".  I suggest naming based around
> create/drop, register/unregister, or acquire/release.  Since, as
> previously noted, I'm gunning for these slots to apply to ordinary
> replication as well, I kind of like ACQUIRE_REPLICATION_SLOT and
> RELEASE_REPLICATION_SLOT.  If we're going to make them specific to
> logical replication, then your suggestion of CREATE_DECODING_SLOT and
> DROP_DECODING_SLOT, or maybe ACQUIRE_DECODING_SLOT and
> RELEASE_DECODING_SLOT, sounds fine.

I think there'll always be a bit of a difference between slots for
physical and logical data, even if 90% of the implementation is the
same. We can signal that difference by specifying logical/physical as an
option or having two different sets of commands.

Maybe?

ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
-- already exists without slot, PHYSICAL arguments
START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
START_REPLICATION SLOT LOGICAL slot plugin_options
RELEASE_REPLICATION_SLOT slot_name

> It also strikes me that just as it's possible to stream WAL without
> allocating a slot first (since we don't at present have slots),
> perhaps it ought also to be possible to stream logical replication
> data without acquiring a slot first.  You could argue that it was a
> mistake not to introduce slots in the first place, but the stateless
> nature of WAL streaming definitely has some benefits, and it's unclear
> to me why you shouldn't be able to do the same thing with logical
> decoding.

I think it would be quite a bit harder for logical decoding. The
difference is that, from the perspective of the walsender, for plain WAL
streaming, all that needs to be checked is whether the WAL is still
there. For decoding though, we need to be sure that a) the catalog xmin
is still low enough and has been all along b) that we are able instantly
build a historical mvcc snapshot from the point we want to start
streaming.
Both a) and b) are solved by keeping the xmin and the point where to
reread WAL from in the slot data and by serializing data about
historical snapshots to disk. But those are removed if there isn't a
slot around requiring them...

So what you could get is something that starts streaming you changes
sometime after you asked it to start streaming, without a guarantee that
you can restart at exactly the position you stopped. If that's useful,
we can do it, but I am not sure what the usecase would be?

> > b) Decide which of the SQL functions should be in a contrib module, and
> >    which in core. Currently init_logical_replication() and
> >    stop_logical_replication() are in core, whereas
> >    start_logical_replication() is in the 'test_logical_decoding'
> >    extension. The reasoning behind that is that init/stop ones are
> >    important to the DBA and the start_logical_replication() SRF isn't
> >    all that useful in the real world because our SRFs don't support
> >    streaming changes out.
> 
> Seems pretty arbitrary to me.  If start_logical_replication() is
> usable with any output plugin, then it ought to be in core.

Ok, I certainly have no problem with that.

> I think
> the name isn't great, though; the actual functionality seems to be
> more or less decode-from-last-position-up-to-present, which doesn't
> sound much like "start".

The name originally comes from the START_REPLICATION walsender command,
but I agree, there's not much point in trying to keep that name.

I am also open to different behaviour for the SRF, but I am not sure
what that could be. There's just no sensible way to stream data on the
SQL level afaics.

What about pg_decoding_slot_get_[binary_]changes()?

> > c) Which data-types does start_logical_replication() return. Currently
> > it's OUT location text, OUT xid bigint, OUT data text. Making the 'data'
> > column text has some obvious disadvantages though - there's obvious
> > usecases for output plugins that return binary data. But making it bytea
> > sucks, because the output is harder to read by default...

> I think having two functions might be sensible.  I'm not sure what
> happens if the text function is used and the plugin outputs something
> that's not valid in the database encoding, though.  I guess you'd
> better check for that and error out.

I wonder if we should let the output plugin tell us whether it will
output data in binary? I think it generally would be a good idea to let
the output plugin's _init() function return some configuration
data. That will make extending the interface to support more features
easier.

> > III) Keep the output plugin a library, but simply rely on _PG_init()
> > calling a function to register all callbacks. Imo it's worse than I) and
> > II) because it basically prohibits using the library in
> > shared_preload_libraries as well, because then it's _PG_init() doesn't
> > get called when starting to stream, and another library might have
> > registered other callbacks.

> I don't understand the disadvantage that you're describing here.  What
> I'm imagining is that you have some struct that looks like this:

> Now you provide a function RegisterOutputPlugin(output_plugin *).  If
> there are any output plugins built into core, core will call
> RegisterOutputPlugin once for each one.  If a shared library
> containing an output plugin is loaded, the libraries _PG_init function
> does the same thing.  When someone tries to use a plugin, they ask for
> it by name.  We go iterate through the data saved by all previous
> calls to RegisterOutputPlugin() until we find one with a matching
> name, and then we use the callbacks embedded in that struct.

But if we don't pass in a .so's name, how can additional plugins be
registered except by adding them to [shared|local]_preload_libraries? If
we do pass in one, it seems confusing if you suddenly get a plugin
implemented somewhere else.

> > IV) Make output plugins a SQL-level object/catalog table where a plugin
> > can be registered, and the callbacks are normal pg_proc entries. It's
> > more in line with other stuff, but has the disadvantage that we need to
> > register plugins on the primary, even if we only stream from a
> > standby. But then, we're used to that with CREATE EXTENSION et al.
> 
> I don't think I'd make every callback a pg_proc entry; I'd make a
> single pg_proc entry that returns a struct of function pointers, as we
> do for FDWs.  But I think this has merit.  One significant advantage
> of it over (III) is that execution of a function in pg_proc can
> trigger a library load without any extra pushups, which is nice.

So I guess this is? It has the advantage that an output plugin can
create any additional functionality it needs in the course of it's
CREATE EXTENSION.

As far as I have been thinking of, this would be another catalog table like
pg_decoding_plugin(oid, dpname name, dpload regproc).

I guess that we at least need a function for adding output plugins
there, that also creates proper pg_depend entries? We could also go for
full DDL commands, although I have a bit of a hard time finding
suitable, already used, keywords... If we're not worried about that, maybe
CREATE/DROP OUTPUT PLUGIN ...;

Thanks for the input!

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Changeset Extraction Interfaces

From
Robert Haas
Date:
On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I think there'll always be a bit of a difference between slots for
> physical and logical data, even if 90% of the implementation is the
> same. We can signal that difference by specifying logical/physical as an
> option or having two different sets of commands.
>
> Maybe?
>
> ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
> ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
> -- already exists without slot, PHYSICAL arguments
> START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
> START_REPLICATION SLOT LOGICAL slot plugin_options
> RELEASE_REPLICATION_SLOT slot_name

I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
but basically this seems OK to me.  I hadn't realized that the options
were going to be different for logical vs. physical.  So you could
also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT,
START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT.  I'm
not sure whether that's better.

>> It also strikes me that just as it's possible to stream WAL without
>> allocating a slot first (since we don't at present have slots),
>> perhaps it ought also to be possible to stream logical replication
>> data without acquiring a slot first.  You could argue that it was a
>> mistake not to introduce slots in the first place, but the stateless
>> nature of WAL streaming definitely has some benefits, and it's unclear
>> to me why you shouldn't be able to do the same thing with logical
>> decoding.
>
> I think it would be quite a bit harder for logical decoding. The
> difference is that, from the perspective of the walsender, for plain WAL
> streaming, all that needs to be checked is whether the WAL is still
> there. For decoding though, we need to be sure that a) the catalog xmin
> is still low enough and has been all along b) that we are able instantly
> build a historical mvcc snapshot from the point we want to start
> streaming.
> Both a) and b) are solved by keeping the xmin and the point where to
> reread WAL from in the slot data and by serializing data about
> historical snapshots to disk. But those are removed if there isn't a
> slot around requiring them...
>
> So what you could get is something that starts streaming you changes
> sometime after you asked it to start streaming, without a guarantee that
> you can restart at exactly the position you stopped. If that's useful,
> we can do it, but I am not sure what the usecase would be?

I haven't yet looked closely at the snapshot-building stuff, but my
thought is that you ought to be able to decode any transactions that
start after you make the connection.  You might not be able to decode
transactions that are already in progress at that point, because you
might have already missed XID assignment records, catalog changes,
etc. that they've performed.  But transactions that begin after that
point ought to be OK.  I have a feeling you're going to tell me it
doesn't work like that, but maybe it should, because there's a whole
lot of benefit in having decoding start up quickly, and a whole lot of
benefit also to having the rules for that be easy to understand.

Now if you have that, then I think ad-hoc decoding is potentially
useful.  Granted, you're not going to want to build a full-fledged
replication solution that way, but you might want to just connect and
watch the world stream by... or you might imagine an application that
opens a replication connection and a regular connection, copies a
table, and then applies the stream of changes made to that table after
the fact.  When that completes, the table is sync'd between the two
machines as of the end of the copy.  Useful enough to bother with?  I
don't know.  But not obviously useless.

> I am also open to different behaviour for the SRF, but I am not sure
> what that could be. There's just no sensible way to stream data on the
> SQL level afaics.

I don't have a problem with the behavior.  Seems useful.  One useful
addition might be to provide an option to stream out up to X changes
but without consuming them, so that the DBA can peek at the
replication stream.  I think it's a safe bet DBAs will want to do
things like that, so it'd be nice to make it easy, if we can.

> What about pg_decoding_slot_get_[binary_]changes()?

Sounds about right, but I think we need to get religion about figuring
out what terminology to use.  At the moment it seems to vary quite a
bit between "logical", "logical decoding", and "decoding".  Not sure
how to nail that down.

As a more abstract linguistic question, what do we think the
difference is between logical *replication* and logical *decoding*?
Are they the same or different?  If different, how?

> I wonder if we should let the output plugin tell us whether it will
> output data in binary? I think it generally would be a good idea to let
> the output plugin's _init() function return some configuration
> data. That will make extending the interface to support more features
> easier.

Maybe, but you've got to consider the question of encoding, too.  You
could make the choices "binary" and "the database encoding", I
suppose.

>> Now you provide a function RegisterOutputPlugin(output_plugin *).  If
>> there are any output plugins built into core, core will call
>> RegisterOutputPlugin once for each one.  If a shared library
>> containing an output plugin is loaded, the libraries _PG_init function
>> does the same thing.  When someone tries to use a plugin, they ask for
>> it by name.  We go iterate through the data saved by all previous
>> calls to RegisterOutputPlugin() until we find one with a matching
>> name, and then we use the callbacks embedded in that struct.
>
> But if we don't pass in a .so's name, how can additional plugins be
> registered except by adding them to [shared|local]_preload_libraries? If
> we do pass in one, it seems confusing if you suddenly get a plugin
> implemented somewhere else.

I don't see what the confusion is.  You can use any method you like
for loading those libraries, including shared_preload_libarries,
local_preload_libraries, or the LOAD command.  The replication grammar
might have to grow support for LOAD.  But I think the winner may be
the next option.  Some backends might not actually use the library in
question, but that's true of any library you preload.

>> > IV) Make output plugins a SQL-level object/catalog table where a plugin
>> > can be registered, and the callbacks are normal pg_proc entries. It's
>> > more in line with other stuff, but has the disadvantage that we need to
>> > register plugins on the primary, even if we only stream from a
>> > standby. But then, we're used to that with CREATE EXTENSION et al.
>>
>> I don't think I'd make every callback a pg_proc entry; I'd make a
>> single pg_proc entry that returns a struct of function pointers, as we
>> do for FDWs.  But I think this has merit.  One significant advantage
>> of it over (III) is that execution of a function in pg_proc can
>> trigger a library load without any extra pushups, which is nice.
>
> So I guess this is? It has the advantage that an output plugin can
> create any additional functionality it needs in the course of it's
> CREATE EXTENSION.

Yes, that's elegant.

> As far as I have been thinking of, this would be another catalog table like
> pg_decoding_plugin(oid, dpname name, dpload regproc).

Instead of adding another catalog table, I think we should just define
a new type.  Again, please look at the way that foreign data wrappers
do this:

rhaas=# \df file_fdw_handler                             List of functionsSchema |       Name       | Result data type
|Argument data types |  Type
 
--------+------------------+------------------+---------------------+--------public | file_fdw_handler | fdw_handler
 |                     | normal
 
(1 row)

Is there any reason not to slavishly copy that design?  Note that if
you do it this way, you don't need any special DDL or pg_dump support;
a separate catalog table will raise the bar considerably.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Changeset Extraction Interfaces

From
Andres Freund
Date:
On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
> On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I think there'll always be a bit of a difference between slots for
> > physical and logical data, even if 90% of the implementation is the
> > same. We can signal that difference by specifying logical/physical as an
> > option or having two different sets of commands.
> >
> > Maybe?
> >
> > ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
> > ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
> > -- already exists without slot, PHYSICAL arguments
> > START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
> > START_REPLICATION SLOT LOGICAL slot plugin_options
> > RELEASE_REPLICATION_SLOT slot_name
> 
> I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
> but basically this seems OK to me.

Uh, yes.

> I hadn't realized that the options were going to be different for
> logical vs. physical.

I don't see how we could avoid that, there just are some differences
between both.

> So you could
> also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT,
> START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT.  I'm
> not sure whether that's better.

Not sure either, but I slightly favor keeping the the toplevel slot
commands the same. I think we'll want one namespace for both and
possibly similar reporting functions and that seems less surprising if
they are treated more similar.

> > So what you could get is something that starts streaming you changes
> > sometime after you asked it to start streaming, without a guarantee that
> > you can restart at exactly the position you stopped. If that's useful,
> > we can do it, but I am not sure what the usecase would be?
> 
> I haven't yet looked closely at the snapshot-building stuff, but my
> thought is that you ought to be able to decode any transactions that
> start after you make the connection.  You might not be able to decode
> transactions that are already in progress at that point, because you
> might have already missed XID assignment records, catalog changes,
> etc. that they've performed.  But transactions that begin after that
> point ought to be OK.

It works mostly like that, yes. At least on primaries. When we start
decoding, we jot down the current xlog insertion pointer to know where
to start decoding from, then trigger a xl_running_xacts record to be
logged so we have enough information. Then we start reading from that
point onwards. On standbys the process is the same, just that we have to
wait for the primary to issue a xl_running_xacts.
(I had considered starting with information from the procarray, but
turns out that's hard to do without race conditions.)

We only decode changes in transactions that commit after the last
transaction that was in-progress when we started observing has finished
though. That allows us to export a snapshot when the last still-running
transaction finished which shows a snapshot of the database that can be
rolled forward exactly by the changes contained in the changestream. I
think that's a useful property for the majority of cases.

If we were to start out streaming changes before the last running
transaction has finished, they would be visible in that exported
snapshot and you couldn't use it to to roll forward from anymore.

It'd be pretty easy to optionally decode the transactions we currently
skip if we want that feature later. That would remove the option to
export a snapshot in many cases though (think suboverflowed snapshots).

> I have a feeling you're going to tell me it
> doesn't work like that, but maybe it should, because there's a whole
> lot of benefit in having decoding start up quickly, and a whole lot of
> benefit also to having the rules for that be easy to understand.

I am not sure if the above qualifies as "doesn't work like that", if
not, sometimes the correct thing isn't the immediately obvious thing. I
think "all transactions that were running when initiating decoding need
to finish" is reasonably easy to explain.

> > I am also open to different behaviour for the SRF, but I am not sure
> > what that could be. There's just no sensible way to stream data on the
> > SQL level afaics.

> I don't have a problem with the behavior.  Seems useful.  One useful
> addition might be to provide an option to stream out up to X changes
> but without consuming them, so that the DBA can peek at the
> replication stream.  I think it's a safe bet DBAs will want to do
> things like that, so it'd be nice to make it easy, if we can.

It's not too difficult to provide an option to do that. What I've been
thinking of was to correlate the confirmation of consumption with the
transaction the SRF is running in. So, confirm the data as consumed if
it commits, and don't if not. I think we could do that relatively easily
by registering a XACT_EVENT_COMMIT.

> > What about pg_decoding_slot_get_[binary_]changes()?
>
> Sounds about right, but I think we need to get religion about figuring
> out what terminology to use.  At the moment it seems to vary quite a
> bit between "logical", "logical decoding", and "decoding".  Not sure
> how to nail that down.

Agreed. Perhaps we should just avoid both logical and decoding entirely
and go for "changestream" or similar?

> As a more abstract linguistic question, what do we think the
> difference is between logical *replication* and logical *decoding*?
> Are they the same or different?  If different, how?

For me "logical decoding" can be the basis of "logical replication", but
also for other features.

> > I wonder if we should let the output plugin tell us whether it will
> > output data in binary? I think it generally would be a good idea to let
> > the output plugin's _init() function return some configuration
> > data. That will make extending the interface to support more features
> > easier.
> 
> Maybe, but you've got to consider the question of encoding, too.  You
> could make the choices "binary" and "the database encoding", I
> suppose.

Yes, I think that should be the choice. There seems little justification
for an output plugin to produce textual output in anything but the
server encoding.

I am not sure if we want to verify that in !USE_ASSERT? That'd be quite
expensive...

> > As far as I have been thinking of, this would be another catalog table like
> > pg_decoding_plugin(oid, dpname name, dpload regproc).
> 
> Instead of adding another catalog table, I think we should just define
> a new type.  Again, please look at the way that foreign data wrappers
> do this:

I don't really see what the usage of a special type has to do with this,
but I think that's besides your main point. What you're saying is that
the output plugin is just defined by a function name, possibly schema
prefixed. That has an elegance to it. +1

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Changeset Extraction Interfaces

From
Robert Haas
Date:
On Thu, Dec 12, 2013 at 10:49 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I hadn't realized that the options were going to be different for
>> logical vs. physical.
>
> I don't see how we could avoid that, there just are some differences
> between both.

Right, I'm not complaining, just observing that it was a point I had overlooked.

>> So you could
>> also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT,
>> START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT.  I'm
>> not sure whether that's better.
>
> Not sure either, but I slightly favor keeping the the toplevel slot
> commands the same. I think we'll want one namespace for both and
> possibly similar reporting functions and that seems less surprising if
> they are treated more similar.

OK.

> If we were to start out streaming changes before the last running
> transaction has finished, they would be visible in that exported
> snapshot and you couldn't use it to to roll forward from anymore.

Actually, you could.  You'd just have to throw away any transactions
whose XIDs are visible to the exported snapshot.  In other words, you
begin replication at time T0, and all transactions which begin after
that time are included in the change stream.  At some later time T1,
all transactions in progress at time T0 have ended, and now you can
export a snapshot at that time, or any later time, from which you can
roll forward.  Any change-stream entries for XIDs which would be
visible to that snapshot shouldn't be replayed when rolling forward
from it, though.

I think it sucks (that's the technical term) to have to wait for all
currently-running transactions to terminate before being able to begin
streaming changes, because that could take a long time.  And you might
well know that the long-running transaction which is rolling up
enormous table A that you don't care about is never going to touch
table B which you actually want to replicate.  Now, ideally, the DBA
would have a way to ignore that long-running transaction and force
replication to start, perhaps with the caveat that if that
long-running transaction actually does touch B after all then we have
to resync.  Your model's fine when we want to replicate the whole
database, but a big part of why I want this feature is to allow
finer-grained replication, down to the table level, or even slices of
tables.

So imagine this.  After initiating logical replication, a replication
solution either briefly x-locks a table it wants to replicate, so that
there can't be anyone else touching it, or it observes who has a lock
>= RowExclusiveLock and waits for all of those locks to drop away.  At
that point, it knows that no currently-in-progress transaction can
have modified the table prior to the start of replication, and begins
copying the table.  If a transaction that began before the start of
replication subsequently modifies the table, a WAL record will be
written, and the core logical decoding support could let the plugin
know by means of an optional callback (hey, btw, a change I can't
decode just hit table XYZ).  The plugin will need to respond by
recopying the table, which sucks, but it was the plugin's decision to
be optimistic in the first place, and that will in many cases be a
valid policy decision.  If no such callback arrives before the
safe-snapshot point, then the plugin made the right bet and will reap
the just rewards of its optimism.

>> I don't have a problem with the behavior.  Seems useful.  One useful
>> addition might be to provide an option to stream out up to X changes
>> but without consuming them, so that the DBA can peek at the
>> replication stream.  I think it's a safe bet DBAs will want to do
>> things like that, so it'd be nice to make it easy, if we can.
>
> It's not too difficult to provide an option to do that. What I've been
> thinking of was to correlate the confirmation of consumption with the
> transaction the SRF is running in. So, confirm the data as consumed if
> it commits, and don't if not. I think we could do that relatively easily
> by registering a XACT_EVENT_COMMIT.

That's a bit too accident-prone for my taste.  I'd rather the DBA had
some equivalent of peek_at_replication(nchanges int).

>> Sounds about right, but I think we need to get religion about figuring
>> out what terminology to use.  At the moment it seems to vary quite a
>> bit between "logical", "logical decoding", and "decoding".  Not sure
>> how to nail that down.
>
> Agreed. Perhaps we should just avoid both logical and decoding entirely
> and go for "changestream" or similar?

So wal_level=changestream?  Not feeling it.  Of course we don't have
to be 100% rigid about this but we should try to make our terminology
corresponding with natural semantic boundaries.  Maybe we should call
the process logical decoding, and the results logical streams, or
something like that.

>> As a more abstract linguistic question, what do we think the
>> difference is between logical *replication* and logical *decoding*?
>> Are they the same or different?  If different, how?
>
> For me "logical decoding" can be the basis of "logical replication", but
> also for other features.

Such as?

>> > I wonder if we should let the output plugin tell us whether it will
>> > output data in binary? I think it generally would be a good idea to let
>> > the output plugin's _init() function return some configuration
>> > data. That will make extending the interface to support more features
>> > easier.
>>
>> Maybe, but you've got to consider the question of encoding, too.  You
>> could make the choices "binary" and "the database encoding", I
>> suppose.
>
> Yes, I think that should be the choice. There seems little justification
> for an output plugin to produce textual output in anything but the
> server encoding.
>
> I am not sure if we want to verify that in !USE_ASSERT? That'd be quite
> expensive...

Since it's all C code anyway, it's probably fine to push the
responsibility back onto the output plugin, as long as that's a
documented part of the API contract.

>> > As far as I have been thinking of, this would be another catalog table like
>> > pg_decoding_plugin(oid, dpname name, dpload regproc).
>>
>> Instead of adding another catalog table, I think we should just define
>> a new type.  Again, please look at the way that foreign data wrappers
>> do this:
>
> I don't really see what the usage of a special type has to do with this,
> but I think that's besides your main point. What you're saying is that
> the output plugin is just defined by a function name, possibly schema
> prefixed. That has an elegance to it. +1

Well, file_fdw_handler returns type fdw_handler.  That's nice, because
we can validate that we've got the right sort of object when what we
want is an FDW handler.  If it just returned type internal, it would
be too easy to mix it up with something unrelated that passed back
some other kind of binary goop.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Changeset Extraction Interfaces

From
Andres Freund
Date:
On 2013-12-12 12:13:24 -0500, Robert Haas wrote:
> On Thu, Dec 12, 2013 at 10:49 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > If we were to start out streaming changes before the last running
> > transaction has finished, they would be visible in that exported
> > snapshot and you couldn't use it to to roll forward from anymore.
> 
> Actually, you could.  You'd just have to throw away any transactions
> whose XIDs are visible to the exported snapshot.  In other words, you
> begin replication at time T0, and all transactions which begin after
> that time are included in the change stream.  At some later time T1,
> all transactions in progress at time T0 have ended, and now you can
> export a snapshot at that time, or any later time, from which you can
> roll forward.  Any change-stream entries for XIDs which would be
> visible to that snapshot shouldn't be replayed when rolling forward
> from it, though.

But that would become a too complex interface, imo without a
corresponding benefit. If you skip the changes when rolling forward,
there's no point in streaming them out in the first place.

> I think it sucks (that's the technical term) to have to wait for all
> currently-running transactions to terminate before being able to begin
> streaming changes, because that could take a long time.

I don't think there's much of an alternative for replication solutions,
for other usecases, we may want to add an option to skip the wait. It's
not like that's something you do all the time. As soon as a slot was
acquired, there's no further waits anymore.

> And you might
> well know that the long-running transaction which is rolling up
> enormous table A that you don't care about is never going to touch
> table B which you actually want to replicate.  Now, ideally, the DBA
> would have a way to ignore that long-running transaction and force
> replication to start, perhaps with the caveat that if that
> long-running transaction actually does touch B after all then we have
> to resync.

Puh. I honestly have zero confidence in DBAs making an informed decision
about something like this. Honestly, for a replication solution, how
often do you think this will be an issue?

> So imagine this.  After initiating logical replication, a replication
> solution either briefly x-locks a table it wants to replicate, so that
> there can't be anyone else touching it, or it observes who has a lock
> >= RowExclusiveLock and waits for all of those locks to drop away.  At
> that point, it knows that no currently-in-progress transaction can
> have modified the table prior to the start of replication, and begins
> copying the table.  If a transaction that began before the start of
> replication subsequently modifies the table, a WAL record will be
> written, and the core logical decoding support could let the plugin
> know by means of an optional callback (hey, btw, a change I can't
> decode just hit table XYZ).  The plugin will need to respond by
> recopying the table, which sucks, but it was the plugin's decision to
> be optimistic in the first place, and that will in many cases be a
> valid policy decision.  If no such callback arrives before the
> safe-snapshot point, then the plugin made the right bet and will reap
> the just rewards of its optimism.

Sure, all that's possible. But hell, it's complicated to use. If reality
proves people want this, lets go there, but lets get the basics right
and committed first.

All the logic around whether to decode a transaction is:
void
SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,               int nsubxacts, TransactionId
*subxacts)
...if (builder->state < SNAPBUILD_CONSISTENT){    /* ensure that only commits after this are getting replayed */    if
(builder->transactions_after< lsn)        builder->transactions_after = lsn;
 
and then

/** Should the contents of a transaction ending at 'ptr' be decoded?*/
bool
SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr)
{return ptr <= builder->transactions_after;
}

so it's not like it will require all too many changes.

What I can see as possibly getting into 9.4 is a FASTSTART option that
doesn't support exporting a snapshot, but doesn't have to wait for the
SNAPBUILD_CONSISTENT state in return. That's fine for some usecases,
although I don't think for any of the major ones.

> > It's not too difficult to provide an option to do that. What I've been
> > thinking of was to correlate the confirmation of consumption with the
> > transaction the SRF is running in. So, confirm the data as consumed if
> > it commits, and don't if not. I think we could do that relatively easily
> > by registering a XACT_EVENT_COMMIT.
> 
> That's a bit too accident-prone for my taste.  I'd rather the DBA had
> some equivalent of peek_at_replication(nchanges int).

One point for my suggested behaviour is that it closes a bigger
racecondition. Currently as soon as start_logical_replication() has
finished building the tuplestore it marks the endposition as
received. But we very well can fail before the user has received all
those changes.
The only other idea I have to close that situation is to add an explicit
function to confirm receiving the changes, but that sounds icky for
something exposed to SQL.

> >> Sounds about right, but I think we need to get religion about figuring
> >> out what terminology to use.  At the moment it seems to vary quite a
> >> bit between "logical", "logical decoding", and "decoding".  Not sure
> >> how to nail that down.
> >
> > Agreed. Perhaps we should just avoid both logical and decoding entirely
> > and go for "changestream" or similar?
> 
> So wal_level=changestream?  Not feeling it.  Of course we don't have
> to be 100% rigid about this but we should try to make our terminology
> corresponding with natural semantic boundaries.  Maybe we should call
> the process logical decoding, and the results logical streams, or
> something like that.

I am fine with that, but I wouldn't mind some opinions of people knowing
less about the implementation that you and I.

> > For me "logical decoding" can be the basis of "logical replication", but
> > also for other features.
> 
> Such as?

* auditing
* cache invalidation
* concurrent, rewriting ALTER TABLE
* concurrent VACUUM/CLUSTER similar to pg_reorg

> > I don't really see what the usage of a special type has to do with this,
> > but I think that's besides your main point. What you're saying is that
> > the output plugin is just defined by a function name, possibly schema
> > prefixed. That has an elegance to it. +1
> 
> Well, file_fdw_handler returns type fdw_handler.  That's nice, because
> we can validate that we've got the right sort of object when what we
> want is an FDW handler.  If it just returned type internal, it would
> be too easy to mix it up with something unrelated that passed back
> some other kind of binary goop.

My, badly expressed, point is that returning INTERNAL or a specialized
type seems orthogonal to either using a special catalog mapping the
output plugin name to a function returning callbacks in comparison to
just using the function's name as the output plugin name.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Changeset Extraction Interfaces

From
Robert Haas
Date:
On Thu, Dec 12, 2013 at 1:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Puh. I honestly have zero confidence in DBAs making an informed decision
> about something like this. Honestly, for a replication solution, how
> often do you think this will be an issue?

If you imagine a scenario where somebody establishes a replication
slot and then keeps it forever, not often.  But if you're trying to do
something more ad hoc, where replication slots might be used just for
short periods of time and then abandoned, I think it could come up
pretty frequently.  Generally, I think you're being too dismissive of
the stuff I'm complaining about here.  If we just can't get this, well
then I suppose we can't.  But I think the amount of time that it takes
Hot Standby to open for connections is an issue, precisely because
it's got to wait until certain criteria are met before it can
establish a snapshot, and sometimes that takes an unpleasantly long
time.  I think it unlikely that we can export that logic to this case
also and experience no pain as a result.

In fact, I think that even restricting things to streaming changes
from transactions started after we initiate replication is going to be
an annoying amount of delay for some purposes.  People will accept it
because, no matter how you slice it, this is an awesome new
capability.  Full stop.  That having been said, I don't find it at all
hard to imagine someone wanting to jump into the replication stream at
an arbitrary point in time and see changes from every transaction that
*commits* after that point, even if it began earlier, or even to see
changes from transactions that have not yet committed as they happen.
I realize that's asking for a pony, and I'm not saying you have to go
off and do that right now in order for this to move forward, or indeed
that it will ever happen at all.  What I am saying is that I find it
entirely likely that people are going to push the limits of this
thing, that this is one of the limits I expect them to push, and that
the more we can do to put policy in the hands of the user without
pre-judging the sanity of what they're trying to do, the happier we
(and our users) will be.

>> > It's not too difficult to provide an option to do that. What I've been
>> > thinking of was to correlate the confirmation of consumption with the
>> > transaction the SRF is running in. So, confirm the data as consumed if
>> > it commits, and don't if not. I think we could do that relatively easily
>> > by registering a XACT_EVENT_COMMIT.
>>
>> That's a bit too accident-prone for my taste.  I'd rather the DBA had
>> some equivalent of peek_at_replication(nchanges int).
>
> One point for my suggested behaviour is that it closes a bigger
> racecondition. Currently as soon as start_logical_replication() has
> finished building the tuplestore it marks the endposition as
> received. But we very well can fail before the user has received all
> those changes.

Right.  I think your idea is good, but maybe there should also be a
version of the function that never confirms receipt even if the
transaction commits.  That would be useful for ad-hoc poking at the
queue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Changeset Extraction Interfaces

From
Andres Freund
Date:
On 2013-12-13 08:30:41 -0500, Robert Haas wrote:
> On Thu, Dec 12, 2013 at 1:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Puh. I honestly have zero confidence in DBAs making an informed decision
> > about something like this. Honestly, for a replication solution, how
> > often do you think this will be an issue?
> 
> If you imagine a scenario where somebody establishes a replication
> slot and then keeps it forever, not often.  But if you're trying to do
> something more ad hoc, where replication slots might be used just for
> short periods of time and then abandoned, I think it could come up
> pretty frequently.

But can you imagine those users needing an exported snapshot? I can
think of several short-lived usages, but all of those are unlikely to
need a consistent view of the overall database. And those are less
likely to be full blown replication solutions.
I.e. it's not the DBA making that decision but the developer making the
decision based on whether he requires the snapshot or not.

> Generally, I think you're being too dismissive of the stuff I'm
> complaining about here.  If we just can't get this, well then I
> suppose we can't.

I think I am just scared of needing to add more features before getting
the basics done and in consequence overrunning 9.4...

> But I think the amount of time that it takes
> Hot Standby to open for connections is an issue, precisely because
> it's got to wait until certain criteria are met before it can
> establish a snapshot, and sometimes that takes an unpleasantly long
> time.

The HS situation is quite different though. There the problem is hit on
every restart, not just the initial start; Suboverflowed xl_running
xacts can delay HS startup for a long time; and we don't start up at the
exact point we know a xl_running_xacts just has been logged.

FWIW I think we need to fix HS to store it's visibility state at
restartpoints, so we get rid of that problem. Not sure who will have
time to do that tho.

> That having been said, I don't find it at all
> hard to imagine someone wanting to jump into the replication stream at
> an arbitrary point in time and see changes from every transaction that
> *commits* after that point, even if it began earlier, or even to see
> changes from transactions that have not yet committed as they happen.

I agree they are desirable, but all those will require additional state
to be kept about running transactions. That's not saying it
shouldn't/cannot be done, to the contrary, just that it requires some
engineering effort.

> I realize that's asking for a pony, and I'm not saying you have to go
> off and do that right now in order for this to move forward, or indeed
> that it will ever happen at all.  What I am saying is that I find it
> entirely likely that people are going to push the limits of this
> thing, that this is one of the limits I expect them to push, and that
> the more we can do to put policy in the hands of the user without
> pre-judging the sanity of what they're trying to do, the happier we
> (and our users) will be.

Completely agreed. I really think this the most basic building block,
missing many important features. And we'll be busy for some time adding
those.

> > One point for my suggested behaviour is that it closes a bigger
> > racecondition. Currently as soon as start_logical_replication() has
> > finished building the tuplestore it marks the endposition as
> > received. But we very well can fail before the user has received all
> > those changes.
> 
> Right.  I think your idea is good, but maybe there should also be a
> version of the function that never confirms receipt even if the
> transaction commits.  That would be useful for ad-hoc poking at the
> queue.

Ok, that sounds easy enough, maybe
pg_decoding_slot_get_[binary_]changes()
pg_decoding_slot_peek_[binary_]changes()
?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Changeset Extraction Interfaces

From
Robert Haas
Date:
On Fri, Dec 13, 2013 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> If you imagine a scenario where somebody establishes a replication
>> slot and then keeps it forever, not often.  But if you're trying to do
>> something more ad hoc, where replication slots might be used just for
>> short periods of time and then abandoned, I think it could come up
>> pretty frequently.
>
> But can you imagine those users needing an exported snapshot? I can
> think of several short-lived usages, but all of those are unlikely to
> need a consistent view of the overall database. And those are less
> likely to be full blown replication solutions.
> I.e. it's not the DBA making that decision but the developer making the
> decision based on whether he requires the snapshot or not.

Well, it still seems to me that the right way to think about this is
that the change stream begins at a certain point, and then once you
cross a certain threshold (all transactions in progress at that time
have ended) any subsequent snapshot is a possible point from which to
roll forward.  You'll need to avoid applying any transactions that are
already included during the snapshot, but I don't really think that's
any great matter.  You're focusing on the first point at which the
consistent snapshot can be taken, and on throwing away any logical
changes that might have been available before that point so that they
don't have to be ignored in the application code, but I think that's
myopic.

For example, suppose somebody is replication tables on node A to node
B.  And then the decide to replicate some of the same tables to node
C.  Well, one way to do this is to have node C connect to node A and
acquire its own slot, but that means decoding everything twice.
Alternatively, you could reuse the same change stream, but you'll need
a new snapshot to roll forward from.  That doesn't seem like a problem
unless the API makes it a problem.

>> Generally, I think you're being too dismissive of the stuff I'm
>> complaining about here.  If we just can't get this, well then I
>> suppose we can't.
>
> I think I am just scared of needing to add more features before getting
> the basics done and in consequence overrunning 9.4...

I am sensitive to that.  On the other hand, this API is going to be a
lot harder to change once it's released, so we really need to avoid
painting ourselves into a corner with v1.  As far as high-level design
concerns go, there are three things that I'm not happy with:

1. Slots.  We know we need physical slots as well as logical slots,
but the patch as currently constituted only offers logical slots.
2. Snapshot Mangement.  This issue.
3. Incremental Decoding.  So that we can begin applying a really big
transaction speculatively before it's actually committed.

I'm willing to completely punt #3 as far as 9.4 is concerned, because
I see a pretty clear path to fixing that later.  I am not yet
convinced that either of the other two can or should be postponed.

>> Right.  I think your idea is good, but maybe there should also be a
>> version of the function that never confirms receipt even if the
>> transaction commits.  That would be useful for ad-hoc poking at the
>> queue.
>
> Ok, that sounds easy enough, maybe
> pg_decoding_slot_get_[binary_]changes()
> pg_decoding_slot_peek_[binary_]changes()
> ?

s/pg_decoding_slot/pg_logical_stream/?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Changeset Extraction Interfaces

From
Andres Freund
Date:
On 2013-12-14 11:50:00 -0500, Robert Haas wrote:
> On Fri, Dec 13, 2013 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > But can you imagine those users needing an exported snapshot? I can
> > think of several short-lived usages, but all of those are unlikely to
> > need a consistent view of the overall database. And those are less
> > likely to be full blown replication solutions.
> > I.e. it's not the DBA making that decision but the developer making the
> > decision based on whether he requires the snapshot or not.
> 
> Well, it still seems to me that the right way to think about this is
> that the change stream begins at a certain point, and then once you
> cross a certain threshold (all transactions in progress at that time
> have ended) any subsequent snapshot is a possible point from which to
> roll forward.

Unfortunately it's not possible to build exportable snapshots at any
time - it requires keeping far more state around since we need to care
about all transactions, not just transactions touching the
catalog. Currently you can only export the snapshot in the one point we
become consistent, after that we stop maintaining that state.

I see pretty much no chance to change that without major effort that's
imo clearly out of scope for 9.4. Maybe we want to provide that at some
point, but it's going to be a bit.
Also note that in order to import a snapshot, the exporting snapshot
still has to be alive, in the same transaction that has exported the
snapshot. So, until the snapshot has been imported, replication cannot
progress. That's existing limitations from the snapshot import code, but
the reasoning for them make it unlikely that we can easily change them

Under that light, usecases like your example won't be able to benefit
from getting changes before the snapshot, or do you still see usecases
for that?

I've already prototyped the "quickstart" option that doesn't allow
exporting a snapshot, although I don't know how the UI for it is going
to look.

> > I think I am just scared of needing to add more features before getting
> > the basics done and in consequence overrunning 9.4...
> 
> I am sensitive to that.  On the other hand, this API is going to be a
> lot harder to change once it's released, so we really need to avoid
> painting ourselves into a corner with v1.

I think we need to accept the fact that quite possibly we won't get
everything right in 9.4. Obviously we should strive to avoid that, but
trying to design things perfectly usually ends in not getting there.
Yes, the API changes for FDWs caused pain. But there's no way we would
have the current features if we'd tried to introduce the correct API
allowing them from the get go.

>  As far as high-level design
> concerns go, there are three things that I'm not happy with:
> 
> 1. Slots.  We know we need physical slots as well as logical slots,
> but the patch as currently constituted only offers logical slots.

Well, then tell me the way you want to go forward on that end. I can
make the slot interface more generic if we know exactly what we need,
but I doesn't seem fair to take this patch hostage until I develop a
separate not so small feature. Why is that my task?
Because I think it's important, and because by now I know the related
code pretty well by now, I am willing to provide the parts of the that
prevent required WAL from being deleted, peg xmin and report the current
state to SQL, but somebody else is going to have to the rest.

> s/pg_decoding_slot/pg_logical_stream/?

Works for me.

Regards,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Changeset Extraction Interfaces

From
Jim Nasby
Date:
On 12/12/13 11:13 AM, Robert Haas wrote:
> I think it sucks (that's the technical term) to have to wait for all
> currently-running transactions to terminate before being able to begin
> streaming changes, because that could take a long time.  And you might
> well know that the long-running transaction which is rolling up
> enormous table A that you don't care about is never going to touch
> table B which you actually want to replicate.  Now, ideally, the DBA
> would have a way to ignore that long-running transaction and force
> replication to start, perhaps with the caveat that if that
> long-running transaction actually does touch B after all then we have
> to resync.  Your model's fine when we want to replicate the whole
> database, but a big part of why I want this feature is to allow
> finer-grained replication, down to the table level, or even slices of
> tables.

I know you're not going to attempt this for 9.4, but I want to mention a related case here. I've often wanted the
abilityto limit the tables a transaction can touch, so that it will not interfere with vacuuming other tables.
 

This would be useful when you have some tables that see very frequent updates/deletes in a database that also has to
supportlong-running transactions that don't hit those tables. You'd explicitly limit the tables your long-running
transactionwill touch and that way vacuum can ignore the long-running XID when calculating minimum tuple age for the
heavy-hittables.
 

If we had that capability it could also be used to improve the time required to get a snapshot for a limited set of
tables.
-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net



Re: Changeset Extraction Interfaces

From
Robert Haas
Date:
On Sat, Dec 14, 2013 at 12:37 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-14 11:50:00 -0500, Robert Haas wrote:
>> On Fri, Dec 13, 2013 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > But can you imagine those users needing an exported snapshot? I can
>> > think of several short-lived usages, but all of those are unlikely to
>> > need a consistent view of the overall database. And those are less
>> > likely to be full blown replication solutions.
>> > I.e. it's not the DBA making that decision but the developer making the
>> > decision based on whether he requires the snapshot or not.
>>
>> Well, it still seems to me that the right way to think about this is
>> that the change stream begins at a certain point, and then once you
>> cross a certain threshold (all transactions in progress at that time
>> have ended) any subsequent snapshot is a possible point from which to
>> roll forward.
>
> Unfortunately it's not possible to build exportable snapshots at any
> time - it requires keeping far more state around since we need to care
> about all transactions, not just transactions touching the
> catalog. Currently you can only export the snapshot in the one point we
> become consistent, after that we stop maintaining that state.

I don't get it.  Once all the old transactions are gone, I don't see
why you need any state at all to build an exportable snapshot.  Just
take a snapshot.

>> 1. Slots.  We know we need physical slots as well as logical slots,
>> but the patch as currently constituted only offers logical slots.
>
> Well, then tell me the way you want to go forward on that end. I can
> make the slot interface more generic if we know exactly what we need,
> but I doesn't seem fair to take this patch hostage until I develop a
> separate not so small feature. Why is that my task?
> Because I think it's important, and because by now I know the related
> code pretty well by now, I am willing to provide the parts of the that
> prevent required WAL from being deleted, peg xmin and report the current
> state to SQL, but somebody else is going to have to the rest.

The part that you're expressing willingness to do sounds entirely
satisfactory to me.  As I mentioned on the other thread, I'm perhaps
even willing to punt that feature entirely provided that we have a
clear design for how to add it later, but I think it'd be nicer to get
it done now.

And just for the record, I think the idea that I am holding this patch
hostage is absurd.  I have devoted a large amount of time and energy
to moving this forward and plan to devote more.  Because of that work,
big chunks of what is needed here are already committed.  If my secret
plan is to make it as difficult as possible for you to get this
committed, I'm playing a deep game.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Changeset Extraction Interfaces

From
Andres Freund
Date:
On 2013-12-16 23:01:16 -0500, Robert Haas wrote:
> On Sat, Dec 14, 2013 at 12:37 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-12-14 11:50:00 -0500, Robert Haas wrote:
> >> Well, it still seems to me that the right way to think about this is
> >> that the change stream begins at a certain point, and then once you
> >> cross a certain threshold (all transactions in progress at that time
> >> have ended) any subsequent snapshot is a possible point from which to
> >> roll forward.
> >
> > Unfortunately it's not possible to build exportable snapshots at any
> > time - it requires keeping far more state around since we need to care
> > about all transactions, not just transactions touching the
> > catalog. Currently you can only export the snapshot in the one point we
> > become consistent, after that we stop maintaining that state.
> 
> I don't get it.  Once all the old transactions are gone, I don't see
> why you need any state at all to build an exportable snapshot.  Just
> take a snapshot.

The state we're currently decoding, somewhere in already fsynced WAL,
won't correspond to the state in the procarray. There might be
situations where it will, but we can't guarantee that we ever reach that
point without taking locks that will be problematic.

> The part that you're expressing willingness to do sounds entirely
> satisfactory to me.  As I mentioned on the other thread, I'm perhaps
> even willing to punt that feature entirely provided that we have a
> clear design for how to add it later, but I think it'd be nicer to get
> it done now.

We'll see how the next version looks like. Not sure on that myself yet
;)

> And just for the record, I think the idea that I am holding this patch
> hostage is absurd.  I have devoted a large amount of time and energy
> to moving this forward and plan to devote more.  Because of that work,
> big chunks of what is needed here are already committed. If my secret
> plan is to make it as difficult as possible for you to get this
> committed, I'm playing a deep game.

I am not saying at all that you're planning to stop the patch from
getting in. You've delivered pretty clear proof that that's not the
case.
But that doesn't prevent us from arguing over details and disagreeing
whether they are dealbreakers or not, does it ;)

I think you know that I am hugely grateful for the work you've put into
the topic.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Changeset Extraction Interfaces

From
Robert Haas
Date:
On Tue, Dec 17, 2013 at 4:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-16 23:01:16 -0500, Robert Haas wrote:
>> On Sat, Dec 14, 2013 at 12:37 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > On 2013-12-14 11:50:00 -0500, Robert Haas wrote:
>> >> Well, it still seems to me that the right way to think about this is
>> >> that the change stream begins at a certain point, and then once you
>> >> cross a certain threshold (all transactions in progress at that time
>> >> have ended) any subsequent snapshot is a possible point from which to
>> >> roll forward.
>> >
>> > Unfortunately it's not possible to build exportable snapshots at any
>> > time - it requires keeping far more state around since we need to care
>> > about all transactions, not just transactions touching the
>> > catalog. Currently you can only export the snapshot in the one point we
>> > become consistent, after that we stop maintaining that state.
>>
>> I don't get it.  Once all the old transactions are gone, I don't see
>> why you need any state at all to build an exportable snapshot.  Just
>> take a snapshot.
>
> The state we're currently decoding, somewhere in already fsynced WAL,
> won't correspond to the state in the procarray. There might be
> situations where it will, but we can't guarantee that we ever reach that
> point without taking locks that will be problematic.

You don't need to guarantee that.  Just take a current snapshot and
then throw away (or don't decode in the first place) any transactions
that would be visible to that snapshot.  This is simpler and more
flexible, and possibly more performant, too, because with your design
you'll have to hold back xmin to the historical snapshot you build
while copying the table rather than to a current snapshot.

I really think we should consider whether we can't get by with ripping
out the build-an-exportable-snapshot code altogether.  I don't see
that it's really buying us much.  We need a way for the client to know
when decoding has reached the point where it is guaranteed complete -
i.e. all transactions in progress at the time decoding was initiated
have ended.  We also need a way for a backend performing decoding to
take a current MVCC snapshot, export it, and send the identifier to
the client.  And we need a way for the client to know whether any
given one of those snapshots includes a particular XID we may have
decoded.  But I think all of that might still be simpler than what you
have now, and it's definitely more flexible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Changeset Extraction Interfaces

From
Andres Freund
Date:
On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
> On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I think there'll always be a bit of a difference between slots for
> > physical and logical data, even if 90% of the implementation is the
> > same. We can signal that difference by specifying logical/physical as an
> > option or having two different sets of commands.
> >
> > Maybe?
> >
> > ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
> > ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
> > -- already exists without slot, PHYSICAL arguments
> > START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
> > START_REPLICATION SLOT LOGICAL slot plugin_options
> > RELEASE_REPLICATION_SLOT slot_name
> 
> I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
> but basically this seems OK to me.

When writing the code for this, I decided that I need to reneg a bit on
those names - they don't work nicely enough on the C level for
me. Specifically during a START_REPLICATION we need to temporarily mark
the slot as being actively used and mark it unused again
afterwards. That's much more Acquire/Release like than the persistent
Acquire/Release above for me.

The C names in the version I am working on currently are:
extern void ReplicationSlotCreate(const char *name);
extern void ReplicationSlotDrop(const char *name);
extern void ReplicationSlotAcquire(const char *name);
extern void ReplicationSlotRelease(void);
extern void ReplicationSlotSave(void);

which would make the walsender ones

CREATE_REPLICATION_SLOT ...
START_REPLICATION [SLOT slot] [LOGICAL | PHYSICAL] ...
DROP_REPLICATION_SLOT ...

where START_REPLICATION internally does acquire/release on the passed
SLOT.

Does that work for you?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Changeset Extraction Interfaces

From
Robert Haas
Date:
On Fri, Jan 3, 2014 at 10:12 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
>> On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > I think there'll always be a bit of a difference between slots for
>> > physical and logical data, even if 90% of the implementation is the
>> > same. We can signal that difference by specifying logical/physical as an
>> > option or having two different sets of commands.
>> >
>> > Maybe?
>> >
>> > ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts
>> > ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts
>> > -- already exists without slot, PHYSICAL arguments
>> > START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline
>> > START_REPLICATION SLOT LOGICAL slot plugin_options
>> > RELEASE_REPLICATION_SLOT slot_name
>>
>> I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options,
>> but basically this seems OK to me.
>
> When writing the code for this, I decided that I need to reneg a bit on
> those names - they don't work nicely enough on the C level for
> me. Specifically during a START_REPLICATION we need to temporarily mark
> the slot as being actively used and mark it unused again
> afterwards. That's much more Acquire/Release like than the persistent
> Acquire/Release above for me.
>
> The C names in the version I am working on currently are:
> extern void ReplicationSlotCreate(const char *name);
> extern void ReplicationSlotDrop(const char *name);
> extern void ReplicationSlotAcquire(const char *name);
> extern void ReplicationSlotRelease(void);
> extern void ReplicationSlotSave(void);
>
> which would make the walsender ones
>
> CREATE_REPLICATION_SLOT ...
> START_REPLICATION [SLOT slot] [LOGICAL | PHYSICAL] ...
> DROP_REPLICATION_SLOT ...
>
> where START_REPLICATION internally does acquire/release on the passed
> SLOT.
>
> Does that work for you?

Yep, no objections.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Changeset Extraction Interfaces

From
Andres Freund
Date:
On 2013-12-12 16:49:33 +0100, Andres Freund wrote:
> On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
> > On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > As far as I have been thinking of, this would be another catalog table like
> > > pg_decoding_plugin(oid, dpname name, dpload regproc).
> > 
> > Instead of adding another catalog table, I think we should just define
> > a new type.  Again, please look at the way that foreign data wrappers
> > do this:
> 
> I don't really see what the usage of a special type has to do with this,
> but I think that's besides your main point. What you're saying is that
> the output plugin is just defined by a function name, possibly schema
> prefixed. That has an elegance to it. +1

Ok, so I've implemented this, but I am not so sure it's sufficient,
there's some issue:
Currently a logical replication slot has a plugin assigned, previously
that has just been identified by the basename of a .so. But with the
above proposal the identifier is pointing to a function, currently via
its oid. But what happens if somebody drops or recreates the function?
We can't make pg_depend entries or anything since that won't work on a
standby.
Earlier, if somebody removed the .so we'd just error out, but pg's
dependency tracking always only mattered to things inside the catalogs.

I see the following possible solutions for this:

1) accept that fact, and throw an error if the function doesn't exist
anymore, or has an unsuitable signature. We can check the return type of
output_plugin_callbacks, so that's a pretty specific test.

2) Create a pg_output_plugin catalog and prevent DROP OUTPUT PLUGIN (or
similar) when there's a slot defined. But how'd that work if the slot is
only defined on standbys? We could have the redo routine block and/or
kill the slot if necessary?

3) Don't assign a specific output plugin to a slot, but have it
specified everytime data is streamed, not just when a slot is
created. Currently that wouldn't be a problem, but I am afraid it will
constrict some future optimizations.

Good ideas?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Changeset Extraction Interfaces

From
Andres Freund
Date:
On 2014-01-07 17:54:21 +0100, Andres Freund wrote:
> On 2013-12-12 16:49:33 +0100, Andres Freund wrote:
> > On 2013-12-12 10:01:21 -0500, Robert Haas wrote:
> > > On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > > As far as I have been thinking of, this would be another catalog table like
> > > > pg_decoding_plugin(oid, dpname name, dpload regproc).
> > > 
> > > Instead of adding another catalog table, I think we should just define
> > > a new type.  Again, please look at the way that foreign data wrappers
> > > do this:
> > 
> > I don't really see what the usage of a special type has to do with this,
> > but I think that's besides your main point. What you're saying is that
> > the output plugin is just defined by a function name, possibly schema
> > prefixed. That has an elegance to it. +1
> 
> Ok, so I've implemented this, but I am not so sure it's sufficient,
> there's some issue:
> Currently a logical replication slot has a plugin assigned, previously
> that has just been identified by the basename of a .so. But with the
> above proposal the identifier is pointing to a function, currently via
> its oid. But what happens if somebody drops or recreates the function?
> We can't make pg_depend entries or anything since that won't work on a
> standby.
> Earlier, if somebody removed the .so we'd just error out, but pg's
> dependency tracking always only mattered to things inside the catalogs.
> 
> I see the following possible solutions for this:
> 
> 1) accept that fact, and throw an error if the function doesn't exist
> anymore, or has an unsuitable signature. We can check the return type of
> output_plugin_callbacks, so that's a pretty specific test.
> 
> 2) Create a pg_output_plugin catalog and prevent DROP OUTPUT PLUGIN (or
> similar) when there's a slot defined. But how'd that work if the slot is
> only defined on standbys? We could have the redo routine block and/or
> kill the slot if necessary?
> 
> 3) Don't assign a specific output plugin to a slot, but have it
> specified everytime data is streamed, not just when a slot is
> created. Currently that wouldn't be a problem, but I am afraid it will
> constrict some future optimizations.
> 
> Good ideas?

So, Robert and I had a IM discussion about this. Neither of us was
particularly happy about the proposed solutions.

So, what we've concluded is that using a function as the handler doesn't
work out well enough given the constraints (primarily the inability to
create dependency records on a HS node). We've concluded that the best
way forward is a variant of the current implementation where the output
plugin is specified as a dynamic library. Which is:
CREATE_REPLICATION_SLOT slot_name LOGICAL OUTPUT_PLUGIN library_name;
but in contrast to the current code where each individual output plugin
callback is dlsym()ed via a fixed function name, only a
_PG_output_plugin_init() function is looked up & called which fills out
a struct containing the individual callbacks. Additionally the "init"
and "cleanup" output plugin callbacks will be renamed to
startup/shutdown to avoid possible confusions.

This unfortunately still prohibits implementing output plugins within
the core postgres binary, but that can be solved by shipping
core-provided output plugins - should they ever exist - as shared
objects, like it's already done for libpqwalreceiver.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services