Thread: Changeset Extraction Interfaces
Hi, Short recap: From the perspective of the user interface the changeset extraction feature consists out of two abstract interfaces that the "user" has to do with: 1) The "slot" or "changestream" management interface which manages individual streams of changes. The user can create and destroy a changestream, and most importantly stream the changes. Simplified, a "logical replication slot" is a position in the WAL and a bunch of state associated with it. As long as a slot exists, the user can ask, for all changes that happened since the last time he asked, to be streamed out. It is abstract, because different usecases require the changes to be streamed out via different methods. The series contains two implementation of that interface: I) One integrated into walsender that allows for efficient streaming, including support for synchronous replication. II) Another that is accessible via SQL functions, very useful for writing pg_regress/isolationtester tests. It is, with a relatively low amount of code, possible to add other such interfaces without touching core code. One example, that has been asked for by a number of people, is consuming the changestream in a background worker without involving SQL or connecting to a walsender. There's basically three major 'verbs' that can be performed on a stream, currently named (walsender names): * INIT_LOGICAL_REPLICATION "name" "output_plugin" * START_LOGICAL_REPLICATION "name" last_received ("option_name" value,...) * FREE_LOGICAL_REPLICATION "name" The SQL variant currrently has: * init_logical_replication(name, plugin) * start_logical_replication(name, stream_upto, options[]) * stop_logical_replication(name) You might have noticed the slight inconsistency... 2) The "output plugin" interface, which transforms a changestream (begin, change, commit) into the desired target format. There are 5 callbacks, 3 of them obligatory: * pg_decode_init(context, is_initial) [optional] * pg_decode_begin(context, txn) * pg_decode_change(context, txn, relation, change) * pg_decode_commit(context, txn) * pg_decode_cleanup(context) [optional] Every output plugin can be used from every slot management interface. The current pain points, that I'd like to discuss, are: a) Better naming for the slot management between walsender, SQL and possible future interfaces. b) Decide which of the SQL functions should be in a contrib module, and which in core. Currently init_logical_replication()and stop_logical_replication() are in core, whereas start_logical_replication() is in the 'test_logical_decoding' extension. The reasoning behind that is that init/stop ones are important to the DBA and the start_logical_replication()SRF isn't all that useful in the real world because our SRFs don't support streaming changesout. c) Which data-types does start_logical_replication() return. Currently it's OUT location text, OUT xid bigint, OUT data text. Making the 'data' column text has some obvious disadvantages though - there's obvious usecases for output plugins that return binary data. But making it bytea sucks, because the output is harder to read by default... d) How does a slot acquire the callbacks of an output plugin. For a), my current feeling is to name them: * LOGICAL_DECODING_SLOT_CREATE/pg_logical_decoding_slot_create() * LOGICAL_DECODING_SLOT_STREAM/pg_logical_decoding_slot_extract() * LOGICAL_DECODING_SLOT_DESTROY/pg_logical_decoding_slot_destroy() with an intentional discrepancy between stream and extract, to make the difference obvious. One day we might have the facility - which would be rather cool - to do the streaming from sql as well. Better ideas? Leave out the "logical"? For b), I am happy with that split, I would just like others to comment. For c), I have better idea than two functions. d) is my main question, and Robert, Peter G. and I previously argued about it a fair bit. I know of the following alternatives: I) The output plugin that's specified in INIT_LOGICAL_REPLICATION is actually a library name, and we simply lookup the fixed symbol names in it. That's what currently implemented. The advantage is that it's pretty easy to implement, works on a HS standby without involving the primary, and doesn't have a problem if the library is used in shared_preload_library. The disadvantages are: All output plugins need to be shared libraries and there can only be one output plugin per shared library (although you could route differently, via options, but ugh). II) Keep the output plugin a library, but only lookup a _PG_init_output_plugin() which registers/returns the callbacks. Pretty much the same tradeoffs as I) III) Keep the output plugin a library, but simply rely on _PG_init() calling a function to register all callbacks. Imo it's worse than I) and II) because it basically prohibits using the library in shared_preload_libraries as well, because then it's _PG_init() doesn't get called when starting to stream, and another library might have registered other callbacks. IV) Make output plugins a SQL-level object/catalog table where a plugin can be registered, and the callbacks are normal pg_proc entries. It's more in line with other stuff, but has the disadvantage that we need to register plugins on the primary, even if we only stream from a standby. But then, we're used to that with CREATE EXTENSION et al. I personally lean towards I), followed by II) and IV). Comments? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 4, 2013 at 7:15 PM, Andres Freund <andres@2ndquadrant.com> wrote: > There's basically three major 'verbs' that can be performed on a > stream, currently named (walsender names): > * INIT_LOGICAL_REPLICATION "name" "output_plugin" > * START_LOGICAL_REPLICATION "name" last_received ("option_name" value,...) > * FREE_LOGICAL_REPLICATION "name" > > The SQL variant currrently has: > * init_logical_replication(name, plugin) > * start_logical_replication(name, stream_upto, options[]) > * stop_logical_replication(name) > > You might have noticed the slight inconsistency... I think this naming is probably not the greatest. When I hear "init", I don't think "permanently allocate a resource that will never be released unless I explicitly throw it away", and when I hear "stop", I don't think "free that resource". I suggest naming based around create/drop, register/unregister, or acquire/release. Since, as previously noted, I'm gunning for these slots to apply to ordinary replication as well, I kind of like ACQUIRE_REPLICATION_SLOT and RELEASE_REPLICATION_SLOT. If we're going to make them specific to logical replication, then your suggestion of CREATE_DECODING_SLOT and DROP_DECODING_SLOT, or maybe ACQUIRE_DECODING_SLOT and RELEASE_DECODING_SLOT, sounds fine. It also strikes me that just as it's possible to stream WAL without allocating a slot first (since we don't at present have slots), perhaps it ought also to be possible to stream logical replication data without acquiring a slot first. You could argue that it was a mistake not to introduce slots in the first place, but the stateless nature of WAL streaming definitely has some benefits, and it's unclear to me why you shouldn't be able to do the same thing with logical decoding. > b) Decide which of the SQL functions should be in a contrib module, and > which in core. Currently init_logical_replication() and > stop_logical_replication() are in core, whereas > start_logical_replication() is in the 'test_logical_decoding' > extension. The reasoning behind that is that init/stop ones are > important to the DBA and the start_logical_replication() SRF isn't > all that useful in the real world because our SRFs don't support > streaming changes out. Seems pretty arbitrary to me. If start_logical_replication() is usable with any output plugin, then it ought to be in core. I think the name isn't great, though; the actual functionality seems to be more or less decode-from-last-position-up-to-present, which doesn't sound much like "start". > c) Which data-types does start_logical_replication() return. Currently > it's OUT location text, OUT xid bigint, OUT data text. Making the 'data' > column text has some obvious disadvantages though - there's obvious > usecases for output plugins that return binary data. But making it bytea > sucks, because the output is harder to read by default... I think having two functions might be sensible. I'm not sure what happens if the text function is used and the plugin outputs something that's not valid in the database encoding, though. I guess you'd better check for that and error out. > d) is my main question, and Robert, Peter G. and I previously argued > about it a fair bit. I know of the following alternatives: > > I) The output plugin that's specified in INIT_LOGICAL_REPLICATION is > actually a library name, and we simply lookup the fixed symbol names in > it. That's what currently implemented. > The advantage is that it's pretty easy to implement, works on a HS > standby without involving the primary, and doesn't have a problem if the > library is used in shared_preload_library. > The disadvantages are: All output plugins need to be shared libraries > and there can only be one output plugin per shared library (although you > could route differently, via options, but ugh). I still dislike this. > II) Keep the output plugin a library, but only lookup a > _PG_init_output_plugin() which registers/returns the callbacks. Pretty > much the same tradeoffs as I) > > III) Keep the output plugin a library, but simply rely on _PG_init() > calling a function to register all callbacks. Imo it's worse than I) and > II) because it basically prohibits using the library in > shared_preload_libraries as well, because then it's _PG_init() doesn't > get called when starting to stream, and another library might have > registered other callbacks. I don't understand the disadvantage that you're describing here. What I'm imagining is that you have some struct that looks like this: struct output_plugin { char name[64]; void (*begin)(args); void (*change)(args); void (*commit)(args); }; Now you provide a function RegisterOutputPlugin(output_plugin *). If there are any output plugins built into core, core will call RegisterOutputPlugin once for each one. If a shared library containing an output plugin is loaded, the libraries _PG_init function does the same thing. When someone tries to use a plugin, they ask for it by name. We go iterate through the data saved by all previous calls to RegisterOutputPlugin() until we find one with a matching name, and then we use the callbacks embedded in that struct. It doesn't matter when _PG_init() runs, except that it had better run before the output plugin is requested by name. It also doesn't matter that another library may have registered other callbacks, because we use the name to decide which output plugin to call, from among all those we know about. > IV) Make output plugins a SQL-level object/catalog table where a plugin > can be registered, and the callbacks are normal pg_proc entries. It's > more in line with other stuff, but has the disadvantage that we need to > register plugins on the primary, even if we only stream from a > standby. But then, we're used to that with CREATE EXTENSION et al. I don't think I'd make every callback a pg_proc entry; I'd make a single pg_proc entry that returns a struct of function pointers, as we do for FDWs. But I think this has merit. One significant advantage of it over (III) is that execution of a function in pg_proc can trigger a library load without any extra pushups, which is nice. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello Robert, On 2013-12-11 22:29:46 -0500, Robert Haas wrote: > On Wed, Dec 4, 2013 at 7:15 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > There's basically three major 'verbs' that can be performed on a > > stream, currently named (walsender names): > > * INIT_LOGICAL_REPLICATION "name" "output_plugin" > > * START_LOGICAL_REPLICATION "name" last_received ("option_name" value,...) > > * FREE_LOGICAL_REPLICATION "name" > > > > The SQL variant currrently has: > > * init_logical_replication(name, plugin) > > * start_logical_replication(name, stream_upto, options[]) > > * stop_logical_replication(name) > > > > You might have noticed the slight inconsistency... > > I think this naming is probably not the greatest. Completely agreed. > When I hear "init", > I don't think "permanently allocate a resource that will never be > released unless I explicitly throw it away", and when I hear "stop", I > don't think "free that resource". I suggest naming based around > create/drop, register/unregister, or acquire/release. Since, as > previously noted, I'm gunning for these slots to apply to ordinary > replication as well, I kind of like ACQUIRE_REPLICATION_SLOT and > RELEASE_REPLICATION_SLOT. If we're going to make them specific to > logical replication, then your suggestion of CREATE_DECODING_SLOT and > DROP_DECODING_SLOT, or maybe ACQUIRE_DECODING_SLOT and > RELEASE_DECODING_SLOT, sounds fine. I think there'll always be a bit of a difference between slots for physical and logical data, even if 90% of the implementation is the same. We can signal that difference by specifying logical/physical as an option or having two different sets of commands. Maybe? ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts -- already exists without slot, PHYSICAL arguments START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline START_REPLICATION SLOT LOGICAL slot plugin_options RELEASE_REPLICATION_SLOT slot_name > It also strikes me that just as it's possible to stream WAL without > allocating a slot first (since we don't at present have slots), > perhaps it ought also to be possible to stream logical replication > data without acquiring a slot first. You could argue that it was a > mistake not to introduce slots in the first place, but the stateless > nature of WAL streaming definitely has some benefits, and it's unclear > to me why you shouldn't be able to do the same thing with logical > decoding. I think it would be quite a bit harder for logical decoding. The difference is that, from the perspective of the walsender, for plain WAL streaming, all that needs to be checked is whether the WAL is still there. For decoding though, we need to be sure that a) the catalog xmin is still low enough and has been all along b) that we are able instantly build a historical mvcc snapshot from the point we want to start streaming. Both a) and b) are solved by keeping the xmin and the point where to reread WAL from in the slot data and by serializing data about historical snapshots to disk. But those are removed if there isn't a slot around requiring them... So what you could get is something that starts streaming you changes sometime after you asked it to start streaming, without a guarantee that you can restart at exactly the position you stopped. If that's useful, we can do it, but I am not sure what the usecase would be? > > b) Decide which of the SQL functions should be in a contrib module, and > > which in core. Currently init_logical_replication() and > > stop_logical_replication() are in core, whereas > > start_logical_replication() is in the 'test_logical_decoding' > > extension. The reasoning behind that is that init/stop ones are > > important to the DBA and the start_logical_replication() SRF isn't > > all that useful in the real world because our SRFs don't support > > streaming changes out. > > Seems pretty arbitrary to me. If start_logical_replication() is > usable with any output plugin, then it ought to be in core. Ok, I certainly have no problem with that. > I think > the name isn't great, though; the actual functionality seems to be > more or less decode-from-last-position-up-to-present, which doesn't > sound much like "start". The name originally comes from the START_REPLICATION walsender command, but I agree, there's not much point in trying to keep that name. I am also open to different behaviour for the SRF, but I am not sure what that could be. There's just no sensible way to stream data on the SQL level afaics. What about pg_decoding_slot_get_[binary_]changes()? > > c) Which data-types does start_logical_replication() return. Currently > > it's OUT location text, OUT xid bigint, OUT data text. Making the 'data' > > column text has some obvious disadvantages though - there's obvious > > usecases for output plugins that return binary data. But making it bytea > > sucks, because the output is harder to read by default... > I think having two functions might be sensible. I'm not sure what > happens if the text function is used and the plugin outputs something > that's not valid in the database encoding, though. I guess you'd > better check for that and error out. I wonder if we should let the output plugin tell us whether it will output data in binary? I think it generally would be a good idea to let the output plugin's _init() function return some configuration data. That will make extending the interface to support more features easier. > > III) Keep the output plugin a library, but simply rely on _PG_init() > > calling a function to register all callbacks. Imo it's worse than I) and > > II) because it basically prohibits using the library in > > shared_preload_libraries as well, because then it's _PG_init() doesn't > > get called when starting to stream, and another library might have > > registered other callbacks. > I don't understand the disadvantage that you're describing here. What > I'm imagining is that you have some struct that looks like this: > Now you provide a function RegisterOutputPlugin(output_plugin *). If > there are any output plugins built into core, core will call > RegisterOutputPlugin once for each one. If a shared library > containing an output plugin is loaded, the libraries _PG_init function > does the same thing. When someone tries to use a plugin, they ask for > it by name. We go iterate through the data saved by all previous > calls to RegisterOutputPlugin() until we find one with a matching > name, and then we use the callbacks embedded in that struct. But if we don't pass in a .so's name, how can additional plugins be registered except by adding them to [shared|local]_preload_libraries? If we do pass in one, it seems confusing if you suddenly get a plugin implemented somewhere else. > > IV) Make output plugins a SQL-level object/catalog table where a plugin > > can be registered, and the callbacks are normal pg_proc entries. It's > > more in line with other stuff, but has the disadvantage that we need to > > register plugins on the primary, even if we only stream from a > > standby. But then, we're used to that with CREATE EXTENSION et al. > > I don't think I'd make every callback a pg_proc entry; I'd make a > single pg_proc entry that returns a struct of function pointers, as we > do for FDWs. But I think this has merit. One significant advantage > of it over (III) is that execution of a function in pg_proc can > trigger a library load without any extra pushups, which is nice. So I guess this is? It has the advantage that an output plugin can create any additional functionality it needs in the course of it's CREATE EXTENSION. As far as I have been thinking of, this would be another catalog table like pg_decoding_plugin(oid, dpname name, dpload regproc). I guess that we at least need a function for adding output plugins there, that also creates proper pg_depend entries? We could also go for full DDL commands, although I have a bit of a hard time finding suitable, already used, keywords... If we're not worried about that, maybe CREATE/DROP OUTPUT PLUGIN ...; Thanks for the input! Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I think there'll always be a bit of a difference between slots for > physical and logical data, even if 90% of the implementation is the > same. We can signal that difference by specifying logical/physical as an > option or having two different sets of commands. > > Maybe? > > ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts > ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts > -- already exists without slot, PHYSICAL arguments > START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline > START_REPLICATION SLOT LOGICAL slot plugin_options > RELEASE_REPLICATION_SLOT slot_name I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options, but basically this seems OK to me. I hadn't realized that the options were going to be different for logical vs. physical. So you could also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT, START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT. I'm not sure whether that's better. >> It also strikes me that just as it's possible to stream WAL without >> allocating a slot first (since we don't at present have slots), >> perhaps it ought also to be possible to stream logical replication >> data without acquiring a slot first. You could argue that it was a >> mistake not to introduce slots in the first place, but the stateless >> nature of WAL streaming definitely has some benefits, and it's unclear >> to me why you shouldn't be able to do the same thing with logical >> decoding. > > I think it would be quite a bit harder for logical decoding. The > difference is that, from the perspective of the walsender, for plain WAL > streaming, all that needs to be checked is whether the WAL is still > there. For decoding though, we need to be sure that a) the catalog xmin > is still low enough and has been all along b) that we are able instantly > build a historical mvcc snapshot from the point we want to start > streaming. > Both a) and b) are solved by keeping the xmin and the point where to > reread WAL from in the slot data and by serializing data about > historical snapshots to disk. But those are removed if there isn't a > slot around requiring them... > > So what you could get is something that starts streaming you changes > sometime after you asked it to start streaming, without a guarantee that > you can restart at exactly the position you stopped. If that's useful, > we can do it, but I am not sure what the usecase would be? I haven't yet looked closely at the snapshot-building stuff, but my thought is that you ought to be able to decode any transactions that start after you make the connection. You might not be able to decode transactions that are already in progress at that point, because you might have already missed XID assignment records, catalog changes, etc. that they've performed. But transactions that begin after that point ought to be OK. I have a feeling you're going to tell me it doesn't work like that, but maybe it should, because there's a whole lot of benefit in having decoding start up quickly, and a whole lot of benefit also to having the rules for that be easy to understand. Now if you have that, then I think ad-hoc decoding is potentially useful. Granted, you're not going to want to build a full-fledged replication solution that way, but you might want to just connect and watch the world stream by... or you might imagine an application that opens a replication connection and a regular connection, copies a table, and then applies the stream of changes made to that table after the fact. When that completes, the table is sync'd between the two machines as of the end of the copy. Useful enough to bother with? I don't know. But not obviously useless. > I am also open to different behaviour for the SRF, but I am not sure > what that could be. There's just no sensible way to stream data on the > SQL level afaics. I don't have a problem with the behavior. Seems useful. One useful addition might be to provide an option to stream out up to X changes but without consuming them, so that the DBA can peek at the replication stream. I think it's a safe bet DBAs will want to do things like that, so it'd be nice to make it easy, if we can. > What about pg_decoding_slot_get_[binary_]changes()? Sounds about right, but I think we need to get religion about figuring out what terminology to use. At the moment it seems to vary quite a bit between "logical", "logical decoding", and "decoding". Not sure how to nail that down. As a more abstract linguistic question, what do we think the difference is between logical *replication* and logical *decoding*? Are they the same or different? If different, how? > I wonder if we should let the output plugin tell us whether it will > output data in binary? I think it generally would be a good idea to let > the output plugin's _init() function return some configuration > data. That will make extending the interface to support more features > easier. Maybe, but you've got to consider the question of encoding, too. You could make the choices "binary" and "the database encoding", I suppose. >> Now you provide a function RegisterOutputPlugin(output_plugin *). If >> there are any output plugins built into core, core will call >> RegisterOutputPlugin once for each one. If a shared library >> containing an output plugin is loaded, the libraries _PG_init function >> does the same thing. When someone tries to use a plugin, they ask for >> it by name. We go iterate through the data saved by all previous >> calls to RegisterOutputPlugin() until we find one with a matching >> name, and then we use the callbacks embedded in that struct. > > But if we don't pass in a .so's name, how can additional plugins be > registered except by adding them to [shared|local]_preload_libraries? If > we do pass in one, it seems confusing if you suddenly get a plugin > implemented somewhere else. I don't see what the confusion is. You can use any method you like for loading those libraries, including shared_preload_libarries, local_preload_libraries, or the LOAD command. The replication grammar might have to grow support for LOAD. But I think the winner may be the next option. Some backends might not actually use the library in question, but that's true of any library you preload. >> > IV) Make output plugins a SQL-level object/catalog table where a plugin >> > can be registered, and the callbacks are normal pg_proc entries. It's >> > more in line with other stuff, but has the disadvantage that we need to >> > register plugins on the primary, even if we only stream from a >> > standby. But then, we're used to that with CREATE EXTENSION et al. >> >> I don't think I'd make every callback a pg_proc entry; I'd make a >> single pg_proc entry that returns a struct of function pointers, as we >> do for FDWs. But I think this has merit. One significant advantage >> of it over (III) is that execution of a function in pg_proc can >> trigger a library load without any extra pushups, which is nice. > > So I guess this is? It has the advantage that an output plugin can > create any additional functionality it needs in the course of it's > CREATE EXTENSION. Yes, that's elegant. > As far as I have been thinking of, this would be another catalog table like > pg_decoding_plugin(oid, dpname name, dpload regproc). Instead of adding another catalog table, I think we should just define a new type. Again, please look at the way that foreign data wrappers do this: rhaas=# \df file_fdw_handler List of functionsSchema | Name | Result data type |Argument data types | Type --------+------------------+------------------+---------------------+--------public | file_fdw_handler | fdw_handler | | normal (1 row) Is there any reason not to slavishly copy that design? Note that if you do it this way, you don't need any special DDL or pg_dump support; a separate catalog table will raise the bar considerably. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-12-12 10:01:21 -0500, Robert Haas wrote: > On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I think there'll always be a bit of a difference between slots for > > physical and logical data, even if 90% of the implementation is the > > same. We can signal that difference by specifying logical/physical as an > > option or having two different sets of commands. > > > > Maybe? > > > > ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts > > ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts > > -- already exists without slot, PHYSICAL arguments > > START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline > > START_REPLICATION SLOT LOGICAL slot plugin_options > > RELEASE_REPLICATION_SLOT slot_name > > I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options, > but basically this seems OK to me. Uh, yes. > I hadn't realized that the options were going to be different for > logical vs. physical. I don't see how we could avoid that, there just are some differences between both. > So you could > also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT, > START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT. I'm > not sure whether that's better. Not sure either, but I slightly favor keeping the the toplevel slot commands the same. I think we'll want one namespace for both and possibly similar reporting functions and that seems less surprising if they are treated more similar. > > So what you could get is something that starts streaming you changes > > sometime after you asked it to start streaming, without a guarantee that > > you can restart at exactly the position you stopped. If that's useful, > > we can do it, but I am not sure what the usecase would be? > > I haven't yet looked closely at the snapshot-building stuff, but my > thought is that you ought to be able to decode any transactions that > start after you make the connection. You might not be able to decode > transactions that are already in progress at that point, because you > might have already missed XID assignment records, catalog changes, > etc. that they've performed. But transactions that begin after that > point ought to be OK. It works mostly like that, yes. At least on primaries. When we start decoding, we jot down the current xlog insertion pointer to know where to start decoding from, then trigger a xl_running_xacts record to be logged so we have enough information. Then we start reading from that point onwards. On standbys the process is the same, just that we have to wait for the primary to issue a xl_running_xacts. (I had considered starting with information from the procarray, but turns out that's hard to do without race conditions.) We only decode changes in transactions that commit after the last transaction that was in-progress when we started observing has finished though. That allows us to export a snapshot when the last still-running transaction finished which shows a snapshot of the database that can be rolled forward exactly by the changes contained in the changestream. I think that's a useful property for the majority of cases. If we were to start out streaming changes before the last running transaction has finished, they would be visible in that exported snapshot and you couldn't use it to to roll forward from anymore. It'd be pretty easy to optionally decode the transactions we currently skip if we want that feature later. That would remove the option to export a snapshot in many cases though (think suboverflowed snapshots). > I have a feeling you're going to tell me it > doesn't work like that, but maybe it should, because there's a whole > lot of benefit in having decoding start up quickly, and a whole lot of > benefit also to having the rules for that be easy to understand. I am not sure if the above qualifies as "doesn't work like that", if not, sometimes the correct thing isn't the immediately obvious thing. I think "all transactions that were running when initiating decoding need to finish" is reasonably easy to explain. > > I am also open to different behaviour for the SRF, but I am not sure > > what that could be. There's just no sensible way to stream data on the > > SQL level afaics. > I don't have a problem with the behavior. Seems useful. One useful > addition might be to provide an option to stream out up to X changes > but without consuming them, so that the DBA can peek at the > replication stream. I think it's a safe bet DBAs will want to do > things like that, so it'd be nice to make it easy, if we can. It's not too difficult to provide an option to do that. What I've been thinking of was to correlate the confirmation of consumption with the transaction the SRF is running in. So, confirm the data as consumed if it commits, and don't if not. I think we could do that relatively easily by registering a XACT_EVENT_COMMIT. > > What about pg_decoding_slot_get_[binary_]changes()? > > Sounds about right, but I think we need to get religion about figuring > out what terminology to use. At the moment it seems to vary quite a > bit between "logical", "logical decoding", and "decoding". Not sure > how to nail that down. Agreed. Perhaps we should just avoid both logical and decoding entirely and go for "changestream" or similar? > As a more abstract linguistic question, what do we think the > difference is between logical *replication* and logical *decoding*? > Are they the same or different? If different, how? For me "logical decoding" can be the basis of "logical replication", but also for other features. > > I wonder if we should let the output plugin tell us whether it will > > output data in binary? I think it generally would be a good idea to let > > the output plugin's _init() function return some configuration > > data. That will make extending the interface to support more features > > easier. > > Maybe, but you've got to consider the question of encoding, too. You > could make the choices "binary" and "the database encoding", I > suppose. Yes, I think that should be the choice. There seems little justification for an output plugin to produce textual output in anything but the server encoding. I am not sure if we want to verify that in !USE_ASSERT? That'd be quite expensive... > > As far as I have been thinking of, this would be another catalog table like > > pg_decoding_plugin(oid, dpname name, dpload regproc). > > Instead of adding another catalog table, I think we should just define > a new type. Again, please look at the way that foreign data wrappers > do this: I don't really see what the usage of a special type has to do with this, but I think that's besides your main point. What you're saying is that the output plugin is just defined by a function name, possibly schema prefixed. That has an elegance to it. +1 Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 12, 2013 at 10:49 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> I hadn't realized that the options were going to be different for >> logical vs. physical. > > I don't see how we could avoid that, there just are some differences > between both. Right, I'm not complaining, just observing that it was a point I had overlooked. >> So you could >> also do ACQUIRE_LOGICAL_SLOT, ACQUIRE_PHYSICAL_SLOT, >> START_REPLICATION, START_LOGICAL_REPLICATION, and RELEASE_SLOT. I'm >> not sure whether that's better. > > Not sure either, but I slightly favor keeping the the toplevel slot > commands the same. I think we'll want one namespace for both and > possibly similar reporting functions and that seems less surprising if > they are treated more similar. OK. > If we were to start out streaming changes before the last running > transaction has finished, they would be visible in that exported > snapshot and you couldn't use it to to roll forward from anymore. Actually, you could. You'd just have to throw away any transactions whose XIDs are visible to the exported snapshot. In other words, you begin replication at time T0, and all transactions which begin after that time are included in the change stream. At some later time T1, all transactions in progress at time T0 have ended, and now you can export a snapshot at that time, or any later time, from which you can roll forward. Any change-stream entries for XIDs which would be visible to that snapshot shouldn't be replayed when rolling forward from it, though. I think it sucks (that's the technical term) to have to wait for all currently-running transactions to terminate before being able to begin streaming changes, because that could take a long time. And you might well know that the long-running transaction which is rolling up enormous table A that you don't care about is never going to touch table B which you actually want to replicate. Now, ideally, the DBA would have a way to ignore that long-running transaction and force replication to start, perhaps with the caveat that if that long-running transaction actually does touch B after all then we have to resync. Your model's fine when we want to replicate the whole database, but a big part of why I want this feature is to allow finer-grained replication, down to the table level, or even slices of tables. So imagine this. After initiating logical replication, a replication solution either briefly x-locks a table it wants to replicate, so that there can't be anyone else touching it, or it observes who has a lock >= RowExclusiveLock and waits for all of those locks to drop away. At that point, it knows that no currently-in-progress transaction can have modified the table prior to the start of replication, and begins copying the table. If a transaction that began before the start of replication subsequently modifies the table, a WAL record will be written, and the core logical decoding support could let the plugin know by means of an optional callback (hey, btw, a change I can't decode just hit table XYZ). The plugin will need to respond by recopying the table, which sucks, but it was the plugin's decision to be optimistic in the first place, and that will in many cases be a valid policy decision. If no such callback arrives before the safe-snapshot point, then the plugin made the right bet and will reap the just rewards of its optimism. >> I don't have a problem with the behavior. Seems useful. One useful >> addition might be to provide an option to stream out up to X changes >> but without consuming them, so that the DBA can peek at the >> replication stream. I think it's a safe bet DBAs will want to do >> things like that, so it'd be nice to make it easy, if we can. > > It's not too difficult to provide an option to do that. What I've been > thinking of was to correlate the confirmation of consumption with the > transaction the SRF is running in. So, confirm the data as consumed if > it commits, and don't if not. I think we could do that relatively easily > by registering a XACT_EVENT_COMMIT. That's a bit too accident-prone for my taste. I'd rather the DBA had some equivalent of peek_at_replication(nchanges int). >> Sounds about right, but I think we need to get religion about figuring >> out what terminology to use. At the moment it seems to vary quite a >> bit between "logical", "logical decoding", and "decoding". Not sure >> how to nail that down. > > Agreed. Perhaps we should just avoid both logical and decoding entirely > and go for "changestream" or similar? So wal_level=changestream? Not feeling it. Of course we don't have to be 100% rigid about this but we should try to make our terminology corresponding with natural semantic boundaries. Maybe we should call the process logical decoding, and the results logical streams, or something like that. >> As a more abstract linguistic question, what do we think the >> difference is between logical *replication* and logical *decoding*? >> Are they the same or different? If different, how? > > For me "logical decoding" can be the basis of "logical replication", but > also for other features. Such as? >> > I wonder if we should let the output plugin tell us whether it will >> > output data in binary? I think it generally would be a good idea to let >> > the output plugin's _init() function return some configuration >> > data. That will make extending the interface to support more features >> > easier. >> >> Maybe, but you've got to consider the question of encoding, too. You >> could make the choices "binary" and "the database encoding", I >> suppose. > > Yes, I think that should be the choice. There seems little justification > for an output plugin to produce textual output in anything but the > server encoding. > > I am not sure if we want to verify that in !USE_ASSERT? That'd be quite > expensive... Since it's all C code anyway, it's probably fine to push the responsibility back onto the output plugin, as long as that's a documented part of the API contract. >> > As far as I have been thinking of, this would be another catalog table like >> > pg_decoding_plugin(oid, dpname name, dpload regproc). >> >> Instead of adding another catalog table, I think we should just define >> a new type. Again, please look at the way that foreign data wrappers >> do this: > > I don't really see what the usage of a special type has to do with this, > but I think that's besides your main point. What you're saying is that > the output plugin is just defined by a function name, possibly schema > prefixed. That has an elegance to it. +1 Well, file_fdw_handler returns type fdw_handler. That's nice, because we can validate that we've got the right sort of object when what we want is an FDW handler. If it just returned type internal, it would be too easy to mix it up with something unrelated that passed back some other kind of binary goop. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-12-12 12:13:24 -0500, Robert Haas wrote: > On Thu, Dec 12, 2013 at 10:49 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > If we were to start out streaming changes before the last running > > transaction has finished, they would be visible in that exported > > snapshot and you couldn't use it to to roll forward from anymore. > > Actually, you could. You'd just have to throw away any transactions > whose XIDs are visible to the exported snapshot. In other words, you > begin replication at time T0, and all transactions which begin after > that time are included in the change stream. At some later time T1, > all transactions in progress at time T0 have ended, and now you can > export a snapshot at that time, or any later time, from which you can > roll forward. Any change-stream entries for XIDs which would be > visible to that snapshot shouldn't be replayed when rolling forward > from it, though. But that would become a too complex interface, imo without a corresponding benefit. If you skip the changes when rolling forward, there's no point in streaming them out in the first place. > I think it sucks (that's the technical term) to have to wait for all > currently-running transactions to terminate before being able to begin > streaming changes, because that could take a long time. I don't think there's much of an alternative for replication solutions, for other usecases, we may want to add an option to skip the wait. It's not like that's something you do all the time. As soon as a slot was acquired, there's no further waits anymore. > And you might > well know that the long-running transaction which is rolling up > enormous table A that you don't care about is never going to touch > table B which you actually want to replicate. Now, ideally, the DBA > would have a way to ignore that long-running transaction and force > replication to start, perhaps with the caveat that if that > long-running transaction actually does touch B after all then we have > to resync. Puh. I honestly have zero confidence in DBAs making an informed decision about something like this. Honestly, for a replication solution, how often do you think this will be an issue? > So imagine this. After initiating logical replication, a replication > solution either briefly x-locks a table it wants to replicate, so that > there can't be anyone else touching it, or it observes who has a lock > >= RowExclusiveLock and waits for all of those locks to drop away. At > that point, it knows that no currently-in-progress transaction can > have modified the table prior to the start of replication, and begins > copying the table. If a transaction that began before the start of > replication subsequently modifies the table, a WAL record will be > written, and the core logical decoding support could let the plugin > know by means of an optional callback (hey, btw, a change I can't > decode just hit table XYZ). The plugin will need to respond by > recopying the table, which sucks, but it was the plugin's decision to > be optimistic in the first place, and that will in many cases be a > valid policy decision. If no such callback arrives before the > safe-snapshot point, then the plugin made the right bet and will reap > the just rewards of its optimism. Sure, all that's possible. But hell, it's complicated to use. If reality proves people want this, lets go there, but lets get the basics right and committed first. All the logic around whether to decode a transaction is: void SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, int nsubxacts, TransactionId *subxacts) ...if (builder->state < SNAPBUILD_CONSISTENT){ /* ensure that only commits after this are getting replayed */ if (builder->transactions_after< lsn) builder->transactions_after = lsn; and then /** Should the contents of a transaction ending at 'ptr' be decoded?*/ bool SnapBuildXactNeedsSkip(SnapBuild *builder, XLogRecPtr ptr) {return ptr <= builder->transactions_after; } so it's not like it will require all too many changes. What I can see as possibly getting into 9.4 is a FASTSTART option that doesn't support exporting a snapshot, but doesn't have to wait for the SNAPBUILD_CONSISTENT state in return. That's fine for some usecases, although I don't think for any of the major ones. > > It's not too difficult to provide an option to do that. What I've been > > thinking of was to correlate the confirmation of consumption with the > > transaction the SRF is running in. So, confirm the data as consumed if > > it commits, and don't if not. I think we could do that relatively easily > > by registering a XACT_EVENT_COMMIT. > > That's a bit too accident-prone for my taste. I'd rather the DBA had > some equivalent of peek_at_replication(nchanges int). One point for my suggested behaviour is that it closes a bigger racecondition. Currently as soon as start_logical_replication() has finished building the tuplestore it marks the endposition as received. But we very well can fail before the user has received all those changes. The only other idea I have to close that situation is to add an explicit function to confirm receiving the changes, but that sounds icky for something exposed to SQL. > >> Sounds about right, but I think we need to get religion about figuring > >> out what terminology to use. At the moment it seems to vary quite a > >> bit between "logical", "logical decoding", and "decoding". Not sure > >> how to nail that down. > > > > Agreed. Perhaps we should just avoid both logical and decoding entirely > > and go for "changestream" or similar? > > So wal_level=changestream? Not feeling it. Of course we don't have > to be 100% rigid about this but we should try to make our terminology > corresponding with natural semantic boundaries. Maybe we should call > the process logical decoding, and the results logical streams, or > something like that. I am fine with that, but I wouldn't mind some opinions of people knowing less about the implementation that you and I. > > For me "logical decoding" can be the basis of "logical replication", but > > also for other features. > > Such as? * auditing * cache invalidation * concurrent, rewriting ALTER TABLE * concurrent VACUUM/CLUSTER similar to pg_reorg > > I don't really see what the usage of a special type has to do with this, > > but I think that's besides your main point. What you're saying is that > > the output plugin is just defined by a function name, possibly schema > > prefixed. That has an elegance to it. +1 > > Well, file_fdw_handler returns type fdw_handler. That's nice, because > we can validate that we've got the right sort of object when what we > want is an FDW handler. If it just returned type internal, it would > be too easy to mix it up with something unrelated that passed back > some other kind of binary goop. My, badly expressed, point is that returning INTERNAL or a specialized type seems orthogonal to either using a special catalog mapping the output plugin name to a function returning callbacks in comparison to just using the function's name as the output plugin name. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 12, 2013 at 1:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Puh. I honestly have zero confidence in DBAs making an informed decision > about something like this. Honestly, for a replication solution, how > often do you think this will be an issue? If you imagine a scenario where somebody establishes a replication slot and then keeps it forever, not often. But if you're trying to do something more ad hoc, where replication slots might be used just for short periods of time and then abandoned, I think it could come up pretty frequently. Generally, I think you're being too dismissive of the stuff I'm complaining about here. If we just can't get this, well then I suppose we can't. But I think the amount of time that it takes Hot Standby to open for connections is an issue, precisely because it's got to wait until certain criteria are met before it can establish a snapshot, and sometimes that takes an unpleasantly long time. I think it unlikely that we can export that logic to this case also and experience no pain as a result. In fact, I think that even restricting things to streaming changes from transactions started after we initiate replication is going to be an annoying amount of delay for some purposes. People will accept it because, no matter how you slice it, this is an awesome new capability. Full stop. That having been said, I don't find it at all hard to imagine someone wanting to jump into the replication stream at an arbitrary point in time and see changes from every transaction that *commits* after that point, even if it began earlier, or even to see changes from transactions that have not yet committed as they happen. I realize that's asking for a pony, and I'm not saying you have to go off and do that right now in order for this to move forward, or indeed that it will ever happen at all. What I am saying is that I find it entirely likely that people are going to push the limits of this thing, that this is one of the limits I expect them to push, and that the more we can do to put policy in the hands of the user without pre-judging the sanity of what they're trying to do, the happier we (and our users) will be. >> > It's not too difficult to provide an option to do that. What I've been >> > thinking of was to correlate the confirmation of consumption with the >> > transaction the SRF is running in. So, confirm the data as consumed if >> > it commits, and don't if not. I think we could do that relatively easily >> > by registering a XACT_EVENT_COMMIT. >> >> That's a bit too accident-prone for my taste. I'd rather the DBA had >> some equivalent of peek_at_replication(nchanges int). > > One point for my suggested behaviour is that it closes a bigger > racecondition. Currently as soon as start_logical_replication() has > finished building the tuplestore it marks the endposition as > received. But we very well can fail before the user has received all > those changes. Right. I think your idea is good, but maybe there should also be a version of the function that never confirms receipt even if the transaction commits. That would be useful for ad-hoc poking at the queue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-12-13 08:30:41 -0500, Robert Haas wrote: > On Thu, Dec 12, 2013 at 1:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Puh. I honestly have zero confidence in DBAs making an informed decision > > about something like this. Honestly, for a replication solution, how > > often do you think this will be an issue? > > If you imagine a scenario where somebody establishes a replication > slot and then keeps it forever, not often. But if you're trying to do > something more ad hoc, where replication slots might be used just for > short periods of time and then abandoned, I think it could come up > pretty frequently. But can you imagine those users needing an exported snapshot? I can think of several short-lived usages, but all of those are unlikely to need a consistent view of the overall database. And those are less likely to be full blown replication solutions. I.e. it's not the DBA making that decision but the developer making the decision based on whether he requires the snapshot or not. > Generally, I think you're being too dismissive of the stuff I'm > complaining about here. If we just can't get this, well then I > suppose we can't. I think I am just scared of needing to add more features before getting the basics done and in consequence overrunning 9.4... > But I think the amount of time that it takes > Hot Standby to open for connections is an issue, precisely because > it's got to wait until certain criteria are met before it can > establish a snapshot, and sometimes that takes an unpleasantly long > time. The HS situation is quite different though. There the problem is hit on every restart, not just the initial start; Suboverflowed xl_running xacts can delay HS startup for a long time; and we don't start up at the exact point we know a xl_running_xacts just has been logged. FWIW I think we need to fix HS to store it's visibility state at restartpoints, so we get rid of that problem. Not sure who will have time to do that tho. > That having been said, I don't find it at all > hard to imagine someone wanting to jump into the replication stream at > an arbitrary point in time and see changes from every transaction that > *commits* after that point, even if it began earlier, or even to see > changes from transactions that have not yet committed as they happen. I agree they are desirable, but all those will require additional state to be kept about running transactions. That's not saying it shouldn't/cannot be done, to the contrary, just that it requires some engineering effort. > I realize that's asking for a pony, and I'm not saying you have to go > off and do that right now in order for this to move forward, or indeed > that it will ever happen at all. What I am saying is that I find it > entirely likely that people are going to push the limits of this > thing, that this is one of the limits I expect them to push, and that > the more we can do to put policy in the hands of the user without > pre-judging the sanity of what they're trying to do, the happier we > (and our users) will be. Completely agreed. I really think this the most basic building block, missing many important features. And we'll be busy for some time adding those. > > One point for my suggested behaviour is that it closes a bigger > > racecondition. Currently as soon as start_logical_replication() has > > finished building the tuplestore it marks the endposition as > > received. But we very well can fail before the user has received all > > those changes. > > Right. I think your idea is good, but maybe there should also be a > version of the function that never confirms receipt even if the > transaction commits. That would be useful for ad-hoc poking at the > queue. Ok, that sounds easy enough, maybe pg_decoding_slot_get_[binary_]changes() pg_decoding_slot_peek_[binary_]changes() ? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 13, 2013 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> If you imagine a scenario where somebody establishes a replication >> slot and then keeps it forever, not often. But if you're trying to do >> something more ad hoc, where replication slots might be used just for >> short periods of time and then abandoned, I think it could come up >> pretty frequently. > > But can you imagine those users needing an exported snapshot? I can > think of several short-lived usages, but all of those are unlikely to > need a consistent view of the overall database. And those are less > likely to be full blown replication solutions. > I.e. it's not the DBA making that decision but the developer making the > decision based on whether he requires the snapshot or not. Well, it still seems to me that the right way to think about this is that the change stream begins at a certain point, and then once you cross a certain threshold (all transactions in progress at that time have ended) any subsequent snapshot is a possible point from which to roll forward. You'll need to avoid applying any transactions that are already included during the snapshot, but I don't really think that's any great matter. You're focusing on the first point at which the consistent snapshot can be taken, and on throwing away any logical changes that might have been available before that point so that they don't have to be ignored in the application code, but I think that's myopic. For example, suppose somebody is replication tables on node A to node B. And then the decide to replicate some of the same tables to node C. Well, one way to do this is to have node C connect to node A and acquire its own slot, but that means decoding everything twice. Alternatively, you could reuse the same change stream, but you'll need a new snapshot to roll forward from. That doesn't seem like a problem unless the API makes it a problem. >> Generally, I think you're being too dismissive of the stuff I'm >> complaining about here. If we just can't get this, well then I >> suppose we can't. > > I think I am just scared of needing to add more features before getting > the basics done and in consequence overrunning 9.4... I am sensitive to that. On the other hand, this API is going to be a lot harder to change once it's released, so we really need to avoid painting ourselves into a corner with v1. As far as high-level design concerns go, there are three things that I'm not happy with: 1. Slots. We know we need physical slots as well as logical slots, but the patch as currently constituted only offers logical slots. 2. Snapshot Mangement. This issue. 3. Incremental Decoding. So that we can begin applying a really big transaction speculatively before it's actually committed. I'm willing to completely punt #3 as far as 9.4 is concerned, because I see a pretty clear path to fixing that later. I am not yet convinced that either of the other two can or should be postponed. >> Right. I think your idea is good, but maybe there should also be a >> version of the function that never confirms receipt even if the >> transaction commits. That would be useful for ad-hoc poking at the >> queue. > > Ok, that sounds easy enough, maybe > pg_decoding_slot_get_[binary_]changes() > pg_decoding_slot_peek_[binary_]changes() > ? s/pg_decoding_slot/pg_logical_stream/? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-12-14 11:50:00 -0500, Robert Haas wrote: > On Fri, Dec 13, 2013 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > But can you imagine those users needing an exported snapshot? I can > > think of several short-lived usages, but all of those are unlikely to > > need a consistent view of the overall database. And those are less > > likely to be full blown replication solutions. > > I.e. it's not the DBA making that decision but the developer making the > > decision based on whether he requires the snapshot or not. > > Well, it still seems to me that the right way to think about this is > that the change stream begins at a certain point, and then once you > cross a certain threshold (all transactions in progress at that time > have ended) any subsequent snapshot is a possible point from which to > roll forward. Unfortunately it's not possible to build exportable snapshots at any time - it requires keeping far more state around since we need to care about all transactions, not just transactions touching the catalog. Currently you can only export the snapshot in the one point we become consistent, after that we stop maintaining that state. I see pretty much no chance to change that without major effort that's imo clearly out of scope for 9.4. Maybe we want to provide that at some point, but it's going to be a bit. Also note that in order to import a snapshot, the exporting snapshot still has to be alive, in the same transaction that has exported the snapshot. So, until the snapshot has been imported, replication cannot progress. That's existing limitations from the snapshot import code, but the reasoning for them make it unlikely that we can easily change them Under that light, usecases like your example won't be able to benefit from getting changes before the snapshot, or do you still see usecases for that? I've already prototyped the "quickstart" option that doesn't allow exporting a snapshot, although I don't know how the UI for it is going to look. > > I think I am just scared of needing to add more features before getting > > the basics done and in consequence overrunning 9.4... > > I am sensitive to that. On the other hand, this API is going to be a > lot harder to change once it's released, so we really need to avoid > painting ourselves into a corner with v1. I think we need to accept the fact that quite possibly we won't get everything right in 9.4. Obviously we should strive to avoid that, but trying to design things perfectly usually ends in not getting there. Yes, the API changes for FDWs caused pain. But there's no way we would have the current features if we'd tried to introduce the correct API allowing them from the get go. > As far as high-level design > concerns go, there are three things that I'm not happy with: > > 1. Slots. We know we need physical slots as well as logical slots, > but the patch as currently constituted only offers logical slots. Well, then tell me the way you want to go forward on that end. I can make the slot interface more generic if we know exactly what we need, but I doesn't seem fair to take this patch hostage until I develop a separate not so small feature. Why is that my task? Because I think it's important, and because by now I know the related code pretty well by now, I am willing to provide the parts of the that prevent required WAL from being deleted, peg xmin and report the current state to SQL, but somebody else is going to have to the rest. > s/pg_decoding_slot/pg_logical_stream/? Works for me. Regards, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12/12/13 11:13 AM, Robert Haas wrote: > I think it sucks (that's the technical term) to have to wait for all > currently-running transactions to terminate before being able to begin > streaming changes, because that could take a long time. And you might > well know that the long-running transaction which is rolling up > enormous table A that you don't care about is never going to touch > table B which you actually want to replicate. Now, ideally, the DBA > would have a way to ignore that long-running transaction and force > replication to start, perhaps with the caveat that if that > long-running transaction actually does touch B after all then we have > to resync. Your model's fine when we want to replicate the whole > database, but a big part of why I want this feature is to allow > finer-grained replication, down to the table level, or even slices of > tables. I know you're not going to attempt this for 9.4, but I want to mention a related case here. I've often wanted the abilityto limit the tables a transaction can touch, so that it will not interfere with vacuuming other tables. This would be useful when you have some tables that see very frequent updates/deletes in a database that also has to supportlong-running transactions that don't hit those tables. You'd explicitly limit the tables your long-running transactionwill touch and that way vacuum can ignore the long-running XID when calculating minimum tuple age for the heavy-hittables. If we had that capability it could also be used to improve the time required to get a snapshot for a limited set of tables. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Sat, Dec 14, 2013 at 12:37 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-12-14 11:50:00 -0500, Robert Haas wrote: >> On Fri, Dec 13, 2013 at 9:14 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > But can you imagine those users needing an exported snapshot? I can >> > think of several short-lived usages, but all of those are unlikely to >> > need a consistent view of the overall database. And those are less >> > likely to be full blown replication solutions. >> > I.e. it's not the DBA making that decision but the developer making the >> > decision based on whether he requires the snapshot or not. >> >> Well, it still seems to me that the right way to think about this is >> that the change stream begins at a certain point, and then once you >> cross a certain threshold (all transactions in progress at that time >> have ended) any subsequent snapshot is a possible point from which to >> roll forward. > > Unfortunately it's not possible to build exportable snapshots at any > time - it requires keeping far more state around since we need to care > about all transactions, not just transactions touching the > catalog. Currently you can only export the snapshot in the one point we > become consistent, after that we stop maintaining that state. I don't get it. Once all the old transactions are gone, I don't see why you need any state at all to build an exportable snapshot. Just take a snapshot. >> 1. Slots. We know we need physical slots as well as logical slots, >> but the patch as currently constituted only offers logical slots. > > Well, then tell me the way you want to go forward on that end. I can > make the slot interface more generic if we know exactly what we need, > but I doesn't seem fair to take this patch hostage until I develop a > separate not so small feature. Why is that my task? > Because I think it's important, and because by now I know the related > code pretty well by now, I am willing to provide the parts of the that > prevent required WAL from being deleted, peg xmin and report the current > state to SQL, but somebody else is going to have to the rest. The part that you're expressing willingness to do sounds entirely satisfactory to me. As I mentioned on the other thread, I'm perhaps even willing to punt that feature entirely provided that we have a clear design for how to add it later, but I think it'd be nicer to get it done now. And just for the record, I think the idea that I am holding this patch hostage is absurd. I have devoted a large amount of time and energy to moving this forward and plan to devote more. Because of that work, big chunks of what is needed here are already committed. If my secret plan is to make it as difficult as possible for you to get this committed, I'm playing a deep game. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-12-16 23:01:16 -0500, Robert Haas wrote: > On Sat, Dec 14, 2013 at 12:37 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-12-14 11:50:00 -0500, Robert Haas wrote: > >> Well, it still seems to me that the right way to think about this is > >> that the change stream begins at a certain point, and then once you > >> cross a certain threshold (all transactions in progress at that time > >> have ended) any subsequent snapshot is a possible point from which to > >> roll forward. > > > > Unfortunately it's not possible to build exportable snapshots at any > > time - it requires keeping far more state around since we need to care > > about all transactions, not just transactions touching the > > catalog. Currently you can only export the snapshot in the one point we > > become consistent, after that we stop maintaining that state. > > I don't get it. Once all the old transactions are gone, I don't see > why you need any state at all to build an exportable snapshot. Just > take a snapshot. The state we're currently decoding, somewhere in already fsynced WAL, won't correspond to the state in the procarray. There might be situations where it will, but we can't guarantee that we ever reach that point without taking locks that will be problematic. > The part that you're expressing willingness to do sounds entirely > satisfactory to me. As I mentioned on the other thread, I'm perhaps > even willing to punt that feature entirely provided that we have a > clear design for how to add it later, but I think it'd be nicer to get > it done now. We'll see how the next version looks like. Not sure on that myself yet ;) > And just for the record, I think the idea that I am holding this patch > hostage is absurd. I have devoted a large amount of time and energy > to moving this forward and plan to devote more. Because of that work, > big chunks of what is needed here are already committed. If my secret > plan is to make it as difficult as possible for you to get this > committed, I'm playing a deep game. I am not saying at all that you're planning to stop the patch from getting in. You've delivered pretty clear proof that that's not the case. But that doesn't prevent us from arguing over details and disagreeing whether they are dealbreakers or not, does it ;) I think you know that I am hugely grateful for the work you've put into the topic. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 17, 2013 at 4:31 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-12-16 23:01:16 -0500, Robert Haas wrote: >> On Sat, Dec 14, 2013 at 12:37 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > On 2013-12-14 11:50:00 -0500, Robert Haas wrote: >> >> Well, it still seems to me that the right way to think about this is >> >> that the change stream begins at a certain point, and then once you >> >> cross a certain threshold (all transactions in progress at that time >> >> have ended) any subsequent snapshot is a possible point from which to >> >> roll forward. >> > >> > Unfortunately it's not possible to build exportable snapshots at any >> > time - it requires keeping far more state around since we need to care >> > about all transactions, not just transactions touching the >> > catalog. Currently you can only export the snapshot in the one point we >> > become consistent, after that we stop maintaining that state. >> >> I don't get it. Once all the old transactions are gone, I don't see >> why you need any state at all to build an exportable snapshot. Just >> take a snapshot. > > The state we're currently decoding, somewhere in already fsynced WAL, > won't correspond to the state in the procarray. There might be > situations where it will, but we can't guarantee that we ever reach that > point without taking locks that will be problematic. You don't need to guarantee that. Just take a current snapshot and then throw away (or don't decode in the first place) any transactions that would be visible to that snapshot. This is simpler and more flexible, and possibly more performant, too, because with your design you'll have to hold back xmin to the historical snapshot you build while copying the table rather than to a current snapshot. I really think we should consider whether we can't get by with ripping out the build-an-exportable-snapshot code altogether. I don't see that it's really buying us much. We need a way for the client to know when decoding has reached the point where it is guaranteed complete - i.e. all transactions in progress at the time decoding was initiated have ended. We also need a way for a backend performing decoding to take a current MVCC snapshot, export it, and send the identifier to the client. And we need a way for the client to know whether any given one of those snapshots includes a particular XID we may have decoded. But I think all of that might still be simpler than what you have now, and it's definitely more flexible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-12-12 10:01:21 -0500, Robert Haas wrote: > On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I think there'll always be a bit of a difference between slots for > > physical and logical data, even if 90% of the implementation is the > > same. We can signal that difference by specifying logical/physical as an > > option or having two different sets of commands. > > > > Maybe? > > > > ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts > > ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts > > -- already exists without slot, PHYSICAL arguments > > START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline > > START_REPLICATION SLOT LOGICAL slot plugin_options > > RELEASE_REPLICATION_SLOT slot_name > > I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options, > but basically this seems OK to me. When writing the code for this, I decided that I need to reneg a bit on those names - they don't work nicely enough on the C level for me. Specifically during a START_REPLICATION we need to temporarily mark the slot as being actively used and mark it unused again afterwards. That's much more Acquire/Release like than the persistent Acquire/Release above for me. The C names in the version I am working on currently are: extern void ReplicationSlotCreate(const char *name); extern void ReplicationSlotDrop(const char *name); extern void ReplicationSlotAcquire(const char *name); extern void ReplicationSlotRelease(void); extern void ReplicationSlotSave(void); which would make the walsender ones CREATE_REPLICATION_SLOT ... START_REPLICATION [SLOT slot] [LOGICAL | PHYSICAL] ... DROP_REPLICATION_SLOT ... where START_REPLICATION internally does acquire/release on the passed SLOT. Does that work for you? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 3, 2014 at 10:12 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-12-12 10:01:21 -0500, Robert Haas wrote: >> On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > I think there'll always be a bit of a difference between slots for >> > physical and logical data, even if 90% of the implementation is the >> > same. We can signal that difference by specifying logical/physical as an >> > option or having two different sets of commands. >> > >> > Maybe? >> > >> > ACQUIRE_REPLICATION_SLOT slot_name PHYSICAL physical_opts >> > ACQUIRE_REPLICATION_SLOT slot_name LOGICAL logical_opts >> > -- already exists without slot, PHYSICAL arguments >> > START_REPLICATION [SLOT slot] [PHYSICAL] RECPTR opt_timeline >> > START_REPLICATION SLOT LOGICAL slot plugin_options >> > RELEASE_REPLICATION_SLOT slot_name >> >> I assume you meant START_REPLICATION SLOT slot LOGICAL plugin_options, >> but basically this seems OK to me. > > When writing the code for this, I decided that I need to reneg a bit on > those names - they don't work nicely enough on the C level for > me. Specifically during a START_REPLICATION we need to temporarily mark > the slot as being actively used and mark it unused again > afterwards. That's much more Acquire/Release like than the persistent > Acquire/Release above for me. > > The C names in the version I am working on currently are: > extern void ReplicationSlotCreate(const char *name); > extern void ReplicationSlotDrop(const char *name); > extern void ReplicationSlotAcquire(const char *name); > extern void ReplicationSlotRelease(void); > extern void ReplicationSlotSave(void); > > which would make the walsender ones > > CREATE_REPLICATION_SLOT ... > START_REPLICATION [SLOT slot] [LOGICAL | PHYSICAL] ... > DROP_REPLICATION_SLOT ... > > where START_REPLICATION internally does acquire/release on the passed > SLOT. > > Does that work for you? Yep, no objections. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-12-12 16:49:33 +0100, Andres Freund wrote: > On 2013-12-12 10:01:21 -0500, Robert Haas wrote: > > On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > > As far as I have been thinking of, this would be another catalog table like > > > pg_decoding_plugin(oid, dpname name, dpload regproc). > > > > Instead of adding another catalog table, I think we should just define > > a new type. Again, please look at the way that foreign data wrappers > > do this: > > I don't really see what the usage of a special type has to do with this, > but I think that's besides your main point. What you're saying is that > the output plugin is just defined by a function name, possibly schema > prefixed. That has an elegance to it. +1 Ok, so I've implemented this, but I am not so sure it's sufficient, there's some issue: Currently a logical replication slot has a plugin assigned, previously that has just been identified by the basename of a .so. But with the above proposal the identifier is pointing to a function, currently via its oid. But what happens if somebody drops or recreates the function? We can't make pg_depend entries or anything since that won't work on a standby. Earlier, if somebody removed the .so we'd just error out, but pg's dependency tracking always only mattered to things inside the catalogs. I see the following possible solutions for this: 1) accept that fact, and throw an error if the function doesn't exist anymore, or has an unsuitable signature. We can check the return type of output_plugin_callbacks, so that's a pretty specific test. 2) Create a pg_output_plugin catalog and prevent DROP OUTPUT PLUGIN (or similar) when there's a slot defined. But how'd that work if the slot is only defined on standbys? We could have the redo routine block and/or kill the slot if necessary? 3) Don't assign a specific output plugin to a slot, but have it specified everytime data is streamed, not just when a slot is created. Currently that wouldn't be a problem, but I am afraid it will constrict some future optimizations. Good ideas? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-01-07 17:54:21 +0100, Andres Freund wrote: > On 2013-12-12 16:49:33 +0100, Andres Freund wrote: > > On 2013-12-12 10:01:21 -0500, Robert Haas wrote: > > > On Thu, Dec 12, 2013 at 7:04 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > > > As far as I have been thinking of, this would be another catalog table like > > > > pg_decoding_plugin(oid, dpname name, dpload regproc). > > > > > > Instead of adding another catalog table, I think we should just define > > > a new type. Again, please look at the way that foreign data wrappers > > > do this: > > > > I don't really see what the usage of a special type has to do with this, > > but I think that's besides your main point. What you're saying is that > > the output plugin is just defined by a function name, possibly schema > > prefixed. That has an elegance to it. +1 > > Ok, so I've implemented this, but I am not so sure it's sufficient, > there's some issue: > Currently a logical replication slot has a plugin assigned, previously > that has just been identified by the basename of a .so. But with the > above proposal the identifier is pointing to a function, currently via > its oid. But what happens if somebody drops or recreates the function? > We can't make pg_depend entries or anything since that won't work on a > standby. > Earlier, if somebody removed the .so we'd just error out, but pg's > dependency tracking always only mattered to things inside the catalogs. > > I see the following possible solutions for this: > > 1) accept that fact, and throw an error if the function doesn't exist > anymore, or has an unsuitable signature. We can check the return type of > output_plugin_callbacks, so that's a pretty specific test. > > 2) Create a pg_output_plugin catalog and prevent DROP OUTPUT PLUGIN (or > similar) when there's a slot defined. But how'd that work if the slot is > only defined on standbys? We could have the redo routine block and/or > kill the slot if necessary? > > 3) Don't assign a specific output plugin to a slot, but have it > specified everytime data is streamed, not just when a slot is > created. Currently that wouldn't be a problem, but I am afraid it will > constrict some future optimizations. > > Good ideas? So, Robert and I had a IM discussion about this. Neither of us was particularly happy about the proposed solutions. So, what we've concluded is that using a function as the handler doesn't work out well enough given the constraints (primarily the inability to create dependency records on a HS node). We've concluded that the best way forward is a variant of the current implementation where the output plugin is specified as a dynamic library. Which is: CREATE_REPLICATION_SLOT slot_name LOGICAL OUTPUT_PLUGIN library_name; but in contrast to the current code where each individual output plugin callback is dlsym()ed via a fixed function name, only a _PG_output_plugin_init() function is looked up & called which fills out a struct containing the individual callbacks. Additionally the "init" and "cleanup" output plugin callbacks will be renamed to startup/shutdown to avoid possible confusions. This unfortunately still prohibits implementing output plugins within the core postgres binary, but that can be solved by shipping core-provided output plugins - should they ever exist - as shared objects, like it's already done for libpqwalreceiver. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services