Thread: Logical decoding & exported base snapshot
Hi, When initiating a new logical replication "slot" I want to provide a 'SET TRANSACTION SNAPSHOT'able snapshot which can be used to setup a new replica. I have some questions arround this where I could use some input on. First, some basics around how this currently works: Test the other side: psql "port=5440 host=/tmp dbname=postgres replication=1" postgres=# IDENTIFY_SYSTEM; systemid | timeline | xlogpos | dbname ---------------------+----------+-----------+----------5820768138794874841 | 1 | 0/190AF98 | postgres (1 row) Now, initiate a replication slot: postgres=# INIT_LOGICAL_REPLICATION 'test'; WARNING: Initiating logical rep WARNING: reached consistent point, stopping!replication_id | consistent_point | snapshot_name | plugin ----------------+------------------+---------------+--------id-0 | 0/190AFD0 | 0xDEADBEEF | test (1 row) If you would actually want to receive changes you would now (and later) use START_LOGICAL_REPLICATION 'id-0' 0/190AFD0; to stream the changes from that point onwards. INIT_LOGICAL_REPLICATION starts to decode WAL from the last checkpoint on and reads until it finds a point where it has enough information (including a suitable xmin horizon) to decode the contents.. Then it tells you the name of the newly created slot, the wal location and the name of a snapshot it exported (obviously fake here). I have the code to export a real snapshot, but exporting the snapshot is actually the easy part. The idea with the exported snapshot obviously is that you can start a pg_dump using it to replicate the base date for a new replica. Problem 1: One problem I see is that while exporting a snapshot solves the visibility issues of the table's contents it does not protect against schema changes. I am not sure whether thats a problem. If somebody runs a CLUSTER or something like that, the table's contents will be preserved including MVCC semantics. That's fine. The more problematic cases I see are TRUNCATE, DROP and ALTER TABLE. Imagine the following: S1: INIT_LOGICAL_REPLICATION S1: get snapshot: 00000333-1 S2: ALTER TABLE foo ALTER COLUMN blub text USING (blub::text); S3: pg_dump --snapshot 00000333-1 S1: START_LOGICAL_REPLICATION In that case the pg_dump would dump foo using the schema *after* the ALTER TABLE but showing only rows visible to our snapshot. After START_LOGICAL_REPLICATION all changes after the xlog position from INIT_LOGICAL_REPLICATION will be returned though - including all the tuples from the ALTER TABLE and potentially - if some form of schema capturing was in place - the ALTER TABLE itself. The copied schema would have the new format already though. Does anybody see that as aproblem or is it just a case of PEBKAC? One argument for the latter is that thats already a problematic case for normal pg_dump's. Its just that the window is a bit larger here. Problem 2: Control Flow. To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction needs to be alive. That's currently solved by exporting the snapshot in the walsender connection that did the INIT_LOGICAL_REPLICATION. The question is how long should we preserve that snapshot? There is no requirement - and there *cannot* be one in the general case, the slot needs to be usable after a restart - that START_LOGICAL_REPLICATION has to be run in the same replication connection as INIT. Possible solutions: 1) INIT_LOGICAL_REPLICATION waits for an answer from the client that confirms that logical replication initialization is finished. Before that the walsender connection cannot be used for anything else. 2) we remove the snapshot as soon as any other commend is received, this way the replication connection stays usable, e.g. to issue a START_LOGICAL_REPLICATION in parallel to the initial data dump. In that case the snapshot would have to be imported *before* the next command was received as SET TRANSACTION SNAPSHOT requires the source transaction to be still open. Opinions? Andres --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > One problem I see is that while exporting a snapshot solves the > visibility issues of the table's contents it does not protect against > schema changes. I am not sure whether thats a problem. > > If somebody runs a CLUSTER or something like that, the table's contents > will be preserved including MVCC semantics. That's fine. > The more problematic cases I see are TRUNCATE, DROP and ALTER > TABLE. This is why the pg_dump master process executes a lock table <table> in access share mode for every table, so your commands would all block. In fact it's even more complicated because the pg_dump worker processes also need to lock the table. They try to get a similar lock in "NOWAIT" mode right before dumping the table. If they don't get the lock that means that somebody else is waiting for an exclusive lock (this is the case you describe) and the backup will fail. > Problem 2: > > To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction > needs to be alive. That's currently solved by exporting the snapshot in > the walsender connection that did the INIT_LOGICAL_REPLICATION. The > question is how long should we preserve that snapshot? You lost me on this one after the first sentence... But in general the snapshot isn't so much a magical thing: As long the exporting transaction is open, it guarantees that there is a transaction out there that is holding off vacuum from removing data and it's also guaranteeing that the snapshot as is has existed at some time in the past. Once it is applied to another transaction you now have two transactions that will hold off vacuum because they share the same xmin,xmax values. You could also just end the exporting transaction at that point. One thought at the time was to somehow integrate exported snapshots with the prepared transactions feature, then you could export a snapshot, prepare the exporting transaction and have that snapshot frozen forever and it would even survive a server restart.
Joachim Wieland <joe@mcknight.de> writes: > On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> One problem I see is that while exporting a snapshot solves the >> visibility issues of the table's contents it does not protect against >> schema changes. I am not sure whether thats a problem. >> >> If somebody runs a CLUSTER or something like that, the table's contents >> will be preserved including MVCC semantics. That's fine. >> The more problematic cases I see are TRUNCATE, DROP and ALTER >> TABLE. > This is why the pg_dump master process executes a > lock table <table> in access share mode > for every table, so your commands would all block. A lock doesn't protect against schema changes made before the lock was taken. The reason that the described scenario is problematic is that pg_dump is going to be expected to work against a snapshot made before it gets a chance to take those table locks. Thus, there's a window where DDL is dangerous, and will invalidate the dump --- perhaps without any warning. Now, we have this problem today, in that pg_dump has to read pg_class before it can take table locks so some window exists already. What's bothering me about what Andres describes is that the window for trouble seems to be getting much bigger. regards, tom lane
On 12-12-11 06:52 PM, Andres Freund wrote: > Hi, > > Problem 1: > > One problem I see is that while exporting a snapshot solves the > visibility issues of the table's contents it does not protect against > schema changes. I am not sure whether thats a problem. > > If somebody runs a CLUSTER or something like that, the table's contents > will be preserved including MVCC semantics. That's fine. > The more problematic cases I see are TRUNCATE, DROP and ALTER > TABLE. Imagine the following: > > S1: INIT_LOGICAL_REPLICATION > S1: get snapshot: 00000333-1 > S2: ALTER TABLE foo ALTER COLUMN blub text USING (blub::text); > S3: pg_dump --snapshot 00000333-1 > S1: START_LOGICAL_REPLICATION > > In that case the pg_dump would dump foo using the schema *after* the > ALTER TABLE but showing only rows visible to our snapshot. After > START_LOGICAL_REPLICATION all changes after the xlog position from > INIT_LOGICAL_REPLICATION will be returned though - including all the > tuples from the ALTER TABLE and potentially - if some form of schema > capturing was in place - the ALTER TABLE itself. The copied schema would > have the new format already though. > > Does anybody see that as aproblem or is it just a case of PEBKAC? One > argument for the latter is that thats already a problematic case for > normal pg_dump's. Its just that the window is a bit larger here. Is there anyway to detect this situation as part of the pg_dump? If I could detect this, abort my pg_dump then go and get a new snapshot then I don't see this as a big deal. I can live with telling users, "don't do DDL like things while subscribing a new node, if you do the subscription will restart". I am less keen on telling users "don't do DDL like things while subscribing a new node or the results will be unpredictable" I'm trying to convince myself if I will be able to take a pg_dump from an exported snapshot plus the changes made after in between the snapshot id to some later time and turn the results into a consistent database. What if something like this comes along INIT REPLICATION insert into foo (id,bar) values (1,2); alter table foo drop column bar; pg_dump --snapshot The schema I get as part of the pg_dump won't have bar because it has been dropped, even though it will have the rows that existed with bar. I then go to process the INSERT statement. It will have a WAL record with column data for bar and the logical replication replay will lookup the catalog rows from before the alter table so it will generate a logical INSERT record with BAR. That will fail on the replica. I'm worried that it will be difficult to pragmatically stitch together the inconsistent snapshot from the pg_dump plus the logical records generated in between the snapshot and the dump (along with any record of the DDL if it exists). > Problem 2: > > Control Flow. > > To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction > needs to be alive. That's currently solved by exporting the snapshot in > the walsender connection that did the INIT_LOGICAL_REPLICATION. The > question is how long should we preserve that snapshot? > > There is no requirement - and there *cannot* be one in the general case, > the slot needs to be usable after a restart - that > START_LOGICAL_REPLICATION has to be run in the same replication > connection as INIT. > > Possible solutions: > 1) INIT_LOGICAL_REPLICATION waits for an answer from the client that > confirms that logical replication initialization is finished. Before > that the walsender connection cannot be used for anything else. > > 2) we remove the snapshot as soon as any other commend is received, this > way the replication connection stays usable, e.g. to issue a > START_LOGICAL_REPLICATION in parallel to the initial data dump. In that > case the snapshot would have to be imported *before* the next command > was received as SET TRANSACTION SNAPSHOT requires the source transaction > to be still open. > > Opinions? Option 2 sounds more flexible. Is it more difficult to implement? > Andres > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > >
On 2012-12-11 21:05:51 -0500, Joachim Wieland wrote: > On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > One problem I see is that while exporting a snapshot solves the > > visibility issues of the table's contents it does not protect against > > schema changes. I am not sure whether thats a problem. > > > > If somebody runs a CLUSTER or something like that, the table's contents > > will be preserved including MVCC semantics. That's fine. > > The more problematic cases I see are TRUNCATE, DROP and ALTER > > TABLE. > > This is why the pg_dump master process executes a > > lock table <table> in access share mode > > for every table, so your commands would all block. > > In fact it's even more complicated because the pg_dump worker > processes also need to lock the table. They try to get a similar lock > in "NOWAIT" mode right before dumping the table. If they don't get the > lock that means that somebody else is waiting for an exclusive lock > (this is the case you describe) and the backup will fail. [Tom explains why this is problematic way better than I do, see his email] A trivial example - you can play nastier games than that though: S1: CREATE TABLE test(id int); S1: INSERT INTO test VALUES(1), (2); S1: BEGIN; S1: ALTER TABLE test RENAME COLUMN id TO id2; S2: pg_dump S2: waits S1: COMMIT; pg_dump: [archiver (db)] query failed: ERROR: column "id" of relation "test" does not exist pg_dump: [archiver (db)] query was: COPY public.test (id) TO stdout; > > Problem 2: > > > > To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction > > needs to be alive. That's currently solved by exporting the snapshot in > > the walsender connection that did the INIT_LOGICAL_REPLICATION. The > > question is how long should we preserve that snapshot? > > You lost me on this one after the first sentence... But in general the > snapshot isn't so much a magical thing: As long the exporting > transaction is open, it guarantees that there is a transaction out > there that is holding off vacuum from removing data and it's also > guaranteeing that the snapshot as is has existed at some time in the > past. > > Once it is applied to another transaction you now have two > transactions that will hold off vacuum because they share the same > xmin,xmax values. You could also just end the exporting transaction at > that point. I should probably have given a bit more context here... All this is in the context of decoding WAL back into something useful for the purpose of replication. This is done in the already existing walsender process, using the commands I explained in the original post. What we do there is walk the WAL from some point until we could assemble a snapshot for exactly that LSN. Several preconditions need to be met there (like a low enough xmin horizon, so the old catalog entries are still arround). Using that snapshot we can now decode the entries stored in the WAL. If you setup a new replica though, getting all changes from one point where you're consistent isn't all that interesting if you cannot also get a backup from exactly that point in time because otherwise its very hard to start with something sensible on the standby without locking everything down. So the idea is that once we found that consistent snapshot we also export it. Only that we are in a walsender connection which doesn't expose transactional semantics, so the snapshot cannot be "deallocated" by COMMIT; ing. So the question is basically about how to design the user interface of that deallocation. Makes slightly more sense now? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-11 22:20:18 -0500, Tom Lane wrote: > Joachim Wieland <joe@mcknight.de> writes: > > On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> One problem I see is that while exporting a snapshot solves the > >> visibility issues of the table's contents it does not protect against > >> schema changes. I am not sure whether thats a problem. > >> > >> If somebody runs a CLUSTER or something like that, the table's contents > >> will be preserved including MVCC semantics. That's fine. > >> The more problematic cases I see are TRUNCATE, DROP and ALTER > >> TABLE. > > > This is why the pg_dump master process executes a > > lock table <table> in access share mode > > for every table, so your commands would all block. > > A lock doesn't protect against schema changes made before the lock was > taken. The reason that the described scenario is problematic is that > pg_dump is going to be expected to work against a snapshot made before > it gets a chance to take those table locks. Thus, there's a window > where DDL is dangerous, and will invalidate the dump --- perhaps without > any warning. Exactly. > Now, we have this problem today, in that pg_dump has to read pg_class > before it can take table locks so some window exists already. Yea. And if somebody else already has conflicting locks its pretty damn easy to hit as show in my answer to Joachim... I am pretty sure there are lots of nasty silent variants. > What's > bothering me about what Andres describes is that the window for trouble > seems to be getting much bigger. Me too. If it only were clearly visible errors I wouldn't be bothered too much, but ... This morning I wondered whether we couldn't protect against that by acquiring share locks on the catalog rows pg_dump reads, that would result in "could not serialize access due to concurrent update" type of errors which would be easy enough discernible/translateable. While pretty damn ugly that should take care of most of those issues, shouldn't it? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-11 22:39:14 -0500, Steve Singer wrote: > On 12-12-11 06:52 PM, Andres Freund wrote: > >Hi, > > > > >Problem 1: > > > >One problem I see is that while exporting a snapshot solves the > >visibility issues of the table's contents it does not protect against > >schema changes. I am not sure whether thats a problem. > > > >If somebody runs a CLUSTER or something like that, the table's contents > >will be preserved including MVCC semantics. That's fine. > >The more problematic cases I see are TRUNCATE, DROP and ALTER > >TABLE. Imagine the following: > > > >S1: INIT_LOGICAL_REPLICATION > >S1: get snapshot: 00000333-1 > >S2: ALTER TABLE foo ALTER COLUMN blub text USING (blub::text); > >S3: pg_dump --snapshot 00000333-1 > >S1: START_LOGICAL_REPLICATION > > > >In that case the pg_dump would dump foo using the schema *after* the > >ALTER TABLE but showing only rows visible to our snapshot. After > >START_LOGICAL_REPLICATION all changes after the xlog position from > >INIT_LOGICAL_REPLICATION will be returned though - including all the > >tuples from the ALTER TABLE and potentially - if some form of schema > >capturing was in place - the ALTER TABLE itself. The copied schema would > >have the new format already though. > > > >Does anybody see that as aproblem or is it just a case of PEBKAC? One > >argument for the latter is that thats already a problematic case for > >normal pg_dump's. Its just that the window is a bit larger here. > > Is there anyway to detect this situation as part of the pg_dump? If I could > detect this, abort my pg_dump then go and get a new snapshot then I don't > see this as a big deal. I can live with telling users, "don't do DDL like > things while subscribing a new node, if you do the subscription will > restart". I am less keen on telling users "don't do DDL like things while > subscribing a new node or the results will be unpredictable" I am trying to think of unintrusive way to detect this.... > I'm trying to convince myself if I will be able to take a pg_dump from an > exported snapshot plus the changes made after in between the snapshot id to > some later time and turn the results into a consistent database. What if > something like this comes along > > ... > > I'm worried that it will be difficult to pragmatically stitch together the > inconsistent snapshot from the pg_dump plus the logical records generated in > between the snapshot and the dump (along with any record of the DDL if it > exists). I think trying to solve that in the replication solution is a bad idea. There are too many possible scenarios and some of them very subtle and hard to detect. So I think its either: 1) find something that tests for this and abort if so 2) acquire locks earlier preventing DDL alltogether till pg_dump starts 3) don't care. The problem exists today and not many people have complained. > >Problem 2: > > > >Control Flow. > > > >To be able to do a "SET TRANSACTION SNAPSHOT" the source transaction > >needs to be alive. That's currently solved by exporting the snapshot in > >the walsender connection that did the INIT_LOGICAL_REPLICATION. The > >question is how long should we preserve that snapshot? > > > >There is no requirement - and there *cannot* be one in the general case, > >the slot needs to be usable after a restart - that > >START_LOGICAL_REPLICATION has to be run in the same replication > >connection as INIT. > > > >Possible solutions: > >1) INIT_LOGICAL_REPLICATION waits for an answer from the client that > >confirms that logical replication initialization is finished. Before > >that the walsender connection cannot be used for anything else. > > > >2) we remove the snapshot as soon as any other commend is received, this > >way the replication connection stays usable, e.g. to issue a > >START_LOGICAL_REPLICATION in parallel to the initial data dump. In that > >case the snapshot would have to be imported *before* the next command > >was received as SET TRANSACTION SNAPSHOT requires the source transaction > >to be still open. > Option 2 sounds more flexible. Is it more difficult to implement? No, I don't think so. It's a bit more intrusive in that it requires knowledge about logical replication in more parts of walsender, but it should be ok. Note btw, that my description of 1) was easy to misunderstand. The "that" in "Before that the walsender connection cannot be used for anything else." is the answer from the client, not the usage of the exported snapshot. Once the snapshot has been iimported into other session(s) the source doesn't need to be alive anymore. Does that explanation change anything? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-12 12:13:44 +0100, Andres Freund wrote: > On 2012-12-11 22:20:18 -0500, Tom Lane wrote: > > Joachim Wieland <joe@mcknight.de> writes: > > > On Tue, Dec 11, 2012 at 6:52 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > >> One problem I see is that while exporting a snapshot solves the > > >> visibility issues of the table's contents it does not protect against > > >> schema changes. I am not sure whether thats a problem. > > >> > > >> If somebody runs a CLUSTER or something like that, the table's contents > > >> will be preserved including MVCC semantics. That's fine. > > >> The more problematic cases I see are TRUNCATE, DROP and ALTER > > >> TABLE. > > > > > This is why the pg_dump master process executes a > > > lock table <table> in access share mode > > > for every table, so your commands would all block. > > > > A lock doesn't protect against schema changes made before the lock was > > taken. The reason that the described scenario is problematic is that > > pg_dump is going to be expected to work against a snapshot made before > > it gets a chance to take those table locks. Thus, there's a window > > where DDL is dangerous, and will invalidate the dump --- perhaps without > > any warning. > > Now, we have this problem today, in that pg_dump has to read pg_class > > before it can take table locks so some window exists already. > > > What's > > bothering me about what Andres describes is that the window for trouble > > seems to be getting much bigger. > > This morning I wondered whether we couldn't protect against that by > acquiring share locks on the catalog rows pg_dump reads, that would > result in "could not serialize access due to concurrent update" type of > errors which would be easy enough discernible/translateable. > While pretty damn ugly that should take care of most of those issues, > shouldn't it? After a quick look it doesn't look too hard to add this, does anybody have an opinion whether its something worthwile? And possibly a suggest option name? I can't come up with something better than --recheck-locks or something, but thats awful. I don't think there's too much point in locking anything but pg_class, pg_attribute, pg_type nearly everything else is read using a mvcc snapshot anyway or isn't all that critical. Other candidates? Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2012-12-12 12:13:44 +0100, Andres Freund wrote: >> This morning I wondered whether we couldn't protect against that by >> acquiring share locks on the catalog rows pg_dump reads, that would >> result in "could not serialize access due to concurrent update" type of >> errors which would be easy enough discernible/translateable. >> While pretty damn ugly that should take care of most of those issues, >> shouldn't it? How would it fix anything? The problem is with DDL that's committed and gone before pg_dump ever gets to the table's pg_class row. Once it does, and takes AccessShareLock on the relation, it's safe. Adding a SELECT FOR SHARE step just adds more time before we can get that lock. Also, locking the pg_class row doesn't provide protection against DDL that doesn't modify the relation's pg_class row, of which there is plenty. regards, tom lane
On 2012-12-12 18:52:33 -0500, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2012-12-12 12:13:44 +0100, Andres Freund wrote: > >> This morning I wondered whether we couldn't protect against that by > >> acquiring share locks on the catalog rows pg_dump reads, that would > >> result in "could not serialize access due to concurrent update" type of > >> errors which would be easy enough discernible/translateable. > >> While pretty damn ugly that should take care of most of those issues, > >> shouldn't it? > > How would it fix anything? The problem is with DDL that's committed and > gone before pg_dump ever gets to the table's pg_class row. Once it > does, and takes AccessShareLock on the relation, it's safe. Adding a > SELECT FOR SHARE step just adds more time before we can get that lock. Getting a FOR SHARE lock ought to error out with a serialization failure if the row was updated since our snapshot started as pg_dump uses repeatable read/serializable. Now that obviously doesn't fix the situation, but making it detectable in a safe way seems to be good enough for me. > Also, locking the pg_class row doesn't provide protection against DDL > that doesn't modify the relation's pg_class row, of which there is > plenty. Well, thats why I thought of pg_class, pg_attribute, pg_type. Maybe that list needs to get extended a bit, but I think just those 3 should detect most dangerous situations. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 12-12-12 06:20 AM, Andres Freund wrote: >> Possible solutions: >> 1) INIT_LOGICAL_REPLICATION waits for an answer from the client that >> confirms that logical replication initialization is finished. Before >> that the walsender connection cannot be used for anything else. >> >> 2) we remove the snapshot as soon as any other commend is received, this >> way the replication connection stays usable, e.g. to issue a >> START_LOGICAL_REPLICATION in parallel to the initial data dump. In that >> case the snapshot would have to be imported *before* the next command >> was received as SET TRANSACTION SNAPSHOT requires the source transaction >> to be still open. >> Option 2 sounds more flexible. Is it more difficult to implement? > No, I don't think so. It's a bit more intrusive in that it requires > knowledge about logical replication in more parts of walsender, but it > should be ok. > > Note btw, that my description of 1) was easy to misunderstand. The > "that" in "Before that the walsender connection cannot be used for > anything else." is the answer from the client, not the usage of the > exported snapshot. Once the snapshot has been iimported into other > session(s) the source doesn't need to be alive anymore. > Does that explanation change anything? I think I understood you were saying the walsender connection can't be used for anything else (ie streaming WAL) until the exported snapshot has been imported. I think your clarification is still consistent with this? WIth option 2 I can still get the option 1 behaviour by not sending the next command to the walsender until I am done importing the snapshot. However if I want to start processing WAL before the snapshot has been imported I can do that with option 2. I am not sure I would need to do that, I'm just saying having the option is more flexible. > Greetings, > > Andres Freund > > -- > Andres Freund http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > >
On 2012-12-13 11:02:06 -0500, Steve Singer wrote: > On 12-12-12 06:20 AM, Andres Freund wrote: > >>Possible solutions: > >>1) INIT_LOGICAL_REPLICATION waits for an answer from the client that > >>confirms that logical replication initialization is finished. Before > >>that the walsender connection cannot be used for anything else. > >> > >>2) we remove the snapshot as soon as any other commend is received, this > >>way the replication connection stays usable, e.g. to issue a > >>START_LOGICAL_REPLICATION in parallel to the initial data dump. In that > >>case the snapshot would have to be imported *before* the next command > >>was received as SET TRANSACTION SNAPSHOT requires the source transaction > >>to be still open. > >>Option 2 sounds more flexible. Is it more difficult to implement? > >No, I don't think so. It's a bit more intrusive in that it requires > >knowledge about logical replication in more parts of walsender, but it > >should be ok. > > > >Note btw, that my description of 1) was easy to misunderstand. The > >"that" in "Before that the walsender connection cannot be used for > >anything else." is the answer from the client, not the usage of the > >exported snapshot. Once the snapshot has been iimported into other > >session(s) the source doesn't need to be alive anymore. > >Does that explanation change anything? > > I think I understood you were saying the walsender connection can't be used > for anything else (ie streaming WAL) until the exported snapshot has been > imported. I think your clarification is still consistent with this? Yes, thats correct. > WIth option 2 I can still get the option 1 behaviour by not sending the next > command to the walsender until I am done importing the snapshot. However if > I want to start processing WAL before the snapshot has been imported I can > do that with option 2. > > I am not sure I would need to do that, I'm just saying having the option is > more flexible. True. Still not sure whats better, but since youre slightly leaning towards 2) I am going to implement that. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 11, 2012 at 10:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> This is why the pg_dump master process executes a >> lock table <table> in access share mode >> for every table, so your commands would all block. > > A lock doesn't protect against schema changes made before the lock was > taken. The reason that the described scenario is problematic is that > pg_dump is going to be expected to work against a snapshot made before > it gets a chance to take those table locks. Thus, there's a window > where DDL is dangerous, and will invalidate the dump --- perhaps without > any warning. > > Now, we have this problem today, in that pg_dump has to read pg_class > before it can take table locks so some window exists already. What's > bothering me about what Andres describes is that the window for trouble > seems to be getting much bigger. It would be sort of nice to have some kind of lock that would freeze out all DDL (except for temporary objects, perhaps?). Then pg_dump could take that lock before taking a snapshot. As a nifty side benefit, it would perhaps provide a way around the problem that databases with many relations are undumpable due to lock table exhaustion. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2012-12-13 17:11:31 -0500, Robert Haas wrote: > On Tue, Dec 11, 2012 at 10:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> This is why the pg_dump master process executes a > >> lock table <table> in access share mode > >> for every table, so your commands would all block. > > > > A lock doesn't protect against schema changes made before the lock was > > taken. The reason that the described scenario is problematic is that > > pg_dump is going to be expected to work against a snapshot made before > > it gets a chance to take those table locks. Thus, there's a window > > where DDL is dangerous, and will invalidate the dump --- perhaps without > > any warning. > > > > Now, we have this problem today, in that pg_dump has to read pg_class > > before it can take table locks so some window exists already. What's > > bothering me about what Andres describes is that the window for trouble > > seems to be getting much bigger. > > It would be sort of nice to have some kind of lock that would freeze > out all DDL (except for temporary objects, perhaps?). Then pg_dump > could take that lock before taking a snapshot. As a nifty side > benefit, it would perhaps provide a way around the problem that > databases with many relations are undumpable due to lock table > exhaustion. That would solve the consistency problem, yes. Would we need a special kind of permission for this? I would say superuser/database owner only? It should actually even work for standbys in contrast to my FOR SHARE hack idea as we could "just" make it conflict with the exclusive locks logged into the WAL. I don't think that it will against the too-many-locks problem itself though, unless we play some additional tricks. The locks will be acquired later when the tables are dumped anyway. Now, given the snapshot import/export feature we actually could dump them table by table in a single transaction, but thats a bit more complicated. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Dec 13, 2012 at 5:25 PM, Andres Freund <andres@2ndquadrant.com> wrote: > That would solve the consistency problem, yes. Would we need a special > kind of permission for this? I would say superuser/database owner only? Yeah, I doubt we would need a whole new permission for it, but it would certainly have to be disallowed for ordinary users. > It should actually even work for standbys in contrast to my FOR SHARE > hack idea as we could "just" make it conflict with the exclusive locks > logged into the WAL. > > I don't think that it will against the too-many-locks problem itself > though, unless we play some additional tricks. The locks will be > acquired later when the tables are dumped anyway. Now, given the > snapshot import/export feature we actually could dump them table by > table in a single transaction, but thats a bit more complicated. Well, I was thinking that if a transaction already had the no-DDL-on-nontemp-objects lock then perhaps we could optimize away AccessShareLocks on individual relations. Of course that would rely on the global lock being an adequate substitute for all cases where an AccessShareLock would otherwise be needed. Not sure how feasible that is. But it would be really cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Dec 13, 2012 at 5:25 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> That would solve the consistency problem, yes. Would we need a special >> kind of permission for this? I would say superuser/database owner only? > Yeah, I doubt we would need a whole new permission for it, but it > would certainly have to be disallowed for ordinary users. Giving up the ability to run pg_dump as a non-superuser would be pretty annoying, so this would have to be an optional feature if we restrict it that way. regards, tom lane
On Thu, Dec 13, 2012 at 5:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Dec 13, 2012 at 5:25 PM, Andres Freund <andres@2ndquadrant.com> wrote: >>> That would solve the consistency problem, yes. Would we need a special >>> kind of permission for this? I would say superuser/database owner only? > >> Yeah, I doubt we would need a whole new permission for it, but it >> would certainly have to be disallowed for ordinary users. > > Giving up the ability to run pg_dump as a non-superuser would be > pretty annoying, so this would have to be an optional feature if we > restrict it that way. Yeah, for sure. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2012-12-13 21:40:43 +0100, Andres Freund wrote: > On 2012-12-13 11:02:06 -0500, Steve Singer wrote: > > On 12-12-12 06:20 AM, Andres Freund wrote: > > >>Possible solutions: > > >>1) INIT_LOGICAL_REPLICATION waits for an answer from the client that > > >>confirms that logical replication initialization is finished. Before > > >>that the walsender connection cannot be used for anything else. > > >> > > >>2) we remove the snapshot as soon as any other commend is received, this > > >>way the replication connection stays usable, e.g. to issue a > > >>START_LOGICAL_REPLICATION in parallel to the initial data dump. In that > > >>case the snapshot would have to be imported *before* the next command > > >>was received as SET TRANSACTION SNAPSHOT requires the source transaction > > >>to be still open. > > >>Option 2 sounds more flexible. Is it more difficult to implement? > > >No, I don't think so. It's a bit more intrusive in that it requires > > >knowledge about logical replication in more parts of walsender, but it > > >should be ok. > > > > > >Note btw, that my description of 1) was easy to misunderstand. The > > >"that" in "Before that the walsender connection cannot be used for > > >anything else." is the answer from the client, not the usage of the > > >exported snapshot. Once the snapshot has been iimported into other > > >session(s) the source doesn't need to be alive anymore. > > >Does that explanation change anything? > > > > I think I understood you were saying the walsender connection can't be used > > for anything else (ie streaming WAL) until the exported snapshot has been > > imported. I think your clarification is still consistent with this? > > Yes, thats correct. > > > WIth option 2 I can still get the option 1 behaviour by not sending the next > > command to the walsender until I am done importing the snapshot. However if > > I want to start processing WAL before the snapshot has been imported I can > > do that with option 2. > > > > I am not sure I would need to do that, I'm just saying having the option is > > more flexible. > > True. > > Still not sure whats better, but since youre slightly leaning towards 2) > I am going to implement that. Pushed and lightly tested. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services