Thread: [HACKERS] snapbuild woes
Hi, I recently found couple of issues with the way initial logical decoding snapshot is made. First one is outright bug, which has to do with how we track running transactions. What snapbuild basically does while doing initial snapshot is read the xl_running_xacts record, store the list of running txes and then wait until they all finish. The problem with this is that xl_running_xacts does not ensure that it only logs transactions that are actually still running (to avoid locking PGPROC) so there might be xids in xl_running_xacts that already committed before it was logged. This in turn means that the snapbuild will never make snapshot if the situation occurs. The reason why this didn't bite us much yet is that the snapbuild will consider all transactions finished if the empty xl_running_xacts comes which usually happens after a while (but might not on a consistently busy server). The fix I came up with this is to extend the mechanism to also consider all transactions finished when xl_running_xacts which has xmin bigger then that previous xmax comes. This seems to work pretty well on busy servers. This fix is attached as 0001-Mark-snapshot-consistent-when-all-running-txes-have.patch. I believe it should be backpatched all the way to 9.4. The other issue is performance problem again on busy servers with initial snapshot. We track transactions for catalog modifications so that we can do proper catalog time travel for decoding of changes. But for transactions that were running while we started trying to get initial consistent snapshot, there is no good way to tell if they did catalog changes or not so we consider them all as catalog changing. We make separate historical snapshot for every such transaction. This by itself is fine, the problem is that current implementation also considers all the transactions that started after we started watching for changes but before we reached consistent state to also do catalog changes even though there we actually do know if they did any catalog change or not. In practice it means we make snapshots that are not really necessary and if there was long running transaction for which the snapshot builder has to wait for then we can create thousands of unused snapshots which affects performance in bad ways (I've seen the initial snapshot taking hour because of this). The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this behavior so that we don't make snapshots for transactions that we seen wholly and know that they didn't make catalog changes even if we didn't reach consistency yet. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 10/12/16 23:10, Petr Jelinek wrote: > > The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this > behavior so that we don't make snapshots for transactions that we seen > wholly and know that they didn't make catalog changes even if we didn't > reach consistency yet. > Eh, attached wrong patch. This one is correct. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 11 Dec. 2016 06:50, "Petr Jelinek" <petr.jelinek@2ndquadrant.com> wrote:
On 10/12/16 23:10, Petr Jelinek wrote:Eh, attached wrong patch. This one is correct.
>
> The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
> behavior so that we don't make snapshots for transactions that we seen
> wholly and know that they didn't make catalog changes even if we didn't
> reach consistency yet.
>
Attached no patch second time?
On Sun, Dec 11, 2016 at 1:17 AM, Craig Ringer <craig.ringer@2ndquadrant.com> wrote: > On 11 Dec. 2016 06:50, "Petr Jelinek" <petr.jelinek@2ndquadrant.com> wrote: >> On 10/12/16 23:10, Petr Jelinek wrote: >> >>> The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this >>> behavior so that we don't make snapshots for transactions that we seen >>> wholly and know that they didn't make catalog changes even if we didn't >>> reach consistency yet. >> >> Eh, attached wrong patch. This one is correct. > > Attached no patch second time? I see an attachment, and it shows in the archives. https://www.postgresql.org/message-id/aee1d499-e3ca-e091-56da-1ee6a47741c8%402ndquadrant.com -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12 December 2016 at 00:36, Kevin Grittner <kgrittn@gmail.com> wrote: > On Sun, Dec 11, 2016 at 1:17 AM, Craig Ringer > <craig.ringer@2ndquadrant.com> wrote: >> On 11 Dec. 2016 06:50, "Petr Jelinek" <petr.jelinek@2ndquadrant.com> wrote: >>> On 10/12/16 23:10, Petr Jelinek wrote: >>> >>>> The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this >>>> behavior so that we don't make snapshots for transactions that we seen >>>> wholly and know that they didn't make catalog changes even if we didn't >>>> reach consistency yet. >>> >>> Eh, attached wrong patch. This one is correct. >> >> Attached no patch second time? > > I see an attachment, and it shows in the archives. > > https://www.postgresql.org/message-id/aee1d499-e3ca-e091-56da-1ee6a47741c8%402ndquadrant.com Sorry for the noise, apparently my phone's mail client was being dense. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Remote DBA, Training &Services
Hi, On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote: > Hi, > First one is outright bug, which has to do with how we track running > transactions. What snapbuild basically does while doing initial snapshot > is read the xl_running_xacts record, store the list of running txes and > then wait until they all finish. The problem with this is that > xl_running_xacts does not ensure that it only logs transactions that are > actually still running (to avoid locking PGPROC) so there might be xids > in xl_running_xacts that already committed before it was logged. I don't think that's actually true? Notice how LogStandbySnapshot() only releases the lock *after* the LogCurrentRunningXacts() iff wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you observed must actually be a bit more complex :( Regards, Andres
On 12/12/16 22:42, Andres Freund wrote: > Hi, > > On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote: >> Hi, >> First one is outright bug, which has to do with how we track running >> transactions. What snapbuild basically does while doing initial snapshot >> is read the xl_running_xacts record, store the list of running txes and >> then wait until they all finish. The problem with this is that >> xl_running_xacts does not ensure that it only logs transactions that are >> actually still running (to avoid locking PGPROC) so there might be xids >> in xl_running_xacts that already committed before it was logged. > > I don't think that's actually true? Notice how LogStandbySnapshot() > only releases the lock *after* the LogCurrentRunningXacts() iff > wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you > observed must actually be a bit more complex :( > Hmm, interesting, I did see the transaction commit in the WAL before the xl_running_xacts that contained the xid as running. I only seen it on production system though, didn't really manage to easily reproduce it locally. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote: > On 12/12/16 22:42, Andres Freund wrote: > > Hi, > > > > On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote: > >> Hi, > >> First one is outright bug, which has to do with how we track running > >> transactions. What snapbuild basically does while doing initial snapshot > >> is read the xl_running_xacts record, store the list of running txes and > >> then wait until they all finish. The problem with this is that > >> xl_running_xacts does not ensure that it only logs transactions that are > >> actually still running (to avoid locking PGPROC) so there might be xids > >> in xl_running_xacts that already committed before it was logged. > > > > I don't think that's actually true? Notice how LogStandbySnapshot() > > only releases the lock *after* the LogCurrentRunningXacts() iff > > wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you > > observed must actually be a bit more complex :( > > > > Hmm, interesting, I did see the transaction commit in the WAL before the > xl_running_xacts that contained the xid as running. I only seen it on > production system though, didn't really manage to easily reproduce it > locally. I suspect the reason for that is that RecordTransactionCommit() doesn't conflict with ProcArrayLock in the first place - only ProcArrayEndTransaction() does. So they're still running in the PGPROC sense, just not the crash-recovery sense... Andres
On 12/12/16 23:33, Andres Freund wrote: > On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote: >> On 12/12/16 22:42, Andres Freund wrote: >>> Hi, >>> >>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote: >>>> Hi, >>>> First one is outright bug, which has to do with how we track running >>>> transactions. What snapbuild basically does while doing initial snapshot >>>> is read the xl_running_xacts record, store the list of running txes and >>>> then wait until they all finish. The problem with this is that >>>> xl_running_xacts does not ensure that it only logs transactions that are >>>> actually still running (to avoid locking PGPROC) so there might be xids >>>> in xl_running_xacts that already committed before it was logged. >>> >>> I don't think that's actually true? Notice how LogStandbySnapshot() >>> only releases the lock *after* the LogCurrentRunningXacts() iff >>> wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you >>> observed must actually be a bit more complex :( >>> >> >> Hmm, interesting, I did see the transaction commit in the WAL before the >> xl_running_xacts that contained the xid as running. I only seen it on >> production system though, didn't really manage to easily reproduce it >> locally. > > I suspect the reason for that is that RecordTransactionCommit() doesn't > conflict with ProcArrayLock in the first place - only > ProcArrayEndTransaction() does. So they're still running in the PGPROC > sense, just not the crash-recovery sense... > That looks like reasonable explanation. BTW I realized my patch needs bit more work, currently it will break the actual snapshot as it behaves same as if the xl_running_xacts was empty which is not correct AFAICS. Also if we did the approach suggested by my patch (ie using this xmin/xmax comparison) I guess we wouldn't need to hold the lock for extra time in wal_level logical anymore. That is of course unless you think it should be approached from the other side of the stream and try log correct xl_running_xacts. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 13/12/16 00:38, Petr Jelinek wrote: > On 12/12/16 23:33, Andres Freund wrote: >> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote: >>> On 12/12/16 22:42, Andres Freund wrote: >>>> Hi, >>>> >>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote: >>>>> Hi, >>>>> First one is outright bug, which has to do with how we track running >>>>> transactions. What snapbuild basically does while doing initial snapshot >>>>> is read the xl_running_xacts record, store the list of running txes and >>>>> then wait until they all finish. The problem with this is that >>>>> xl_running_xacts does not ensure that it only logs transactions that are >>>>> actually still running (to avoid locking PGPROC) so there might be xids >>>>> in xl_running_xacts that already committed before it was logged. >>>> >>>> I don't think that's actually true? Notice how LogStandbySnapshot() >>>> only releases the lock *after* the LogCurrentRunningXacts() iff >>>> wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you >>>> observed must actually be a bit more complex :( >>>> >>> >>> Hmm, interesting, I did see the transaction commit in the WAL before the >>> xl_running_xacts that contained the xid as running. I only seen it on >>> production system though, didn't really manage to easily reproduce it >>> locally. >> >> I suspect the reason for that is that RecordTransactionCommit() doesn't >> conflict with ProcArrayLock in the first place - only >> ProcArrayEndTransaction() does. So they're still running in the PGPROC >> sense, just not the crash-recovery sense... >> > > That looks like reasonable explanation. BTW I realized my patch needs > bit more work, currently it will break the actual snapshot as it behaves > same as if the xl_running_xacts was empty which is not correct AFAICS. > Hi, I got to work on this again. Unfortunately I haven't found solution that I would be very happy with. What I did is in case we read xl_running_xacts which has all transactions we track finished, we start tracking from that new xl_running_xacts again with the difference that we clean up the running transactions based on previously seen committed ones. That means that on busy server we may wait for multiple xl_running_xacts rather than just one, but at least we have chance to finish unlike with current coding which basically waits for empty xl_running_xacts. I also removed the additional locking for logical wal_level in LogStandbySnapshot() since it does not work. I also identified another bug in snapbuild while looking at the code. That is the logical decoding will try to use on disk serialized snapshot for initial snapshot export when it can. The problem is that these snapshots are quite special and are not really usable as snapshots for data (ie, the logical decoding snapshots regularly have xmax smaller than xmin). So then when client tries to use this exported snapshot it gets completely wrong data as the snapshot is broken. I think this is explanation for Erik Rijker's problems with the initial COPY patch for logical replication. At least for me the issues go away when I disable use of the ondisk snapshots. I didn't really find better solution than that though (disabling the use of ondisk snapshots for initial consistent snapshot). So to summarize attached patches: 0001 - Fixes performance issue where we build tons of snapshots that we don't need which kills CPU. 0002 - Disables the use of ondisk historical snapshots for initial consistent snapshot export as it may result in corrupt data. This definitely needs backport. 0003 - Fixes bug where we might never reach snapshot on busy server due to race condition in xl_running_xacts logging. The original use of extra locking does not seem to be enough in practice. Once we have agreed fix for this it's probably worth backpatching. There are still some comments that need updating, this is more of a PoC. Thoughts or better ideas? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-02-22 03:05, Petr Jelinek wrote: > > So to summarize attached patches: > 0001 - Fixes performance issue where we build tons of snapshots that we > don't need which kills CPU. > > 0002 - Disables the use of ondisk historical snapshots for initial > consistent snapshot export as it may result in corrupt data. This > definitely needs backport. > > 0003 - Fixes bug where we might never reach snapshot on busy server due > to race condition in xl_running_xacts logging. The original use of > extra > locking does not seem to be enough in practice. Once we have agreed fix > for this it's probably worth backpatching. There are still some > comments > that need updating, this is more of a PoC. > I am not not entirely sure what to expect. Should a server with these 3 patches do initial data copy or not? The sgml seems to imply there is not inital data copy. But my test does copy something. Anyway, I have repeated the same old pgbench-test, assuming inital data copy should be working. With > 0001-Skip-unnecessary-snapshot-builds.patch > 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch > 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch the consistent (but wrong) end state is always that only one of the four pgbench tables, pgbench_history, is replicated (always correctly). Below is the output from the test (I've edited the lines for email) (below, a,b,t,h stand for: pgbench_accounts, pgbench_branches, pgbench_tellers, pgbench_history) (master on port 6972, replica on port 6973.) port 6972 a,b,t,h: 100000 1 10 347 6973 a,b,t,h: 0 0 0 347 a,b,t,h: a68efc81a 2c27f7ba5 128590a57 1e4070879 master a,b,t,h: d41d8cd98 d41d8cd98 d41d8cd98 1e4070879 replica NOK The md5-initstrings are from a md5 of the whole content of each table (an ordered select *) I repeated this a few times: of course, the number of rows in pgbench_history varies a bit but otherwise it is always the same: 3 empty replica tables, pgbench_history replicated correctly. Something is not right. thanks, Erik Rijkers
On 22/02/17 11:29, Erik Rijkers wrote: > On 2017-02-22 03:05, Petr Jelinek wrote: >> >> So to summarize attached patches: >> 0001 - Fixes performance issue where we build tons of snapshots that we >> don't need which kills CPU. >> >> 0002 - Disables the use of ondisk historical snapshots for initial >> consistent snapshot export as it may result in corrupt data. This >> definitely needs backport. >> >> 0003 - Fixes bug where we might never reach snapshot on busy server due >> to race condition in xl_running_xacts logging. The original use of extra >> locking does not seem to be enough in practice. Once we have agreed fix >> for this it's probably worth backpatching. There are still some comments >> that need updating, this is more of a PoC. >> > > I am not not entirely sure what to expect. Should a server with these 3 > patches do initial data copy or not? The sgml seems to imply there is > not inital data copy. But my test does copy something. > Not by itself (without the copy patch), those fixes are for snapshots. > > With >> 0001-Skip-unnecessary-snapshot-builds.patch >> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch >> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch > > the consistent (but wrong) end state is always that only one of the four > pgbench tables, pgbench_history, is replicated (always correctly). > > Below is the output from the test (I've edited the lines for email) > (below, a,b,t,h stand for: pgbench_accounts, pgbench_branches, > pgbench_tellers, pgbench_history) > (master on port 6972, replica on port 6973.) > > port > 6972 a,b,t,h: 100000 1 10 347 > 6973 a,b,t,h: 0 0 0 347 > > a,b,t,h: a68efc81a 2c27f7ba5 128590a57 1e4070879 master > a,b,t,h: d41d8cd98 d41d8cd98 d41d8cd98 1e4070879 replica NOK > > The md5-initstrings are from a md5 of the whole content of each table > (an ordered select *) > > I repeated this a few times: of course, the number of rows in > pgbench_history varies a bit but otherwise it is always the same: 3 > empty replica tables, pgbench_history replicated correctly. > That's actually correct behaviour without the initial copy patch, it replicates changes, but since the 3 tables only get updates there is nothing to replicate as there is no data downstream. However inserts will of course work fine even without data downstream. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 22/02/17 03:05, Petr Jelinek wrote: > On 13/12/16 00:38, Petr Jelinek wrote: >> On 12/12/16 23:33, Andres Freund wrote: >>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote: >>>> On 12/12/16 22:42, Andres Freund wrote: >>>>> Hi, >>>>> >>>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote: >>>>>> Hi, >>>>>> First one is outright bug, which has to do with how we track running >>>>>> transactions. What snapbuild basically does while doing initial snapshot >>>>>> is read the xl_running_xacts record, store the list of running txes and >>>>>> then wait until they all finish. The problem with this is that >>>>>> xl_running_xacts does not ensure that it only logs transactions that are >>>>>> actually still running (to avoid locking PGPROC) so there might be xids >>>>>> in xl_running_xacts that already committed before it was logged. >>>>> >>>>> I don't think that's actually true? Notice how LogStandbySnapshot() >>>>> only releases the lock *after* the LogCurrentRunningXacts() iff >>>>> wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you >>>>> observed must actually be a bit more complex :( >>>>> >>>> >>>> Hmm, interesting, I did see the transaction commit in the WAL before the >>>> xl_running_xacts that contained the xid as running. I only seen it on >>>> production system though, didn't really manage to easily reproduce it >>>> locally. >>> >>> I suspect the reason for that is that RecordTransactionCommit() doesn't >>> conflict with ProcArrayLock in the first place - only >>> ProcArrayEndTransaction() does. So they're still running in the PGPROC >>> sense, just not the crash-recovery sense... >>> >> >> That looks like reasonable explanation. BTW I realized my patch needs >> bit more work, currently it will break the actual snapshot as it behaves >> same as if the xl_running_xacts was empty which is not correct AFAICS. >> > > Hi, > > I got to work on this again. Unfortunately I haven't found solution that > I would be very happy with. What I did is in case we read > xl_running_xacts which has all transactions we track finished, we start > tracking from that new xl_running_xacts again with the difference that > we clean up the running transactions based on previously seen committed > ones. That means that on busy server we may wait for multiple > xl_running_xacts rather than just one, but at least we have chance to > finish unlike with current coding which basically waits for empty > xl_running_xacts. I also removed the additional locking for logical > wal_level in LogStandbySnapshot() since it does not work. Not hearing any opposition to this idea so I decided to polish this and also optimize it a bit. That being said, thanks to testing from Erik Rijkers I've identified one more bug in how we do the initial snapshot. Apparently we don't reserve the global xmin when we start building the initial exported snapshot for a slot (we only reserver catalog_xmin which is fine for logical decoding but not for the exported snapshot) so the VACUUM and heap pruning will happily delete old versions of rows that are still needed by anybody trying to use that exported snapshot. Attached are updated versions of patches: 0001 - Fixes the above mentioned global xmin tracking issues. Needs to be backported all the way to 9.4 0002 - Removes use of the logical decoding saved snapshots for initial exported snapshot since those snapshots only work for catalogs and not user data. Also needs to be backported all the way to 9.4. 0003 - Changes handling of the xl_running_xacts in initial snapshot build to what I wrote above and removes the extra locking from LogStandbySnapshot introduced by logical decoding. 0004 - Improves performance of initial snapshot building by skipping catalog snapshot build for transactions that don't do catalog changes. The 0001 and 0002 are bug fixes because without them the exported snapshots are basically corrupted. The 0003 and 0004 are performance improvements, but on busy servers the snapshot export might never happen so it's for rather serious performance issues. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 24/02/17 22:56, Petr Jelinek wrote: > On 22/02/17 03:05, Petr Jelinek wrote: >> On 13/12/16 00:38, Petr Jelinek wrote: >>> On 12/12/16 23:33, Andres Freund wrote: >>>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote: >>>>> On 12/12/16 22:42, Andres Freund wrote: >>>>>> Hi, >>>>>> >>>>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote: >>>>>>> Hi, >>>>>>> First one is outright bug, which has to do with how we track running >>>>>>> transactions. What snapbuild basically does while doing initial snapshot >>>>>>> is read the xl_running_xacts record, store the list of running txes and >>>>>>> then wait until they all finish. The problem with this is that >>>>>>> xl_running_xacts does not ensure that it only logs transactions that are >>>>>>> actually still running (to avoid locking PGPROC) so there might be xids >>>>>>> in xl_running_xacts that already committed before it was logged. >>>>>> >>>>>> I don't think that's actually true? Notice how LogStandbySnapshot() >>>>>> only releases the lock *after* the LogCurrentRunningXacts() iff >>>>>> wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you >>>>>> observed must actually be a bit more complex :( >>>>>> >>>>> >>>>> Hmm, interesting, I did see the transaction commit in the WAL before the >>>>> xl_running_xacts that contained the xid as running. I only seen it on >>>>> production system though, didn't really manage to easily reproduce it >>>>> locally. >>>> >>>> I suspect the reason for that is that RecordTransactionCommit() doesn't >>>> conflict with ProcArrayLock in the first place - only >>>> ProcArrayEndTransaction() does. So they're still running in the PGPROC >>>> sense, just not the crash-recovery sense... >>>> >>> >>> That looks like reasonable explanation. BTW I realized my patch needs >>> bit more work, currently it will break the actual snapshot as it behaves >>> same as if the xl_running_xacts was empty which is not correct AFAICS. >>> >> >> Hi, >> >> I got to work on this again. Unfortunately I haven't found solution that >> I would be very happy with. What I did is in case we read >> xl_running_xacts which has all transactions we track finished, we start >> tracking from that new xl_running_xacts again with the difference that >> we clean up the running transactions based on previously seen committed >> ones. That means that on busy server we may wait for multiple >> xl_running_xacts rather than just one, but at least we have chance to >> finish unlike with current coding which basically waits for empty >> xl_running_xacts. I also removed the additional locking for logical >> wal_level in LogStandbySnapshot() since it does not work. > > Not hearing any opposition to this idea so I decided to polish this and > also optimize it a bit. > > That being said, thanks to testing from Erik Rijkers I've identified one > more bug in how we do the initial snapshot. Apparently we don't reserve > the global xmin when we start building the initial exported snapshot for > a slot (we only reserver catalog_xmin which is fine for logical decoding > but not for the exported snapshot) so the VACUUM and heap pruning will > happily delete old versions of rows that are still needed by anybody > trying to use that exported snapshot. > Aaand I found one more bug in snapbuild. Apparently we don't protect the snapshot builder xmin from going backwards which can yet again result in corrupted exported snapshot. Summary of attached patches: 0001 - Fixes the above mentioned global xmin tracking issues. Needs to be backported all the way to 9.4 0002 - Removes use of the logical decoding saved snapshots for initial exported snapshot since those snapshots only work for catalogs and not user data. Also needs to be backported all the way to 9.4. 0003 - Makes sure snapshot builder xmin is not moved backwards by xl_running_xacts (which can otherwise happen during initial snapshot building). Also should be backported to 9.4. 0004 - Changes handling of the xl_running_xacts in initial snapshot build to what I wrote above and removes the extra locking from LogStandbySnapshot introduced by logical decoding. 0005 - Improves performance of initial snapshot building by skipping catalog snapshot build for transactions that don't do catalog changes. I did some improvements to the other patches as well so they are not the same as in previous post, hence the version bump. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
- snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
- snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
- snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
- snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch
On 26/02/17 01:43, Petr Jelinek wrote: > On 24/02/17 22:56, Petr Jelinek wrote: >> On 22/02/17 03:05, Petr Jelinek wrote: >>> On 13/12/16 00:38, Petr Jelinek wrote: >>>> On 12/12/16 23:33, Andres Freund wrote: >>>>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote: >>>>>> On 12/12/16 22:42, Andres Freund wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote: >>>>>>>> Hi, >>>>>>>> First one is outright bug, which has to do with how we track running >>>>>>>> transactions. What snapbuild basically does while doing initial snapshot >>>>>>>> is read the xl_running_xacts record, store the list of running txes and >>>>>>>> then wait until they all finish. The problem with this is that >>>>>>>> xl_running_xacts does not ensure that it only logs transactions that are >>>>>>>> actually still running (to avoid locking PGPROC) so there might be xids >>>>>>>> in xl_running_xacts that already committed before it was logged. >>>>>>> >>>>>>> I don't think that's actually true? Notice how LogStandbySnapshot() >>>>>>> only releases the lock *after* the LogCurrentRunningXacts() iff >>>>>>> wal_level >= WAL_LEVEL_LOGICAL. So the explanation for the problem you >>>>>>> observed must actually be a bit more complex :( >>>>>>> >>>>>> >>>>>> Hmm, interesting, I did see the transaction commit in the WAL before the >>>>>> xl_running_xacts that contained the xid as running. I only seen it on >>>>>> production system though, didn't really manage to easily reproduce it >>>>>> locally. >>>>> >>>>> I suspect the reason for that is that RecordTransactionCommit() doesn't >>>>> conflict with ProcArrayLock in the first place - only >>>>> ProcArrayEndTransaction() does. So they're still running in the PGPROC >>>>> sense, just not the crash-recovery sense... >>>>> >>>> >>>> That looks like reasonable explanation. BTW I realized my patch needs >>>> bit more work, currently it will break the actual snapshot as it behaves >>>> same as if the xl_running_xacts was empty which is not correct AFAICS. >>>> >>> >>> Hi, >>> >>> I got to work on this again. Unfortunately I haven't found solution that >>> I would be very happy with. What I did is in case we read >>> xl_running_xacts which has all transactions we track finished, we start >>> tracking from that new xl_running_xacts again with the difference that >>> we clean up the running transactions based on previously seen committed >>> ones. That means that on busy server we may wait for multiple >>> xl_running_xacts rather than just one, but at least we have chance to >>> finish unlike with current coding which basically waits for empty >>> xl_running_xacts. I also removed the additional locking for logical >>> wal_level in LogStandbySnapshot() since it does not work. >> >> Not hearing any opposition to this idea so I decided to polish this and >> also optimize it a bit. >> >> That being said, thanks to testing from Erik Rijkers I've identified one >> more bug in how we do the initial snapshot. Apparently we don't reserve >> the global xmin when we start building the initial exported snapshot for >> a slot (we only reserver catalog_xmin which is fine for logical decoding >> but not for the exported snapshot) so the VACUUM and heap pruning will >> happily delete old versions of rows that are still needed by anybody >> trying to use that exported snapshot. >> > > Aaand I found one more bug in snapbuild. Apparently we don't protect the > snapshot builder xmin from going backwards which can yet again result in > corrupted exported snapshot. > > Summary of attached patches: > 0001 - Fixes the above mentioned global xmin tracking issues. Needs to > be backported all the way to 9.4 > > 0002 - Removes use of the logical decoding saved snapshots for initial > exported snapshot since those snapshots only work for catalogs and not > user data. Also needs to be backported all the way to 9.4. I've been bit overzealous about this one (removed the use of the saved snapshots completely). So here is a fix for that (and rebase on top of current HEAD). > > 0003 - Makes sure snapshot builder xmin is not moved backwards by > xl_running_xacts (which can otherwise happen during initial snapshot > building). Also should be backported to 9.4. > > 0004 - Changes handling of the xl_running_xacts in initial snapshot > build to what I wrote above and removes the extra locking from > LogStandbySnapshot introduced by logical decoding. > > 0005 - Improves performance of initial snapshot building by skipping > catalog snapshot build for transactions that don't do catalog changes. > -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
- snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
- snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
- snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
- snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
- snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patch
On 2017-03-03 01:30, Petr Jelinek wrote: With these patches: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 0002-Fix-after-trigger-execution-in-logical-replication.patch 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patch 0001-Logical-replication-support-for-initial-data-copy-v6.patch I get: subscriptioncmds.c:47:12: error: static declaration of ‘oid_cmp’ follows non-static declaration static int oid_cmp(const void *p1, const void *p2); ^~~~~~~ In file included from subscriptioncmds.c:42:0: ../../../src/include/utils/builtins.h:70:12: note: previous declaration of ‘oid_cmp’ was here extern int oid_cmp(const void *p1, const void *p2); ^~~~~~~ make[3]: *** [subscriptioncmds.o] Error 1 make[3]: *** Waiting for unfinished jobs.... make[2]: *** [commands-recursive] Error 2 make[2]: *** Waiting for unfinished jobs.... make[1]: *** [all-backend-recurse] Error 2 make: *** [all-src-recurse] Error 2
On 03/03/17 01:53, Erik Rijkers wrote: > On 2017-03-03 01:30, Petr Jelinek wrote: > > With these patches: > > 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch > 0002-Fix-after-trigger-execution-in-logical-replication.patch > 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch > snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch > snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch > > snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch > snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch > snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patch > 0001-Logical-replication-support-for-initial-data-copy-v6.patch > > I get: > > subscriptioncmds.c:47:12: error: static declaration of ‘oid_cmp’ follows > non-static declaration > static int oid_cmp(const void *p1, const void *p2); > ^~~~~~~ > In file included from subscriptioncmds.c:42:0: > ../../../src/include/utils/builtins.h:70:12: note: previous declaration > of ‘oid_cmp’ was here > extern int oid_cmp(const void *p1, const void *p2); > ^~~~~~~ > make[3]: *** [subscriptioncmds.o] Error 1 > make[3]: *** Waiting for unfinished jobs.... > make[2]: *** [commands-recursive] Error 2 > make[2]: *** Waiting for unfinished jobs.... > make[1]: *** [all-backend-recurse] Error 2 > make: *** [all-src-recurse] Error 2 > Yes the copy patch needs rebase as well. But these ones are fine. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 3/2/17 7:54 PM, Petr Jelinek wrote: > > Yes the copy patch needs rebase as well. But these ones are fine. This bug has been moved to CF 2017-07. -- -David david@pgmasters.net
On 2017-04-08 09:51:39 -0400, David Steele wrote: > On 3/2/17 7:54 PM, Petr Jelinek wrote: > > > > Yes the copy patch needs rebase as well. But these ones are fine. > > This bug has been moved to CF 2017-07. FWIW, as these are bug-fixes that need to be backpatched, I do plan to work on them soon. - Andres
On 2017-04-08 15:56, Andres Freund wrote: > On 2017-04-08 09:51:39 -0400, David Steele wrote: >> On 3/2/17 7:54 PM, Petr Jelinek wrote: >> > >> > Yes the copy patch needs rebase as well. But these ones are fine. >> >> This bug has been moved to CF 2017-07. > > FWIW, as these are bug-fixes that need to be backpatched, I do plan to > work on them soon. > CF 2017-07 pertains to postgres 11, is that right? But I hope you mean to commit these snapbuild patches before the postgres 10 release? As far as I know, logical replication is still very broken without them (or at least some of that set of 5 patches - I don't know which ones are essential and which may not be). If it's at all useful I can repeat tests to show how often current master still fails (easily 50% or so failure-rate). This would be the pgbench-over-logical-replication test that I did so often earlier on. thanks, Erik Rijkers
On 2017-04-08 16:29:10 +0200, Erik Rijkers wrote: > On 2017-04-08 15:56, Andres Freund wrote: > > On 2017-04-08 09:51:39 -0400, David Steele wrote: > > > On 3/2/17 7:54 PM, Petr Jelinek wrote: > > > > > > > > Yes the copy patch needs rebase as well. But these ones are fine. > > > > > > This bug has been moved to CF 2017-07. > > > > FWIW, as these are bug-fixes that need to be backpatched, I do plan to > > work on them soon. > > > > CF 2017-07 pertains to postgres 11, is that right? > > But I hope you mean to commit these snapbuild patches before the postgres 10 > release? As far as I know, logical replication is still very broken without > them (or at least some of that set of 5 patches - I don't know which ones > are essential and which may not be). Yes, these should go into 10 *and* earlier releases, and I don't plan to wait for 2017-07. - Andres
On 4/8/17 10:29 AM, Erik Rijkers wrote: > On 2017-04-08 15:56, Andres Freund wrote: >> On 2017-04-08 09:51:39 -0400, David Steele wrote: >>> On 3/2/17 7:54 PM, Petr Jelinek wrote: >>> > >>> > Yes the copy patch needs rebase as well. But these ones are fine. >>> >>> This bug has been moved to CF 2017-07. >> >> FWIW, as these are bug-fixes that need to be backpatched, I do plan to >> work on them soon. >> > > CF 2017-07 pertains to postgres 11, is that right? In general, yes, but bugs will always be fixed as needed. It doesn't matter what CF they are in. -- -David david@pgmasters.net
On Sat, Apr 08, 2017 at 07:30:59AM -0700, Andres Freund wrote: > On 2017-04-08 16:29:10 +0200, Erik Rijkers wrote: > > On 2017-04-08 15:56, Andres Freund wrote: > > > On 2017-04-08 09:51:39 -0400, David Steele wrote: > > > > On 3/2/17 7:54 PM, Petr Jelinek wrote: > > > > > > > > > > Yes the copy patch needs rebase as well. But these ones are fine. > > > > > > > > This bug has been moved to CF 2017-07. > > > > > > FWIW, as these are bug-fixes that need to be backpatched, I do plan to > > > work on them soon. > > > > > > > CF 2017-07 pertains to postgres 11, is that right? > > > > But I hope you mean to commit these snapbuild patches before the postgres 10 > > release? As far as I know, logical replication is still very broken without > > them (or at least some of that set of 5 patches - I don't know which ones > > are essential and which may not be). > > Yes, these should go into 10 *and* earlier releases, and I don't plan to > wait for 2017-07. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Peter, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On 4/12/17 02:31, Noah Misch wrote: >>> But I hope you mean to commit these snapbuild patches before the postgres 10 >>> release? As far as I know, logical replication is still very broken without >>> them (or at least some of that set of 5 patches - I don't know which ones >>> are essential and which may not be). >> >> Yes, these should go into 10 *and* earlier releases, and I don't plan to >> wait for 2017-07. > > [Action required within three days. This is a generic notification.] I'm hoping for a word from Andres on this. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3 March 2017 at 00:30, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: >> 0004 - Changes handling of the xl_running_xacts in initial snapshot >> build to what I wrote above and removes the extra locking from >> LogStandbySnapshot introduced by logical decoding. This seems OK and unlikely to have wider impact. The "race condition" we're speaking about is by design, not a bug. I think the initial comment could be slightly better worded; if I didn't already know what is being discussed, I wouldnt be too much further forwards from those comments. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote: > On 4/12/17 02:31, Noah Misch wrote: > >>> But I hope you mean to commit these snapbuild patches before the postgres 10 > >>> release? As far as I know, logical replication is still very broken without > >>> them (or at least some of that set of 5 patches - I don't know which ones > >>> are essential and which may not be). > >> > >> Yes, these should go into 10 *and* earlier releases, and I don't plan to > >> wait for 2017-07. > > > > [Action required within three days. This is a generic notification.] > > I'm hoping for a word from Andres on this. Feel free to reassign to me.
Hi, On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote: > From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001 > From: Petr Jelinek <pjmodos@pjmodos.net> > Date: Fri, 24 Feb 2017 21:39:03 +0100 > Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export > > Otherwise the VACUUM or pruning might remove tuples still needed by the > exported snapshot. > --- > src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++- > 1 file changed, 20 insertions(+), 1 deletion(-) > > diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c > index 5529ac8..57c392c 100644 > --- a/src/backend/replication/logical/logical.c > +++ b/src/backend/replication/logical/logical.c > @@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin, > * the slot machinery about the new limit. Once that's done the > * ProcArrayLock can be released as the slot machinery now is > * protecting against vacuum. > + * > + * Note that we only store the global xmin temporarily in the in-memory > + * state so that the initial snapshot can be exported. After initial > + * snapshot is done global xmin should be reset and not tracked anymore > + * so we are fine with losing the global xmin after crash. > * ---- > */ > LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); > > slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId(); > slot->data.catalog_xmin = slot->effective_catalog_xmin; > + slot->effective_xmin = slot->effective_catalog_xmin; > void > FreeDecodingContext(LogicalDecodingContext *ctx) > { > + ReplicationSlot *slot = MyReplicationSlot; > + > if (ctx->callbacks.shutdown_cb != NULL) > shutdown_cb_wrapper(ctx); > > + /* > + * Cleanup global xmin for the slot that we may have set in > + * CreateInitDecodingContext(). Hm. Is that actually a meaningful point to do so? For one, it gets called by pg_logical_slot_get_changes_guts(), but more importantly, the snapshot is exported till SnapBuildClearExportedSnapshot(), which is the next command? If we rely on the snapshot magic done by ExportSnapshot() it'd be worthwhile to mention that... > We do not take ProcArrayLock or similar > + * since we only reset xmin here and there's not much harm done by a > + * concurrent computation missing that. > + */ Hum. I was prepared to complain about this, but ISTM, that there's absolutely no risk because the following ReplicationSlotsComputeRequiredXmin(false); actually does all the necessary locking? But still, I don't see much point in the optimization. > This patch changes the code so that stored snapshots are only used for > logical decoding restart but not for initial slot snapshot. Yea, that's a very good point... > @@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn > > return false; > } > - /* c) valid on disk state */ > - else if (SnapBuildRestore(builder, lsn)) > + /* c) valid on disk state and not exported snapshot */ > + else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) && > + SnapBuildRestore(builder, lsn)) Hm. Is this a good signaling mechanism? It'll also trigger for the SQL interface, where it'd strictly speaking not be required, right? > From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001 > From: Petr Jelinek <pjmodos@pjmodos.net> > Date: Sun, 26 Feb 2017 01:07:33 +0100 > Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards A bit more commentary would be good. What does that protect us against? > From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001 > From: Petr Jelinek <pjmodos@pjmodos.net> > Date: Wed, 22 Feb 2017 00:57:33 +0100 > Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder > > Due to race condition, the xl_running_xacts might contain no longer > running transactions. Previous coding tried to get around this by > additional locking but that did not work correctly for committs. Instead > try combining decoded commits and multiple xl_running_xacts to get the > consistent snapshot. Needs more explanation about approach. > @@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn > * simply track the number of in-progress toplevel transactions and > * lower it whenever one commits or aborts. When that number > * (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT > - * to CONSISTENT. > + * to CONSISTENT. Sometimes we might get xl_running_xacts which has > + * all tracked transactions as finished. We'll need to restart tracking > + * in that case and use previously collected committed transactions to > + * purge transactions mistakenly marked as running in the > + * xl_running_xacts which exist as a result of race condition in > + * LogStandbySnapshot(). I'm not following this yet. > @@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn > * b) first encounter of a useable xl_running_xacts record. If we had > * found one earlier we would either track running transactions (i.e. > * builder->running.xcnt != 0) or be consistent (this function wouldn't > - * get called). > + * get called). However it's possible that we could not see all > + * transactions that were marked as running in xl_running_xacts, so if > + * we get new one that says all were closed but we are not consistent > + * yet, we need to restart the tracking while taking previously seen > + * transactions into account. This needs to revise the preceding comment more heavily. "This is the first!!! Or maybe not!" isn't easy to understand. > */ > - else if (!builder->running.xcnt) > + else if (!builder->running.xcnt || > + running->oldestRunningXid > builder->running.xmax) Isn't that wrong under wraparound? > { > int off; > + bool first = builder->running.xcnt == 0; > > /* > * We only care about toplevel xids as those are the ones we > @@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn > builder->running.xmin = builder->running.xip[0]; > builder->running.xmax = builder->running.xip[running->xcnt - 1]; > > + > /* makes comparisons cheaper later */ > TransactionIdRetreat(builder->running.xmin); > TransactionIdAdvance(builder->running.xmax); > > builder->state = SNAPBUILD_FULL_SNAPSHOT; > > - ereport(LOG, > - (errmsg("logical decoding found initial starting point at %X/%X", > - (uint32) (lsn >> 32), (uint32) lsn), > - errdetail_plural("%u transaction needs to finish.", > - "%u transactions need to finish.", > - builder->running.xcnt, > - (uint32) builder->running.xcnt))); > - > + /* > + * If this is the first time we've seen xl_running_xacts, we are done. > + */ > + if (first) > + { > + ereport(LOG, > + (errmsg("logical decoding found initial starting point at %X/%X", > + (uint32) (lsn >> 32), (uint32) lsn), > + errdetail_plural("%u transaction needs to finish.", > + "%u transactions need to finish.", > + builder->running.xcnt, > + (uint32) builder->running.xcnt))); > + } > + else > + { > + /* > + * Because of the race condition in LogStandbySnapshot() the > + * transactions recorded in xl_running_xacts as running might have > + * already committed by the time the xl_running_xacts was written > + * to WAL. Use the information about decoded transactions that we > + * gathered so far to update our idea about what's still running. > + * > + * We can use SnapBuildEndTxn directly as it only does the > + * transaction running check and handling without any additional > + * side effects. > + */ > + for (off = 0; off < builder->committed.xcnt; off++) > + SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]); > + if (builder->state == SNAPBUILD_CONSISTENT) > + return false; > + > + ereport(LOG, > + (errmsg("logical decoding moved initial starting point to %X/%X", > + (uint32) (lsn >> 32), (uint32) lsn), > + errdetail_plural("%u transaction needs to finish.", > + "%u transactions need to finish.", > + builder->running.xcnt, > + (uint32) builder->running.xcnt))); > + } Hm, this is not pretty. > From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001 > From: Petr Jelinek <pjmodos@pjmodos.net> > Date: Tue, 21 Feb 2017 19:58:18 +0100 > Subject: [PATCH 5/5] Skip unnecessary snapshot builds > > When doing initial snapshot build during logical decoding > initialization, don't build snapshots for transactions where we know the > transaction didn't do any catalog changes. Otherwise we might end up > with thousands of useless snapshots on busy server which can be quite > expensive. > --- > src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++---------- > 1 file changed, 53 insertions(+), 29 deletions(-) > > diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c > index 1a1c9ba..c800aa5 100644 > --- a/src/backend/replication/logical/snapbuild.c > +++ b/src/backend/replication/logical/snapbuild.c > @@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, > bool forced_timetravel = false; > bool sub_needs_timetravel = false; > bool top_needs_timetravel = false; > + bool skip_forced_snapshot = false; > > TransactionId xmax = xid; > > @@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, > /* > * We could avoid treating !SnapBuildTxnIsRunning transactions as > * timetravel ones, but we want to be able to export a snapshot when > - * we reached consistency. > + * we reached consistency so we need to keep track of them. > */ > forced_timetravel = true; > elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid); > + > + /* > + * It is however desirable to skip building new snapshot for > + * !SnapBuildTxnIsRunning transactions as otherwise we might end up > + * building thousands of unused snapshots on busy servers which can > + * be very expensive. > + */ > + if (!SnapBuildTxnIsRunning(builder, xid)) > + skip_forced_snapshot = true; > } That's pretty crudely bolted on the existing logic, isn't there a simpler way? Greetings, Andres Freund
On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote: > On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote: > > On 4/12/17 02:31, Noah Misch wrote: > > >>> But I hope you mean to commit these snapbuild patches before the postgres 10 > > >>> release? As far as I know, logical replication is still very broken without > > >>> them (or at least some of that set of 5 patches - I don't know which ones > > >>> are essential and which may not be). > > >> > > >> Yes, these should go into 10 *and* earlier releases, and I don't plan to > > >> wait for 2017-07. > > > > > > [Action required within three days. This is a generic notification.] > > > > I'm hoping for a word from Andres on this. > > Feel free to reassign to me. Thanks for volunteering; I'll do that shortly. Please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On April 12, 2017 9:58:12 PM PDT, Noah Misch <noah@leadboat.com> wrote: >On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote: >> On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote: >> > On 4/12/17 02:31, Noah Misch wrote: >> > >>> But I hope you mean to commit these snapbuild patches before >the postgres 10 >> > >>> release? As far as I know, logical replication is still very >broken without >> > >>> them (or at least some of that set of 5 patches - I don't know >which ones >> > >>> are essential and which may not be). >> > >> >> > >> Yes, these should go into 10 *and* earlier releases, and I don't >plan to >> > >> wait for 2017-07. >> > > >> > > [Action required within three days. This is a generic >notification.] >> > >> > I'm hoping for a word from Andres on this. >> >> Feel free to reassign to me. > >Thanks for volunteering; I'll do that shortly. Please observe the >policy on >open item ownership[1] and send a status update within three calendar >days of >this message. Include a date for your subsequent status update. > >[1] >https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com Will, volunteering might be the wrong word. These ate all my fault, although none look v10 specific. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Thanks for looking at this! On 13/04/17 02:29, Andres Freund wrote: > Hi, > On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote: > >> From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001 >> From: Petr Jelinek <pjmodos@pjmodos.net> >> Date: Fri, 24 Feb 2017 21:39:03 +0100 >> Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export >> >> Otherwise the VACUUM or pruning might remove tuples still needed by the >> exported snapshot. >> --- >> src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++- >> 1 file changed, 20 insertions(+), 1 deletion(-) >> >> diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c >> index 5529ac8..57c392c 100644 >> --- a/src/backend/replication/logical/logical.c >> +++ b/src/backend/replication/logical/logical.c >> @@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin, >> * the slot machinery about the new limit. Once that's done the >> * ProcArrayLock can be released as the slot machinery now is >> * protecting against vacuum. >> + * >> + * Note that we only store the global xmin temporarily in the in-memory >> + * state so that the initial snapshot can be exported. After initial >> + * snapshot is done global xmin should be reset and not tracked anymore >> + * so we are fine with losing the global xmin after crash. >> * ---- >> */ >> LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); >> >> slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId(); >> slot->data.catalog_xmin = slot->effective_catalog_xmin; >> + slot->effective_xmin = slot->effective_catalog_xmin; > > >> void >> FreeDecodingContext(LogicalDecodingContext *ctx) >> { >> + ReplicationSlot *slot = MyReplicationSlot; >> + >> if (ctx->callbacks.shutdown_cb != NULL) >> shutdown_cb_wrapper(ctx); >> >> + /* >> + * Cleanup global xmin for the slot that we may have set in >> + * CreateInitDecodingContext(). > > Hm. Is that actually a meaningful point to do so? For one, it gets > called by pg_logical_slot_get_changes_guts(), but more importantly, the > snapshot is exported till SnapBuildClearExportedSnapshot(), which is the > next command? If we rely on the snapshot magic done by ExportSnapshot() > it'd be worthwhile to mention that... > (Didn't see the patch for couple of months so don't remember all the detailed decisions anymore) Yes we rely on backend's xmin to be set for exported snapshot. We only care about global xmin for exported snapshots really, I assumed that's clear enough from "so that the initial snapshot can be exported" but I guess there should be more clear comment about this where we actually clean this up. > >> We do not take ProcArrayLock or similar >> + * since we only reset xmin here and there's not much harm done by a >> + * concurrent computation missing that. >> + */ > > Hum. I was prepared to complain about this, but ISTM, that there's > absolutely no risk because the following > ReplicationSlotsComputeRequiredXmin(false); actually does all the > necessary locking? But still, I don't see much point in the > optimization. > Well, if we don't need it in LogicalConfirmReceivedLocation, I don't see why we need it here. Please enlighten me. > > >> This patch changes the code so that stored snapshots are only used for >> logical decoding restart but not for initial slot snapshot. > > Yea, that's a very good point... > >> @@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn >> >> return false; >> } >> - /* c) valid on disk state */ >> - else if (SnapBuildRestore(builder, lsn)) >> + /* c) valid on disk state and not exported snapshot */ >> + else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) && >> + SnapBuildRestore(builder, lsn)) > > Hm. Is this a good signaling mechanism? It'll also trigger for the SQL > interface, where it'd strictly speaking not be required, right? > Good point. Maybe we should really tell snapshot builder if the snapshot is going to be exported or not explicitly (see the rant all the way down). > >> From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001 >> From: Petr Jelinek <pjmodos@pjmodos.net> >> Date: Sun, 26 Feb 2017 01:07:33 +0100 >> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards > > A bit more commentary would be good. What does that protect us against? > I think I explained that in the email. We might export snapshot with xmin smaller than global xmin otherwise. > > >> From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001 >> From: Petr Jelinek <pjmodos@pjmodos.net> >> Date: Wed, 22 Feb 2017 00:57:33 +0100 >> Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder >> >> Due to race condition, the xl_running_xacts might contain no longer >> running transactions. Previous coding tried to get around this by >> additional locking but that did not work correctly for committs. Instead >> try combining decoded commits and multiple xl_running_xacts to get the >> consistent snapshot. > > Needs more explanation about approach. > > >> @@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn >> * simply track the number of in-progress toplevel transactions and >> * lower it whenever one commits or aborts. When that number >> * (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT >> - * to CONSISTENT. >> + * to CONSISTENT. Sometimes we might get xl_running_xacts which has >> + * all tracked transactions as finished. We'll need to restart tracking >> + * in that case and use previously collected committed transactions to >> + * purge transactions mistakenly marked as running in the >> + * xl_running_xacts which exist as a result of race condition in >> + * LogStandbySnapshot(). > > I'm not following this yet. > Let me try to explain: We get xl_running_xacts with txes 1,3,4. But the 1 already committed before so the decoding will never see it and we never get snapshot. Now at some point we might get xl_running_xact with txes 6,7,8 so we know that all transactions from the initial xl_running_xacts must be closed. We restart the tracking here from beginning as if this was first xl_running_xacts we've seen, with the exception that we look into past if we seen the 6,7,8 transactions already as finished, then we'll mark them as finished immediately (hence avoiding the issue with transaction 6 being already committed before xl_running_xacts was written). > >> @@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn >> * b) first encounter of a useable xl_running_xacts record. If we had >> * found one earlier we would either track running transactions (i.e. >> * builder->running.xcnt != 0) or be consistent (this function wouldn't >> - * get called). >> + * get called). However it's possible that we could not see all >> + * transactions that were marked as running in xl_running_xacts, so if >> + * we get new one that says all were closed but we are not consistent >> + * yet, we need to restart the tracking while taking previously seen >> + * transactions into account. > > This needs to revise the preceding comment more heavily. "This is the > first!!! Or maybe not!" isn't easy to understand. > Yeah, I found it bit hard to make it sound correct and not confusing, even wondered if I should split this code to two because of that but it would lead into quite a bit of code duplication, dunno if that's better. Maybe we could move the "reset" code into separate function to avoid most of the duplication. > >> */ >> - else if (!builder->running.xcnt) >> + else if (!builder->running.xcnt || >> + running->oldestRunningXid > builder->running.xmax) > > Isn't that wrong under wraparound? > Right, should use TransactionIdFollows. > >> { >> int off; >> + bool first = builder->running.xcnt == 0; >> >> /* >> * We only care about toplevel xids as those are the ones we >> @@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn >> builder->running.xmin = builder->running.xip[0]; >> builder->running.xmax = builder->running.xip[running->xcnt - 1]; >> >> + >> /* makes comparisons cheaper later */ >> TransactionIdRetreat(builder->running.xmin); >> TransactionIdAdvance(builder->running.xmax); >> >> builder->state = SNAPBUILD_FULL_SNAPSHOT; >> >> - ereport(LOG, >> - (errmsg("logical decoding found initial starting point at %X/%X", >> - (uint32) (lsn >> 32), (uint32) lsn), >> - errdetail_plural("%u transaction needs to finish.", >> - "%u transactions need to finish.", >> - builder->running.xcnt, >> - (uint32) builder->running.xcnt))); >> - >> + /* >> + * If this is the first time we've seen xl_running_xacts, we are done. >> + */ >> + if (first) >> + { >> + ereport(LOG, >> + (errmsg("logical decoding found initial starting point at %X/%X", >> + (uint32) (lsn >> 32), (uint32) lsn), >> + errdetail_plural("%u transaction needs to finish.", >> + "%u transactions need to finish.", >> + builder->running.xcnt, >> + (uint32) builder->running.xcnt))); >> + } >> + else >> + { >> + /* >> + * Because of the race condition in LogStandbySnapshot() the >> + * transactions recorded in xl_running_xacts as running might have >> + * already committed by the time the xl_running_xacts was written >> + * to WAL. Use the information about decoded transactions that we >> + * gathered so far to update our idea about what's still running. >> + * >> + * We can use SnapBuildEndTxn directly as it only does the >> + * transaction running check and handling without any additional >> + * side effects. >> + */ >> + for (off = 0; off < builder->committed.xcnt; off++) >> + SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]); >> + if (builder->state == SNAPBUILD_CONSISTENT) >> + return false; >> + >> + ereport(LOG, >> + (errmsg("logical decoding moved initial starting point to %X/%X", >> + (uint32) (lsn >> 32), (uint32) lsn), >> + errdetail_plural("%u transaction needs to finish.", >> + "%u transactions need to finish.", >> + builder->running.xcnt, >> + (uint32) builder->running.xcnt))); >> + } > > Hm, this is not pretty. > Hmm? There are two possible scenarios that need to be handled differently. Possibly another reason to split it out completely as mentioned above. > >> From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001 >> From: Petr Jelinek <pjmodos@pjmodos.net> >> Date: Tue, 21 Feb 2017 19:58:18 +0100 >> Subject: [PATCH 5/5] Skip unnecessary snapshot builds >> >> When doing initial snapshot build during logical decoding >> initialization, don't build snapshots for transactions where we know the >> transaction didn't do any catalog changes. Otherwise we might end up >> with thousands of useless snapshots on busy server which can be quite >> expensive. >> --- >> src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++---------- >> 1 file changed, 53 insertions(+), 29 deletions(-) >> >> diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c >> index 1a1c9ba..c800aa5 100644 >> --- a/src/backend/replication/logical/snapbuild.c >> +++ b/src/backend/replication/logical/snapbuild.c >> @@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, >> bool forced_timetravel = false; >> bool sub_needs_timetravel = false; >> bool top_needs_timetravel = false; >> + bool skip_forced_snapshot = false; >> >> TransactionId xmax = xid; >> >> @@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, >> /* >> * We could avoid treating !SnapBuildTxnIsRunning transactions as >> * timetravel ones, but we want to be able to export a snapshot when >> - * we reached consistency. >> + * we reached consistency so we need to keep track of them. >> */ >> forced_timetravel = true; >> elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid); >> + >> + /* >> + * It is however desirable to skip building new snapshot for >> + * !SnapBuildTxnIsRunning transactions as otherwise we might end up >> + * building thousands of unused snapshots on busy servers which can >> + * be very expensive. >> + */ >> + if (!SnapBuildTxnIsRunning(builder, xid)) >> + skip_forced_snapshot = true; >> } > > That's pretty crudely bolted on the existing logic, isn't there a > simpler way? > Agreed, however, every time I tried to make this prettier I ended up either producing subtle bugs (see the initial email in this thread for example of that), so I eventually gave up on pretty. As a side note, my opinion after all this is that it's probably mistake to try to use various situational conditions to make sure we are building exportable snapshot. ISTM we should tell the snapbuilder explicitly that the snapshot will be exported and it should behave accordingly based on that. Because for example, we also should track aborted transactions in the snapshot which is to be exported because otherwise enough of them happening during snapshot building will result in export failure due to too big snapshot. But this seems like too invasive change to be back-portable. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 13/04/17 07:02, Andres Freund wrote: > > On April 12, 2017 9:58:12 PM PDT, Noah Misch <noah@leadboat.com> wrote: >> On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote: >>> On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote: >>>> On 4/12/17 02:31, Noah Misch wrote: >>>>>>> But I hope you mean to commit these snapbuild patches before >> the postgres 10 >>>>>>> release? As far as I know, logical replication is still very >> broken without >>>>>>> them (or at least some of that set of 5 patches - I don't know >> which ones >>>>>>> are essential and which may not be). >>>>>> >>>>>> Yes, these should go into 10 *and* earlier releases, and I don't >> plan to >>>>>> wait for 2017-07. >>>>> >>>>> [Action required within three days. This is a generic >> notification.] >>>> >>>> I'm hoping for a word from Andres on this. >>> >>> Feel free to reassign to me. >> >> Thanks for volunteering; I'll do that shortly. Please observe the >> policy on >> open item ownership[1] and send a status update within three calendar >> days of >> this message. Include a date for your subsequent status update. >> >> [1] >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > Will, volunteering might be the wrong word. These ate all my fault, although none look v10 specific. > Yeah none of this is v10 specific, the importance to v10 is that it affects the logical replication in core, not just extensions like in 9.4-9.6. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, here is updated patch (details inline). On 13/04/17 20:00, Petr Jelinek wrote: > Thanks for looking at this! > > On 13/04/17 02:29, Andres Freund wrote: >> Hi, >> On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote: >> >>> From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001 >>> From: Petr Jelinek <pjmodos@pjmodos.net> >>> Date: Fri, 24 Feb 2017 21:39:03 +0100 >>> Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export >>> >>> Otherwise the VACUUM or pruning might remove tuples still needed by the >>> exported snapshot. >>> --- >>> src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++- >>> 1 file changed, 20 insertions(+), 1 deletion(-) >>> >>> diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c >>> index 5529ac8..57c392c 100644 >>> --- a/src/backend/replication/logical/logical.c >>> +++ b/src/backend/replication/logical/logical.c >>> @@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin, >>> * the slot machinery about the new limit. Once that's done the >>> * ProcArrayLock can be released as the slot machinery now is >>> * protecting against vacuum. >>> + * >>> + * Note that we only store the global xmin temporarily in the in-memory >>> + * state so that the initial snapshot can be exported. After initial >>> + * snapshot is done global xmin should be reset and not tracked anymore >>> + * so we are fine with losing the global xmin after crash. >>> * ---- >>> */ >>> LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); >>> >>> slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId(); >>> slot->data.catalog_xmin = slot->effective_catalog_xmin; >>> + slot->effective_xmin = slot->effective_catalog_xmin; >> >> >>> void >>> FreeDecodingContext(LogicalDecodingContext *ctx) >>> { >>> + ReplicationSlot *slot = MyReplicationSlot; >>> + >>> if (ctx->callbacks.shutdown_cb != NULL) >>> shutdown_cb_wrapper(ctx); >>> >>> + /* >>> + * Cleanup global xmin for the slot that we may have set in >>> + * CreateInitDecodingContext(). >> >> Hm. Is that actually a meaningful point to do so? For one, it gets >> called by pg_logical_slot_get_changes_guts(), but more importantly, the >> snapshot is exported till SnapBuildClearExportedSnapshot(), which is the >> next command? If we rely on the snapshot magic done by ExportSnapshot() >> it'd be worthwhile to mention that... >> > > (Didn't see the patch for couple of months so don't remember all the > detailed decisions anymore) > > Yes we rely on backend's xmin to be set for exported snapshot. We only > care about global xmin for exported snapshots really, I assumed that's > clear enough from "so that the initial snapshot can be exported" but I > guess there should be more clear comment about this where we actually > clean this up. > Okay, wrote new comment there, how is it now? >> >>> We do not take ProcArrayLock or similar >>> + * since we only reset xmin here and there's not much harm done by a >>> + * concurrent computation missing that. >>> + */ >> >> Hum. I was prepared to complain about this, but ISTM, that there's >> absolutely no risk because the following >> ReplicationSlotsComputeRequiredXmin(false); actually does all the >> necessary locking? But still, I don't see much point in the >> optimization. >> > > Well, if we don't need it in LogicalConfirmReceivedLocation, I don't see > why we need it here. Please enlighten me. > I kept this as it was, after rereading, the ReplicationSlotsComputeRequiredXmin(false) will do shared lock while if we wanted to avoid mutex and do the xmin update under lock we'd need to do exclusive lock so I think it's worth the optimization... >> >> >>> This patch changes the code so that stored snapshots are only used for >>> logical decoding restart but not for initial slot snapshot. >> >> Yea, that's a very good point... >> >>> @@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn >>> >>> return false; >>> } >>> - /* c) valid on disk state */ >>> - else if (SnapBuildRestore(builder, lsn)) >>> + /* c) valid on disk state and not exported snapshot */ >>> + else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) && >>> + SnapBuildRestore(builder, lsn)) >> >> Hm. Is this a good signaling mechanism? It'll also trigger for the SQL >> interface, where it'd strictly speaking not be required, right? >> > > Good point. Maybe we should really tell snapshot builder if the snapshot > is going to be exported or not explicitly (see the rant all the way down). > I added the new signaling mechanism (the new boolean option indicating if we are building full snapshot which is only set when the snapshot is exported or used by the transaction). >> >>> From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001 >>> From: Petr Jelinek <pjmodos@pjmodos.net> >>> Date: Sun, 26 Feb 2017 01:07:33 +0100 >>> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards >> >> A bit more commentary would be good. What does that protect us against? >> > > I think I explained that in the email. We might export snapshot with > xmin smaller than global xmin otherwise. > Updated commit message with explanation as well. >> >> >>> From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001 >>> From: Petr Jelinek <pjmodos@pjmodos.net> >>> Date: Wed, 22 Feb 2017 00:57:33 +0100 >>> Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder >>> >>> Due to race condition, the xl_running_xacts might contain no longer >>> running transactions. Previous coding tried to get around this by >>> additional locking but that did not work correctly for committs. Instead >>> try combining decoded commits and multiple xl_running_xacts to get the >>> consistent snapshot. >> >> Needs more explanation about approach. >> >> >>> @@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn >>> * simply track the number of in-progress toplevel transactions and >>> * lower it whenever one commits or aborts. When that number >>> * (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT >>> - * to CONSISTENT. >>> + * to CONSISTENT. Sometimes we might get xl_running_xacts which has >>> + * all tracked transactions as finished. We'll need to restart tracking >>> + * in that case and use previously collected committed transactions to >>> + * purge transactions mistakenly marked as running in the >>> + * xl_running_xacts which exist as a result of race condition in >>> + * LogStandbySnapshot(). >> >> I'm not following this yet. >> > > Let me try to explain: > We get xl_running_xacts with txes 1,3,4. But the 1 already committed > before so the decoding will never see it and we never get snapshot. Now > at some point we might get xl_running_xact with txes 6,7,8 so we know > that all transactions from the initial xl_running_xacts must be closed. > We restart the tracking here from beginning as if this was first > xl_running_xacts we've seen, with the exception that we look into past > if we seen the 6,7,8 transactions already as finished, then we'll mark > them as finished immediately (hence avoiding the issue with transaction > 6 being already committed before xl_running_xacts was written). > >> >>> @@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn >>> * b) first encounter of a useable xl_running_xacts record. If we had >>> * found one earlier we would either track running transactions (i.e. >>> * builder->running.xcnt != 0) or be consistent (this function wouldn't >>> - * get called). >>> + * get called). However it's possible that we could not see all >>> + * transactions that were marked as running in xl_running_xacts, so if >>> + * we get new one that says all were closed but we are not consistent >>> + * yet, we need to restart the tracking while taking previously seen >>> + * transactions into account. >> >> This needs to revise the preceding comment more heavily. "This is the >> first!!! Or maybe not!" isn't easy to understand. >> > > Yeah, I found it bit hard to make it sound correct and not confusing, > even wondered if I should split this code to two because of that but it > would lead into quite a bit of code duplication, dunno if that's better. > Maybe we could move the "reset" code into separate function to avoid > most of the duplication. Rewrote and moved this comment to it's own thing. > >> >>> */ >>> - else if (!builder->running.xcnt) >>> + else if (!builder->running.xcnt || >>> + running->oldestRunningXid > builder->running.xmax) >> >> Isn't that wrong under wraparound? >> > > Right, should use TransactionIdFollows. > Fixed. >> >>> { >>> int off; >>> + bool first = builder->running.xcnt == 0; >>> >>> /* >>> * We only care about toplevel xids as those are the ones we >>> @@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn >>> builder->running.xmin = builder->running.xip[0]; >>> builder->running.xmax = builder->running.xip[running->xcnt - 1]; >>> >>> + >>> /* makes comparisons cheaper later */ >>> TransactionIdRetreat(builder->running.xmin); >>> TransactionIdAdvance(builder->running.xmax); >>> >>> builder->state = SNAPBUILD_FULL_SNAPSHOT; >>> >>> - ereport(LOG, >>> - (errmsg("logical decoding found initial starting point at %X/%X", >>> - (uint32) (lsn >> 32), (uint32) lsn), >>> - errdetail_plural("%u transaction needs to finish.", >>> - "%u transactions need to finish.", >>> - builder->running.xcnt, >>> - (uint32) builder->running.xcnt))); >>> - >>> + /* >>> + * If this is the first time we've seen xl_running_xacts, we are done. >>> + */ >>> + if (first) >>> + { >>> + ereport(LOG, >>> + (errmsg("logical decoding found initial starting point at %X/%X", >>> + (uint32) (lsn >> 32), (uint32) lsn), >>> + errdetail_plural("%u transaction needs to finish.", >>> + "%u transactions need to finish.", >>> + builder->running.xcnt, >>> + (uint32) builder->running.xcnt))); >>> + } >>> + else >>> + { >>> + /* >>> + * Because of the race condition in LogStandbySnapshot() the >>> + * transactions recorded in xl_running_xacts as running might have >>> + * already committed by the time the xl_running_xacts was written >>> + * to WAL. Use the information about decoded transactions that we >>> + * gathered so far to update our idea about what's still running. >>> + * >>> + * We can use SnapBuildEndTxn directly as it only does the >>> + * transaction running check and handling without any additional >>> + * side effects. >>> + */ >>> + for (off = 0; off < builder->committed.xcnt; off++) >>> + SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]); >>> + if (builder->state == SNAPBUILD_CONSISTENT) >>> + return false; >>> + >>> + ereport(LOG, >>> + (errmsg("logical decoding moved initial starting point to %X/%X", >>> + (uint32) (lsn >> 32), (uint32) lsn), >>> + errdetail_plural("%u transaction needs to finish.", >>> + "%u transactions need to finish.", >>> + builder->running.xcnt, >>> + (uint32) builder->running.xcnt))); >>> + } >> >> Hm, this is not pretty. >> > Changed this whole thing to be 2 different code paths with common function doing the common work. > >> >>> From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001 >>> From: Petr Jelinek <pjmodos@pjmodos.net> >>> Date: Tue, 21 Feb 2017 19:58:18 +0100 >>> Subject: [PATCH 5/5] Skip unnecessary snapshot builds >>> >>> When doing initial snapshot build during logical decoding >>> initialization, don't build snapshots for transactions where we know the >>> transaction didn't do any catalog changes. Otherwise we might end up >>> with thousands of useless snapshots on busy server which can be quite >>> expensive. >>> --- >>> src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++---------- >>> 1 file changed, 53 insertions(+), 29 deletions(-) >>> >>> diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c >>> index 1a1c9ba..c800aa5 100644 >>> --- a/src/backend/replication/logical/snapbuild.c >>> +++ b/src/backend/replication/logical/snapbuild.c >>> @@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, >>> bool forced_timetravel = false; >>> bool sub_needs_timetravel = false; >>> bool top_needs_timetravel = false; >>> + bool skip_forced_snapshot = false; >>> >>> TransactionId xmax = xid; >>> >>> @@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid, >>> /* >>> * We could avoid treating !SnapBuildTxnIsRunning transactions as >>> * timetravel ones, but we want to be able to export a snapshot when >>> - * we reached consistency. >>> + * we reached consistency so we need to keep track of them. >>> */ >>> forced_timetravel = true; >>> elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid); >>> + >>> + /* >>> + * It is however desirable to skip building new snapshot for >>> + * !SnapBuildTxnIsRunning transactions as otherwise we might end up >>> + * building thousands of unused snapshots on busy servers which can >>> + * be very expensive. >>> + */ >>> + if (!SnapBuildTxnIsRunning(builder, xid)) >>> + skip_forced_snapshot = true; >>> } >> >> That's pretty crudely bolted on the existing logic, isn't there a >> simpler way? >> > > Agreed, however, every time I tried to make this prettier I ended up > either producing subtle bugs (see the initial email in this thread for > example of that), so I eventually gave up on pretty. > Okay, gave it one more try with fresh head, hopefully without new bugs, what do you think? > As a side note, my opinion after all this is that it's probably mistake > to try to use various situational conditions to make sure we are > building exportable snapshot. ISTM we should tell the snapbuilder > explicitly that the snapshot will be exported and it should behave > accordingly based on that. Because for example, we also should track > aborted transactions in the snapshot which is to be exported because > otherwise enough of them happening during snapshot building will result > in export failure due to too big snapshot. But this seems like too > invasive change to be back-portable. > Ended up doing this in the 0002 and also use those changes in 0005, does not seem to be that bad. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Thu, Apr 13, 2017 at 12:58:12AM -0400, Noah Misch wrote: > On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote: > > On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote: > > > On 4/12/17 02:31, Noah Misch wrote: > > > >>> But I hope you mean to commit these snapbuild patches before the postgres 10 > > > >>> release? As far as I know, logical replication is still very broken without > > > >>> them (or at least some of that set of 5 patches - I don't know which ones > > > >>> are essential and which may not be). > > > >> > > > >> Yes, these should go into 10 *and* earlier releases, and I don't plan to > > > >> wait for 2017-07. > > > > > > > > [Action required within three days. This is a generic notification.] > > > > > > I'm hoping for a word from Andres on this. > > > > Feel free to reassign to me. > > Thanks for volunteering; I'll do that shortly. Please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. > > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com This PostgreSQL 10 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On 2017-04-16 22:04:04 -0400, Noah Misch wrote: > On Thu, Apr 13, 2017 at 12:58:12AM -0400, Noah Misch wrote: > > On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote: > > > On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote: > > > > On 4/12/17 02:31, Noah Misch wrote: > > > > >>> But I hope you mean to commit these snapbuild patches before the postgres 10 > > > > >>> release? As far as I know, logical replication is still very broken without > > > > >>> them (or at least some of that set of 5 patches - I don't know which ones > > > > >>> are essential and which may not be). > > > > >> > > > > >> Yes, these should go into 10 *and* earlier releases, and I don't plan to > > > > >> wait for 2017-07. > > > > > > > > > > [Action required within three days. This is a generic notification.] > > > > > > > > I'm hoping for a word from Andres on this. > > > > > > Feel free to reassign to me. > > > > Thanks for volunteering; I'll do that shortly. Please observe the policy on > > open item ownership[1] and send a status update within three calendar days of > > this message. Include a date for your subsequent status update. > > > > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > This PostgreSQL 10 open item is past due for your status update. Kindly send > a status update within 24 hours, and include a date for your subsequent status > update. I've since the previous update reviewed Petr's patch, which he since has updated over the weekend. I'll do another round tomorrow, and will see how it looks. I think we might need some more tests for this to be committable, so it might not become committable tomorrow. I hope we'll have something in tree by end of this week, if not I'll send an update. - Andres
On 2017-04-17 21:16:57 -0700, Andres Freund wrote: > I think we might need some more tests for this to be committable, so > it might not become committable tomorrow. I'm working on some infrastructure around this. Not sure if it needs to be committed, but it's certainly useful for evaluation. Basically it's a small UDF that: 1) creates a slot via walsender protocol (to some dsn) 2) imports that snapshot into yet another connection to that dsn 3) runs some query over that new connection That makes it reasonably easy to run e.g. pgbench and continually create slots, and use the snapshot to run queries "verifying" that things look good. It's a bit shoestring-ed together, but everything else seems to require more code. And it's just test. Unless somebody has a better idea? - Andres
On 20/04/17 02:09, Andres Freund wrote: > On 2017-04-17 21:16:57 -0700, Andres Freund wrote: >> I think we might need some more tests for this to be committable, so >> it might not become committable tomorrow. > > I'm working on some infrastructure around this. Not sure if it needs to > be committed, but it's certainly useful for evaluation. Basically it's > a small UDF that: > 1) creates a slot via walsender protocol (to some dsn) > 2) imports that snapshot into yet another connection to that dsn > 3) runs some query over that new connection > > That makes it reasonably easy to run e.g. pgbench and continually create > slots, and use the snapshot to run queries "verifying" that things look > good. It's a bit shoestring-ed together, but everything else seems to > require more code. And it's just test. > > Unless somebody has a better idea? > I don't. I mean it would be nice to have isolation tester support walsender protocol, but I don't know anything about isolation tester internals so no idea how much work that is. On top of that some of the issues are not even possible to provoke via isolation tester (or anything similar that would give us control over timing) unless we expose a lot of guts of xlog/xact as a UDFs as well. So I think simple function that does what you said and pgbench are reasonable solutions. I guess you plan to make that as one of the test/modules or something similar (the UDF)? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-04-20 13:32:10 +0200, Petr Jelinek wrote: > On 20/04/17 02:09, Andres Freund wrote: > > On 2017-04-17 21:16:57 -0700, Andres Freund wrote: > > I'm working on some infrastructure around this. Not sure if it needs to > > be committed, but it's certainly useful for evaluation. Basically it's > > a small UDF that: > > 1) creates a slot via walsender protocol (to some dsn) > > 2) imports that snapshot into yet another connection to that dsn > > 3) runs some query over that new connection > > > > That makes it reasonably easy to run e.g. pgbench and continually create > > slots, and use the snapshot to run queries "verifying" that things look > > good. It's a bit shoestring-ed together, but everything else seems to > > require more code. And it's just test. > > > > Unless somebody has a better idea? > I don't. I mean it would be nice to have isolation tester support > walsender protocol, but I don't know anything about isolation tester > internals so no idea how much work that is. Now that the replication protocol supports normal queries, it's actually not much of an issue on its own. The problem is more that isolationtester's client side language isn't powerfull enough - you can't extract the snapshot name from one session and import it in another. While that might be something we want to address, I certainly don't want to tackle it for v10. I'd started to develop a C toolkit as above, but after I got the basics running I actually noticed it's pretty much unnecessary: You can just as well do it with dblink and some plpgsql. I can reliably reproduce several of the bugs in this thread in relatively short amount of time before applying the patch, and so far not after. Thats great! > I guess you plan to make that as one of the test/modules or something > similar (the UDF)? I've a bunch of tests, but I don't quite know whether we can expose all of them via classical tests. There are several easy ones that I definitely want to add (import "empty" snapshot; import snapshot with running xacts; create snapshot, perform some ddl, import snapshot, perform some ddl, check things work reasonably crazy), but there's enough others that are just probabilistic. I was wondering about adding a loop that simply runs for like 30s and then quits or such, but who knows. Testing around this made me wonder whether we need to make bgwriter.c's LOG_SNAPSHOT_INTERVAL_MS configurable - for efficient testing reducing it is quite valuable, and on busier machines it'll also almost always be a win to log more frequently... Opinions? Greetings, Andres Freund
On 2017-04-17 21:16:57 -0700, Andres Freund wrote: > I've since the previous update reviewed Petr's patch, which he since has > updated over the weekend. I'll do another round tomorrow, and will see > how it looks. I think we might need some more tests for this to be > committable, so it might not become committable tomorrow. I hope we'll > have something in tree by end of this week, if not I'll send an update. I was less productive this week than I'd hoped, and creating a testsuite was more work than I'd anticipated, so I'm slightly lagging behind. I hope to have a patchset tomorrow, aiming to commit something Monday/Tuesday. - Andres
Hi, On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote: > Hi, here is updated patch (details inline). I'm not yet all that happy, sorry: Looking at 0001: - GetOldestSafeDecodingTransactionId() only guarantees to return an xid safe for decoding (note how procArray->replication_slot_catalog_xmin is checked), not one for the initial snapshot - so afaics this whole exercise doesn't guarantee much so far. - A later commit introduces need_full_snapshot as a CreateInitDecodingContext, but you don't use it, not here. That seems wrong. - I remain unhappy with the handling of the reset of effective_xmin in FreeDecodingContext(). What if we ERROR/FATAL out before that happens? What do you think about something like the attached? I've not yet tested it any way except running the regression tests. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 25/04/17 00:59, Andres Freund wrote: > Hi, > > On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote: >> Hi, here is updated patch (details inline). > > I'm not yet all that happy, sorry: > > Looking at 0001: > - GetOldestSafeDecodingTransactionId() only guarantees to return an xid > safe for decoding (note how procArray->replication_slot_catalog_xmin > is checked), not one for the initial snapshot - so afaics this whole > exercise doesn't guarantee much so far. > - A later commit introduces need_full_snapshot as a > CreateInitDecodingContext, but you don't use it, not here. That seems > wrong. Ah yeah looks like that optimization is useful even here. > - I remain unhappy with the handling of the reset of effective_xmin in > FreeDecodingContext(). What if we ERROR/FATAL out before that happens? > Oh your problem was that I did it in FreeDecodingContext() instead of slot release, that I didn't get, yeah sure it's possibly better place. > What do you think about something like the attached? I've not yet > tested it any way except running the regression tests. > > - if (!already_locked) > - LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); Don't really understand this change much, but otherwise the patch looks good to me. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote: > On 2017-04-17 21:16:57 -0700, Andres Freund wrote: > > I've since the previous update reviewed Petr's patch, which he since has > > updated over the weekend. I'll do another round tomorrow, and will see > > how it looks. I think we might need some more tests for this to be > > committable, so it might not become committable tomorrow. I hope we'll > > have something in tree by end of this week, if not I'll send an update. > > I was less productive this week than I'd hoped, and creating a testsuite > was more work than I'd anticipated, so I'm slightly lagging behind. I > hope to have a patchset tomorrow, aiming to commit something > Monday/Tuesday. This PostgreSQL 10 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> wrote: >On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote: >> On 2017-04-17 21:16:57 -0700, Andres Freund wrote: >> > I've since the previous update reviewed Petr's patch, which he >since has >> > updated over the weekend. I'll do another round tomorrow, and will >see >> > how it looks. I think we might need some more tests for this to be >> > committable, so it might not become committable tomorrow. I hope >we'll >> > have something in tree by end of this week, if not I'll send an >update. >> >> I was less productive this week than I'd hoped, and creating a >testsuite >> was more work than I'd anticipated, so I'm slightly lagging behind. >I >> hope to have a patchset tomorrow, aiming to commit something >> Monday/Tuesday. > >This PostgreSQL 10 open item is past due for your status update. >Kindly send >a status update within 24 hours, and include a date for your subsequent >status >update. Refer to the policy on open item ownership: >https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com I committed part of the series today, plan to continue doing so over the next few days. Changes require careful review &testing, this is easy to get wrong... -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote: > I've a bunch of tests, but I don't quite know whether we can expose all > of them via classical tests. There are several easy ones that I > definitely want to add (import "empty" snapshot; import snapshot with > running xacts; create snapshot, perform some ddl, import snapshot, > perform some ddl, check things work reasonably crazy), but there's > enough others that are just probabilistic. I was wondering about adding > a loop that simply runs for like 30s and then quits or such, but who > knows. If the probabilistic test catches the bug even 5% of the time in typical configurations, the buildfarm will rapidly identify any regression. I'd choose a 7s test that detects the bug 5% of the time over a 30s test that detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl for a probabilistic bug, I sized that test to finish in 1s and catch its bug half the time. In its case, only two buildfarm members were able to demonstrate the original bug, so 5% detection would have been too low.)
Hi, On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote: > + /* > + * c) we already seen the xl_running_xacts and tried to do the above. > + * However because of race condition in LogStandbySnapshot() there might > + * have been transaction reported as running but in reality has already > + * written commit record before the xl_running_xacts so decoding has > + * missed it. We now see xl_running_xacts that suggests all transactions > + * from the original one were closed but the consistent state wasn't > + * reached which means the race condition has indeed happened. > + * > + * Start tracking again as if this was the first xl_running_xacts we've > + * seen, with the advantage that because decoding was already running, > + * any transactions committed before the xl_running_xacts record will be > + * known to us so we won't hit with the same issue again. > + */ Unfortunately I don't think that's true, as coded. You're using information about committed transactions: > + else if (TransactionIdFollows(running->oldestRunningXid, > + builder->running.xmax)) > + { > + int off; > + > + SnapBuildStartXactTracking(builder, running); > + > /* > + * Nark any transactions that are known to have committed before the > + * xl_running_xacts as finished to avoid the race condition in > + * LogStandbySnapshot(). > * > + * We can use SnapBuildEndTxn directly as it only does the > + * transaction running check and handling without any additional > + * side effects. > */ > + for (off = 0; off < builder->committed.xcnt; off++) > + SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]); but a transaction might just have *aborted* before the new snapshot, no? Since we don't keep track of those, I don't think this guarantees anything? ISTM, we need a xip_status array in SnapBuild->running. Then, whenever a xl_running_xacts is encountered while builder->running.xcnt > 0, loop for i in 0 .. xcnt_space, and end SnapBuildEndTxn() if xip_status[i] but the xid is not xl_running_xacts? Does that make sense? - Andres
On 2017-04-30 17:59:21 -0700, Andres Freund wrote: > ISTM, we need a xip_status array in SnapBuild->running. Then, whenever > a xl_running_xacts is encountered while builder->running.xcnt > 0, loop > for i in 0 .. xcnt_space, and end SnapBuildEndTxn() if xip_status[i] but > the xid is not xl_running_xacts? Does that make sense? A hasty implementation, untested besides check-world, of that approach is attached. I'm going out for dinner now, will subject it to mean things afterwards. Needs more testing, comment and commit message policing. But I think the idea is sound? - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 01/05/17 03:35, Andres Freund wrote: > On 2017-04-30 17:59:21 -0700, Andres Freund wrote: >> ISTM, we need a xip_status array in SnapBuild->running. Then, whenever >> a xl_running_xacts is encountered while builder->running.xcnt > 0, loop >> for i in 0 .. xcnt_space, and end SnapBuildEndTxn() if xip_status[i] but >> the xid is not xl_running_xacts? Does that make sense? > > A hasty implementation, untested besides check-world, of that approach > is attached. I'm going out for dinner now, will subject it to mean > things afterwards. > > Needs more testing, comment and commit message policing. But I think > the idea is sound? > I agree with adding running, I think that's good thing even for the per transaction tracking and snapshot exports - we could use the newly added field to get rid of the issue we have with 'snapshot too large' when there were many aborted transactions while we waited for running ones to finish. Because so far only tracked committed transactions, any aborted one would show as running inside the exported snapshot so it can grow over maximum number of backends and can't be exported anymore. So +1 for that part. But, I still think we need to restart the tracking after new xl_running_xacts. Reason for that is afaics any of the catalog snapshots that we assigned to transactions at the end of SnapBuildCommitTxn might be corrupted otherwise as they were built before we knew one of the supposedly running txes was actually already committed and that transaction might have done catalog changes. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 1 May 2017 at 09:54, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > But, I still think we need to restart the tracking after new > xl_running_xacts. Reason for that is afaics any of the catalog snapshots > that we assigned to transactions at the end of SnapBuildCommitTxn might > be corrupted otherwise as they were built before we knew one of the > supposedly running txes was actually already committed and that > transaction might have done catalog changes. Due to the race where LogStandbySnapshot() collects running-xacts info while a concurrent xact commits, such that the xl_xact_commit appears before the xl_running_xacts, but the xl_running_xacts still has the commited xact listed as running, right? Because we update PGXACT only after we write the commit to WAL, so there's a window where an xact is committed in WAL but not shown as committed in shmem. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 01/05/17 04:29, Craig Ringer wrote: > On 1 May 2017 at 09:54, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > >> But, I still think we need to restart the tracking after new >> xl_running_xacts. Reason for that is afaics any of the catalog snapshots >> that we assigned to transactions at the end of SnapBuildCommitTxn might >> be corrupted otherwise as they were built before we knew one of the >> supposedly running txes was actually already committed and that >> transaction might have done catalog changes. > > Due to the race where LogStandbySnapshot() collects running-xacts info > while a concurrent xact commits, such that the xl_xact_commit appears > before the xl_running_xacts, but the xl_running_xacts still has the > commited xact listed as running, right? Because we update PGXACT only > after we write the commit to WAL, so there's a window where an xact is > committed in WAL but not shown as committed in shmem. > Yes that's what the patch at hand tries to fix, but Andres approached it from too simplistic standpoint as we don't only care about the exported snapshot but whatever catalog snapshots we made for the transactions we already track, unless I am missing something. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote: > I agree with adding running, I think that's good thing even for the per > transaction tracking and snapshot exports - we could use the newly added > field to get rid of the issue we have with 'snapshot too large' when > there were many aborted transactions while we waited for running ones to > finish. I'm not sure of that - what I was proposing would only track this for the ->running substructure. How'd that help? > But, I still think we need to restart the tracking after new > xl_running_xacts. Reason for that is afaics any of the catalog snapshots > that we assigned to transactions at the end of SnapBuildCommitTxn might > be corrupted otherwise as they were built before we knew one of the > supposedly running txes was actually already committed and that > transaction might have done catalog changes. I'm afraid you're right. But I think this is even more complicated: The argument in your version that this can only happen once, seems to also be holey: Just imagine a pg_usleep(3000 * 1000000) right before ProcArrayEndTransaction() and enjoy the picture. Wonder if we should just (re-)add a stage between SNAPBUILD_START and SNAPBUILD_FULL_SNAPSHOT. Enter SNAPBUILD_BUILD_INITIAL_SNAPSHOT at the first xl_running_xacts, wait for all transactions to end with my approach, while populating SnapBuild->committed, only then start collecting changes for transactions (i.e. return true from SnapBuildProcessChange()), return true once all xacts have finished again. That'd presumably be a bit easier to understand, more robust - and slower. - Andres
On 01/05/17 10:03, Andres Freund wrote: > On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote: >> I agree with adding running, I think that's good thing even for the per >> transaction tracking and snapshot exports - we could use the newly added >> field to get rid of the issue we have with 'snapshot too large' when >> there were many aborted transactions while we waited for running ones to >> finish. > > I'm not sure of that - what I was proposing would only track this for > the ->running substructure. How'd that help? > Well not as is, but it's a building block for it. > >> But, I still think we need to restart the tracking after new >> xl_running_xacts. Reason for that is afaics any of the catalog snapshots >> that we assigned to transactions at the end of SnapBuildCommitTxn might >> be corrupted otherwise as they were built before we knew one of the >> supposedly running txes was actually already committed and that >> transaction might have done catalog changes. > > I'm afraid you're right. But I think this is even more complicated: The > argument in your version that this can only happen once, seems to also > be holey: Just imagine a pg_usleep(3000 * 1000000) right before > ProcArrayEndTransaction() and enjoy the picture. > Well yes, transaction can in theory have written commit/abort xlog record and stayed in proc for more than single xl_running_xacts write. But then the condition which we test that the new xl_running_xacts has bigger xmin than the previously tracked one's xmax would not be satisfied and we would not enter the relevant code path yet. So I think we should not be able to get any xids we didn't see. But we have to restart tracking from beginning (after first checking if we didn't already see anything that the xl_running_xacts considers as running), that's what my code did. > Wonder if we should just (re-)add a stage between SNAPBUILD_START and > SNAPBUILD_FULL_SNAPSHOT. Enter SNAPBUILD_BUILD_INITIAL_SNAPSHOT at the > first xl_running_xacts, wait for all transactions to end with my > approach, while populating SnapBuild->committed, only then start > collecting changes for transactions (i.e. return true from > SnapBuildProcessChange()), return true once all xacts have finished > again. That'd presumably be a bit easier to understand, more robust - > and slower. > That would also work, but per above, I don't understand why it's needed. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Noah Misch <noah@leadboat.com> writes: > On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote: >> ... I was wondering about adding >> a loop that simply runs for like 30s and then quits or such, but who >> knows. > If the probabilistic test catches the bug even 5% of the time in typical > configurations, the buildfarm will rapidly identify any regression. I'd > choose a 7s test that detects the bug 5% of the time over a 30s test that > detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl > for a probabilistic bug, I sized that test to finish in 1s and catch its bug > half the time. In its case, only two buildfarm members were able to > demonstrate the original bug, so 5% detection would have been too low.) 30sec is kind of a big lump from a buildfarm standpoint, especially if you mean "it runs for 30s on my honkin' fast workstation". I'm fine with individual tests that run for ~ 1sec. (This is top-of-mind for me right now because I've been looking around for ways to speed up the regression tests.) regards, tom lane
On 05/01/2017 08:46 AM, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: >> On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote: >>> ... I was wondering about adding >>> a loop that simply runs for like 30s and then quits or such, but who >>> knows. >> If the probabilistic test catches the bug even 5% of the time in typical >> configurations, the buildfarm will rapidly identify any regression. I'd >> choose a 7s test that detects the bug 5% of the time over a 30s test that >> detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl >> for a probabilistic bug, I sized that test to finish in 1s and catch its bug >> half the time. In its case, only two buildfarm members were able to >> demonstrate the original bug, so 5% detection would have been too low.) > 30sec is kind of a big lump from a buildfarm standpoint, especially if > you mean "it runs for 30s on my honkin' fast workstation". I'm fine > with individual tests that run for ~ 1sec. > > (This is top-of-mind for me right now because I've been looking around > for ways to speed up the regression tests.) > > Yes, me too. We're getting a bit lazy about that - see thread nearby that will let us avoid unnecessary temp installs among other things. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2017-05-01 08:46:47 -0400, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote: > >> ... I was wondering about adding > >> a loop that simply runs for like 30s and then quits or such, but who > >> knows. > > > If the probabilistic test catches the bug even 5% of the time in typical > > configurations, the buildfarm will rapidly identify any regression. I'd > > choose a 7s test that detects the bug 5% of the time over a 30s test that > > detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl > > for a probabilistic bug, I sized that test to finish in 1s and catch its bug > > half the time. In its case, only two buildfarm members were able to > > demonstrate the original bug, so 5% detection would have been too low.) > > 30sec is kind of a big lump from a buildfarm standpoint, especially if > you mean "it runs for 30s on my honkin' fast workstation". I'm fine > with individual tests that run for ~ 1sec. I was more thinking of pgench -T$XX, rather than constant number of iterations. I currently can reproduce the issues within like 3-4 minutes, so 5s is probably not quite sufficient to get decent coverage. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-05-01 08:46:47 -0400, Tom Lane wrote: >> 30sec is kind of a big lump from a buildfarm standpoint, especially if >> you mean "it runs for 30s on my honkin' fast workstation". I'm fine >> with individual tests that run for ~ 1sec. > I was more thinking of pgench -T$XX, rather than constant number of > iterations. I currently can reproduce the issues within like 3-4 > minutes, so 5s is probably not quite sufficient to get decent coverage. Adding a five-minute pgbench run to the buildfarm sequence is definitely going to get you ridden out of town on a rail. But quite aside from the question of whether we can afford the cycles, it seems like the wrong approach. IMO the buildfarm is mainly for verifying portability, not for trying to prove that race-like conditions don't exist. In most situations we're going out of our way to ensure reproduceability of tests we add to the buildfarm sequence; but it seems like this is looking for irreproducible results. regards, tom lane
On 2017-05-01 12:32:07 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-05-01 08:46:47 -0400, Tom Lane wrote: > >> 30sec is kind of a big lump from a buildfarm standpoint, especially if > >> you mean "it runs for 30s on my honkin' fast workstation". I'm fine > >> with individual tests that run for ~ 1sec. > > > I was more thinking of pgench -T$XX, rather than constant number of > > iterations. I currently can reproduce the issues within like 3-4 > > minutes, so 5s is probably not quite sufficient to get decent coverage. > > Adding a five-minute pgbench run to the buildfarm sequence is definitely > going to get you ridden out of town on a rail. Right - that was referring to Noah's comment upthread: On 2017-04-29 14:42:01 -0700, Noah Misch wrote: > If the probabilistic test catches the bug even 5% of the time in typical > configurations, the buildfarm will rapidly identify any regression. I'd > choose a 7s test that detects the bug 5% of the time over a 30s test that > detects it 99% of the time. (When I wrote src/bin/pgbench/t/001_pgbench.pl > for a probabilistic bug, I sized that test to finish in 1s and catch its bug > half the time. In its case, only two buildfarm members were able to > demonstrate the original bug, so 5% detection would have been too low.) and I suspect that you'd not find these within 5s within sufficient time, because the detection rate would be too low. > But quite aside from the question of whether we can afford the cycles, > it seems like the wrong approach. IMO the buildfarm is mainly for > verifying portability, not for trying to prove that race-like > conditions don't exist. In most situations we're going out of our way > to ensure reproduceability of tests we add to the buildfarm sequence; > but it seems like this is looking for irreproducible results. Yea, I wondered about that upthread as well. But the tests are quite useful nonetheless. Wonder about adding them simply as a separate target. - Andres
Andres Freund <andres@anarazel.de> writes: > On 2017-05-01 12:32:07 -0400, Tom Lane wrote: >> But quite aside from the question of whether we can afford the cycles, >> it seems like the wrong approach. IMO the buildfarm is mainly for >> verifying portability, not for trying to prove that race-like >> conditions don't exist. In most situations we're going out of our way >> to ensure reproduceability of tests we add to the buildfarm sequence; >> but it seems like this is looking for irreproducible results. > Yea, I wondered about that upthread as well. But the tests are quite > useful nonetheless. Wonder about adding them simply as a separate > target. I have no objection to adding more tests as a non-default target. regards, tom lane
On 2017-05-01 11:09:44 +0200, Petr Jelinek wrote: > On 01/05/17 10:03, Andres Freund wrote: > > On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote: > >> But, I still think we need to restart the tracking after new > >> xl_running_xacts. Reason for that is afaics any of the catalog snapshots > >> that we assigned to transactions at the end of SnapBuildCommitTxn might > >> be corrupted otherwise as they were built before we knew one of the > >> supposedly running txes was actually already committed and that > >> transaction might have done catalog changes. > > > > I'm afraid you're right. But I think this is even more complicated: The > > argument in your version that this can only happen once, seems to also > > be holey: Just imagine a pg_usleep(3000 * 1000000) right before > > ProcArrayEndTransaction() and enjoy the picture. > Well yes, transaction can in theory have written commit/abort xlog > record and stayed in proc for more than single xl_running_xacts write. > But then the condition which we test that the new xl_running_xacts has > bigger xmin than the previously tracked one's xmax would not be > satisfied and we would not enter the relevant code path yet. So I think > we should not be able to get any xids we didn't see. But we have to > restart tracking from beginning (after first checking if we didn't > already see anything that the xl_running_xacts considers as running), > that's what my code did. But to get that correct, we'd have to not only track ->committed, but also somehow maintain ->aborted, and not just for the transactions in the original set of running transactions. That'd be fairly complicated and large. The reason I was trying - and it's definitely not correct as I had proposed - to use the original running_xacts record is that that only required tracking as many transaction statuses as in the first xl_running_xacts. Am I missing something? The probabilistic tests catch the issues here fairly quickly, btw, if you run with synchronous_commit=on, while pgbench is running, because the WAL flushes make this more likely. Runs this query: SELECT account_count, teller_count, account_sum - teller_sum s FROM ( SELECT count(*) account_count, SUM(abalance) account_sum FROM pgbench_accounts ) a, ( SELECTcount(*) teller_count, SUM(tbalance) teller_sum FROM pgbench_tellers ) t which, for my scale, should always return: ┌─────────┬─────┬───┐ │ a │ t │ s │ ├─────────┼─────┼───┤ │ 2000000 │ 200 │ 0 │ └─────────┴─────┴───┘ but with my POC patch occasionally returns things like: ┌─────────┬─────┬───────┐ │ a │ t │ s │ ├─────────┼─────┼───────┤ │ 2000000 │ 212 │ 37358 │ └─────────┴─────┴───────┘ which obviously shouldn't be the case.
On 5/1/17 13:02, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2017-05-01 12:32:07 -0400, Tom Lane wrote: >>> But quite aside from the question of whether we can afford the cycles, >>> it seems like the wrong approach. IMO the buildfarm is mainly for >>> verifying portability, not for trying to prove that race-like >>> conditions don't exist. In most situations we're going out of our way >>> to ensure reproduceability of tests we add to the buildfarm sequence; >>> but it seems like this is looking for irreproducible results. > >> Yea, I wondered about that upthread as well. But the tests are quite >> useful nonetheless. Wonder about adding them simply as a separate >> target. > > I have no objection to adding more tests as a non-default target. Well, the problem with nondefault targets is that they are hard to find if you don't know them, and then they will rot. Sure, we need a way to distinguish different classes of tests, but lets think about the bigger scheme, too. Ideas welcome. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, May 01, 2017 at 12:32:07PM -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-05-01 08:46:47 -0400, Tom Lane wrote: > >> 30sec is kind of a big lump from a buildfarm standpoint, especially if > >> you mean "it runs for 30s on my honkin' fast workstation". I'm fine > >> with individual tests that run for ~ 1sec. > > > I was more thinking of pgench -T$XX, rather than constant number of > > iterations. I currently can reproduce the issues within like 3-4 > > minutes, so 5s is probably not quite sufficient to get decent coverage. You might hit the race faster by adding a dedicated stress test function to regress.c. > IMO the buildfarm is mainly for verifying portability, not for > trying to prove that race-like conditions don't exist. Perhaps so, but it has excelled at both tasks.
On 01/05/17 21:14, Andres Freund wrote: > On 2017-05-01 11:09:44 +0200, Petr Jelinek wrote: >> On 01/05/17 10:03, Andres Freund wrote: >>> On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote: > >>>> But, I still think we need to restart the tracking after new >>>> xl_running_xacts. Reason for that is afaics any of the catalog snapshots >>>> that we assigned to transactions at the end of SnapBuildCommitTxn might >>>> be corrupted otherwise as they were built before we knew one of the >>>> supposedly running txes was actually already committed and that >>>> transaction might have done catalog changes. >>> >>> I'm afraid you're right. But I think this is even more complicated: The >>> argument in your version that this can only happen once, seems to also >>> be holey: Just imagine a pg_usleep(3000 * 1000000) right before >>> ProcArrayEndTransaction() and enjoy the picture. > >> Well yes, transaction can in theory have written commit/abort xlog >> record and stayed in proc for more than single xl_running_xacts write. >> But then the condition which we test that the new xl_running_xacts has >> bigger xmin than the previously tracked one's xmax would not be >> satisfied and we would not enter the relevant code path yet. So I think >> we should not be able to get any xids we didn't see. But we have to >> restart tracking from beginning (after first checking if we didn't >> already see anything that the xl_running_xacts considers as running), >> that's what my code did. > > But to get that correct, we'd have to not only track ->committed, but > also somehow maintain ->aborted, and not just for the transactions in > the original set of running transactions. That'd be fairly complicated > and large. The reason I was trying - and it's definitely not correct as > I had proposed - to use the original running_xacts record is that that > only required tracking as many transaction statuses as in the first > xl_running_xacts. Am I missing something? Aah, now I understand we talked about slightly different things, I considered the running thing to be first step towards tracking aborted txes everywhere. I am not sure if it's complicated, it would be exactly the same as committed tracking, except we'd do it only before we reach SNAPBUILD_CONSISTENT. It would be definitely larger patch I agree, but I can give it at try. If you think that adding the SNAPBUILD_BUILD_INITIAL_SNAPSHOT would be less invasive/smaller patch I am okay with doing that for PG10. I think we'll have to revisit tracking of aborted transactions in PG11 then though because of the 'snapshot too large' issue when exporting, at least I don't see any other way to fix that. > > The probabilistic tests catch the issues here fairly quickly, btw, if > you run with synchronous_commit=on, while pgbench is running, because > the WAL flushes make this more likely. Runs this query: > > SELECT account_count, teller_count, account_sum - teller_sum s > FROM > ( > SELECT count(*) account_count, SUM(abalance) account_sum > FROM pgbench_accounts > ) a, > ( > SELECT count(*) teller_count, SUM(tbalance) teller_sum > FROM pgbench_tellers > ) t > > which, for my scale, should always return: > ┌─────────┬─────┬───┐ > │ a │ t │ s │ > ├─────────┼─────┼───┤ > │ 2000000 │ 200 │ 0 │ > └─────────┴─────┴───┘ > but with my POC patch occasionally returns things like: > ┌─────────┬─────┬───────┐ > │ a │ t │ s │ > ├─────────┼─────┼───────┤ > │ 2000000 │ 212 │ 37358 │ > └─────────┴─────┴───────┘ > > which obviously shouldn't be the case. > Very nice (the test, not the failures ;)) ! -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Apr 27, 2017 at 09:42:58PM -0700, Andres Freund wrote: > > > On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> wrote: > >On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote: > >> On 2017-04-17 21:16:57 -0700, Andres Freund wrote: > >> > I've since the previous update reviewed Petr's patch, which he > >since has > >> > updated over the weekend. I'll do another round tomorrow, and will > >see > >> > how it looks. I think we might need some more tests for this to be > >> > committable, so it might not become committable tomorrow. I hope > >we'll > >> > have something in tree by end of this week, if not I'll send an > >update. > >> > >> I was less productive this week than I'd hoped, and creating a > >testsuite > >> was more work than I'd anticipated, so I'm slightly lagging behind. > >I > >> hope to have a patchset tomorrow, aiming to commit something > >> Monday/Tuesday. > > > >This PostgreSQL 10 open item is past due for your status update. > >Kindly send > >a status update within 24 hours, and include a date for your subsequent > >status > >update. Refer to the policy on open item ownership: > >https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > I committed part of the series today, plan to continue doing so over the next few days. Changes require careful review& testing, this is easy to get wrong... This PostgreSQL 10 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Also, this open item has been alive for three weeks, well above guideline. I understand it's a tricky bug, but I'm worried this isn't on track to end. What is missing to make it end? Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
On 04/05/17 07:45, Noah Misch wrote: > On Thu, Apr 27, 2017 at 09:42:58PM -0700, Andres Freund wrote: >> >> >> On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> wrote: >>> On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote: >>>> On 2017-04-17 21:16:57 -0700, Andres Freund wrote: >>>>> I've since the previous update reviewed Petr's patch, which he >>> since has >>>>> updated over the weekend. I'll do another round tomorrow, and will >>> see >>>>> how it looks. I think we might need some more tests for this to be >>>>> committable, so it might not become committable tomorrow. I hope >>> we'll >>>>> have something in tree by end of this week, if not I'll send an >>> update. >>>> >>>> I was less productive this week than I'd hoped, and creating a >>> testsuite >>>> was more work than I'd anticipated, so I'm slightly lagging behind. >>> I >>>> hope to have a patchset tomorrow, aiming to commit something >>>> Monday/Tuesday. >>> >>> This PostgreSQL 10 open item is past due for your status update. >>> Kindly send >>> a status update within 24 hours, and include a date for your subsequent >>> status >>> update. Refer to the policy on open item ownership: >>> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com >> >> I committed part of the series today, plan to continue doing so over the next few days. Changes require careful review& testing, this is easy to get wrong... > > This PostgreSQL 10 open item is past due for your status update. Kindly send > a status update within 24 hours, and include a date for your subsequent status > update. > > Also, this open item has been alive for three weeks, well above guideline. I > understand it's a tricky bug, but I'm worried this isn't on track to end. > What is missing to make it end? It's tricky 5 bugs, and they are quite sensitive on rare timing/concurrency events. First two are fixed, we can live with 5th being done later (as it's not correctness fix, but very bad performance one). We haven't figured way for fixing the 4th one that we all agree is good solution (everything proposed so far still had bugs). I am not quite sure what happened to the 3rd one. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2017-05-02 08:55:53 +0200, Petr Jelinek wrote: > Aah, now I understand we talked about slightly different things, I > considered the running thing to be first step towards tracking aborted > txes everywhere. > I think > we'll have to revisit tracking of aborted transactions in PG11 then > though because of the 'snapshot too large' issue when exporting, at > least I don't see any other way to fix that. FWIW, that seems unnecessary - we can just check for that using the clog. Should be very simple to check for aborted xacts when exporting the snapshot (like 2 lines + comments). That should address your concern, right? > If you think that adding the SNAPBUILD_BUILD_INITIAL_SNAPSHOT would be > less invasive/smaller patch I am okay with doing that for PG10. Attached is a prototype patch for that. What I decided is that essentially tracking the running xacts is too unrealiable due to the race, so I decided that just relying on oldestRunningXid and nextXid - which are solely in the procArray and thus racefree - is better. It's not perfect yet, primarily because we'd need to take a bit more care about being ABI compatible for older releases, and because we'd probably have to trigger LogStandbySnapshot() a bit more frequently (presumably while waiting for WAL). The change means we'll have to wait a bit longer for slot creation, but it's considerably simpler / more robust. Could you have a look? Regards, Andres
On 2017-05-04 17:00:04 -0700, Andres Freund wrote: > Attached is a prototype patch for that. Oops. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 05/05/17 02:00, Andres Freund wrote: > Hi, > > On 2017-05-02 08:55:53 +0200, Petr Jelinek wrote: >> Aah, now I understand we talked about slightly different things, I >> considered the running thing to be first step towards tracking aborted >> txes everywhere. >> I think >> we'll have to revisit tracking of aborted transactions in PG11 then >> though because of the 'snapshot too large' issue when exporting, at >> least I don't see any other way to fix that. > > FWIW, that seems unnecessary - we can just check for that using the > clog. Should be very simple to check for aborted xacts when exporting > the snapshot (like 2 lines + comments). That should address your > concern, right? Right, because there isn't practical difference between running and aborted transaction for us so we don't mind if the abort has happened in the future. Yeah the export code could do the check seems quite simple. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 05/05/17 02:42, Andres Freund wrote: > On 2017-05-04 17:00:04 -0700, Andres Freund wrote: >> Attached is a prototype patch for that. > I am not sure I understand the ABI comment for started_collection_at. What's the ABI issue? The struct is private to snapbuild.c module. Or you want to store it in the ondisk snapshot as well? About better name for it what about something like oldest_full_xact? Otherwise the logic seems to be right on first read, will ponder it a bit more -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-05-05 13:53:16 +0200, Petr Jelinek wrote: > On 05/05/17 02:42, Andres Freund wrote: > > On 2017-05-04 17:00:04 -0700, Andres Freund wrote: > >> Attached is a prototype patch for that. > > > > I am not sure I understand the ABI comment for started_collection_at. > What's the ABI issue? The struct is private to snapbuild.c module. Or > you want to store it in the ondisk snapshot as well? It's stored on-disk already :( > Otherwise the logic seems to be right on first read, will ponder it a > bit more Cool! Greetings, Andres Freund
On 05/05/17 18:18, Andres Freund wrote: > On 2017-05-05 13:53:16 +0200, Petr Jelinek wrote: >> On 05/05/17 02:42, Andres Freund wrote: >>> On 2017-05-04 17:00:04 -0700, Andres Freund wrote: >>>> Attached is a prototype patch for that. >>> >> >> I am not sure I understand the ABI comment for started_collection_at. >> What's the ABI issue? The struct is private to snapbuild.c module. Or >> you want to store it in the ondisk snapshot as well? > > It's stored on-disk already :( Hmm okay, well then I guess we'll have to store it somewhere inside running, we should not normally care about that since we only load CONSISTENT snapshots where running don't matter anymore. Alternatively we could bump the SNAPBUILD_VERSION but that would make it impossible to downgrade minor version which is bad (I guess we'll want to do that for master, but not back-branches). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On May 3, 2017 10:45:16 PM PDT, Noah Misch <noah@leadboat.com> wrote: >On Thu, Apr 27, 2017 at 09:42:58PM -0700, Andres Freund wrote: >> >> >> On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> >wrote: >> >On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote: >> >> On 2017-04-17 21:16:57 -0700, Andres Freund wrote: >> >> > I've since the previous update reviewed Petr's patch, which he >> >since has >> >> > updated over the weekend. I'll do another round tomorrow, and >will >> >see >> >> > how it looks. I think we might need some more tests for this to >be >> >> > committable, so it might not become committable tomorrow. I >hope >> >we'll >> >> > have something in tree by end of this week, if not I'll send an >> >update. >> >> >> >> I was less productive this week than I'd hoped, and creating a >> >testsuite >> >> was more work than I'd anticipated, so I'm slightly lagging >behind. >> >I >> >> hope to have a patchset tomorrow, aiming to commit something >> >> Monday/Tuesday. >> > >> >This PostgreSQL 10 open item is past due for your status update. >> >Kindly send >> >a status update within 24 hours, and include a date for your >subsequent >> >status >> >update. Refer to the policy on open item ownership: >> >>https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com >> >> I committed part of the series today, plan to continue doing so over >the next few days. Changes require careful review & testing, this is >easy to get wrong... > >This PostgreSQL 10 open item is past due for your status update. >Kindly send >a status update within 24 hours, and include a date for your subsequent >status >update. > >Also, this open item has been alive for three weeks, well above >guideline. I >understand it's a tricky bug, but I'm worried this isn't on track to >end. >What is missing to make it end? > >Refer to the policy on open item ownership: >https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com I plan to commit the next pending patch after the back branch releases are cut - it's an invasive fix and the issue doesn'tcause corruption "just" slow slot creation. So it seems better to wait for a few days, rather than hurry it into therelease. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 2017-05-05 02:00, Andres Freund wrote: > > Could you have a look? Running tests with these three patches: > 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ > 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + > fix-statistics-reporting-in-logical-replication-work.patch (on top of 44c528810) I test by 15-minute pgbench runs while there is a logical replication connection. Primary and replica are on the same machine. I have seen errors on 3 different machines (where error means: at least 1 of the 4 pgbench tables is not md5-equal). It seems better, faster machines yield less errors. Normally I see in pg_stat_replication (on master) one process in state 'streaming'. pid | wal | replay_loc | diff | state | app | sync_state 16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | async Often there are another two processes in pg_stat_replication that remain in state 'startup'. In the failing sessions the 'streaming'-state process is missing; in failing sessions there are only the two processes that are and remain in 'startup'. FWIW, below the output of a succesful and a failed run: ------------------ successful run: creating tables... 1590400 of 2500000 tuples (63%) done (elapsed 5.34 s, remaining 3.05 s) 2500000 of 2500000 tuples (100%) done (elapsed 9.63 s, remaining 0.00 s) vacuum... set primary keys... done. create publication pub1 for all tables; create subscription sub1 connection 'port=6972 application_name=derail2' publication pub1 with (disabled); alter subscription sub1 enable; -- pgbench -c 90 -j 8 -T 900 -P 180 -n -- scale 25 progress: 180.0 s, 82.5 tps, lat 1086.845 ms stddev 3211.785 progress: 360.0 s, 25.4 tps, lat 3469.040 ms stddev 6297.440 progress: 540.0 s, 28.9 tps, lat 3131.438 ms stddev 4288.130 progress: 720.0 s, 27.5 tps, lat 3285.024 ms stddev 4113.841 progress: 900.0 s, 47.2 tps, lat 1896.698 ms stddev 2182.695 transaction type: <builtin: TPC-B (sort of)> scaling factor: 25 query mode: simple number of clients: 90 number of threads: 8 duration: 900 s number of transactions actually processed: 38175 latency average = 2128.606 ms latency stddev = 3948.634 ms tps = 42.151205 (including connections establishing) tps = 42.151589 (excluding connections establishing) -- waiting 0s... (always) port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 2545 | 18/432B2180 | 18/432B2180 | 0 | t | streaming | derail2 | async 2017.05.08 23:19:22 -- getting md5 (cb) 6972 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837 d1afac950 d4abcc72e master 6973 a,b,t,h: 2500000 25 250 38175 b2ba48b53 b3788a837 d1afac950 d4abcc72e replica ok bee2312c7 2017.05.08 23:20:48 port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972| pg_stat_replication | 2545 | 18/4AEEC8C0 | 18/453FBD20 | 95357856 | f | streaming | derail2 | async ------------------------------------ ------------------ failure: creating tables... 1777100 of 2500000 tuples (71%) done (elapsed 5.06 s, remaining 2.06 s) 2500000 of 2500000 tuples (100%) done (elapsed 7.41 s, remaining 0.00 s) vacuum... set primary keys... done. create publication pub1 for all tables; create subscription sub1 connection 'port=6972 application_name=derail2' publication pub1 with (disabled); alter subscription sub1 enable; port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 11945 | 18/5E2913D0 | | | | catchup | derail2 | async -- pgbench -c 90 -j 8 -T 900 -P 180 -n -- scale 25 progress: 180.0 s, 78.4 tps, lat 1138.348 ms stddev 2884.815 progress: 360.0 s, 69.2 tps, lat 1309.716 ms stddev 2594.231 progress: 540.0 s, 59.0 tps, lat 1519.146 ms stddev 2033.400 progress: 720.0 s, 62.9 tps, lat 1421.854 ms stddev 1775.066 progress: 900.0 s, 57.0 tps, lat 1575.693 ms stddev 1681.800 transaction type: <builtin: TPC-B (sort of)> scaling factor: 25 query mode: simple number of clients: 90 number of threads: 8 duration: 900 s number of transactions actually processed: 58846 latency average = 1378.259 ms latency stddev = 2304.159 ms tps = 65.224168 (including connections establishing) tps = 65.224788 (excluding connections establishing) -- waiting 0s... (always) port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 11948 | 18/7469A038 | | | | startup | derail2 | async 6972 | pg_stat_replication | 12372 | 18/7469A038 | | | | startup |derail2 | async ------------------------------------ During my tests, I keep an eye on pg_stat_replication (refreshing every 2s), and when I see those two 'startup'-lines in pg_stat_replication without any 'streaming'-line I know the test is going to fail. I believe this has been true for all failure cases that I've seen (except the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another thread). Sorry, I have not been able to get any thing more clear or definitive... thanks, Erik Rijkers
On 09/05/17 00:03, Erik Rijkers wrote: > On 2017-05-05 02:00, Andres Freund wrote: >> >> Could you have a look? > > Running tests with these three patches: > >> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ >> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + >> fix-statistics-reporting-in-logical-replication-work.patch > (on top of 44c528810) > > I test by 15-minute pgbench runs while there is a logical replication > connection. Primary and replica are on the same machine. > > I have seen errors on 3 different machines (where error means: at least > 1 of the 4 pgbench tables is not md5-equal). It seems better, faster > machines yield less errors. > > Normally I see in pg_stat_replication (on master) one process in state > 'streaming'. > > pid | wal | replay_loc | diff | state | app | > sync_state > 16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | async > > Often there are another two processes in pg_stat_replication that remain > in state 'startup'. > > In the failing sessions the 'streaming'-state process is missing; in > failing sessions there are only the two processes that are and remain in > 'startup'. Hmm, startup is the state where slot creation is happening. I wonder if it's just taking long time to create snapshot because of the 5th issue which is not yet fixed (and the original patch will not apply on top of this change). Alternatively there is a bug in this patch. Did you see high CPU usage during the test when there were those "startup" state walsenders? -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-05-09 10:50, Petr Jelinek wrote: > On 09/05/17 00:03, Erik Rijkers wrote: >> On 2017-05-05 02:00, Andres Freund wrote: >>> >>> Could you have a look? >> >> Running tests with these three patches: >> >>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ >>> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + >>> fix-statistics-reporting-in-logical-replication-work.patch >> (on top of 44c528810) >> >> I test by 15-minute pgbench runs while there is a logical replication >> connection. Primary and replica are on the same machine. >> >> I have seen errors on 3 different machines (where error means: at >> least >> 1 of the 4 pgbench tables is not md5-equal). It seems better, faster >> machines yield less errors. >> >> Normally I see in pg_stat_replication (on master) one process in state >> 'streaming'. >> >> pid | wal | replay_loc | diff | state | app | >> sync_state >> 16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | >> async >> >> Often there are another two processes in pg_stat_replication that >> remain >> in state 'startup'. >> >> In the failing sessions the 'streaming'-state process is missing; in >> failing sessions there are only the two processes that are and remain >> in >> 'startup'. > > Hmm, startup is the state where slot creation is happening. I wonder if > it's just taking long time to create snapshot because of the 5th issue > which is not yet fixed (and the original patch will not apply on top of > this change). Alternatively there is a bug in this patch. > > Did you see high CPU usage during the test when there were those > "startup" state walsenders? > I haven't noticed but I didn't pay attention to that particularly. I'll try to get some CPU-info logged...
On 09/05/17 10:59, Erik Rijkers wrote: > On 2017-05-09 10:50, Petr Jelinek wrote: >> On 09/05/17 00:03, Erik Rijkers wrote: >>> On 2017-05-05 02:00, Andres Freund wrote: >>>> >>>> Could you have a look? >>> >>> Running tests with these three patches: >>> >>>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ >>>> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + >>>> fix-statistics-reporting-in-logical-replication-work.patch >>> (on top of 44c528810) >>> >>> I test by 15-minute pgbench runs while there is a logical replication >>> connection. Primary and replica are on the same machine. >>> >>> I have seen errors on 3 different machines (where error means: at least >>> 1 of the 4 pgbench tables is not md5-equal). It seems better, faster >>> machines yield less errors. >>> >>> Normally I see in pg_stat_replication (on master) one process in state >>> 'streaming'. >>> >>> pid | wal | replay_loc | diff | state | app | >>> sync_state >>> 16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | >>> async >>> >>> Often there are another two processes in pg_stat_replication that remain >>> in state 'startup'. >>> >>> In the failing sessions the 'streaming'-state process is missing; in >>> failing sessions there are only the two processes that are and remain in >>> 'startup'. >> >> Hmm, startup is the state where slot creation is happening. I wonder if >> it's just taking long time to create snapshot because of the 5th issue >> which is not yet fixed (and the original patch will not apply on top of >> this change). Alternatively there is a bug in this patch. >> >> Did you see high CPU usage during the test when there were those >> "startup" state walsenders? >> > > I haven't noticed but I didn't pay attention to that particularly. > > I'll try to get some CPU-info logged... > I rebased the above mentioned patch to apply to the patches Andres sent, if you could try to add it on top of what you have and check if it still fails, that would be helpful. Thanks! -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-05-09 11:50, Petr Jelinek wrote: > On 09/05/17 10:59, Erik Rijkers wrote: >> On 2017-05-09 10:50, Petr Jelinek wrote: >>> On 09/05/17 00:03, Erik Rijkers wrote: >>>> On 2017-05-05 02:00, Andres Freund wrote: >>>>> >>>>> Could you have a look? [...] > I rebased the above mentioned patch to apply to the patches Andres > sent, > if you could try to add it on top of what you have and check if it > still > fails, that would be helpful. I suppose you mean these; but they do not apply anymore: 20170505/0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch 20170505/0002-WIP-Possibly-more-robust-snapbuild-approach.patch Andres, any change you could update them? alternatively I could use the older version again.. thanks, Erik Rijkers
On 2017-05-09 11:50, Petr Jelinek wrote: > I rebased the above mentioned patch to apply to the patches Andres > sent, > if you could try to add it on top of what you have and check if it > still > fails, that would be helpful. It still fails. With these patches - 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ - 00002-WIP-Possibly-more-robust-snapbuild-approach.patch + - fix-statistics-reporting-in-logical-replication-work.patch + - Skip-unnecessary-snapshot-builds.patch built again on top of 44c528810a1 ( so I had to add the 'fix-statistics-rep*' patch because without it I immediately got that Assertion failure again ). As always most runs succeed (especially on this large 192GB 16-core server). But attached is an output file of a number of runs of my pgbench_derail2.sh test. Overal result: -- out_20170509_1635.txt 3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n -- scale 25 2 -- All is well. 1 -- Not good, but breaking out of wait (21 times no change) I broke it off after iteration 4, so 5 never ran, and iteration 1 failed due to a mistake in the harness (somethind stupid I did) - not interesting. iteration 2 succeeds. (eventually has 'replica ok') iteration 3 succeeds. (eventually has 'replica ok') iteration 4 fails. Just after 'alter subscription sub1 enable' I caught (as is usual) pg_stat_replication.state as 'catchup'. So far so good. After the 15-minute pgbench run pg_stat_replication has only 2 'startup' lines (and none 'catchup' or 'streaming'): port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 108349 | 19/8FBCC248 | | | | startup | derail2 | async 6972 | pg_stat_replication | 108351 | 19/8FBCC248 | | | | startup | derail2 | async (that's from: select $port1 as port,'pg_stat_replication' as pg_stat_replication, pid , pg_current_wal_location() wal, replay_location replay_loc, pg_current_wal_location() - replay_location as diff , pg_current_wal_location() <= replay_location , state, application_name as app, sync_state from pg_stat_replication ) This remains in this state for as long as my test-programs lets it (i.e., 20 x 30s, or something like that, and then the loop is exited); in the ouput file it says: 'Not good, but breaking out of wait' Below is the accompanying ps (with the 2 'deranged senders' as Jeff Janes would surely call them): UID PID PPID C STIME TTY STAT TIME CMD rijkers 107147 1 0 17:11 pts/35 S+ 0:00 /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres -D /var/data1/pg_stuff/pg_installations rijkers 107149 107147 0 17:11 ? Ss 0:00 \_ postgres: logger process rijkers 107299 107147 0 17:11 ? Ss 0:01 \_ postgres: checkpointer process rijkers 107300 107147 0 17:11 ? Ss 0:00 \_ postgres: writer process rijkers 107301 107147 0 17:11 ? Ss 0:00 \_ postgres: wal writer process rijkers 107302 107147 0 17:11 ? Ss 0:00 \_ postgres: autovacuum launcher process rijkers 107303 107147 0 17:11 ? Ss 0:00 \_ postgres: stats collector process rijkers 107304 107147 0 17:11 ? Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 108348 107147 0 17:12 ? Ss 0:01 \_ postgres: bgworker: logical replication worker for subscription 70310 sync 70293 rijkers 108350 107147 0 17:12 ? Ss 0:00 \_ postgres: bgworker: logical replication worker for subscription 70310 sync 70298 rijkers 107145 1 0 17:11 pts/35 S+ 0:02 /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres -D /var/data1/pg_stuff/pg_installations rijkers 107151 107145 0 17:11 ? Ss 0:00 \_ postgres: logger process rijkers 107160 107145 0 17:11 ? Ss 0:08 \_ postgres: checkpointer process rijkers 107161 107145 0 17:11 ? Ss 0:07 \_ postgres: writer process rijkers 107162 107145 0 17:11 ? Ss 0:02 \_ postgres: wal writer process rijkers 107163 107145 0 17:11 ? Ss 0:00 \_ postgres: autovacuum launcher process rijkers 107164 107145 0 17:11 ? Ss 0:02 \_ postgres: stats collector process rijkers 107165 107145 0 17:11 ? Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 108349 107145 0 17:12 ? Ss 0:27 \_ postgres: wal sender process rijkers [local] idle rijkers 108351 107145 0 17:12 ? Ss 0:26 \_ postgres: wal sender process rijkers [local] idle I have had no time to add (or view) any CPUinfo. Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 09/05/17 19:54, Erik Rijkers wrote: > On 2017-05-09 11:50, Petr Jelinek wrote: > >> I rebased the above mentioned patch to apply to the patches Andres sent, >> if you could try to add it on top of what you have and check if it still >> fails, that would be helpful. > > It still fails. > > With these patches > > - 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ > - 00002-WIP-Possibly-more-robust-snapbuild-approach.patch + > - fix-statistics-reporting-in-logical-replication-work.patch + > - Skip-unnecessary-snapshot-builds.patch > > built again on top of 44c528810a1 ( so I had to add the > 'fix-statistics-rep*' patch because without it I immediately got that > Assertion failure again ). > > As always most runs succeed (especially on this large 192GB 16-core > server). > > But attached is an output file of a number of runs of my > pgbench_derail2.sh test. > > Overal result: > > -- out_20170509_1635.txt > 3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n -- scale 25 > 2 -- All is well. > 1 -- Not good, but breaking out of wait (21 times no change) > > I broke it off after iteration 4, so 5 never ran, and > iteration 1 failed due to a mistake in the harness (somethind stupid I > did) - not interesting. > > iteration 2 succeeds. (eventually has 'replica ok') > > iteration 3 succeeds. (eventually has 'replica ok') > > iteration 4 fails. > Just after 'alter subscription sub1 enable' I caught (as is usual) > pg_stat_replication.state as 'catchup'. So far so good. > After the 15-minute pgbench run pg_stat_replication has only 2 > 'startup' lines (and none 'catchup' or 'streaming'): > > port | pg_stat_replication | pid | wal | replay_loc | diff | > ?column? | state | app | sync_state > 6972 | pg_stat_replication | 108349 | 19/8FBCC248 | | | > | startup | derail2 | async > 6972 | pg_stat_replication | 108351 | 19/8FBCC248 | | | > | startup | derail2 | async > > (that's from: > select $port1 as port,'pg_stat_replication' as pg_stat_replication, pid > , pg_current_wal_location() wal, replay_location replay_loc, > pg_current_wal_location() - replay_location as diff > , pg_current_wal_location() <= replay_location > , state, application_name as app, sync_state > from pg_stat_replication > ) > > This remains in this state for as long as my test-programs lets it > (i.e., 20 x 30s, or something like that, and then the loop is exited); > in the ouput file it says: 'Not good, but breaking out of wait' > > Below is the accompanying ps (with the 2 'deranged senders' as Jeff > Janes would surely call them): > > > UID PID PPID C STIME TTY STAT TIME CMD > rijkers 107147 1 0 17:11 pts/35 S+ 0:00 > /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres > -D /var/data1/pg_stuff/pg_installations > rijkers 107149 107147 0 17:11 ? Ss 0:00 \_ postgres: > logger process > rijkers 107299 107147 0 17:11 ? Ss 0:01 \_ postgres: > checkpointer process > rijkers 107300 107147 0 17:11 ? Ss 0:00 \_ postgres: > writer process > rijkers 107301 107147 0 17:11 ? Ss 0:00 \_ postgres: wal > writer process > rijkers 107302 107147 0 17:11 ? Ss 0:00 \_ postgres: > autovacuum launcher process > rijkers 107303 107147 0 17:11 ? Ss 0:00 \_ postgres: stats > collector process > rijkers 107304 107147 0 17:11 ? Ss 0:00 \_ postgres: > bgworker: logical replication launcher > rijkers 108348 107147 0 17:12 ? Ss 0:01 \_ postgres: > bgworker: logical replication worker for subscription 70310 sync 70293 > rijkers 108350 107147 0 17:12 ? Ss 0:00 \_ postgres: > bgworker: logical replication worker for subscription 70310 sync 70298 > rijkers 107145 1 0 17:11 pts/35 S+ 0:02 > /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres > -D /var/data1/pg_stuff/pg_installations > rijkers 107151 107145 0 17:11 ? Ss 0:00 \_ postgres: > logger process > rijkers 107160 107145 0 17:11 ? Ss 0:08 \_ postgres: > checkpointer process > rijkers 107161 107145 0 17:11 ? Ss 0:07 \_ postgres: > writer process > rijkers 107162 107145 0 17:11 ? Ss 0:02 \_ postgres: wal > writer process > rijkers 107163 107145 0 17:11 ? Ss 0:00 \_ postgres: > autovacuum launcher process > rijkers 107164 107145 0 17:11 ? Ss 0:02 \_ postgres: stats > collector process > rijkers 107165 107145 0 17:11 ? Ss 0:00 \_ postgres: > bgworker: logical replication launcher > rijkers 108349 107145 0 17:12 ? Ss 0:27 \_ postgres: wal > sender process rijkers [local] idle > rijkers 108351 107145 0 17:12 ? Ss 0:26 \_ postgres: wal > sender process rijkers [local] idle > > I have had no time to add (or view) any CPUinfo. > > Ah okay, so this is same issue that's reported by both Masahiko Sawada [1] and Jeff Janes [2]. [1] https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-05-09 21:00, Petr Jelinek wrote: > On 09/05/17 19:54, Erik Rijkers wrote: >> On 2017-05-09 11:50, Petr Jelinek wrote: >> > > Ah okay, so this is same issue that's reported by both Masahiko Sawada > [1] and Jeff Janes [2]. > > [1] > https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com > [2] > https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com I don't understand why you come to that conclusion: both Masahiko Sawada and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't. Isn't that a real difference? ( I do sometimes get that DROP-SUBSCRIPTION too, but much less often than the sync-failure. )
On 09/05/17 22:11, Erik Rijkers wrote: > On 2017-05-09 21:00, Petr Jelinek wrote: >> On 09/05/17 19:54, Erik Rijkers wrote: >>> On 2017-05-09 11:50, Petr Jelinek wrote: >>> >> >> Ah okay, so this is same issue that's reported by both Masahiko Sawada >> [1] and Jeff Janes [2]. >> >> [1] >> https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com >> >> [2] >> https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com >> > > I don't understand why you come to that conclusion: both Masahiko Sawada > and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't. > Isn't that a real difference? > Because I only see the sync workers running in the output you shown, the bug described in those threads can as one of side effects cause the sync workers to wait forever for the apply that was killed, crashed, etc, in the right moment, which I guess is what happened during your failed test (there should be some info in the log about apply exiting). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
> On 10 May 2017, at 11:43, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > > On 09/05/17 22:11, Erik Rijkers wrote: >> On 2017-05-09 21:00, Petr Jelinek wrote: >>> On 09/05/17 19:54, Erik Rijkers wrote: >>>> On 2017-05-09 11:50, Petr Jelinek wrote: >>>> >>> >>> Ah okay, so this is same issue that's reported by both Masahiko Sawada >>> [1] and Jeff Janes [2]. >>> >>> [1] >>> https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com >>> >>> [2] >>> https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com >>> >> >> I don't understand why you come to that conclusion: both Masahiko Sawada >> and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't. >> Isn't that a real difference? >> > Just noted another one issue/feature with snapshot builder — when logical decoding is in progress and vacuum full command is being issued quite a big amount of files appears in pg_logical/mappings and reside there till the checkpoint. Also having consequent vacuum full’s before checkpoint yields even more files. Having two pgbench-filled instances with logical replication between them: for i in {1..100}; do psql -c 'vacuum full' && ls -la pg_logical/mappings | wc -l; done; VACUUM 73 VACUUM 454 VACUUM 1146 VACUUM 2149 VACUUM 3463 VACUUM 5088 <...> VACUUM 44708 <…> // checkpoint happens somewhere here VACUUM 20921 WARNING: could not truncate file "base/16384/61773": Too many open files in system ERROR: could not fsync file "pg_logical/mappings/map-4000-4df-0_A4EA29F8-5aa5-5ae6": Too many open files in system I’m not sure whether this is boils down to some of the previous issues mentioned here or not, so posting here as observation. Stas Kelvich Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 11/05/17 16:33, Stas Kelvich wrote: > >> On 10 May 2017, at 11:43, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: >> >> On 09/05/17 22:11, Erik Rijkers wrote: >>> On 2017-05-09 21:00, Petr Jelinek wrote: >>>> On 09/05/17 19:54, Erik Rijkers wrote: >>>>> On 2017-05-09 11:50, Petr Jelinek wrote: >>>>> >>>> >>>> Ah okay, so this is same issue that's reported by both Masahiko Sawada >>>> [1] and Jeff Janes [2]. >>>> >>>> [1] >>>> https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com >>>> >>>> [2] >>>> https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com >>>> >>> >>> I don't understand why you come to that conclusion: both Masahiko Sawada >>> and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't. >>> Isn't that a real difference? >>> >> > > Just noted another one issue/feature with snapshot builder — when logical decoding is in progress > and vacuum full command is being issued quite a big amount of files appears in pg_logical/mappings > and reside there till the checkpoint. Also having consequent vacuum full’s before checkpoint yields even > more files. > > Having two pgbench-filled instances with logical replication between them: > > for i in {1..100}; do psql -c 'vacuum full' && ls -la pg_logical/mappings | wc -l; done; > VACUUM > 73 > VACUUM > 454 > VACUUM > 1146 > VACUUM > 2149 > VACUUM > 3463 > VACUUM > 5088 > <...> > VACUUM > 44708 > <…> // checkpoint happens somewhere here > VACUUM > 20921 > WARNING: could not truncate file "base/16384/61773": Too many open files in system > ERROR: could not fsync file "pg_logical/mappings/map-4000-4df-0_A4EA29F8-5aa5-5ae6": Too many open files in system > > > I’m not sure whether this is boils down to some of the previous issues mentioned here or not, so posting > here as observation. > This has nothing to do with snapshot builder so this is not the thread for it. See the comment section near the bottom of src/backend/access/heap/rewriteheap.c for explanation of why it does what it does. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-05-08 00:10:12 -0700, Andres Freund wrote: > I plan to commit the next pending patch after the back branch releases > are cut - it's an invasive fix and the issue doesn't cause corruption > "just" slow slot creation. So it seems better to wait for a few days, > rather than hurry it into the release. Now that that's done, here's an updated version of that patch. Note the new logic to trigger xl_running_xact's to be logged at the right spot. Works well in my testing. I plan to commit this fairly soon, unless somebody wants a bit more time to look into it. Unless somebody protests, I'd like to slightly revise how the on-disk snapshots are stored on master - given the issues this bug/commit showed with the current method - but I can see one could argue that that should rather be done next release. - Andres
On Thu, May 11, 2017 at 2:51 PM, Andres Freund <andres@anarazel.de> wrote: > Now that that's done, here's an updated version of that patch. Note the > new logic to trigger xl_running_xact's to be logged at the right spot. > Works well in my testing. You forgot the patch. :-) -- Peter Geoghegan VMware vCenter Server https://www.vmware.com/
On 2017-05-11 14:51:55 -0700, wrote: > On 2017-05-08 00:10:12 -0700, Andres Freund wrote: > > I plan to commit the next pending patch after the back branch releases > > are cut - it's an invasive fix and the issue doesn't cause corruption > > "just" slow slot creation. So it seems better to wait for a few days, > > rather than hurry it into the release. > > Now that that's done, here's an updated version of that patch. Note the > new logic to trigger xl_running_xact's to be logged at the right spot. > Works well in my testing. > > I plan to commit this fairly soon, unless somebody wants a bit more time > to look into it. > > Unless somebody protests, I'd like to slightly revise how the on-disk > snapshots are stored on master - given the issues this bug/commit showed > with the current method - but I can see one could argue that that should > rather be done next release. As usual I forgot my attachement. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote: > >>> From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001 > >>> From: Petr Jelinek <pjmodos@pjmodos.net> > >>> Date: Sun, 26 Feb 2017 01:07:33 +0100 > >>> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards > >> > >> A bit more commentary would be good. What does that protect us against? > >> > > > > I think I explained that in the email. We might export snapshot with > > xmin smaller than global xmin otherwise. > > > > Updated commit message with explanation as well. > From ae60b52ae0ca96bc14169cd507f101fbb5dfdf52 Mon Sep 17 00:00:00 2001 > From: Petr Jelinek <pjmodos@pjmodos.net> > Date: Sun, 26 Feb 2017 01:07:33 +0100 > Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards > > Logical decoding snapshot builder may encounter xl_running_xacts with > older xmin than the xmin of the builder. This can happen because > LogStandbySnapshot() sometimes sees already committed transactions as > running (there is difference between "running" in terms for WAL and in > terms of ProcArray). When this happens we must make sure that the xmin > of snapshot builder won't go back otherwise the resulting snapshot would > show some transaction as running even though they have already > committed. > --- > src/backend/replication/logical/snapbuild.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c > index ada618d..3e34f75 100644 > --- a/src/backend/replication/logical/snapbuild.c > +++ b/src/backend/replication/logical/snapbuild.c > @@ -1165,7 +1165,8 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact > * looking, it's correct and actually more efficient this way since we hit > * fast paths in tqual.c. > */ > - builder->xmin = running->oldestRunningXid; > + if (TransactionIdFollowsOrEquals(running->oldestRunningXid, builder->xmin)) > + builder->xmin = running->oldestRunningXid; > > /* Remove transactions we don't need to keep track off anymore */ > SnapBuildPurgeCommittedTxn(builder); > -- > 2.7.4 I still don't understand. The snapshot's xmin is solely managed via xl_running_xacts, so I don't see how the WAL/procarray difference can play a role here. ->committed isn't pruned before xl_running_xacts indicates it can be done, so I don't understand your explanation above. I'd be ok with adding this as paranoia check, but I still don't understand when it could practically be hit. - Andres
On 2017-05-11 14:54:26 -0700, Andres Freund wrote: > On 2017-05-11 14:51:55 -0700, wrote: > > On 2017-05-08 00:10:12 -0700, Andres Freund wrote: > > > I plan to commit the next pending patch after the back branch releases > > > are cut - it's an invasive fix and the issue doesn't cause corruption > > > "just" slow slot creation. So it seems better to wait for a few days, > > > rather than hurry it into the release. > > > > Now that that's done, here's an updated version of that patch. Note the > > new logic to trigger xl_running_xact's to be logged at the right spot. > > Works well in my testing. > > > > I plan to commit this fairly soon, unless somebody wants a bit more time > > to look into it. > > > > Unless somebody protests, I'd like to slightly revise how the on-disk > > snapshots are stored on master - given the issues this bug/commit showed > > with the current method - but I can see one could argue that that should > > rather be done next release. > > As usual I forgot my attachement. This patch also seems to offer a way to do your 0005 in, possibly, more efficient manner. We don't ever need to assume a transaction is timetravelling anymore. Could you check whether the attached, hasty, patch resolves the performance problems you measured? Also, do you have a "reference" workload for that? Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On 12/05/17 03:31, Andres Freund wrote: > On 2017-05-11 14:54:26 -0700, Andres Freund wrote: >> On 2017-05-11 14:51:55 -0700, wrote: >>> On 2017-05-08 00:10:12 -0700, Andres Freund wrote: >>>> I plan to commit the next pending patch after the back branch releases >>>> are cut - it's an invasive fix and the issue doesn't cause corruption >>>> "just" slow slot creation. So it seems better to wait for a few days, >>>> rather than hurry it into the release. >>> >>> Now that that's done, here's an updated version of that patch. Note the >>> new logic to trigger xl_running_xact's to be logged at the right spot. >>> Works well in my testing. >>> >>> I plan to commit this fairly soon, unless somebody wants a bit more time >>> to look into it. >>> >>> Unless somebody protests, I'd like to slightly revise how the on-disk >>> snapshots are stored on master - given the issues this bug/commit showed >>> with the current method - but I can see one could argue that that should >>> rather be done next release. >> >> As usual I forgot my attachement. > > This patch also seems to offer a way to do your 0005 in, possibly, more > efficient manner. We don't ever need to assume a transaction is > timetravelling anymore. Could you check whether the attached, hasty, > patch resolves the performance problems you measured? Also, do you have > a "reference" workload for that? > Hmm, well it helps but actually now that we don't track individual running transactions anymore it got much less effective (my version of 0005 does as well). The example workload I test with is: session 1: open transaction, do a write, keep it open session 2: pgbench -M simple -N -c 10 -P 1 -T 5 session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender session 2: pgbench -M simple -N -c 10 -P 1 -T 20 session 1: commit And wait for session 3 to finish slot creation, takes about 20 mins on my laptop without patches, minute and half with your patches for 0004 and 0005 (or with your 0004 and my 0005) and about 2s with my original 0004 and 0005. What makes it slow is the constant qsorting IIRC. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 12/05/17 10:57, Petr Jelinek wrote: > On 12/05/17 03:31, Andres Freund wrote: >> On 2017-05-11 14:54:26 -0700, Andres Freund wrote: >>> On 2017-05-11 14:51:55 -0700, wrote: >>>> On 2017-05-08 00:10:12 -0700, Andres Freund wrote: >>>>> I plan to commit the next pending patch after the back branch releases >>>>> are cut - it's an invasive fix and the issue doesn't cause corruption >>>>> "just" slow slot creation. So it seems better to wait for a few days, >>>>> rather than hurry it into the release. >>>> >>>> Now that that's done, here's an updated version of that patch. Note the >>>> new logic to trigger xl_running_xact's to be logged at the right spot. >>>> Works well in my testing. >>>> >>>> I plan to commit this fairly soon, unless somebody wants a bit more time >>>> to look into it. >>>> >>>> Unless somebody protests, I'd like to slightly revise how the on-disk >>>> snapshots are stored on master - given the issues this bug/commit showed >>>> with the current method - but I can see one could argue that that should >>>> rather be done next release. >>> >>> As usual I forgot my attachement. >> >> This patch also seems to offer a way to do your 0005 in, possibly, more >> efficient manner. We don't ever need to assume a transaction is >> timetravelling anymore. Could you check whether the attached, hasty, >> patch resolves the performance problems you measured? Also, do you have >> a "reference" workload for that? >> > > Hmm, well it helps but actually now that we don't track individual > running transactions anymore it got much less effective (my version of > 0005 does as well). > > The example workload I test with is: > session 1: open transaction, do a write, keep it open > session 2: pgbench -M simple -N -c 10 -P 1 -T 5 > session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender > session 2: pgbench -M simple -N -c 10 -P 1 -T 20 > session 1: commit > > And wait for session 3 to finish slot creation, takes about 20 mins on > my laptop without patches, minute and half with your patches for 0004 > and 0005 (or with your 0004 and my 0005) and about 2s with my original > 0004 and 0005. > > What makes it slow is the constant qsorting IIRC. > Btw I still think that we should pursue the approach you proposed. I think we can deal with the qsorting in some other ways (ordered insert perhaps?) later. What you propose fixes the correctness, makes performance less awful in the above workload and also makes the snapbuild bit easier to read. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-05-12 10:57:55 +0200, Petr Jelinek wrote: > Hmm, well it helps but actually now that we don't track individual > running transactions anymore it got much less effective (my version of > 0005 does as well). > > The example workload I test with is: > session 1: open transaction, do a write, keep it open > session 2: pgbench -M simple -N -c 10 -P 1 -T 5 > session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender > session 2: pgbench -M simple -N -c 10 -P 1 -T 20 > session 1: commit > > And wait for session 3 to finish slot creation, takes about 20 mins on > my laptop without patches, minute and half with your patches for 0004 > and 0005 (or with your 0004 and my 0005) and about 2s with my original > 0004 and 0005. Is that with assertions enabled or not? With assertions all the time post patches seems to be spent in some Asserts in reorderbuffer.c, without it takes less than a second for me here. I'm appylying these now. Greetings, Andres Freund
On 13/05/17 22:23, Andres Freund wrote: > On 2017-05-12 10:57:55 +0200, Petr Jelinek wrote: >> Hmm, well it helps but actually now that we don't track individual >> running transactions anymore it got much less effective (my version of >> 0005 does as well). >> >> The example workload I test with is: >> session 1: open transaction, do a write, keep it open >> session 2: pgbench -M simple -N -c 10 -P 1 -T 5 >> session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender >> session 2: pgbench -M simple -N -c 10 -P 1 -T 20 >> session 1: commit >> >> And wait for session 3 to finish slot creation, takes about 20 mins on >> my laptop without patches, minute and half with your patches for 0004 >> and 0005 (or with your 0004 and my 0005) and about 2s with my original >> 0004 and 0005. > > Is that with assertions enabled or not? With assertions all the time > post patches seems to be spent in some Asserts in reorderbuffer.c, > without it takes less than a second for me here. > Right you are, I always forget to switch that off before benchmarks. > I'm appylying these now. Cool. Just for posterity, this also fixes the issue number 3 as the switch to consistent state is done purely based on xl_running_xacts and not decoded commits/aborts. So all done here, thanks! -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2017-05-14 14:54:37 +0200, Petr Jelinek wrote: > On 13/05/17 22:23, Andres Freund wrote: > >> And wait for session 3 to finish slot creation, takes about 20 mins on > >> my laptop without patches, minute and half with your patches for 0004 > >> and 0005 (or with your 0004 and my 0005) and about 2s with my original > >> 0004 and 0005. > > > > Is that with assertions enabled or not? With assertions all the time > > post patches seems to be spent in some Asserts in reorderbuffer.c, > > without it takes less than a second for me here. > > > > Right you are, I always forget to switch that off before benchmarks. Phew ;) > > I'm appylying these now. > > Cool. Just for posterity, this also fixes the issue number 3 as the > switch to consistent state is done purely based on xl_running_xacts and > not decoded commits/aborts. Cool. Although I'm still not convinced, as noted somewhere in this thread, that it actually did much to start with ;) Greetings, Andres Freund
Andres Freund <andres@anarazel.de> wrote: > Looking at 0001: > - GetOldestSafeDecodingTransactionId() only guarantees to return an xid > safe for decoding (note how procArray->replication_slot_catalog_xmin > is checked), not one for the initial snapshot - so afaics this whole > exercise doesn't guarantee much so far. I happen to use CreateInitDecodingContext() in an extension, so I had to think what the new argumen exactly means (as for the incompatibility between PG 9.6.2 and 9.6.3, I suppose preprocessor directives can handle it). One thing I'm failing to understand is: if TRUE is passed for need_full_snapshot, then slot->effective_xmin receives the result of GetOldestSafeDecodingTransactionId(need_full_snapshot) but this does include "catalog xmin". If the value returned is determined by an existing slot which has valid effective_catalog_xmin and invalid effective_xmin (i.e. that slot only protects catalog tables from VACUUM but not the regular ones), then the new slot will think it (i.e. the new slot) protects even non-catalog tables, but that's no true. Shouldn't the xmin_horizon be computed by this call instead? GetOldestSafeDecodingTransactionId(!need_full_snapshot); (If so, I think "considerCatalog" argument name would be clearer than "catalogOnly".) -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at
On 2017-06-09 09:25:34 +0200, Antonin Houska wrote: > Andres Freund <andres@anarazel.de> wrote: > > > Looking at 0001: > > - GetOldestSafeDecodingTransactionId() only guarantees to return an xid > > safe for decoding (note how procArray->replication_slot_catalog_xmin > > is checked), not one for the initial snapshot - so afaics this whole > > exercise doesn't guarantee much so far. > > I happen to use CreateInitDecodingContext() in an extension, so I had to think > what the new argumen exactly means (as for the incompatibility between PG > 9.6.2 and 9.6.3, I suppose preprocessor directives can handle it). > > One thing I'm failing to understand is: if TRUE is passed for > need_full_snapshot, then slot->effective_xmin receives the result of > > GetOldestSafeDecodingTransactionId(need_full_snapshot) > > but this does include "catalog xmin". > > If the value returned is determined by an existing slot which has valid > effective_catalog_xmin and invalid effective_xmin (i.e. that slot only > protects catalog tables from VACUUM but not the regular ones), then the new > slot will think it (i.e. the new slot) protects even non-catalog tables, but > that's no true. > > Shouldn't the xmin_horizon be computed by this call instead? > > GetOldestSafeDecodingTransactionId(!need_full_snapshot); > > (If so, I think "considerCatalog" argument name would be clearer than > "catalogOnly".) Good catch. Pushing a fix. Afaict it's luckily inconsequential atm because fo the way we wait for concurrent snapshots when creating a slot. But it obviously nevertheless needs tobe fixed. Greetings, Andres Freund