Thread: [HACKERS] snapbuild woes

[HACKERS] snapbuild woes

From
Petr Jelinek
Date:
Hi,

I recently found couple of issues with the way initial logical decoding
snapshot is made.

First one is outright bug, which has to do with how we track running
transactions. What snapbuild basically does while doing initial snapshot
is read the xl_running_xacts record, store the list of running txes and
then wait until they all finish. The problem with this is that
xl_running_xacts does not ensure that it only logs transactions that are
actually still running (to avoid locking PGPROC) so there might be xids
in xl_running_xacts that already committed before it was logged. This in
turn means that the snapbuild will never make snapshot if the situation
occurs. The reason why this didn't bite us much yet is that the
snapbuild will consider all transactions finished if the empty
xl_running_xacts comes which usually happens after a while (but might
not on a consistently busy server).

The fix I came up with this is to extend the mechanism to also consider
all transactions finished when xl_running_xacts which has xmin bigger
then that previous xmax comes. This seems to work pretty well on busy
servers. This fix is attached as
0001-Mark-snapshot-consistent-when-all-running-txes-have.patch. I
believe it should be backpatched all the way to 9.4.


The other issue is performance problem again on busy servers with
initial snapshot. We track transactions for catalog modifications so
that we can do proper catalog time travel for decoding of changes. But
for transactions that were running while we started trying to get
initial consistent snapshot, there is no good way to tell if they did
catalog changes or not so we consider them all as catalog changing. We
make separate historical snapshot for every such transaction. This by
itself is fine, the problem is that current implementation also
considers all the transactions that started after we started watching
for changes but before we reached consistent state to also do catalog
changes even though there we actually do know if they did any catalog
change or not. In practice it means we make snapshots that are not
really necessary and if there was long running transaction for which the
snapshot builder has to wait for then we can create thousands of unused
snapshots which affects performance in bad ways (I've seen the initial
snapshot taking hour because of this).

The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
behavior so that we don't make snapshots for transactions that we seen
wholly and know that they didn't make catalog changes even if we didn't
reach consistency yet.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 10/12/16 23:10, Petr Jelinek wrote:
> 
> The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
> behavior so that we don't make snapshots for transactions that we seen
> wholly and know that they didn't make catalog changes even if we didn't
> reach consistency yet.
>

Eh, attached wrong patch. This one is correct.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Craig Ringer
Date:


On 11 Dec. 2016 06:50, "Petr Jelinek" <petr.jelinek@2ndquadrant.com> wrote:
On 10/12/16 23:10, Petr Jelinek wrote:
>
> The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
> behavior so that we don't make snapshots for transactions that we seen
> wholly and know that they didn't make catalog changes even if we didn't
> reach consistency yet.
>

Eh, attached wrong patch. This one is correct.

Attached no patch second time?

Re: [HACKERS] snapbuild woes

From
Kevin Grittner
Date:
On Sun, Dec 11, 2016 at 1:17 AM, Craig Ringer
<craig.ringer@2ndquadrant.com> wrote:
> On 11 Dec. 2016 06:50, "Petr Jelinek" <petr.jelinek@2ndquadrant.com> wrote:
>> On 10/12/16 23:10, Petr Jelinek wrote:
>>
>>> The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
>>> behavior so that we don't make snapshots for transactions that we seen
>>> wholly and know that they didn't make catalog changes even if we didn't
>>> reach consistency yet.
>>
>> Eh, attached wrong patch. This one is correct.
>
> Attached no patch second time?

I see an attachment, and it shows in the archives.

https://www.postgresql.org/message-id/aee1d499-e3ca-e091-56da-1ee6a47741c8%402ndquadrant.com

-- 
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] snapbuild woes

From
Craig Ringer
Date:
On 12 December 2016 at 00:36, Kevin Grittner <kgrittn@gmail.com> wrote:
> On Sun, Dec 11, 2016 at 1:17 AM, Craig Ringer
> <craig.ringer@2ndquadrant.com> wrote:
>> On 11 Dec. 2016 06:50, "Petr Jelinek" <petr.jelinek@2ndquadrant.com> wrote:
>>> On 10/12/16 23:10, Petr Jelinek wrote:
>>>
>>>> The attached 0002-Skip-unnecessary-snapshot-builds.patch changes this
>>>> behavior so that we don't make snapshots for transactions that we seen
>>>> wholly and know that they didn't make catalog changes even if we didn't
>>>> reach consistency yet.
>>>
>>> Eh, attached wrong patch. This one is correct.
>>
>> Attached no patch second time?
>
> I see an attachment, and it shows in the archives.
>
> https://www.postgresql.org/message-id/aee1d499-e3ca-e091-56da-1ee6a47741c8%402ndquadrant.com

Sorry for the noise, apparently my phone's mail client was being dense.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Remote DBA, Training
&Services
 



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
Hi,

On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
> Hi,
> First one is outright bug, which has to do with how we track running
> transactions. What snapbuild basically does while doing initial snapshot
> is read the xl_running_xacts record, store the list of running txes and
> then wait until they all finish. The problem with this is that
> xl_running_xacts does not ensure that it only logs transactions that are
> actually still running (to avoid locking PGPROC) so there might be xids
> in xl_running_xacts that already committed before it was logged.

I don't think that's actually true?  Notice how LogStandbySnapshot()
only releases the lock *after* the LogCurrentRunningXacts() iff
wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
observed must actually be a bit more complex :(

Regards,

Andres



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 12/12/16 22:42, Andres Freund wrote:
> Hi,
> 
> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
>> Hi,
>> First one is outright bug, which has to do with how we track running
>> transactions. What snapbuild basically does while doing initial snapshot
>> is read the xl_running_xacts record, store the list of running txes and
>> then wait until they all finish. The problem with this is that
>> xl_running_xacts does not ensure that it only logs transactions that are
>> actually still running (to avoid locking PGPROC) so there might be xids
>> in xl_running_xacts that already committed before it was logged.
> 
> I don't think that's actually true?  Notice how LogStandbySnapshot()
> only releases the lock *after* the LogCurrentRunningXacts() iff
> wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
> observed must actually be a bit more complex :(
> 

Hmm, interesting, I did see the transaction commit in the WAL before the
xl_running_xacts that contained the xid as running. I only seen it on
production system though, didn't really manage to easily reproduce it
locally.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:
> On 12/12/16 22:42, Andres Freund wrote:
> > Hi,
> > 
> > On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
> >> Hi,
> >> First one is outright bug, which has to do with how we track running
> >> transactions. What snapbuild basically does while doing initial snapshot
> >> is read the xl_running_xacts record, store the list of running txes and
> >> then wait until they all finish. The problem with this is that
> >> xl_running_xacts does not ensure that it only logs transactions that are
> >> actually still running (to avoid locking PGPROC) so there might be xids
> >> in xl_running_xacts that already committed before it was logged.
> > 
> > I don't think that's actually true?  Notice how LogStandbySnapshot()
> > only releases the lock *after* the LogCurrentRunningXacts() iff
> > wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
> > observed must actually be a bit more complex :(
> > 
> 
> Hmm, interesting, I did see the transaction commit in the WAL before the
> xl_running_xacts that contained the xid as running. I only seen it on
> production system though, didn't really manage to easily reproduce it
> locally.

I suspect the reason for that is that RecordTransactionCommit() doesn't
conflict with ProcArrayLock in the first place - only
ProcArrayEndTransaction() does.  So they're still running in the PGPROC
sense, just not the crash-recovery sense...

Andres



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 12/12/16 23:33, Andres Freund wrote:
> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:
>> On 12/12/16 22:42, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
>>>> Hi,
>>>> First one is outright bug, which has to do with how we track running
>>>> transactions. What snapbuild basically does while doing initial snapshot
>>>> is read the xl_running_xacts record, store the list of running txes and
>>>> then wait until they all finish. The problem with this is that
>>>> xl_running_xacts does not ensure that it only logs transactions that are
>>>> actually still running (to avoid locking PGPROC) so there might be xids
>>>> in xl_running_xacts that already committed before it was logged.
>>>
>>> I don't think that's actually true?  Notice how LogStandbySnapshot()
>>> only releases the lock *after* the LogCurrentRunningXacts() iff
>>> wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
>>> observed must actually be a bit more complex :(
>>>
>>
>> Hmm, interesting, I did see the transaction commit in the WAL before the
>> xl_running_xacts that contained the xid as running. I only seen it on
>> production system though, didn't really manage to easily reproduce it
>> locally.
> 
> I suspect the reason for that is that RecordTransactionCommit() doesn't
> conflict with ProcArrayLock in the first place - only
> ProcArrayEndTransaction() does.  So they're still running in the PGPROC
> sense, just not the crash-recovery sense...
> 

That looks like reasonable explanation. BTW I realized my patch needs
bit more work, currently it will break the actual snapshot as it behaves
same as if the xl_running_xacts was empty which is not correct AFAICS.

Also if we did the approach suggested by my patch (ie using this
xmin/xmax comparison) I guess we wouldn't need to hold the lock for
extra time in wal_level logical anymore.

That is of course unless you think it should be approached from the
other side of the stream and try log correct xl_running_xacts.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 13/12/16 00:38, Petr Jelinek wrote:
> On 12/12/16 23:33, Andres Freund wrote:
>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:
>>> On 12/12/16 22:42, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
>>>>> Hi,
>>>>> First one is outright bug, which has to do with how we track running
>>>>> transactions. What snapbuild basically does while doing initial snapshot
>>>>> is read the xl_running_xacts record, store the list of running txes and
>>>>> then wait until they all finish. The problem with this is that
>>>>> xl_running_xacts does not ensure that it only logs transactions that are
>>>>> actually still running (to avoid locking PGPROC) so there might be xids
>>>>> in xl_running_xacts that already committed before it was logged.
>>>>
>>>> I don't think that's actually true?  Notice how LogStandbySnapshot()
>>>> only releases the lock *after* the LogCurrentRunningXacts() iff
>>>> wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
>>>> observed must actually be a bit more complex :(
>>>>
>>>
>>> Hmm, interesting, I did see the transaction commit in the WAL before the
>>> xl_running_xacts that contained the xid as running. I only seen it on
>>> production system though, didn't really manage to easily reproduce it
>>> locally.
>>
>> I suspect the reason for that is that RecordTransactionCommit() doesn't
>> conflict with ProcArrayLock in the first place - only
>> ProcArrayEndTransaction() does.  So they're still running in the PGPROC
>> sense, just not the crash-recovery sense...
>>
> 
> That looks like reasonable explanation. BTW I realized my patch needs
> bit more work, currently it will break the actual snapshot as it behaves
> same as if the xl_running_xacts was empty which is not correct AFAICS.
> 

Hi,

I got to work on this again. Unfortunately I haven't found solution that
I would be very happy with. What I did is in case we read
xl_running_xacts which has all transactions we track finished, we start
tracking from that new xl_running_xacts again with the difference that
we clean up the running transactions based on previously seen committed
ones. That means that on busy server we may wait for multiple
xl_running_xacts rather than just one, but at least we have chance to
finish unlike with current coding which basically waits for empty
xl_running_xacts. I also removed the additional locking for logical
wal_level in LogStandbySnapshot() since it does not work.

I also identified another bug in snapbuild while looking at the code.
That is the logical decoding will try to use on disk serialized snapshot
for initial snapshot export when it can. The problem is that these
snapshots are quite special and are not really usable as snapshots for
data (ie, the logical decoding snapshots regularly have xmax smaller
than xmin). So then when client tries to use this exported snapshot it
gets completely wrong data as the snapshot is broken. I think this is
explanation for Erik Rijker's problems with the initial COPY patch for
logical replication. At least for me the issues go away when I disable
use of the ondisk snapshots.

I didn't really find better solution than that though (disabling the use
of ondisk snapshots for initial consistent snapshot).

So to summarize attached patches:
0001 - Fixes performance issue where we build tons of snapshots that we
don't need which kills CPU.

0002 - Disables the use of ondisk historical snapshots for initial
consistent snapshot export as it may result in corrupt data. This
definitely needs backport.

0003 - Fixes bug where we might never reach snapshot on busy server due
to race condition in xl_running_xacts logging. The original use of extra
locking does not seem to be enough in practice. Once we have agreed fix
for this it's probably worth backpatching. There are still some comments
that need updating, this is more of a PoC.

Thoughts or better ideas?

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Erik Rijkers
Date:
On 2017-02-22 03:05, Petr Jelinek wrote:
> 
> So to summarize attached patches:
> 0001 - Fixes performance issue where we build tons of snapshots that we
> don't need which kills CPU.
> 
> 0002 - Disables the use of ondisk historical snapshots for initial
> consistent snapshot export as it may result in corrupt data. This
> definitely needs backport.
> 
> 0003 - Fixes bug where we might never reach snapshot on busy server due
> to race condition in xl_running_xacts logging. The original use of 
> extra
> locking does not seem to be enough in practice. Once we have agreed fix
> for this it's probably worth backpatching. There are still some 
> comments
> that need updating, this is more of a PoC.
> 

I am not not entirely sure what to expect.  Should a server with these 3 
patches do initial data copy or not?  The sgml seems to imply there is 
not inital data copy.  But my test does copy something.

Anyway, I have repeated the same old pgbench-test, assuming inital data 
copy should be working.

With
> 0001-Skip-unnecessary-snapshot-builds.patch
> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch

the consistent (but wrong) end state is always that only one of the four 
pgbench tables, pgbench_history, is replicated (always correctly).

Below is the output from the test (I've edited the lines for email)
(below, a,b,t,h stand for: pgbench_accounts, pgbench_branches, 
pgbench_tellers, pgbench_history)
(master on port 6972, replica on port 6973.)

port
6972 a,b,t,h: 100000      1     10    347
6973 a,b,t,h:      0      0      0    347

a,b,t,h: a68efc81a  2c27f7ba5  128590a57  1e4070879   master
a,b,t,h: d41d8cd98  d41d8cd98  d41d8cd98  1e4070879   replica NOK

The md5-initstrings are from a md5 of the whole content of each table 
(an ordered select *)

I repeated this a few times: of course, the number of rows in 
pgbench_history varies a bit but otherwise it is always the same: 3 
empty replica tables, pgbench_history replicated correctly.

Something is not right.


thanks,


Erik Rijkers



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:

On 22/02/17 11:29, Erik Rijkers wrote:
> On 2017-02-22 03:05, Petr Jelinek wrote:
>>
>> So to summarize attached patches:
>> 0001 - Fixes performance issue where we build tons of snapshots that we
>> don't need which kills CPU.
>>
>> 0002 - Disables the use of ondisk historical snapshots for initial
>> consistent snapshot export as it may result in corrupt data. This
>> definitely needs backport.
>>
>> 0003 - Fixes bug where we might never reach snapshot on busy server due
>> to race condition in xl_running_xacts logging. The original use of extra
>> locking does not seem to be enough in practice. Once we have agreed fix
>> for this it's probably worth backpatching. There are still some comments
>> that need updating, this is more of a PoC.
>>
> 
> I am not not entirely sure what to expect.  Should a server with these 3
> patches do initial data copy or not?  The sgml seems to imply there is
> not inital data copy.  But my test does copy something.
> 

Not by itself (without the copy patch), those fixes are for snapshots.

> 
> With
>> 0001-Skip-unnecessary-snapshot-builds.patch
>> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
>> 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> 
> the consistent (but wrong) end state is always that only one of the four
> pgbench tables, pgbench_history, is replicated (always correctly).
> 
> Below is the output from the test (I've edited the lines for email)
> (below, a,b,t,h stand for: pgbench_accounts, pgbench_branches,
> pgbench_tellers, pgbench_history)
> (master on port 6972, replica on port 6973.)
> 
> port
> 6972 a,b,t,h: 100000      1     10    347
> 6973 a,b,t,h:      0      0      0    347
> 
> a,b,t,h: a68efc81a  2c27f7ba5  128590a57  1e4070879   master
> a,b,t,h: d41d8cd98  d41d8cd98  d41d8cd98  1e4070879   replica NOK
> 
> The md5-initstrings are from a md5 of the whole content of each table
> (an ordered select *)
> 
> I repeated this a few times: of course, the number of rows in
> pgbench_history varies a bit but otherwise it is always the same: 3
> empty replica tables, pgbench_history replicated correctly.
> 

That's actually correct behaviour without the initial copy patch, it
replicates changes, but since the 3 tables only get updates there is
nothing to replicate as there is no data downstream. However inserts
will of course work fine even without data downstream.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 22/02/17 03:05, Petr Jelinek wrote:
> On 13/12/16 00:38, Petr Jelinek wrote:
>> On 12/12/16 23:33, Andres Freund wrote:
>>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:
>>>> On 12/12/16 22:42, Andres Freund wrote:
>>>>> Hi,
>>>>>
>>>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
>>>>>> Hi,
>>>>>> First one is outright bug, which has to do with how we track running
>>>>>> transactions. What snapbuild basically does while doing initial snapshot
>>>>>> is read the xl_running_xacts record, store the list of running txes and
>>>>>> then wait until they all finish. The problem with this is that
>>>>>> xl_running_xacts does not ensure that it only logs transactions that are
>>>>>> actually still running (to avoid locking PGPROC) so there might be xids
>>>>>> in xl_running_xacts that already committed before it was logged.
>>>>>
>>>>> I don't think that's actually true?  Notice how LogStandbySnapshot()
>>>>> only releases the lock *after* the LogCurrentRunningXacts() iff
>>>>> wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
>>>>> observed must actually be a bit more complex :(
>>>>>
>>>>
>>>> Hmm, interesting, I did see the transaction commit in the WAL before the
>>>> xl_running_xacts that contained the xid as running. I only seen it on
>>>> production system though, didn't really manage to easily reproduce it
>>>> locally.
>>>
>>> I suspect the reason for that is that RecordTransactionCommit() doesn't
>>> conflict with ProcArrayLock in the first place - only
>>> ProcArrayEndTransaction() does.  So they're still running in the PGPROC
>>> sense, just not the crash-recovery sense...
>>>
>>
>> That looks like reasonable explanation. BTW I realized my patch needs
>> bit more work, currently it will break the actual snapshot as it behaves
>> same as if the xl_running_xacts was empty which is not correct AFAICS.
>>
> 
> Hi,
> 
> I got to work on this again. Unfortunately I haven't found solution that
> I would be very happy with. What I did is in case we read
> xl_running_xacts which has all transactions we track finished, we start
> tracking from that new xl_running_xacts again with the difference that
> we clean up the running transactions based on previously seen committed
> ones. That means that on busy server we may wait for multiple
> xl_running_xacts rather than just one, but at least we have chance to
> finish unlike with current coding which basically waits for empty
> xl_running_xacts. I also removed the additional locking for logical
> wal_level in LogStandbySnapshot() since it does not work.

Not hearing any opposition to this idea so I decided to polish this and
also optimize it a bit.

That being said, thanks to testing from Erik Rijkers I've identified one
more bug in how we do the initial snapshot. Apparently we don't reserve
the global xmin when we start building the initial exported snapshot for
a slot (we only reserver catalog_xmin which is fine for logical decoding
but not for the exported snapshot) so the VACUUM and heap pruning will
happily delete old versions of rows that are still needed by anybody
trying to use that exported snapshot.

Attached are updated versions of patches:

0001 - Fixes the above mentioned global xmin tracking issues. Needs to
be backported all the way to 9.4

0002 - Removes use of the logical decoding saved snapshots for initial
exported snapshot since those snapshots only work for catalogs and not
user data. Also needs to be backported all the way to 9.4.

0003 - Changes handling of the xl_running_xacts in initial snapshot
build to what I wrote above and removes the extra locking from
LogStandbySnapshot introduced by logical decoding.

0004 - Improves performance of initial snapshot building by skipping
catalog snapshot build for transactions that don't do catalog changes.

The 0001 and 0002 are bug fixes because without them the exported
snapshots are basically corrupted. The 0003 and 0004 are performance
improvements, but on busy servers the snapshot export might never happen
so it's for rather serious performance issues.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 24/02/17 22:56, Petr Jelinek wrote:
> On 22/02/17 03:05, Petr Jelinek wrote:
>> On 13/12/16 00:38, Petr Jelinek wrote:
>>> On 12/12/16 23:33, Andres Freund wrote:
>>>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:
>>>>> On 12/12/16 22:42, Andres Freund wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
>>>>>>> Hi,
>>>>>>> First one is outright bug, which has to do with how we track running
>>>>>>> transactions. What snapbuild basically does while doing initial snapshot
>>>>>>> is read the xl_running_xacts record, store the list of running txes and
>>>>>>> then wait until they all finish. The problem with this is that
>>>>>>> xl_running_xacts does not ensure that it only logs transactions that are
>>>>>>> actually still running (to avoid locking PGPROC) so there might be xids
>>>>>>> in xl_running_xacts that already committed before it was logged.
>>>>>>
>>>>>> I don't think that's actually true?  Notice how LogStandbySnapshot()
>>>>>> only releases the lock *after* the LogCurrentRunningXacts() iff
>>>>>> wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
>>>>>> observed must actually be a bit more complex :(
>>>>>>
>>>>>
>>>>> Hmm, interesting, I did see the transaction commit in the WAL before the
>>>>> xl_running_xacts that contained the xid as running. I only seen it on
>>>>> production system though, didn't really manage to easily reproduce it
>>>>> locally.
>>>>
>>>> I suspect the reason for that is that RecordTransactionCommit() doesn't
>>>> conflict with ProcArrayLock in the first place - only
>>>> ProcArrayEndTransaction() does.  So they're still running in the PGPROC
>>>> sense, just not the crash-recovery sense...
>>>>
>>>
>>> That looks like reasonable explanation. BTW I realized my patch needs
>>> bit more work, currently it will break the actual snapshot as it behaves
>>> same as if the xl_running_xacts was empty which is not correct AFAICS.
>>>
>>
>> Hi,
>>
>> I got to work on this again. Unfortunately I haven't found solution that
>> I would be very happy with. What I did is in case we read
>> xl_running_xacts which has all transactions we track finished, we start
>> tracking from that new xl_running_xacts again with the difference that
>> we clean up the running transactions based on previously seen committed
>> ones. That means that on busy server we may wait for multiple
>> xl_running_xacts rather than just one, but at least we have chance to
>> finish unlike with current coding which basically waits for empty
>> xl_running_xacts. I also removed the additional locking for logical
>> wal_level in LogStandbySnapshot() since it does not work.
> 
> Not hearing any opposition to this idea so I decided to polish this and
> also optimize it a bit.
> 
> That being said, thanks to testing from Erik Rijkers I've identified one
> more bug in how we do the initial snapshot. Apparently we don't reserve
> the global xmin when we start building the initial exported snapshot for
> a slot (we only reserver catalog_xmin which is fine for logical decoding
> but not for the exported snapshot) so the VACUUM and heap pruning will
> happily delete old versions of rows that are still needed by anybody
> trying to use that exported snapshot.
> 

Aaand I found one more bug in snapbuild. Apparently we don't protect the
snapshot builder xmin from going backwards which can yet again result in
corrupted exported snapshot.

Summary of attached patches:
0001 - Fixes the above mentioned global xmin tracking issues. Needs to
be backported all the way to 9.4

0002 - Removes use of the logical decoding saved snapshots for initial
exported snapshot since those snapshots only work for catalogs and not
user data. Also needs to be backported all the way to 9.4.

0003 - Makes sure snapshot builder xmin is not moved backwards by
xl_running_xacts (which can otherwise happen during initial snapshot
building). Also should be backported to 9.4.

0004 - Changes handling of the xl_running_xacts in initial snapshot
build to what I wrote above and removes the extra locking from
LogStandbySnapshot introduced by logical decoding.

0005 - Improves performance of initial snapshot building by skipping
catalog snapshot build for transactions that don't do catalog changes.

I did some improvements to the other patches as well so they are not the
same as in previous post, hence the version bump.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 26/02/17 01:43, Petr Jelinek wrote:
> On 24/02/17 22:56, Petr Jelinek wrote:
>> On 22/02/17 03:05, Petr Jelinek wrote:
>>> On 13/12/16 00:38, Petr Jelinek wrote:
>>>> On 12/12/16 23:33, Andres Freund wrote:
>>>>> On 2016-12-12 23:27:30 +0100, Petr Jelinek wrote:
>>>>>> On 12/12/16 22:42, Andres Freund wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 2016-12-10 23:10:19 +0100, Petr Jelinek wrote:
>>>>>>>> Hi,
>>>>>>>> First one is outright bug, which has to do with how we track running
>>>>>>>> transactions. What snapbuild basically does while doing initial snapshot
>>>>>>>> is read the xl_running_xacts record, store the list of running txes and
>>>>>>>> then wait until they all finish. The problem with this is that
>>>>>>>> xl_running_xacts does not ensure that it only logs transactions that are
>>>>>>>> actually still running (to avoid locking PGPROC) so there might be xids
>>>>>>>> in xl_running_xacts that already committed before it was logged.
>>>>>>>
>>>>>>> I don't think that's actually true?  Notice how LogStandbySnapshot()
>>>>>>> only releases the lock *after* the LogCurrentRunningXacts() iff
>>>>>>> wal_level >= WAL_LEVEL_LOGICAL.  So the explanation for the problem you
>>>>>>> observed must actually be a bit more complex :(
>>>>>>>
>>>>>>
>>>>>> Hmm, interesting, I did see the transaction commit in the WAL before the
>>>>>> xl_running_xacts that contained the xid as running. I only seen it on
>>>>>> production system though, didn't really manage to easily reproduce it
>>>>>> locally.
>>>>>
>>>>> I suspect the reason for that is that RecordTransactionCommit() doesn't
>>>>> conflict with ProcArrayLock in the first place - only
>>>>> ProcArrayEndTransaction() does.  So they're still running in the PGPROC
>>>>> sense, just not the crash-recovery sense...
>>>>>
>>>>
>>>> That looks like reasonable explanation. BTW I realized my patch needs
>>>> bit more work, currently it will break the actual snapshot as it behaves
>>>> same as if the xl_running_xacts was empty which is not correct AFAICS.
>>>>
>>>
>>> Hi,
>>>
>>> I got to work on this again. Unfortunately I haven't found solution that
>>> I would be very happy with. What I did is in case we read
>>> xl_running_xacts which has all transactions we track finished, we start
>>> tracking from that new xl_running_xacts again with the difference that
>>> we clean up the running transactions based on previously seen committed
>>> ones. That means that on busy server we may wait for multiple
>>> xl_running_xacts rather than just one, but at least we have chance to
>>> finish unlike with current coding which basically waits for empty
>>> xl_running_xacts. I also removed the additional locking for logical
>>> wal_level in LogStandbySnapshot() since it does not work.
>>
>> Not hearing any opposition to this idea so I decided to polish this and
>> also optimize it a bit.
>>
>> That being said, thanks to testing from Erik Rijkers I've identified one
>> more bug in how we do the initial snapshot. Apparently we don't reserve
>> the global xmin when we start building the initial exported snapshot for
>> a slot (we only reserver catalog_xmin which is fine for logical decoding
>> but not for the exported snapshot) so the VACUUM and heap pruning will
>> happily delete old versions of rows that are still needed by anybody
>> trying to use that exported snapshot.
>>
> 
> Aaand I found one more bug in snapbuild. Apparently we don't protect the
> snapshot builder xmin from going backwards which can yet again result in
> corrupted exported snapshot.
> 
> Summary of attached patches:
> 0001 - Fixes the above mentioned global xmin tracking issues. Needs to
> be backported all the way to 9.4
> 
> 0002 - Removes use of the logical decoding saved snapshots for initial
> exported snapshot since those snapshots only work for catalogs and not
> user data. Also needs to be backported all the way to 9.4.

I've been bit overzealous about this one (removed the use of the saved
snapshots completely). So here is a fix for that (and rebase on top of
current HEAD).

> 
> 0003 - Makes sure snapshot builder xmin is not moved backwards by
> xl_running_xacts (which can otherwise happen during initial snapshot
> building). Also should be backported to 9.4.
> 
> 0004 - Changes handling of the xl_running_xacts in initial snapshot
> build to what I wrote above and removes the extra locking from
> LogStandbySnapshot introduced by logical decoding.
> 
> 0005 - Improves performance of initial snapshot building by skipping
> catalog snapshot build for transactions that don't do catalog changes.
> 


-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Erik Rijkers
Date:
On 2017-03-03 01:30, Petr Jelinek wrote:

With these patches:

0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

I get:

subscriptioncmds.c:47:12: error: static declaration of ‘oid_cmp’ follows 
non-static declaration static int oid_cmp(const void *p1, const void *p2);            ^~~~~~~
In file included from subscriptioncmds.c:42:0:
../../../src/include/utils/builtins.h:70:12: note: previous declaration 
of ‘oid_cmp’ was here extern int oid_cmp(const void *p1, const void *p2);            ^~~~~~~
make[3]: *** [subscriptioncmds.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [commands-recursive] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2





Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 03/03/17 01:53, Erik Rijkers wrote:
> On 2017-03-03 01:30, Petr Jelinek wrote:
> 
> With these patches:
> 
> 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
> 0002-Fix-after-trigger-execution-in-logical-replication.patch
> 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
> snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
> snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
> 
> snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
> snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
> snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patch
> 0001-Logical-replication-support-for-initial-data-copy-v6.patch
> 
> I get:
> 
> subscriptioncmds.c:47:12: error: static declaration of ‘oid_cmp’ follows
> non-static declaration
>  static int oid_cmp(const void *p1, const void *p2);
>             ^~~~~~~
> In file included from subscriptioncmds.c:42:0:
> ../../../src/include/utils/builtins.h:70:12: note: previous declaration
> of ‘oid_cmp’ was here
>  extern int oid_cmp(const void *p1, const void *p2);
>             ^~~~~~~
> make[3]: *** [subscriptioncmds.o] Error 1
> make[3]: *** Waiting for unfinished jobs....
> make[2]: *** [commands-recursive] Error 2
> make[2]: *** Waiting for unfinished jobs....
> make[1]: *** [all-backend-recurse] Error 2
> make: *** [all-src-recurse] Error 2
> 

Yes the copy patch needs rebase as well. But these ones are fine.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
David Steele
Date:
On 3/2/17 7:54 PM, Petr Jelinek wrote:
> 
> Yes the copy patch needs rebase as well. But these ones are fine.

This bug has been moved to CF 2017-07.

-- 
-David
david@pgmasters.net



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-04-08 09:51:39 -0400, David Steele wrote:
> On 3/2/17 7:54 PM, Petr Jelinek wrote:
> > 
> > Yes the copy patch needs rebase as well. But these ones are fine.
> 
> This bug has been moved to CF 2017-07.

FWIW, as these are bug-fixes that need to be backpatched, I do plan to
work on them soon.

- Andres



Re: [HACKERS] snapbuild woes

From
Erik Rijkers
Date:
On 2017-04-08 15:56, Andres Freund wrote:
> On 2017-04-08 09:51:39 -0400, David Steele wrote:
>> On 3/2/17 7:54 PM, Petr Jelinek wrote:
>> >
>> > Yes the copy patch needs rebase as well. But these ones are fine.
>> 
>> This bug has been moved to CF 2017-07.
> 
> FWIW, as these are bug-fixes that need to be backpatched, I do plan to
> work on them soon.
> 

CF 2017-07 pertains to postgres 11, is that right?

But I hope you mean to commit these snapbuild patches before the 
postgres 10 release?  As far as I know, logical replication is still 
very broken without them (or at least some of that set of 5 patches - I 
don't know which ones are essential and which may not be).

If it's at all useful I can repeat tests to show how often current 
master still fails (easily 50% or so failure-rate).

This would be the pgbench-over-logical-replication test that I did so 
often earlier on.

thanks,

Erik Rijkers



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-04-08 16:29:10 +0200, Erik Rijkers wrote:
> On 2017-04-08 15:56, Andres Freund wrote:
> > On 2017-04-08 09:51:39 -0400, David Steele wrote:
> > > On 3/2/17 7:54 PM, Petr Jelinek wrote:
> > > >
> > > > Yes the copy patch needs rebase as well. But these ones are fine.
> > > 
> > > This bug has been moved to CF 2017-07.
> > 
> > FWIW, as these are bug-fixes that need to be backpatched, I do plan to
> > work on them soon.
> > 
> 
> CF 2017-07 pertains to postgres 11, is that right?
> 
> But I hope you mean to commit these snapbuild patches before the postgres 10
> release?  As far as I know, logical replication is still very broken without
> them (or at least some of that set of 5 patches - I don't know which ones
> are essential and which may not be).

Yes, these should go into 10 *and* earlier releases, and I don't plan to
wait for 2017-07.

- Andres



Re: [HACKERS] snapbuild woes

From
David Steele
Date:
On 4/8/17 10:29 AM, Erik Rijkers wrote:
> On 2017-04-08 15:56, Andres Freund wrote:
>> On 2017-04-08 09:51:39 -0400, David Steele wrote:
>>> On 3/2/17 7:54 PM, Petr Jelinek wrote:
>>> >
>>> > Yes the copy patch needs rebase as well. But these ones are fine.
>>>
>>> This bug has been moved to CF 2017-07.
>>
>> FWIW, as these are bug-fixes that need to be backpatched, I do plan to
>> work on them soon.
>>
> 
> CF 2017-07 pertains to postgres 11, is that right?

In general, yes, but bugs will always be fixed as needed.  It doesn't
matter what CF they are in.

-- 
-David
david@pgmasters.net



Re: [HACKERS] snapbuild woes

From
Noah Misch
Date:
On Sat, Apr 08, 2017 at 07:30:59AM -0700, Andres Freund wrote:
> On 2017-04-08 16:29:10 +0200, Erik Rijkers wrote:
> > On 2017-04-08 15:56, Andres Freund wrote:
> > > On 2017-04-08 09:51:39 -0400, David Steele wrote:
> > > > On 3/2/17 7:54 PM, Petr Jelinek wrote:
> > > > >
> > > > > Yes the copy patch needs rebase as well. But these ones are fine.
> > > > 
> > > > This bug has been moved to CF 2017-07.
> > > 
> > > FWIW, as these are bug-fixes that need to be backpatched, I do plan to
> > > work on them soon.
> > > 
> > 
> > CF 2017-07 pertains to postgres 11, is that right?
> > 
> > But I hope you mean to commit these snapbuild patches before the postgres 10
> > release?  As far as I know, logical replication is still very broken without
> > them (or at least some of that set of 5 patches - I don't know which ones
> > are essential and which may not be).
> 
> Yes, these should go into 10 *and* earlier releases, and I don't plan to
> wait for 2017-07.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] snapbuild woes

From
Peter Eisentraut
Date:
On 4/12/17 02:31, Noah Misch wrote:
>>> But I hope you mean to commit these snapbuild patches before the postgres 10
>>> release?  As far as I know, logical replication is still very broken without
>>> them (or at least some of that set of 5 patches - I don't know which ones
>>> are essential and which may not be).
>>
>> Yes, these should go into 10 *and* earlier releases, and I don't plan to
>> wait for 2017-07.
> 
> [Action required within three days.  This is a generic notification.]

I'm hoping for a word from Andres on this.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] snapbuild woes

From
Simon Riggs
Date:
On 3 March 2017 at 00:30, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

>> 0004 - Changes handling of the xl_running_xacts in initial snapshot
>> build to what I wrote above and removes the extra locking from
>> LogStandbySnapshot introduced by logical decoding.

This seems OK and unlikely to have wider impact.

The "race condition" we're speaking about is by design, not a bug.

I think the initial comment could be slightly better worded; if I
didn't already know what is being discussed, I wouldnt be too much
further forwards from those comments.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:
> On 4/12/17 02:31, Noah Misch wrote:
> >>> But I hope you mean to commit these snapbuild patches before the postgres 10
> >>> release?  As far as I know, logical replication is still very broken without
> >>> them (or at least some of that set of 5 patches - I don't know which ones
> >>> are essential and which may not be).
> >>
> >> Yes, these should go into 10 *and* earlier releases, and I don't plan to
> >> wait for 2017-07.
> > 
> > [Action required within three days.  This is a generic notification.]
> 
> I'm hoping for a word from Andres on this.

Feel free to reassign to me.



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
Hi,
On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote:

> From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001
> From: Petr Jelinek <pjmodos@pjmodos.net>
> Date: Fri, 24 Feb 2017 21:39:03 +0100
> Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export
> 
> Otherwise the VACUUM or pruning might remove tuples still needed by the
> exported snapshot.
> ---
>  src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
> index 5529ac8..57c392c 100644
> --- a/src/backend/replication/logical/logical.c
> +++ b/src/backend/replication/logical/logical.c
> @@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
>       * the slot machinery about the new limit. Once that's done the
>       * ProcArrayLock can be released as the slot machinery now is
>       * protecting against vacuum.
> +     *
> +     * Note that we only store the global xmin temporarily in the in-memory
> +     * state so that the initial snapshot can be exported. After initial
> +     * snapshot is done global xmin should be reset and not tracked anymore
> +     * so we are fine with losing the global xmin after crash.
>       * ----
>       */
>      LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>  
>      slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
>      slot->data.catalog_xmin = slot->effective_catalog_xmin;
> +    slot->effective_xmin = slot->effective_catalog_xmin;


>  void
>  FreeDecodingContext(LogicalDecodingContext *ctx)
>  {
> +    ReplicationSlot *slot = MyReplicationSlot;
> +
>      if (ctx->callbacks.shutdown_cb != NULL)
>          shutdown_cb_wrapper(ctx);
>  
> +    /*
> +     * Cleanup global xmin for the slot that we may have set in
> +     * CreateInitDecodingContext().

Hm.  Is that actually a meaningful point to do so? For one, it gets
called by pg_logical_slot_get_changes_guts(), but more importantly, the
snapshot is exported till SnapBuildClearExportedSnapshot(), which is the
next command?  If we rely on the snapshot magic done by ExportSnapshot()
it'd be worthwhile to mention that...


> We do not take ProcArrayLock or similar
> +     * since we only reset xmin here and there's not much harm done by a
> +     * concurrent computation missing that.
> +     */

Hum.  I was prepared to complain about this, but ISTM, that there's
absolutely no risk because the following
ReplicationSlotsComputeRequiredXmin(false); actually does all the
necessary locking?  But still, I don't see much point in the
optimization.



> This patch changes the code so that stored snapshots are only used for
> logical decoding restart but not for initial slot snapshot.

Yea, that's a very good point...

> @@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>  
>          return false;
>      }
> -    /* c) valid on disk state */
> -    else if (SnapBuildRestore(builder, lsn))
> +    /* c) valid on disk state and not exported snapshot */
> +    else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) &&
> +             SnapBuildRestore(builder, lsn))

Hm. Is this a good signaling mechanism? It'll also trigger for the SQL
interface, where it'd strictly speaking not be required, right?


> From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
> From: Petr Jelinek <pjmodos@pjmodos.net>
> Date: Sun, 26 Feb 2017 01:07:33 +0100
> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards

A bit more commentary would be good. What does that protect us against?



> From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001
> From: Petr Jelinek <pjmodos@pjmodos.net>
> Date: Wed, 22 Feb 2017 00:57:33 +0100
> Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder
> 
> Due to race condition, the xl_running_xacts might contain no longer
> running transactions. Previous coding tried to get around this by
> additional locking but that did not work correctly for committs. Instead
> try combining decoded commits and multiple xl_running_xacts to get the
> consistent snapshot.

Needs more explanation about approach.


> @@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>       *      simply track the number of in-progress toplevel transactions and
>       *      lower it whenever one commits or aborts. When that number
>       *      (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
> -     *      to CONSISTENT.
> +     *      to CONSISTENT. Sometimes we might get xl_running_xacts which has
> +     *      all tracked transactions as finished. We'll need to restart tracking
> +     *      in that case and use previously collected committed transactions to
> +     *      purge transactions mistakenly marked as running in the
> +     *      xl_running_xacts which exist as a result of race condition in
> +     *      LogStandbySnapshot().

I'm not following this yet.


> @@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>       * b) first encounter of a useable xl_running_xacts record. If we had
>       * found one earlier we would either track running transactions (i.e.
>       * builder->running.xcnt != 0) or be consistent (this function wouldn't
> -     * get called).
> +     * get called). However it's possible that we could not see all
> +     * transactions that were marked as running in xl_running_xacts, so if
> +     * we get new one that says all were closed but we are not consistent
> +     * yet, we need to restart the tracking while taking previously seen
> +     * transactions into account.

This needs to revise the preceding comment more heavily.  "This is the
first!!! Or maybe not!" isn't easy to understand.


>       */
> -    else if (!builder->running.xcnt)
> +    else if (!builder->running.xcnt ||
> +             running->oldestRunningXid > builder->running.xmax)

Isn't that wrong under wraparound?


>      {
>          int            off;
> +        bool        first = builder->running.xcnt == 0;
>  
>          /*
>           * We only care about toplevel xids as those are the ones we
> @@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>          builder->running.xmin = builder->running.xip[0];
>          builder->running.xmax = builder->running.xip[running->xcnt - 1];
>  
> +
>          /* makes comparisons cheaper later */
>          TransactionIdRetreat(builder->running.xmin);
>          TransactionIdAdvance(builder->running.xmax);
>  
>          builder->state = SNAPBUILD_FULL_SNAPSHOT;
>  
> -        ereport(LOG,
> -            (errmsg("logical decoding found initial starting point at %X/%X",
> -                    (uint32) (lsn >> 32), (uint32) lsn),
> -             errdetail_plural("%u transaction needs to finish.",
> -                              "%u transactions need to finish.",
> -                              builder->running.xcnt,
> -                              (uint32) builder->running.xcnt)));
> -
> +        /*
> +         * If this is the first time we've seen xl_running_xacts, we are done.
> +         */
> +        if (first)
> +        {
> +            ereport(LOG,
> +                (errmsg("logical decoding found initial starting point at %X/%X",
> +                        (uint32) (lsn >> 32), (uint32) lsn),
> +                 errdetail_plural("%u transaction needs to finish.",
> +                                  "%u transactions need to finish.",
> +                                  builder->running.xcnt,
> +                                  (uint32) builder->running.xcnt)));
> +        }
> +        else
> +        {
> +            /*
> +             * Because of the race condition in LogStandbySnapshot() the
> +             * transactions recorded in xl_running_xacts as running might have
> +             * already committed by the time the xl_running_xacts was written
> +             * to WAL. Use the information about decoded transactions that we
> +             * gathered so far to update our idea about what's still running.
> +             *
> +             * We can use SnapBuildEndTxn directly as it only does the
> +             * transaction running check and handling without any additional
> +             * side effects.
> +             */
> +            for (off = 0; off < builder->committed.xcnt; off++)
> +                SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
> +            if (builder->state == SNAPBUILD_CONSISTENT)
> +                return false;
> +
> +            ereport(LOG,
> +                (errmsg("logical decoding moved initial starting point to %X/%X",
> +                        (uint32) (lsn >> 32), (uint32) lsn),
> +                 errdetail_plural("%u transaction needs to finish.",
> +                                  "%u transactions need to finish.",
> +                                  builder->running.xcnt,
> +                                  (uint32) builder->running.xcnt)));
> +        }

Hm, this is not pretty.




> From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001
> From: Petr Jelinek <pjmodos@pjmodos.net>
> Date: Tue, 21 Feb 2017 19:58:18 +0100
> Subject: [PATCH 5/5] Skip unnecessary snapshot builds
> 
> When doing initial snapshot build during logical decoding
> initialization, don't build snapshots for transactions where we know the
> transaction didn't do any catalog changes. Otherwise we might end up
> with thousands of useless snapshots on busy server which can be quite
> expensive.
> ---
>  src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
>  1 file changed, 53 insertions(+), 29 deletions(-)
> 
> diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
> index 1a1c9ba..c800aa5 100644
> --- a/src/backend/replication/logical/snapbuild.c
> +++ b/src/backend/replication/logical/snapbuild.c
> @@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>      bool        forced_timetravel = false;
>      bool        sub_needs_timetravel = false;
>      bool        top_needs_timetravel = false;
> +    bool        skip_forced_snapshot = false;
>  
>      TransactionId xmax = xid;
>  
> @@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>          /*
>           * We could avoid treating !SnapBuildTxnIsRunning transactions as
>           * timetravel ones, but we want to be able to export a snapshot when
> -         * we reached consistency.
> +         * we reached consistency so we need to keep track of them.
>           */
>          forced_timetravel = true;
>          elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
> +
> +        /*
> +         * It is however desirable to skip building new snapshot for
> +         * !SnapBuildTxnIsRunning transactions as otherwise we might end up
> +         * building thousands of unused snapshots on busy servers which can
> +         * be very expensive.
> +         */
> +        if (!SnapBuildTxnIsRunning(builder, xid))
> +            skip_forced_snapshot = true;
>      }

That's pretty crudely bolted on the existing logic, isn't there a
simpler way?


Greetings,

Andres Freund



Re: [HACKERS] snapbuild woes

From
Noah Misch
Date:
On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:
> On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:
> > On 4/12/17 02:31, Noah Misch wrote:
> > >>> But I hope you mean to commit these snapbuild patches before the postgres 10
> > >>> release?  As far as I know, logical replication is still very broken without
> > >>> them (or at least some of that set of 5 patches - I don't know which ones
> > >>> are essential and which may not be).
> > >>
> > >> Yes, these should go into 10 *and* earlier releases, and I don't plan to
> > >> wait for 2017-07.
> > > 
> > > [Action required within three days.  This is a generic notification.]
> > 
> > I'm hoping for a word from Andres on this.
> 
> Feel free to reassign to me.

Thanks for volunteering; I'll do that shortly.  Please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:

On April 12, 2017 9:58:12 PM PDT, Noah Misch <noah@leadboat.com> wrote:
>On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:
>> On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:
>> > On 4/12/17 02:31, Noah Misch wrote:
>> > >>> But I hope you mean to commit these snapbuild patches before
>the postgres 10
>> > >>> release?  As far as I know, logical replication is still very
>broken without
>> > >>> them (or at least some of that set of 5 patches - I don't know
>which ones
>> > >>> are essential and which may not be).
>> > >>
>> > >> Yes, these should go into 10 *and* earlier releases, and I don't
>plan to
>> > >> wait for 2017-07.
>> > >
>> > > [Action required within three days.  This is a generic
>notification.]
>> >
>> > I'm hoping for a word from Andres on this.
>>
>> Feel free to reassign to me.
>
>Thanks for volunteering; I'll do that shortly.  Please observe the
>policy on
>open item ownership[1] and send a status update within three calendar
>days of
>this message.  Include a date for your subsequent status update.
>
>[1]
>https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

Will, volunteering might be the wrong word.  These ate all my fault, although none look v10 specific.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
Thanks for looking at this!

On 13/04/17 02:29, Andres Freund wrote:
> Hi,
> On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote:
> 
>> From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001
>> From: Petr Jelinek <pjmodos@pjmodos.net>
>> Date: Fri, 24 Feb 2017 21:39:03 +0100
>> Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export
>>
>> Otherwise the VACUUM or pruning might remove tuples still needed by the
>> exported snapshot.
>> ---
>>  src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
>>  1 file changed, 20 insertions(+), 1 deletion(-)
>>
>> diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
>> index 5529ac8..57c392c 100644
>> --- a/src/backend/replication/logical/logical.c
>> +++ b/src/backend/replication/logical/logical.c
>> @@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
>>       * the slot machinery about the new limit. Once that's done the
>>       * ProcArrayLock can be released as the slot machinery now is
>>       * protecting against vacuum.
>> +     *
>> +     * Note that we only store the global xmin temporarily in the in-memory
>> +     * state so that the initial snapshot can be exported. After initial
>> +     * snapshot is done global xmin should be reset and not tracked anymore
>> +     * so we are fine with losing the global xmin after crash.
>>       * ----
>>       */
>>      LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>>  
>>      slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
>>      slot->data.catalog_xmin = slot->effective_catalog_xmin;
>> +    slot->effective_xmin = slot->effective_catalog_xmin;
> 
> 
>>  void
>>  FreeDecodingContext(LogicalDecodingContext *ctx)
>>  {
>> +    ReplicationSlot *slot = MyReplicationSlot;
>> +
>>      if (ctx->callbacks.shutdown_cb != NULL)
>>          shutdown_cb_wrapper(ctx);
>>  
>> +    /*
>> +     * Cleanup global xmin for the slot that we may have set in
>> +     * CreateInitDecodingContext().
> 
> Hm.  Is that actually a meaningful point to do so? For one, it gets
> called by pg_logical_slot_get_changes_guts(), but more importantly, the
> snapshot is exported till SnapBuildClearExportedSnapshot(), which is the
> next command?  If we rely on the snapshot magic done by ExportSnapshot()
> it'd be worthwhile to mention that...
> 

(Didn't see the patch for couple of months so don't remember all the
detailed decisions anymore)

Yes we rely on backend's xmin to be set for exported snapshot. We only
care about global xmin for exported snapshots really, I assumed that's
clear enough from "so that the initial snapshot can be exported" but I
guess there should be more clear comment about this where we actually
clean this up.

> 
>> We do not take ProcArrayLock or similar
>> +     * since we only reset xmin here and there's not much harm done by a
>> +     * concurrent computation missing that.
>> +     */
> 
> Hum.  I was prepared to complain about this, but ISTM, that there's
> absolutely no risk because the following
> ReplicationSlotsComputeRequiredXmin(false); actually does all the
> necessary locking?  But still, I don't see much point in the
> optimization.
> 

Well, if we don't need it in LogicalConfirmReceivedLocation, I don't see
why we need it here. Please enlighten me.

> 
> 
>> This patch changes the code so that stored snapshots are only used for
>> logical decoding restart but not for initial slot snapshot.
> 
> Yea, that's a very good point...
> 
>> @@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>  
>>          return false;
>>      }
>> -    /* c) valid on disk state */
>> -    else if (SnapBuildRestore(builder, lsn))
>> +    /* c) valid on disk state and not exported snapshot */
>> +    else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) &&
>> +             SnapBuildRestore(builder, lsn))
> 
> Hm. Is this a good signaling mechanism? It'll also trigger for the SQL
> interface, where it'd strictly speaking not be required, right?
> 

Good point. Maybe we should really tell snapshot builder if the snapshot
is going to be exported or not explicitly (see the rant all the way down).

> 
>> From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
>> From: Petr Jelinek <pjmodos@pjmodos.net>
>> Date: Sun, 26 Feb 2017 01:07:33 +0100
>> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards
> 
> A bit more commentary would be good. What does that protect us against?
> 

I think I explained that in the email. We might export snapshot with
xmin smaller than global xmin otherwise.

> 
> 
>> From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001
>> From: Petr Jelinek <pjmodos@pjmodos.net>
>> Date: Wed, 22 Feb 2017 00:57:33 +0100
>> Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder
>>
>> Due to race condition, the xl_running_xacts might contain no longer
>> running transactions. Previous coding tried to get around this by
>> additional locking but that did not work correctly for committs. Instead
>> try combining decoded commits and multiple xl_running_xacts to get the
>> consistent snapshot.
> 
> Needs more explanation about approach.
> 
> 
>> @@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>       *      simply track the number of in-progress toplevel transactions and
>>       *      lower it whenever one commits or aborts. When that number
>>       *      (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
>> -     *      to CONSISTENT.
>> +     *      to CONSISTENT. Sometimes we might get xl_running_xacts which has
>> +     *      all tracked transactions as finished. We'll need to restart tracking
>> +     *      in that case and use previously collected committed transactions to
>> +     *      purge transactions mistakenly marked as running in the
>> +     *      xl_running_xacts which exist as a result of race condition in
>> +     *      LogStandbySnapshot().
> 
> I'm not following this yet.
> 

Let me try to explain:
We get xl_running_xacts with txes 1,3,4. But the 1 already committed
before so the decoding will never see it and we never get snapshot. Now
at some point we might get xl_running_xact with txes 6,7,8 so we know
that all transactions from the initial xl_running_xacts must be closed.
We restart the tracking here from beginning as if this was first
xl_running_xacts we've seen, with the exception that we look into past
if we seen the 6,7,8 transactions already as finished, then we'll mark
them as finished immediately (hence avoiding the issue with transaction
6 being already committed before xl_running_xacts was written).

> 
>> @@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>       * b) first encounter of a useable xl_running_xacts record. If we had
>>       * found one earlier we would either track running transactions (i.e.
>>       * builder->running.xcnt != 0) or be consistent (this function wouldn't
>> -     * get called).
>> +     * get called). However it's possible that we could not see all
>> +     * transactions that were marked as running in xl_running_xacts, so if
>> +     * we get new one that says all were closed but we are not consistent
>> +     * yet, we need to restart the tracking while taking previously seen
>> +     * transactions into account.
> 
> This needs to revise the preceding comment more heavily.  "This is the
> first!!! Or maybe not!" isn't easy to understand.
> 

Yeah, I found it bit hard to make it sound correct and not confusing,
even wondered if I should split this code to two because of that but it
would lead into quite a bit of code duplication, dunno if that's better.
Maybe we could move the "reset" code into separate function to avoid
most of the duplication.

> 
>>       */
>> -    else if (!builder->running.xcnt)
>> +    else if (!builder->running.xcnt ||
>> +             running->oldestRunningXid > builder->running.xmax)
> 
> Isn't that wrong under wraparound?
> 

Right, should use TransactionIdFollows.

> 
>>      {
>>          int            off;
>> +        bool        first = builder->running.xcnt == 0;
>>  
>>          /*
>>           * We only care about toplevel xids as those are the ones we
>> @@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>          builder->running.xmin = builder->running.xip[0];
>>          builder->running.xmax = builder->running.xip[running->xcnt - 1];
>>  
>> +
>>          /* makes comparisons cheaper later */
>>          TransactionIdRetreat(builder->running.xmin);
>>          TransactionIdAdvance(builder->running.xmax);
>>  
>>          builder->state = SNAPBUILD_FULL_SNAPSHOT;
>>  
>> -        ereport(LOG,
>> -            (errmsg("logical decoding found initial starting point at %X/%X",
>> -                    (uint32) (lsn >> 32), (uint32) lsn),
>> -             errdetail_plural("%u transaction needs to finish.",
>> -                              "%u transactions need to finish.",
>> -                              builder->running.xcnt,
>> -                              (uint32) builder->running.xcnt)));
>> -
>> +        /*
>> +         * If this is the first time we've seen xl_running_xacts, we are done.
>> +         */
>> +        if (first)
>> +        {
>> +            ereport(LOG,
>> +                (errmsg("logical decoding found initial starting point at %X/%X",
>> +                        (uint32) (lsn >> 32), (uint32) lsn),
>> +                 errdetail_plural("%u transaction needs to finish.",
>> +                                  "%u transactions need to finish.",
>> +                                  builder->running.xcnt,
>> +                                  (uint32) builder->running.xcnt)));
>> +        }
>> +        else
>> +        {
>> +            /*
>> +             * Because of the race condition in LogStandbySnapshot() the
>> +             * transactions recorded in xl_running_xacts as running might have
>> +             * already committed by the time the xl_running_xacts was written
>> +             * to WAL. Use the information about decoded transactions that we
>> +             * gathered so far to update our idea about what's still running.
>> +             *
>> +             * We can use SnapBuildEndTxn directly as it only does the
>> +             * transaction running check and handling without any additional
>> +             * side effects.
>> +             */
>> +            for (off = 0; off < builder->committed.xcnt; off++)
>> +                SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
>> +            if (builder->state == SNAPBUILD_CONSISTENT)
>> +                return false;
>> +
>> +            ereport(LOG,
>> +                (errmsg("logical decoding moved initial starting point to %X/%X",
>> +                        (uint32) (lsn >> 32), (uint32) lsn),
>> +                 errdetail_plural("%u transaction needs to finish.",
>> +                                  "%u transactions need to finish.",
>> +                                  builder->running.xcnt,
>> +                                  (uint32) builder->running.xcnt)));
>> +        }
> 
> Hm, this is not pretty.
> 

Hmm? There are two possible scenarios that need to be handled
differently. Possibly another reason to split it out completely as
mentioned above.

> 
>> From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001
>> From: Petr Jelinek <pjmodos@pjmodos.net>
>> Date: Tue, 21 Feb 2017 19:58:18 +0100
>> Subject: [PATCH 5/5] Skip unnecessary snapshot builds
>>
>> When doing initial snapshot build during logical decoding
>> initialization, don't build snapshots for transactions where we know the
>> transaction didn't do any catalog changes. Otherwise we might end up
>> with thousands of useless snapshots on busy server which can be quite
>> expensive.
>> ---
>>  src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
>>  1 file changed, 53 insertions(+), 29 deletions(-)
>>
>> diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
>> index 1a1c9ba..c800aa5 100644
>> --- a/src/backend/replication/logical/snapbuild.c
>> +++ b/src/backend/replication/logical/snapbuild.c
>> @@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>>      bool        forced_timetravel = false;
>>      bool        sub_needs_timetravel = false;
>>      bool        top_needs_timetravel = false;
>> +    bool        skip_forced_snapshot = false;
>>  
>>      TransactionId xmax = xid;
>>  
>> @@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>>          /*
>>           * We could avoid treating !SnapBuildTxnIsRunning transactions as
>>           * timetravel ones, but we want to be able to export a snapshot when
>> -         * we reached consistency.
>> +         * we reached consistency so we need to keep track of them.
>>           */
>>          forced_timetravel = true;
>>          elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
>> +
>> +        /*
>> +         * It is however desirable to skip building new snapshot for
>> +         * !SnapBuildTxnIsRunning transactions as otherwise we might end up
>> +         * building thousands of unused snapshots on busy servers which can
>> +         * be very expensive.
>> +         */
>> +        if (!SnapBuildTxnIsRunning(builder, xid))
>> +            skip_forced_snapshot = true;
>>      }
> 
> That's pretty crudely bolted on the existing logic, isn't there a
> simpler way?
> 

Agreed, however, every time I tried to make this prettier I ended up
either producing subtle bugs (see the initial email in this thread for
example of that), so I eventually gave up on pretty.

As a side note, my opinion after all this is that it's probably mistake
to try to use various situational conditions to make sure we are
building exportable snapshot. ISTM we should tell the snapbuilder
explicitly that the snapshot will be exported and it should behave
accordingly based on that. Because for example, we also should track
aborted transactions in the snapshot which is to be exported because
otherwise enough of them happening during snapshot building will result
in export failure due to too big snapshot. But this seems like too
invasive change to be back-portable.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 13/04/17 07:02, Andres Freund wrote:
> 
> On April 12, 2017 9:58:12 PM PDT, Noah Misch <noah@leadboat.com> wrote:
>> On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:
>>> On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:
>>>> On 4/12/17 02:31, Noah Misch wrote:
>>>>>>> But I hope you mean to commit these snapbuild patches before
>> the postgres 10
>>>>>>> release?  As far as I know, logical replication is still very
>> broken without
>>>>>>> them (or at least some of that set of 5 patches - I don't know
>> which ones
>>>>>>> are essential and which may not be).
>>>>>>
>>>>>> Yes, these should go into 10 *and* earlier releases, and I don't
>> plan to
>>>>>> wait for 2017-07.
>>>>>
>>>>> [Action required within three days.  This is a generic
>> notification.]
>>>>
>>>> I'm hoping for a word from Andres on this.
>>>
>>> Feel free to reassign to me.
>>
>> Thanks for volunteering; I'll do that shortly.  Please observe the
>> policy on
>> open item ownership[1] and send a status update within three calendar
>> days of
>> this message.  Include a date for your subsequent status update.
>>
>> [1]
>> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> Will, volunteering might be the wrong word.  These ate all my fault, although none look v10 specific.
> 

Yeah none of this is v10 specific, the importance to v10 is that it
affects the logical replication in core, not just extensions like in
9.4-9.6.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
Hi, here is updated patch (details inline).

On 13/04/17 20:00, Petr Jelinek wrote:
> Thanks for looking at this!
> 
> On 13/04/17 02:29, Andres Freund wrote:
>> Hi,
>> On 2017-03-03 01:30:11 +0100, Petr Jelinek wrote:
>>
>>> From 7d5b48c8cb80e7c867b2096c999d08feda50b197 Mon Sep 17 00:00:00 2001
>>> From: Petr Jelinek <pjmodos@pjmodos.net>
>>> Date: Fri, 24 Feb 2017 21:39:03 +0100
>>> Subject: [PATCH 1/5] Reserve global xmin for create slot snasphot export
>>>
>>> Otherwise the VACUUM or pruning might remove tuples still needed by the
>>> exported snapshot.
>>> ---
>>>  src/backend/replication/logical/logical.c | 21 ++++++++++++++++++++-
>>>  1 file changed, 20 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
>>> index 5529ac8..57c392c 100644
>>> --- a/src/backend/replication/logical/logical.c
>>> +++ b/src/backend/replication/logical/logical.c
>>> @@ -267,12 +267,18 @@ CreateInitDecodingContext(char *plugin,
>>>       * the slot machinery about the new limit. Once that's done the
>>>       * ProcArrayLock can be released as the slot machinery now is
>>>       * protecting against vacuum.
>>> +     *
>>> +     * Note that we only store the global xmin temporarily in the in-memory
>>> +     * state so that the initial snapshot can be exported. After initial
>>> +     * snapshot is done global xmin should be reset and not tracked anymore
>>> +     * so we are fine with losing the global xmin after crash.
>>>       * ----
>>>       */
>>>      LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
>>>  
>>>      slot->effective_catalog_xmin = GetOldestSafeDecodingTransactionId();
>>>      slot->data.catalog_xmin = slot->effective_catalog_xmin;
>>> +    slot->effective_xmin = slot->effective_catalog_xmin;
>>
>>
>>>  void
>>>  FreeDecodingContext(LogicalDecodingContext *ctx)
>>>  {
>>> +    ReplicationSlot *slot = MyReplicationSlot;
>>> +
>>>      if (ctx->callbacks.shutdown_cb != NULL)
>>>          shutdown_cb_wrapper(ctx);
>>>  
>>> +    /*
>>> +     * Cleanup global xmin for the slot that we may have set in
>>> +     * CreateInitDecodingContext().
>>
>> Hm.  Is that actually a meaningful point to do so? For one, it gets
>> called by pg_logical_slot_get_changes_guts(), but more importantly, the
>> snapshot is exported till SnapBuildClearExportedSnapshot(), which is the
>> next command?  If we rely on the snapshot magic done by ExportSnapshot()
>> it'd be worthwhile to mention that...
>>
> 
> (Didn't see the patch for couple of months so don't remember all the
> detailed decisions anymore)
> 
> Yes we rely on backend's xmin to be set for exported snapshot. We only
> care about global xmin for exported snapshots really, I assumed that's
> clear enough from "so that the initial snapshot can be exported" but I
> guess there should be more clear comment about this where we actually
> clean this up.
> 

Okay, wrote new comment there, how is it now?

>>
>>> We do not take ProcArrayLock or similar
>>> +     * since we only reset xmin here and there's not much harm done by a
>>> +     * concurrent computation missing that.
>>> +     */
>>
>> Hum.  I was prepared to complain about this, but ISTM, that there's
>> absolutely no risk because the following
>> ReplicationSlotsComputeRequiredXmin(false); actually does all the
>> necessary locking?  But still, I don't see much point in the
>> optimization.
>>
> 
> Well, if we don't need it in LogicalConfirmReceivedLocation, I don't see
> why we need it here. Please enlighten me.
> 

I kept this as it was, after rereading, the
ReplicationSlotsComputeRequiredXmin(false) will do shared lock while if
we wanted to avoid mutex and do the xmin update under lock we'd need to
do exclusive lock so I think it's worth the optimization...

>>
>>
>>> This patch changes the code so that stored snapshots are only used for
>>> logical decoding restart but not for initial slot snapshot.
>>
>> Yea, that's a very good point...
>>
>>> @@ -1284,13 +1286,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>>  
>>>          return false;
>>>      }
>>> -    /* c) valid on disk state */
>>> -    else if (SnapBuildRestore(builder, lsn))
>>> +    /* c) valid on disk state and not exported snapshot */
>>> +    else if (!TransactionIdIsNormal(builder->initial_xmin_horizon) &&
>>> +             SnapBuildRestore(builder, lsn))
>>
>> Hm. Is this a good signaling mechanism? It'll also trigger for the SQL
>> interface, where it'd strictly speaking not be required, right?
>>
> 
> Good point. Maybe we should really tell snapshot builder if the snapshot
> is going to be exported or not explicitly (see the rant all the way down).
> 

I added the new signaling mechanism (the new boolean option indicating
if we are building full snapshot which is only set when the snapshot is
exported or used by the transaction).

>>
>>> From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
>>> From: Petr Jelinek <pjmodos@pjmodos.net>
>>> Date: Sun, 26 Feb 2017 01:07:33 +0100
>>> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards
>>
>> A bit more commentary would be good. What does that protect us against?
>>
> 
> I think I explained that in the email. We might export snapshot with
> xmin smaller than global xmin otherwise.
> 

Updated commit message with explanation as well.

>>
>>
>>> From 53193b40f26dd19c712f3b9b77af55f81eb31cc4 Mon Sep 17 00:00:00 2001
>>> From: Petr Jelinek <pjmodos@pjmodos.net>
>>> Date: Wed, 22 Feb 2017 00:57:33 +0100
>>> Subject: [PATCH 4/5] Fix xl_running_xacts usage in snapshot builder
>>>
>>> Due to race condition, the xl_running_xacts might contain no longer
>>> running transactions. Previous coding tried to get around this by
>>> additional locking but that did not work correctly for committs. Instead
>>> try combining decoded commits and multiple xl_running_xacts to get the
>>> consistent snapshot.
>>
>> Needs more explanation about approach.
>>
>>
>>> @@ -1221,7 +1221,12 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>>       *      simply track the number of in-progress toplevel transactions and
>>>       *      lower it whenever one commits or aborts. When that number
>>>       *      (builder->running.xcnt) reaches zero, we can go from FULL_SNAPSHOT
>>> -     *      to CONSISTENT.
>>> +     *      to CONSISTENT. Sometimes we might get xl_running_xacts which has
>>> +     *      all tracked transactions as finished. We'll need to restart tracking
>>> +     *      in that case and use previously collected committed transactions to
>>> +     *      purge transactions mistakenly marked as running in the
>>> +     *      xl_running_xacts which exist as a result of race condition in
>>> +     *      LogStandbySnapshot().
>>
>> I'm not following this yet.
>>
> 
> Let me try to explain:
> We get xl_running_xacts with txes 1,3,4. But the 1 already committed
> before so the decoding will never see it and we never get snapshot. Now
> at some point we might get xl_running_xact with txes 6,7,8 so we know
> that all transactions from the initial xl_running_xacts must be closed.
> We restart the tracking here from beginning as if this was first
> xl_running_xacts we've seen, with the exception that we look into past
> if we seen the 6,7,8 transactions already as finished, then we'll mark
> them as finished immediately (hence avoiding the issue with transaction
> 6 being already committed before xl_running_xacts was written).
> 
>>
>>> @@ -1298,11 +1303,17 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>>       * b) first encounter of a useable xl_running_xacts record. If we had
>>>       * found one earlier we would either track running transactions (i.e.
>>>       * builder->running.xcnt != 0) or be consistent (this function wouldn't
>>> -     * get called).
>>> +     * get called). However it's possible that we could not see all
>>> +     * transactions that were marked as running in xl_running_xacts, so if
>>> +     * we get new one that says all were closed but we are not consistent
>>> +     * yet, we need to restart the tracking while taking previously seen
>>> +     * transactions into account.
>>
>> This needs to revise the preceding comment more heavily.  "This is the
>> first!!! Or maybe not!" isn't easy to understand.
>>
> 
> Yeah, I found it bit hard to make it sound correct and not confusing,
> even wondered if I should split this code to two because of that but it
> would lead into quite a bit of code duplication, dunno if that's better.
> Maybe we could move the "reset" code into separate function to avoid
> most of the duplication.

Rewrote and moved this comment to it's own thing.

> 
>>
>>>       */
>>> -    else if (!builder->running.xcnt)
>>> +    else if (!builder->running.xcnt ||
>>> +             running->oldestRunningXid > builder->running.xmax)
>>
>> Isn't that wrong under wraparound?
>>
> 
> Right, should use TransactionIdFollows.
> 

Fixed.

>>
>>>      {
>>>          int            off;
>>> +        bool        first = builder->running.xcnt == 0;
>>>  
>>>          /*
>>>           * We only care about toplevel xids as those are the ones we
>>> @@ -1338,20 +1349,13 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
>>>          builder->running.xmin = builder->running.xip[0];
>>>          builder->running.xmax = builder->running.xip[running->xcnt - 1];
>>>  
>>> +
>>>          /* makes comparisons cheaper later */
>>>          TransactionIdRetreat(builder->running.xmin);
>>>          TransactionIdAdvance(builder->running.xmax);
>>>  
>>>          builder->state = SNAPBUILD_FULL_SNAPSHOT;
>>>  
>>> -        ereport(LOG,
>>> -            (errmsg("logical decoding found initial starting point at %X/%X",
>>> -                    (uint32) (lsn >> 32), (uint32) lsn),
>>> -             errdetail_plural("%u transaction needs to finish.",
>>> -                              "%u transactions need to finish.",
>>> -                              builder->running.xcnt,
>>> -                              (uint32) builder->running.xcnt)));
>>> -
>>> +        /*
>>> +         * If this is the first time we've seen xl_running_xacts, we are done.
>>> +         */
>>> +        if (first)
>>> +        {
>>> +            ereport(LOG,
>>> +                (errmsg("logical decoding found initial starting point at %X/%X",
>>> +                        (uint32) (lsn >> 32), (uint32) lsn),
>>> +                 errdetail_plural("%u transaction needs to finish.",
>>> +                                  "%u transactions need to finish.",
>>> +                                  builder->running.xcnt,
>>> +                                  (uint32) builder->running.xcnt)));
>>> +        }
>>> +        else
>>> +        {
>>> +            /*
>>> +             * Because of the race condition in LogStandbySnapshot() the
>>> +             * transactions recorded in xl_running_xacts as running might have
>>> +             * already committed by the time the xl_running_xacts was written
>>> +             * to WAL. Use the information about decoded transactions that we
>>> +             * gathered so far to update our idea about what's still running.
>>> +             *
>>> +             * We can use SnapBuildEndTxn directly as it only does the
>>> +             * transaction running check and handling without any additional
>>> +             * side effects.
>>> +             */
>>> +            for (off = 0; off < builder->committed.xcnt; off++)
>>> +                SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);
>>> +            if (builder->state == SNAPBUILD_CONSISTENT)
>>> +                return false;
>>> +
>>> +            ereport(LOG,
>>> +                (errmsg("logical decoding moved initial starting point to %X/%X",
>>> +                        (uint32) (lsn >> 32), (uint32) lsn),
>>> +                 errdetail_plural("%u transaction needs to finish.",
>>> +                                  "%u transactions need to finish.",
>>> +                                  builder->running.xcnt,
>>> +                                  (uint32) builder->running.xcnt)));
>>> +        }
>>
>> Hm, this is not pretty.
>>
> 

Changed this whole thing to be 2 different code paths with common
function doing the common work.

> 
>>
>>> From 4217da872e9aa48750c020542d8bc22c863a3d75 Mon Sep 17 00:00:00 2001
>>> From: Petr Jelinek <pjmodos@pjmodos.net>
>>> Date: Tue, 21 Feb 2017 19:58:18 +0100
>>> Subject: [PATCH 5/5] Skip unnecessary snapshot builds
>>>
>>> When doing initial snapshot build during logical decoding
>>> initialization, don't build snapshots for transactions where we know the
>>> transaction didn't do any catalog changes. Otherwise we might end up
>>> with thousands of useless snapshots on busy server which can be quite
>>> expensive.
>>> ---
>>>  src/backend/replication/logical/snapbuild.c | 82 +++++++++++++++++++----------
>>>  1 file changed, 53 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
>>> index 1a1c9ba..c800aa5 100644
>>> --- a/src/backend/replication/logical/snapbuild.c
>>> +++ b/src/backend/replication/logical/snapbuild.c
>>> @@ -954,6 +954,7 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>>>      bool        forced_timetravel = false;
>>>      bool        sub_needs_timetravel = false;
>>>      bool        top_needs_timetravel = false;
>>> +    bool        skip_forced_snapshot = false;
>>>  
>>>      TransactionId xmax = xid;
>>>  
>>> @@ -975,10 +976,19 @@ SnapBuildCommitTxn(SnapBuild *builder, XLogRecPtr lsn, TransactionId xid,
>>>          /*
>>>           * We could avoid treating !SnapBuildTxnIsRunning transactions as
>>>           * timetravel ones, but we want to be able to export a snapshot when
>>> -         * we reached consistency.
>>> +         * we reached consistency so we need to keep track of them.
>>>           */
>>>          forced_timetravel = true;
>>>          elog(DEBUG1, "forced to assume catalog changes for xid %u because it was running too early", xid);
>>> +
>>> +        /*
>>> +         * It is however desirable to skip building new snapshot for
>>> +         * !SnapBuildTxnIsRunning transactions as otherwise we might end up
>>> +         * building thousands of unused snapshots on busy servers which can
>>> +         * be very expensive.
>>> +         */
>>> +        if (!SnapBuildTxnIsRunning(builder, xid))
>>> +            skip_forced_snapshot = true;
>>>      }
>>
>> That's pretty crudely bolted on the existing logic, isn't there a
>> simpler way?
>>
> 
> Agreed, however, every time I tried to make this prettier I ended up
> either producing subtle bugs (see the initial email in this thread for
> example of that), so I eventually gave up on pretty.
> 

Okay, gave it one more try with fresh head, hopefully without new bugs,
what do you think?

> As a side note, my opinion after all this is that it's probably mistake
> to try to use various situational conditions to make sure we are
> building exportable snapshot. ISTM we should tell the snapbuilder
> explicitly that the snapshot will be exported and it should behave
> accordingly based on that. Because for example, we also should track
> aborted transactions in the snapshot which is to be exported because
> otherwise enough of them happening during snapshot building will result
> in export failure due to too big snapshot. But this seems like too
> invasive change to be back-portable.
> 

Ended up doing this in the 0002 and also use those changes in 0005, does
not seem to be that bad.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Noah Misch
Date:
On Thu, Apr 13, 2017 at 12:58:12AM -0400, Noah Misch wrote:
> On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:
> > On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:
> > > On 4/12/17 02:31, Noah Misch wrote:
> > > >>> But I hope you mean to commit these snapbuild patches before the postgres 10
> > > >>> release?  As far as I know, logical replication is still very broken without
> > > >>> them (or at least some of that set of 5 patches - I don't know which ones
> > > >>> are essential and which may not be).
> > > >>
> > > >> Yes, these should go into 10 *and* earlier releases, and I don't plan to
> > > >> wait for 2017-07.
> > > > 
> > > > [Action required within three days.  This is a generic notification.]
> > > 
> > > I'm hoping for a word from Andres on this.
> > 
> > Feel free to reassign to me.
> 
> Thanks for volunteering; I'll do that shortly.  Please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.
> 
> [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-04-16 22:04:04 -0400, Noah Misch wrote:
> On Thu, Apr 13, 2017 at 12:58:12AM -0400, Noah Misch wrote:
> > On Wed, Apr 12, 2017 at 10:21:51AM -0700, Andres Freund wrote:
> > > On 2017-04-12 11:03:57 -0400, Peter Eisentraut wrote:
> > > > On 4/12/17 02:31, Noah Misch wrote:
> > > > >>> But I hope you mean to commit these snapbuild patches before the postgres 10
> > > > >>> release?  As far as I know, logical replication is still very broken without
> > > > >>> them (or at least some of that set of 5 patches - I don't know which ones
> > > > >>> are essential and which may not be).
> > > > >>
> > > > >> Yes, these should go into 10 *and* earlier releases, and I don't plan to
> > > > >> wait for 2017-07.
> > > > > 
> > > > > [Action required within three days.  This is a generic notification.]
> > > > 
> > > > I'm hoping for a word from Andres on this.
> > > 
> > > Feel free to reassign to me.
> > 
> > Thanks for volunteering; I'll do that shortly.  Please observe the policy on
> > open item ownership[1] and send a status update within three calendar days of
> > this message.  Include a date for your subsequent status update.
> > 
> > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.

I've since the previous update reviewed Petr's patch, which he since has
updated over the weekend.  I'll do another round tomorrow, and will see
how it looks.  I think we might need some more tests for this to be
committable, so it might not become committable tomorrow.  I hope we'll
have something in tree by end of this week, if not I'll send an update.

- Andres



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
> I think we might need some more tests for this to be committable, so
> it might not become committable tomorrow.

I'm working on some infrastructure around this.  Not sure if it needs to
be committed, but it's certainly useful for evaluation.  Basically it's
a small UDF that:
1) creates a slot via walsender protocol (to some dsn)
2) imports that snapshot into yet another connection to that dsn
3) runs some query over that new connection

That makes it reasonably easy to run e.g. pgbench and continually create
slots, and use the snapshot to run queries "verifying" that things look
good.  It's a bit shoestring-ed together, but everything else seems to
require more code. And it's just test.

Unless somebody has a better idea?

- Andres



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 20/04/17 02:09, Andres Freund wrote:
> On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
>> I think we might need some more tests for this to be committable, so
>> it might not become committable tomorrow.
> 
> I'm working on some infrastructure around this.  Not sure if it needs to
> be committed, but it's certainly useful for evaluation.  Basically it's
> a small UDF that:
> 1) creates a slot via walsender protocol (to some dsn)
> 2) imports that snapshot into yet another connection to that dsn
> 3) runs some query over that new connection
> 
> That makes it reasonably easy to run e.g. pgbench and continually create
> slots, and use the snapshot to run queries "verifying" that things look
> good.  It's a bit shoestring-ed together, but everything else seems to
> require more code. And it's just test.
> 
> Unless somebody has a better idea?
> 

I don't. I mean it would be nice to have isolation tester support
walsender protocol, but I don't know anything about isolation tester
internals so no idea how much work that is. On top of that some of the
issues are not even possible to provoke via isolation tester (or
anything similar that would give us control over timing) unless we
expose a lot of guts of xlog/xact as a UDFs as well. So I think simple
function that does what you said and pgbench are reasonable solutions. I
guess you plan to make that as one of the test/modules or something
similar (the UDF)?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-04-20 13:32:10 +0200, Petr Jelinek wrote:
> On 20/04/17 02:09, Andres Freund wrote:
> > On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
> > I'm working on some infrastructure around this.  Not sure if it needs to
> > be committed, but it's certainly useful for evaluation.  Basically it's
> > a small UDF that:
> > 1) creates a slot via walsender protocol (to some dsn)
> > 2) imports that snapshot into yet another connection to that dsn
> > 3) runs some query over that new connection
> > 
> > That makes it reasonably easy to run e.g. pgbench and continually create
> > slots, and use the snapshot to run queries "verifying" that things look
> > good.  It's a bit shoestring-ed together, but everything else seems to
> > require more code. And it's just test.
> > 
> > Unless somebody has a better idea?

> I don't. I mean it would be nice to have isolation tester support
> walsender protocol, but I don't know anything about isolation tester
> internals so no idea how much work that is.

Now that the replication protocol supports normal queries, it's actually
not much of an issue on its own.  The problem is more that
isolationtester's client side language isn't powerfull enough - you
can't extract the snapshot name from one session and import it in
another. While that might be something we want to address, I certainly
don't want to tackle it for v10.

I'd started to develop a C toolkit as above, but after I got the basics
running I actually noticed it's pretty much unnecessary: You can just as
well do it with dblink and some plpgsql.

I can reliably reproduce several of the bugs in this thread in
relatively short amount of time before applying the patch, and so far
not after.  Thats great!


> I guess you plan to make that as one of the test/modules or something
> similar (the UDF)?

I've a bunch of tests, but I don't quite know whether we can expose all
of them via classical tests.  There are several easy ones that I
definitely want to add (import "empty" snapshot; import snapshot with
running xacts; create snapshot, perform some ddl, import snapshot,
perform some ddl, check things work reasonably crazy), but there's
enough others that are just probabilistic.  I was wondering about adding
a loop that simply runs for like 30s and then quits or such, but who
knows.

Testing around this made me wonder whether we need to make bgwriter.c's
LOG_SNAPSHOT_INTERVAL_MS configurable - for efficient testing reducing
it is quite valuable, and on busier machines it'll also almost always be
a win to log more frequently...  Opinions?

Greetings,

Andres Freund



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
> I've since the previous update reviewed Petr's patch, which he since has
> updated over the weekend.  I'll do another round tomorrow, and will see
> how it looks.  I think we might need some more tests for this to be
> committable, so it might not become committable tomorrow.  I hope we'll
> have something in tree by end of this week, if not I'll send an update.

I was less productive this week than I'd hoped, and creating a testsuite
was more work than I'd anticipated, so I'm slightly lagging behind.  I
hope to have a patchset tomorrow, aiming to commit something
Monday/Tuesday.

- Andres



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
Hi,

On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote:
> Hi, here is updated patch (details inline).

I'm not yet all that happy, sorry:

Looking at 0001:
- GetOldestSafeDecodingTransactionId() only guarantees to return an xid
  safe for decoding (note how procArray->replication_slot_catalog_xmin
  is checked), not one for the initial snapshot - so afaics this whole
  exercise doesn't guarantee much so far.
- A later commit introduces need_full_snapshot as a
  CreateInitDecodingContext, but you don't use it, not here.  That seems
  wrong.
- I remain unhappy with the handling of the reset of effective_xmin in
  FreeDecodingContext(). What if we ERROR/FATAL out before that happens?

What do you think about something like the attached?  I've not yet
tested it any way except running the regression tests.

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 25/04/17 00:59, Andres Freund wrote:
> Hi,
> 
> On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote:
>> Hi, here is updated patch (details inline).
> 
> I'm not yet all that happy, sorry:
> 
> Looking at 0001:
> - GetOldestSafeDecodingTransactionId() only guarantees to return an xid
>   safe for decoding (note how procArray->replication_slot_catalog_xmin
>   is checked), not one for the initial snapshot - so afaics this whole
>   exercise doesn't guarantee much so far.
> - A later commit introduces need_full_snapshot as a
>   CreateInitDecodingContext, but you don't use it, not here.  That seems
>   wrong.

Ah yeah looks like that optimization is useful even here.

> - I remain unhappy with the handling of the reset of effective_xmin in
>   FreeDecodingContext(). What if we ERROR/FATAL out before that happens?
> 

Oh your problem was that I did it in FreeDecodingContext() instead of
slot release, that I didn't get, yeah sure it's possibly better place.

> What do you think about something like the attached?  I've not yet
> tested it any way except running the regression tests.
> 

> -    if (!already_locked)
> -        LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> +    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);

Don't really understand this change much, but otherwise the patch looks
good to me.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Noah Misch
Date:
On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:
> On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
> > I've since the previous update reviewed Petr's patch, which he since has
> > updated over the weekend.  I'll do another round tomorrow, and will see
> > how it looks.  I think we might need some more tests for this to be
> > committable, so it might not become committable tomorrow.  I hope we'll
> > have something in tree by end of this week, if not I'll send an update.
> 
> I was less productive this week than I'd hoped, and creating a testsuite
> was more work than I'd anticipated, so I'm slightly lagging behind.  I
> hope to have a patchset tomorrow, aiming to commit something
> Monday/Tuesday.

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.  Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:

On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> wrote:
>On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:
>> On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
>> > I've since the previous update reviewed Petr's patch, which he
>since has
>> > updated over the weekend.  I'll do another round tomorrow, and will
>see
>> > how it looks.  I think we might need some more tests for this to be
>> > committable, so it might not become committable tomorrow.  I hope
>we'll
>> > have something in tree by end of this week, if not I'll send an
>update.
>>
>> I was less productive this week than I'd hoped, and creating a
>testsuite
>> was more work than I'd anticipated, so I'm slightly lagging behind.
>I
>> hope to have a patchset tomorrow, aiming to commit something
>> Monday/Tuesday.
>
>This PostgreSQL 10 open item is past due for your status update.
>Kindly send
>a status update within 24 hours, and include a date for your subsequent
>status
>update.  Refer to the policy on open item ownership:
>https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I committed part of the series today, plan to continue doing so over the next few days.  Changes require careful review
&testing, this is easy to get wrong... 
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [HACKERS] snapbuild woes

From
Noah Misch
Date:
On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote:
> I've a bunch of tests, but I don't quite know whether we can expose all
> of them via classical tests.  There are several easy ones that I
> definitely want to add (import "empty" snapshot; import snapshot with
> running xacts; create snapshot, perform some ddl, import snapshot,
> perform some ddl, check things work reasonably crazy), but there's
> enough others that are just probabilistic.  I was wondering about adding
> a loop that simply runs for like 30s and then quits or such, but who
> knows.

If the probabilistic test catches the bug even 5% of the time in typical
configurations, the buildfarm will rapidly identify any regression.  I'd
choose a 7s test that detects the bug 5% of the time over a 30s test that
detects it 99% of the time.  (When I wrote src/bin/pgbench/t/001_pgbench.pl
for a probabilistic bug, I sized that test to finish in 1s and catch its bug
half the time.  In its case, only two buildfarm members were able to
demonstrate the original bug, so 5% detection would have been too low.)



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
Hi,

On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote:
> +    /*
> +     * c) we already seen the xl_running_xacts and tried to do the above.
> +     * However because of race condition in LogStandbySnapshot() there might
> +     * have been transaction reported as running but in reality has already
> +     * written commit record before the xl_running_xacts so decoding has
> +     * missed it. We now see xl_running_xacts that suggests all transactions
> +     * from the original one were closed but the consistent state wasn't
> +     * reached which means the race condition has indeed happened.
> +     *
> +     * Start tracking again as if this was the first xl_running_xacts we've
> +     * seen, with the advantage that because decoding was already running,
> +     * any transactions committed before the xl_running_xacts record will be
> +     * known to us so we won't hit with the same issue again.
> +     */

Unfortunately I don't think that's true, as coded.  You're using
information about committed transactions:

> +    else if (TransactionIdFollows(running->oldestRunningXid,
> +                                  builder->running.xmax))
> +    {
> +        int off;
> +
> +        SnapBuildStartXactTracking(builder, running);
> +
>          /*
> +         * Nark any transactions that are known to have committed before the
> +         * xl_running_xacts as finished to avoid the race condition in
> +         * LogStandbySnapshot().
>           *
> +         * We can use SnapBuildEndTxn directly as it only does the
> +         * transaction running check and handling without any additional
> +         * side effects.
>           */
> +        for (off = 0; off < builder->committed.xcnt; off++)
> +            SnapBuildEndTxn(builder, lsn, builder->committed.xip[off]);

but a transaction might just have *aborted* before the new snapshot, no?
Since we don't keep track of those, I don't think this guarantees anything?

ISTM, we need a xip_status array in SnapBuild->running.  Then, whenever
a xl_running_xacts is encountered while builder->running.xcnt > 0, loop
for i in 0 .. xcnt_space, and end SnapBuildEndTxn() if xip_status[i] but
the xid is not xl_running_xacts?  Does that make sense?

- Andres



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-04-30 17:59:21 -0700, Andres Freund wrote:
> ISTM, we need a xip_status array in SnapBuild->running.  Then, whenever
> a xl_running_xacts is encountered while builder->running.xcnt > 0, loop
> for i in 0 .. xcnt_space, and end SnapBuildEndTxn() if xip_status[i] but
> the xid is not xl_running_xacts?  Does that make sense?

A hasty implementation, untested besides check-world, of that approach
is attached.  I'm going out for dinner now, will subject it to mean
things afterwards.

Needs more testing, comment and commit message policing.  But I think
the idea is sound?

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 01/05/17 03:35, Andres Freund wrote:
> On 2017-04-30 17:59:21 -0700, Andres Freund wrote:
>> ISTM, we need a xip_status array in SnapBuild->running.  Then, whenever
>> a xl_running_xacts is encountered while builder->running.xcnt > 0, loop
>> for i in 0 .. xcnt_space, and end SnapBuildEndTxn() if xip_status[i] but
>> the xid is not xl_running_xacts?  Does that make sense?
> 
> A hasty implementation, untested besides check-world, of that approach
> is attached.  I'm going out for dinner now, will subject it to mean
> things afterwards.
> 
> Needs more testing, comment and commit message policing.  But I think
> the idea is sound?
> 

I agree with adding running, I think that's good thing even for the per
transaction tracking and snapshot exports - we could use the newly added
field to get rid of the issue we have with 'snapshot too large' when
there were many aborted transactions while we waited for running ones to
finish. Because so far only tracked committed transactions, any aborted
one would show as running inside the exported snapshot so it can grow
over maximum number of backends and can't be exported anymore. So +1 for
that part.

But, I still think we need to restart the tracking after new
xl_running_xacts. Reason for that is afaics any of the catalog snapshots
that we assigned to transactions at the end of SnapBuildCommitTxn might
be corrupted otherwise as they were built before we knew one of the
supposedly running txes was actually already committed and that
transaction might have done catalog changes.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Craig Ringer
Date:
On 1 May 2017 at 09:54, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

> But, I still think we need to restart the tracking after new
> xl_running_xacts. Reason for that is afaics any of the catalog snapshots
> that we assigned to transactions at the end of SnapBuildCommitTxn might
> be corrupted otherwise as they were built before we knew one of the
> supposedly running txes was actually already committed and that
> transaction might have done catalog changes.

Due to the race where LogStandbySnapshot() collects running-xacts info
while a concurrent xact commits, such that the xl_xact_commit appears
before the xl_running_xacts, but the xl_running_xacts still has the
commited xact listed as running, right? Because we update PGXACT only
after we write the commit to WAL, so there's a window where an xact is
committed in WAL but not shown as committed in shmem.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 01/05/17 04:29, Craig Ringer wrote:
> On 1 May 2017 at 09:54, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
> 
>> But, I still think we need to restart the tracking after new
>> xl_running_xacts. Reason for that is afaics any of the catalog snapshots
>> that we assigned to transactions at the end of SnapBuildCommitTxn might
>> be corrupted otherwise as they were built before we knew one of the
>> supposedly running txes was actually already committed and that
>> transaction might have done catalog changes.
> 
> Due to the race where LogStandbySnapshot() collects running-xacts info
> while a concurrent xact commits, such that the xl_xact_commit appears
> before the xl_running_xacts, but the xl_running_xacts still has the
> commited xact listed as running, right? Because we update PGXACT only
> after we write the commit to WAL, so there's a window where an xact is
> committed in WAL but not shown as committed in shmem.
> 

Yes that's what the patch at hand tries to fix, but Andres approached it
from too simplistic standpoint as we don't only care about the exported
snapshot but whatever catalog snapshots we made for the transactions we
already track, unless I am missing something.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote:
> I agree with adding running, I think that's good thing even for the per
> transaction tracking and snapshot exports - we could use the newly added
> field to get rid of the issue we have with 'snapshot too large' when
> there were many aborted transactions while we waited for running ones to
> finish.

I'm not sure of that - what I was proposing would only track this for
the ->running substructure.  How'd that help?


> But, I still think we need to restart the tracking after new
> xl_running_xacts. Reason for that is afaics any of the catalog snapshots
> that we assigned to transactions at the end of SnapBuildCommitTxn might
> be corrupted otherwise as they were built before we knew one of the
> supposedly running txes was actually already committed and that
> transaction might have done catalog changes.

I'm afraid you're right.  But I think this is even more complicated: The
argument in your version that this can only happen once, seems to also
be holey: Just imagine a pg_usleep(3000 * 1000000) right before
ProcArrayEndTransaction() and enjoy the picture.

Wonder if we should just (re-)add a stage between SNAPBUILD_START and
SNAPBUILD_FULL_SNAPSHOT.  Enter SNAPBUILD_BUILD_INITIAL_SNAPSHOT at the
first xl_running_xacts, wait for all transactions to end with my
approach, while populating SnapBuild->committed, only then start
collecting changes for transactions (i.e. return true from
SnapBuildProcessChange()), return true once all xacts have finished
again.  That'd presumably be a bit easier to understand, more robust -
and slower.

- Andres



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 01/05/17 10:03, Andres Freund wrote:
> On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote:
>> I agree with adding running, I think that's good thing even for the per
>> transaction tracking and snapshot exports - we could use the newly added
>> field to get rid of the issue we have with 'snapshot too large' when
>> there were many aborted transactions while we waited for running ones to
>> finish.
> 
> I'm not sure of that - what I was proposing would only track this for
> the ->running substructure.  How'd that help?
> 

Well not as is, but it's a building block for it.

> 
>> But, I still think we need to restart the tracking after new
>> xl_running_xacts. Reason for that is afaics any of the catalog snapshots
>> that we assigned to transactions at the end of SnapBuildCommitTxn might
>> be corrupted otherwise as they were built before we knew one of the
>> supposedly running txes was actually already committed and that
>> transaction might have done catalog changes.
> 
> I'm afraid you're right.  But I think this is even more complicated: The
> argument in your version that this can only happen once, seems to also
> be holey: Just imagine a pg_usleep(3000 * 1000000) right before
> ProcArrayEndTransaction() and enjoy the picture.
>

Well yes, transaction can in theory have written commit/abort xlog
record and stayed in proc for more than single xl_running_xacts write.
But then the condition which we test that the new xl_running_xacts has
bigger xmin than the previously tracked one's xmax would not be
satisfied and we would not enter the relevant code path yet. So I think
we should not be able to get any xids we didn't see. But we have to
restart tracking from beginning (after first checking if we didn't
already see anything that the xl_running_xacts considers as running),
that's what my code did.

> Wonder if we should just (re-)add a stage between SNAPBUILD_START and
> SNAPBUILD_FULL_SNAPSHOT.  Enter SNAPBUILD_BUILD_INITIAL_SNAPSHOT at the
> first xl_running_xacts, wait for all transactions to end with my
> approach, while populating SnapBuild->committed, only then start
> collecting changes for transactions (i.e. return true from
> SnapBuildProcessChange()), return true once all xacts have finished
> again.  That'd presumably be a bit easier to understand, more robust -
> and slower.
> 

That would also work, but per above, I don't understand why it's needed.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote:
>> ... I was wondering about adding
>> a loop that simply runs for like 30s and then quits or such, but who
>> knows.

> If the probabilistic test catches the bug even 5% of the time in typical
> configurations, the buildfarm will rapidly identify any regression.  I'd
> choose a 7s test that detects the bug 5% of the time over a 30s test that
> detects it 99% of the time.  (When I wrote src/bin/pgbench/t/001_pgbench.pl
> for a probabilistic bug, I sized that test to finish in 1s and catch its bug
> half the time.  In its case, only two buildfarm members were able to
> demonstrate the original bug, so 5% detection would have been too low.)

30sec is kind of a big lump from a buildfarm standpoint, especially if
you mean "it runs for 30s on my honkin' fast workstation".  I'm fine
with individual tests that run for ~ 1sec.

(This is top-of-mind for me right now because I've been looking around
for ways to speed up the regression tests.)
        regards, tom lane



Re: [HACKERS] snapbuild woes

From
Andrew Dunstan
Date:

On 05/01/2017 08:46 AM, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
>> On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote:
>>> ... I was wondering about adding
>>> a loop that simply runs for like 30s and then quits or such, but who
>>> knows.
>> If the probabilistic test catches the bug even 5% of the time in typical
>> configurations, the buildfarm will rapidly identify any regression.  I'd
>> choose a 7s test that detects the bug 5% of the time over a 30s test that
>> detects it 99% of the time.  (When I wrote src/bin/pgbench/t/001_pgbench.pl
>> for a probabilistic bug, I sized that test to finish in 1s and catch its bug
>> half the time.  In its case, only two buildfarm members were able to
>> demonstrate the original bug, so 5% detection would have been too low.)
> 30sec is kind of a big lump from a buildfarm standpoint, especially if
> you mean "it runs for 30s on my honkin' fast workstation".  I'm fine
> with individual tests that run for ~ 1sec.
>
> (This is top-of-mind for me right now because I've been looking around
> for ways to speed up the regression tests.)
>
>             


Yes, me too. We're getting a bit lazy about that - see thread nearby
that will let us avoid unnecessary temp installs among other things.


cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-01 08:46:47 -0400, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Fri, Apr 21, 2017 at 10:34:58PM -0700, Andres Freund wrote:
> >> ... I was wondering about adding
> >> a loop that simply runs for like 30s and then quits or such, but who
> >> knows.
> 
> > If the probabilistic test catches the bug even 5% of the time in typical
> > configurations, the buildfarm will rapidly identify any regression.  I'd
> > choose a 7s test that detects the bug 5% of the time over a 30s test that
> > detects it 99% of the time.  (When I wrote src/bin/pgbench/t/001_pgbench.pl
> > for a probabilistic bug, I sized that test to finish in 1s and catch its bug
> > half the time.  In its case, only two buildfarm members were able to
> > demonstrate the original bug, so 5% detection would have been too low.)
> 
> 30sec is kind of a big lump from a buildfarm standpoint, especially if
> you mean "it runs for 30s on my honkin' fast workstation".  I'm fine
> with individual tests that run for ~ 1sec.

I was more thinking of pgench -T$XX, rather than constant number of
iterations.  I currently can reproduce the issues within like 3-4
minutes, so 5s is probably not quite sufficient to get decent coverage.


- Andres



Re: [HACKERS] snapbuild woes

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-05-01 08:46:47 -0400, Tom Lane wrote:
>> 30sec is kind of a big lump from a buildfarm standpoint, especially if
>> you mean "it runs for 30s on my honkin' fast workstation".  I'm fine
>> with individual tests that run for ~ 1sec.

> I was more thinking of pgench -T$XX, rather than constant number of
> iterations.  I currently can reproduce the issues within like 3-4
> minutes, so 5s is probably not quite sufficient to get decent coverage.

Adding a five-minute pgbench run to the buildfarm sequence is definitely
going to get you ridden out of town on a rail.  But quite aside from the
question of whether we can afford the cycles, it seems like the wrong
approach.  IMO the buildfarm is mainly for verifying portability, not for
trying to prove that race-like conditions don't exist.  In most situations
we're going out of our way to ensure reproduceability of tests we add to
the buildfarm sequence; but it seems like this is looking for
irreproducible results.
        regards, tom lane



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-01 12:32:07 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-05-01 08:46:47 -0400, Tom Lane wrote:
> >> 30sec is kind of a big lump from a buildfarm standpoint, especially if
> >> you mean "it runs for 30s on my honkin' fast workstation".  I'm fine
> >> with individual tests that run for ~ 1sec.
> 
> > I was more thinking of pgench -T$XX, rather than constant number of
> > iterations.  I currently can reproduce the issues within like 3-4
> > minutes, so 5s is probably not quite sufficient to get decent coverage.
> 
> Adding a five-minute pgbench run to the buildfarm sequence is definitely
> going to get you ridden out of town on a rail.

Right - that was referring to Noah's comment upthread:

On 2017-04-29 14:42:01 -0700, Noah Misch wrote:
> If the probabilistic test catches the bug even 5% of the time in typical
> configurations, the buildfarm will rapidly identify any regression.  I'd
> choose a 7s test that detects the bug 5% of the time over a 30s test that
> detects it 99% of the time.  (When I wrote src/bin/pgbench/t/001_pgbench.pl
> for a probabilistic bug, I sized that test to finish in 1s and catch its bug
> half the time.  In its case, only two buildfarm members were able to
> demonstrate the original bug, so 5% detection would have been too low.)

and I suspect that you'd not find these within 5s within sufficient
time, because the detection rate would be too low.



> But quite aside from the question of whether we can afford the cycles,
> it seems like the wrong approach.  IMO the buildfarm is mainly for
> verifying portability, not for trying to prove that race-like
> conditions don't exist.  In most situations we're going out of our way
> to ensure reproduceability of tests we add to the buildfarm sequence;
> but it seems like this is looking for irreproducible results.

Yea, I wondered about that upthread as well.  But the tests are quite
useful nonetheless.  Wonder about adding them simply as a separate
target.

- Andres



Re: [HACKERS] snapbuild woes

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-05-01 12:32:07 -0400, Tom Lane wrote:
>> But quite aside from the question of whether we can afford the cycles,
>> it seems like the wrong approach.  IMO the buildfarm is mainly for
>> verifying portability, not for trying to prove that race-like
>> conditions don't exist.  In most situations we're going out of our way
>> to ensure reproduceability of tests we add to the buildfarm sequence;
>> but it seems like this is looking for irreproducible results.

> Yea, I wondered about that upthread as well.  But the tests are quite
> useful nonetheless.  Wonder about adding them simply as a separate
> target.

I have no objection to adding more tests as a non-default target.
        regards, tom lane



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-01 11:09:44 +0200, Petr Jelinek wrote:
> On 01/05/17 10:03, Andres Freund wrote:
> > On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote:

> >> But, I still think we need to restart the tracking after new
> >> xl_running_xacts. Reason for that is afaics any of the catalog snapshots
> >> that we assigned to transactions at the end of SnapBuildCommitTxn might
> >> be corrupted otherwise as they were built before we knew one of the
> >> supposedly running txes was actually already committed and that
> >> transaction might have done catalog changes.
> > 
> > I'm afraid you're right.  But I think this is even more complicated: The
> > argument in your version that this can only happen once, seems to also
> > be holey: Just imagine a pg_usleep(3000 * 1000000) right before
> > ProcArrayEndTransaction() and enjoy the picture.

> Well yes, transaction can in theory have written commit/abort xlog
> record and stayed in proc for more than single xl_running_xacts write.
> But then the condition which we test that the new xl_running_xacts has
> bigger xmin than the previously tracked one's xmax would not be
> satisfied and we would not enter the relevant code path yet. So I think
> we should not be able to get any xids we didn't see. But we have to
> restart tracking from beginning (after first checking if we didn't
> already see anything that the xl_running_xacts considers as running),
> that's what my code did.

But to get that correct, we'd have to not only track ->committed, but
also somehow maintain ->aborted, and not just for the transactions in
the original set of running transactions.  That'd be fairly complicated
and large.  The reason I was trying - and it's definitely not correct as
I had proposed - to use the original running_xacts record is that that
only required tracking as many transaction statuses as in the first
xl_running_xacts.  Am I missing something?



The probabilistic tests catch the issues here fairly quickly, btw, if
you run with synchronous_commit=on, while pgbench is running, because
the WAL flushes make this more likely.  Runs this query:

SELECT account_count, teller_count, account_sum - teller_sum s
FROM   (       SELECT count(*) account_count, SUM(abalance) account_sum       FROM pgbench_accounts   ) a,   (
SELECTcount(*) teller_count, SUM(tbalance) teller_sum       FROM pgbench_tellers   ) t
 

which, for my scale, should always return:
┌─────────┬─────┬───┐
│    a    │  t  │ s │
├─────────┼─────┼───┤
│ 2000000 │ 200 │ 0 │
└─────────┴─────┴───┘
but with my POC patch occasionally returns things like:
┌─────────┬─────┬───────┐
│    a    │  t  │   s   │
├─────────┼─────┼───────┤
│ 2000000 │ 212 │ 37358 │
└─────────┴─────┴───────┘

which obviously shouldn't be the case.



Re: [HACKERS] snapbuild woes

From
Peter Eisentraut
Date:
On 5/1/17 13:02, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
>> On 2017-05-01 12:32:07 -0400, Tom Lane wrote:
>>> But quite aside from the question of whether we can afford the cycles,
>>> it seems like the wrong approach.  IMO the buildfarm is mainly for
>>> verifying portability, not for trying to prove that race-like
>>> conditions don't exist.  In most situations we're going out of our way
>>> to ensure reproduceability of tests we add to the buildfarm sequence;
>>> but it seems like this is looking for irreproducible results.
> 
>> Yea, I wondered about that upthread as well.  But the tests are quite
>> useful nonetheless.  Wonder about adding them simply as a separate
>> target.
> 
> I have no objection to adding more tests as a non-default target.

Well, the problem with nondefault targets is that they are hard to find
if you don't know them, and then they will rot.

Sure, we need a way to distinguish different classes of tests, but lets
think about the bigger scheme, too.  Ideas welcome.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] snapbuild woes

From
Noah Misch
Date:
On Mon, May 01, 2017 at 12:32:07PM -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-05-01 08:46:47 -0400, Tom Lane wrote:
> >> 30sec is kind of a big lump from a buildfarm standpoint, especially if
> >> you mean "it runs for 30s on my honkin' fast workstation".  I'm fine
> >> with individual tests that run for ~ 1sec.
> 
> > I was more thinking of pgench -T$XX, rather than constant number of
> > iterations.  I currently can reproduce the issues within like 3-4
> > minutes, so 5s is probably not quite sufficient to get decent coverage.

You might hit the race faster by adding a dedicated stress test function to
regress.c.

> IMO the buildfarm is mainly for verifying portability, not for
> trying to prove that race-like conditions don't exist.

Perhaps so, but it has excelled at both tasks.



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 01/05/17 21:14, Andres Freund wrote:
> On 2017-05-01 11:09:44 +0200, Petr Jelinek wrote:
>> On 01/05/17 10:03, Andres Freund wrote:
>>> On 2017-05-01 03:54:49 +0200, Petr Jelinek wrote:
> 
>>>> But, I still think we need to restart the tracking after new
>>>> xl_running_xacts. Reason for that is afaics any of the catalog snapshots
>>>> that we assigned to transactions at the end of SnapBuildCommitTxn might
>>>> be corrupted otherwise as they were built before we knew one of the
>>>> supposedly running txes was actually already committed and that
>>>> transaction might have done catalog changes.
>>>
>>> I'm afraid you're right.  But I think this is even more complicated: The
>>> argument in your version that this can only happen once, seems to also
>>> be holey: Just imagine a pg_usleep(3000 * 1000000) right before
>>> ProcArrayEndTransaction() and enjoy the picture.
> 
>> Well yes, transaction can in theory have written commit/abort xlog
>> record and stayed in proc for more than single xl_running_xacts write.
>> But then the condition which we test that the new xl_running_xacts has
>> bigger xmin than the previously tracked one's xmax would not be
>> satisfied and we would not enter the relevant code path yet. So I think
>> we should not be able to get any xids we didn't see. But we have to
>> restart tracking from beginning (after first checking if we didn't
>> already see anything that the xl_running_xacts considers as running),
>> that's what my code did.
> 
> But to get that correct, we'd have to not only track ->committed, but
> also somehow maintain ->aborted, and not just for the transactions in
> the original set of running transactions.  That'd be fairly complicated
> and large.  The reason I was trying - and it's definitely not correct as
> I had proposed - to use the original running_xacts record is that that
> only required tracking as many transaction statuses as in the first
> xl_running_xacts.  Am I missing something?

Aah, now I understand we talked about slightly different things, I
considered the running thing to be first step towards tracking aborted
txes everywhere. I am not sure if it's complicated, it would be exactly
the same as committed tracking, except we'd do it only before we reach
SNAPBUILD_CONSISTENT. It would be definitely larger patch I agree, but I
can give it at try.

If you think that adding the SNAPBUILD_BUILD_INITIAL_SNAPSHOT would be
less invasive/smaller patch I am okay with doing that for PG10. I think
we'll have to revisit tracking of aborted transactions in PG11 then
though because of the 'snapshot too large' issue when exporting, at
least I don't see any other way to fix that.

> 
> The probabilistic tests catch the issues here fairly quickly, btw, if
> you run with synchronous_commit=on, while pgbench is running, because
> the WAL flushes make this more likely.  Runs this query:
> 
> SELECT account_count, teller_count, account_sum - teller_sum s
> FROM
>     (
>         SELECT count(*) account_count, SUM(abalance) account_sum
>         FROM pgbench_accounts
>     ) a,
>     (
>         SELECT count(*) teller_count, SUM(tbalance) teller_sum
>         FROM pgbench_tellers
>     ) t
> 
> which, for my scale, should always return:
> ┌─────────┬─────┬───┐
> │    a    │  t  │ s │
> ├─────────┼─────┼───┤
> │ 2000000 │ 200 │ 0 │
> └─────────┴─────┴───┘
> but with my POC patch occasionally returns things like:
> ┌─────────┬─────┬───────┐
> │    a    │  t  │   s   │
> ├─────────┼─────┼───────┤
> │ 2000000 │ 212 │ 37358 │
> └─────────┴─────┴───────┘
> 
> which obviously shouldn't be the case.
> 

Very nice (the test, not the failures ;)) !

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Noah Misch
Date:
On Thu, Apr 27, 2017 at 09:42:58PM -0700, Andres Freund wrote:
> 
> 
> On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> wrote:
> >On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:
> >> On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
> >> > I've since the previous update reviewed Petr's patch, which he
> >since has
> >> > updated over the weekend.  I'll do another round tomorrow, and will
> >see
> >> > how it looks.  I think we might need some more tests for this to be
> >> > committable, so it might not become committable tomorrow.  I hope
> >we'll
> >> > have something in tree by end of this week, if not I'll send an
> >update.
> >> 
> >> I was less productive this week than I'd hoped, and creating a
> >testsuite
> >> was more work than I'd anticipated, so I'm slightly lagging behind. 
> >I
> >> hope to have a patchset tomorrow, aiming to commit something
> >> Monday/Tuesday.
> >
> >This PostgreSQL 10 open item is past due for your status update. 
> >Kindly send
> >a status update within 24 hours, and include a date for your subsequent
> >status
> >update.  Refer to the policy on open item ownership:
> >https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> I committed part of the series today, plan to continue doing so over the next few days.  Changes require careful
review& testing, this is easy to get wrong...
 

This PostgreSQL 10 open item is past due for your status update.  Kindly send
a status update within 24 hours, and include a date for your subsequent status
update.

Also, this open item has been alive for three weeks, well above guideline.  I
understand it's a tricky bug, but I'm worried this isn't on track to end.
What is missing to make it end?

Refer to the policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 04/05/17 07:45, Noah Misch wrote:
> On Thu, Apr 27, 2017 at 09:42:58PM -0700, Andres Freund wrote:
>>
>>
>> On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com> wrote:
>>> On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:
>>>> On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
>>>>> I've since the previous update reviewed Petr's patch, which he
>>> since has
>>>>> updated over the weekend.  I'll do another round tomorrow, and will
>>> see
>>>>> how it looks.  I think we might need some more tests for this to be
>>>>> committable, so it might not become committable tomorrow.  I hope
>>> we'll
>>>>> have something in tree by end of this week, if not I'll send an
>>> update.
>>>>
>>>> I was less productive this week than I'd hoped, and creating a
>>> testsuite
>>>> was more work than I'd anticipated, so I'm slightly lagging behind. 
>>> I
>>>> hope to have a patchset tomorrow, aiming to commit something
>>>> Monday/Tuesday.
>>>
>>> This PostgreSQL 10 open item is past due for your status update. 
>>> Kindly send
>>> a status update within 24 hours, and include a date for your subsequent
>>> status
>>> update.  Refer to the policy on open item ownership:
>>> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
>>
>> I committed part of the series today, plan to continue doing so over the next few days.  Changes require careful
review& testing, this is easy to get wrong...
 
> 
> This PostgreSQL 10 open item is past due for your status update.  Kindly send
> a status update within 24 hours, and include a date for your subsequent status
> update.
> 
> Also, this open item has been alive for three weeks, well above guideline.  I
> understand it's a tricky bug, but I'm worried this isn't on track to end.
> What is missing to make it end?

It's tricky 5 bugs, and they are quite sensitive on rare
timing/concurrency events.

First two are fixed, we can live with 5th being done later (as it's not
correctness fix, but very bad performance one).

We haven't figured way for fixing the 4th one that we all agree is good
solution (everything proposed so far still had bugs).

I am not quite sure what happened to the 3rd one.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
Hi,

On 2017-05-02 08:55:53 +0200, Petr Jelinek wrote:
> Aah, now I understand we talked about slightly different things, I
> considered the running thing to be first step towards tracking aborted
> txes everywhere.
> I think
> we'll have to revisit tracking of aborted transactions in PG11 then
> though because of the 'snapshot too large' issue when exporting, at
> least I don't see any other way to fix that.

FWIW, that seems unnecessary - we can just check for that using the
clog.  Should be very simple to check for aborted xacts when exporting
the snapshot (like 2 lines + comments).  That should address your
concern, right?


> If you think that adding the SNAPBUILD_BUILD_INITIAL_SNAPSHOT would be
> less invasive/smaller patch I am okay with doing that for PG10.

Attached is a prototype patch for that.

What I decided is that essentially tracking the running xacts is too
unrealiable due to the race, so I decided that just relying on
oldestRunningXid and nextXid - which are solely in the procArray and
thus racefree - is better.

It's not perfect yet, primarily because we'd need to take a bit more
care about being ABI compatible for older releases, and because we'd
probably have to trigger LogStandbySnapshot() a bit more frequently
(presumably while waiting for WAL).  The change means we'll have to wait
a bit longer for slot creation, but it's considerably simpler / more
robust.

Could you have a look?

Regards,

Andres



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-04 17:00:04 -0700, Andres Freund wrote:
> Attached is a prototype patch for that.

Oops.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 05/05/17 02:00, Andres Freund wrote:
> Hi,
> 
> On 2017-05-02 08:55:53 +0200, Petr Jelinek wrote:
>> Aah, now I understand we talked about slightly different things, I
>> considered the running thing to be first step towards tracking aborted
>> txes everywhere.
>> I think
>> we'll have to revisit tracking of aborted transactions in PG11 then
>> though because of the 'snapshot too large' issue when exporting, at
>> least I don't see any other way to fix that.
> 
> FWIW, that seems unnecessary - we can just check for that using the
> clog.  Should be very simple to check for aborted xacts when exporting
> the snapshot (like 2 lines + comments).  That should address your
> concern, right?

Right, because there isn't practical difference between running and
aborted transaction for us so we don't mind if the abort has happened in
the future. Yeah the export code could do the check seems quite simple.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 05/05/17 02:42, Andres Freund wrote:
> On 2017-05-04 17:00:04 -0700, Andres Freund wrote:
>> Attached is a prototype patch for that.
> 

I am not sure I understand the ABI comment for started_collection_at.
What's the ABI issue? The struct is private to snapbuild.c module. Or
you want to store it in the ondisk snapshot as well?

About better name for it what about something like oldest_full_xact?

Otherwise the logic seems to be right on first read, will ponder it a
bit more

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-05 13:53:16 +0200, Petr Jelinek wrote:
> On 05/05/17 02:42, Andres Freund wrote:
> > On 2017-05-04 17:00:04 -0700, Andres Freund wrote:
> >> Attached is a prototype patch for that.
> > 
> 
> I am not sure I understand the ABI comment for started_collection_at.
> What's the ABI issue? The struct is private to snapbuild.c module. Or
> you want to store it in the ondisk snapshot as well?

It's stored on-disk already :(


> Otherwise the logic seems to be right on first read, will ponder it a
> bit more

Cool!

Greetings,

Andres Freund



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 05/05/17 18:18, Andres Freund wrote:
> On 2017-05-05 13:53:16 +0200, Petr Jelinek wrote:
>> On 05/05/17 02:42, Andres Freund wrote:
>>> On 2017-05-04 17:00:04 -0700, Andres Freund wrote:
>>>> Attached is a prototype patch for that.
>>>
>>
>> I am not sure I understand the ABI comment for started_collection_at.
>> What's the ABI issue? The struct is private to snapbuild.c module. Or
>> you want to store it in the ondisk snapshot as well?
> 
> It's stored on-disk already :(

Hmm okay, well then I guess we'll have to store it somewhere inside
running, we should not normally care about that since we only load
CONSISTENT snapshots where running don't matter anymore. Alternatively
we could bump the SNAPBUILD_VERSION but that would make it impossible to
downgrade minor version which is bad (I guess we'll want to do that for
master, but not back-branches).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:

On May 3, 2017 10:45:16 PM PDT, Noah Misch <noah@leadboat.com> wrote:
>On Thu, Apr 27, 2017 at 09:42:58PM -0700, Andres Freund wrote:
>>
>>
>> On April 27, 2017 9:34:44 PM PDT, Noah Misch <noah@leadboat.com>
>wrote:
>> >On Fri, Apr 21, 2017 at 10:36:21PM -0700, Andres Freund wrote:
>> >> On 2017-04-17 21:16:57 -0700, Andres Freund wrote:
>> >> > I've since the previous update reviewed Petr's patch, which he
>> >since has
>> >> > updated over the weekend.  I'll do another round tomorrow, and
>will
>> >see
>> >> > how it looks.  I think we might need some more tests for this to
>be
>> >> > committable, so it might not become committable tomorrow.  I
>hope
>> >we'll
>> >> > have something in tree by end of this week, if not I'll send an
>> >update.
>> >>
>> >> I was less productive this week than I'd hoped, and creating a
>> >testsuite
>> >> was more work than I'd anticipated, so I'm slightly lagging
>behind.
>> >I
>> >> hope to have a patchset tomorrow, aiming to commit something
>> >> Monday/Tuesday.
>> >
>> >This PostgreSQL 10 open item is past due for your status update.
>> >Kindly send
>> >a status update within 24 hours, and include a date for your
>subsequent
>> >status
>> >update.  Refer to the policy on open item ownership:
>>
>>https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
>>
>> I committed part of the series today, plan to continue doing so over
>the next few days.  Changes require careful review & testing, this is
>easy to get wrong...
>
>This PostgreSQL 10 open item is past due for your status update.
>Kindly send
>a status update within 24 hours, and include a date for your subsequent
>status
>update.
>
>Also, this open item has been alive for three weeks, well above
>guideline.  I
>understand it's a tricky bug, but I'm worried this isn't on track to
>end.
>What is missing to make it end?
>
>Refer to the policy on open item ownership:
>https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I plan to commit the next pending patch after the back branch releases are cut - it's an invasive fix and the issue
doesn'tcause corruption "just" slow slot creation. So it seems better to wait for a few days, rather than hurry it into
therelease. 

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [HACKERS] snapbuild woes

From
Erik Rijkers
Date:
On 2017-05-05 02:00, Andres Freund wrote:
> 
> Could you have a look?

Running tests with these three patches:

> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
> fix-statistics-reporting-in-logical-replication-work.patch    (on top of 44c528810)

I test by 15-minute pgbench runs while there is a logical replication 
connection. Primary and replica are on the same machine.

I have seen errors on 3 different machines (where error means: at least 
1 of the 4 pgbench tables is not md5-equal). It seems better, faster 
machines yield less errors.

Normally I see in pg_stat_replication (on master) one process in state 
'streaming'.
 pid  |     wal     | replay_loc  |   diff   |   state   |   app   | 
sync_state
16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | 
async

Often there are another two processes in pg_stat_replication that remain 
in state 'startup'.

In the failing sessions the 'streaming'-state process is missing; in 
failing sessions there are only the two processes that are and remain in 
'startup'.

FWIW, below  the output of a succesful and a failed run:


------------------ successful run:
creating tables...
1590400 of 2500000 tuples (63%) done (elapsed 5.34 s, remaining 3.05 s)
2500000 of 2500000 tuples (100%) done (elapsed 9.63 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2' 
publication pub1 with (disabled);
alter subscription sub1 enable;
-- pgbench -c 90 -j 8 -T 900 -P 180 -n   --  scale 25
progress: 180.0 s, 82.5 tps, lat 1086.845 ms stddev 3211.785
progress: 360.0 s, 25.4 tps, lat 3469.040 ms stddev 6297.440
progress: 540.0 s, 28.9 tps, lat 3131.438 ms stddev 4288.130
progress: 720.0 s, 27.5 tps, lat 3285.024 ms stddev 4113.841
progress: 900.0 s, 47.2 tps, lat 1896.698 ms stddev 2182.695
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 38175
latency average = 2128.606 ms
latency stddev = 3948.634 ms
tps = 42.151205 (including connections establishing)
tps = 42.151589 (excluding connections establishing)
-- waiting 0s... (always) port | pg_stat_replication | pid  |     wal     | replay_loc  | diff | 
?column? |   state   |   app   | sync_state 6972 | pg_stat_replication | 2545 | 18/432B2180 | 18/432B2180 |    0 | 
t        | streaming | derail2 | async

2017.05.08 23:19:22
-- getting md5 (cb)
6972 a,b,t,h:  2500000     25    250  38175  b2ba48b53 b3788a837 
d1afac950 d4abcc72e master
6973 a,b,t,h:  2500000     25    250  38175  b2ba48b53 b3788a837 
d1afac950 d4abcc72e replica ok  bee2312c7
2017.05.08 23:20:48
 port | pg_stat_replication | pid  |     wal     | replay_loc  |   diff   | ?column? |   state   |   app   | sync_state
6972| pg_stat_replication | 2545 | 18/4AEEC8C0 | 18/453FBD20 | 
 
95357856 | f        | streaming | derail2 | async
------------------------------------


------------------ failure:
creating tables...
1777100 of 2500000 tuples (71%) done (elapsed 5.06 s, remaining 2.06 s)
2500000 of 2500000 tuples (100%) done (elapsed 7.41 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972 application_name=derail2' 
publication pub1 with (disabled);
alter subscription sub1 enable; port | pg_stat_replication |  pid  |     wal     | replay_loc | diff | 
?column? |  state  |   app   | sync_state 6972 | pg_stat_replication | 11945 | 18/5E2913D0 |            |      |
 | catchup | derail2 | async
 

-- pgbench -c 90 -j 8 -T 900 -P 180 -n   --  scale 25
progress: 180.0 s, 78.4 tps, lat 1138.348 ms stddev 2884.815
progress: 360.0 s, 69.2 tps, lat 1309.716 ms stddev 2594.231
progress: 540.0 s, 59.0 tps, lat 1519.146 ms stddev 2033.400
progress: 720.0 s, 62.9 tps, lat 1421.854 ms stddev 1775.066
progress: 900.0 s, 57.0 tps, lat 1575.693 ms stddev 1681.800
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 25
query mode: simple
number of clients: 90
number of threads: 8
duration: 900 s
number of transactions actually processed: 58846
latency average = 1378.259 ms
latency stddev = 2304.159 ms
tps = 65.224168 (including connections establishing)
tps = 65.224788 (excluding connections establishing)
-- waiting 0s... (always) port | pg_stat_replication |  pid  |     wal     | replay_loc | diff | 
?column? |  state  |   app   | sync_state 6972 | pg_stat_replication | 11948 | 18/7469A038 |            |      |
 | startup | derail2 | async 6972 | pg_stat_replication | 12372 | 18/7469A038 |            |      |          | startup
|derail2 | async
 

------------------------------------

During my tests, I keep an eye on pg_stat_replication (refreshing every 
2s), and when I see those two 'startup'-lines in pg_stat_replication 
without any 'streaming'-line I know the test is going to fail.  I 
believe this has been true for all failure cases that I've seen (except 
the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another 
thread).

Sorry, I have not been able to get any thing more clear or definitive...


thanks,


Erik Rijkers











Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 09/05/17 00:03, Erik Rijkers wrote:
> On 2017-05-05 02:00, Andres Freund wrote:
>>
>> Could you have a look?
> 
> Running tests with these three patches:
> 
>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
>> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
>> fix-statistics-reporting-in-logical-replication-work.patch
>     (on top of 44c528810)
> 
> I test by 15-minute pgbench runs while there is a logical replication
> connection. Primary and replica are on the same machine.
> 
> I have seen errors on 3 different machines (where error means: at least
> 1 of the 4 pgbench tables is not md5-equal). It seems better, faster
> machines yield less errors.
> 
> Normally I see in pg_stat_replication (on master) one process in state
> 'streaming'.
> 
>  pid  |     wal     | replay_loc  |   diff   |   state   |   app   |
> sync_state
> 16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | async
> 
> Often there are another two processes in pg_stat_replication that remain
> in state 'startup'.
> 
> In the failing sessions the 'streaming'-state process is missing; in
> failing sessions there are only the two processes that are and remain in
> 'startup'.

Hmm, startup is the state where slot creation is happening. I wonder if
it's just taking long time to create snapshot because of the 5th issue
which is not yet fixed (and the original patch will not apply on top of
this change). Alternatively there is a bug in this patch.

Did you see high CPU usage during the test when there were those
"startup" state walsenders?

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Erik Rijkers
Date:
On 2017-05-09 10:50, Petr Jelinek wrote:
> On 09/05/17 00:03, Erik Rijkers wrote:
>> On 2017-05-05 02:00, Andres Freund wrote:
>>> 
>>> Could you have a look?
>> 
>> Running tests with these three patches:
>> 
>>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
>>> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
>>> fix-statistics-reporting-in-logical-replication-work.patch
>>     (on top of 44c528810)
>> 
>> I test by 15-minute pgbench runs while there is a logical replication
>> connection. Primary and replica are on the same machine.
>> 
>> I have seen errors on 3 different machines (where error means: at 
>> least
>> 1 of the 4 pgbench tables is not md5-equal). It seems better, faster
>> machines yield less errors.
>> 
>> Normally I see in pg_stat_replication (on master) one process in state
>> 'streaming'.
>> 
>>  pid  |     wal     | replay_loc  |   diff   |   state   |   app   |
>> sync_state
>> 16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 | 
>> async
>> 
>> Often there are another two processes in pg_stat_replication that 
>> remain
>> in state 'startup'.
>> 
>> In the failing sessions the 'streaming'-state process is missing; in
>> failing sessions there are only the two processes that are and remain 
>> in
>> 'startup'.
> 
> Hmm, startup is the state where slot creation is happening. I wonder if
> it's just taking long time to create snapshot because of the 5th issue
> which is not yet fixed (and the original patch will not apply on top of
> this change). Alternatively there is a bug in this patch.
> 
> Did you see high CPU usage during the test when there were those
> "startup" state walsenders?
> 

I haven't noticed but I didn't pay attention to that particularly.

I'll try to get some CPU-info logged...




Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 09/05/17 10:59, Erik Rijkers wrote:
> On 2017-05-09 10:50, Petr Jelinek wrote:
>> On 09/05/17 00:03, Erik Rijkers wrote:
>>> On 2017-05-05 02:00, Andres Freund wrote:
>>>>
>>>> Could you have a look?
>>>
>>> Running tests with these three patches:
>>>
>>>> 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
>>>> 0002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
>>>> fix-statistics-reporting-in-logical-replication-work.patch
>>>     (on top of 44c528810)
>>>
>>> I test by 15-minute pgbench runs while there is a logical replication
>>> connection. Primary and replica are on the same machine.
>>>
>>> I have seen errors on 3 different machines (where error means: at least
>>> 1 of the 4 pgbench tables is not md5-equal). It seems better, faster
>>> machines yield less errors.
>>>
>>> Normally I see in pg_stat_replication (on master) one process in state
>>> 'streaming'.
>>>
>>>  pid  |     wal     | replay_loc  |   diff   |   state   |   app   |
>>> sync_state
>>> 16495 | 11/EDBC0000 | 11/EA3FEEE8 | 58462488 | streaming | derail2 |
>>> async
>>>
>>> Often there are another two processes in pg_stat_replication that remain
>>> in state 'startup'.
>>>
>>> In the failing sessions the 'streaming'-state process is missing; in
>>> failing sessions there are only the two processes that are and remain in
>>> 'startup'.
>>
>> Hmm, startup is the state where slot creation is happening. I wonder if
>> it's just taking long time to create snapshot because of the 5th issue
>> which is not yet fixed (and the original patch will not apply on top of
>> this change). Alternatively there is a bug in this patch.
>>
>> Did you see high CPU usage during the test when there were those
>> "startup" state walsenders?
>>
> 
> I haven't noticed but I didn't pay attention to that particularly.
> 
> I'll try to get some CPU-info logged...
> 

I rebased the above mentioned patch to apply to the patches Andres sent,
if you could try to add it on top of what you have and check if it still
fails, that would be helpful.

Thanks!

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Erik Rijkers
Date:
On 2017-05-09 11:50, Petr Jelinek wrote:
> On 09/05/17 10:59, Erik Rijkers wrote:
>> On 2017-05-09 10:50, Petr Jelinek wrote:
>>> On 09/05/17 00:03, Erik Rijkers wrote:
>>>> On 2017-05-05 02:00, Andres Freund wrote:
>>>>> 
>>>>> Could you have a look?

[...]

> I rebased the above mentioned patch to apply to the patches Andres 
> sent,
> if you could try to add it on top of what you have and check if it 
> still
> fails, that would be helpful.

I suppose you mean these; but they do not apply anymore:

20170505/0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch
20170505/0002-WIP-Possibly-more-robust-snapbuild-approach.patch

Andres, any change you could update them?

alternatively I could use the older version again..

thanks,

Erik Rijkers



Re: [HACKERS] snapbuild woes

From
Erik Rijkers
Date:
On 2017-05-09 11:50, Petr Jelinek wrote:

> I rebased the above mentioned patch to apply to the patches Andres 
> sent,
> if you could try to add it on top of what you have and check if it 
> still
> fails, that would be helpful.

It still fails.

With these patches

- 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
- 00002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
- fix-statistics-reporting-in-logical-replication-work.patch +
- Skip-unnecessary-snapshot-builds.patch

built again on top of 44c528810a1 ( so I had to add the 
'fix-statistics-rep*' patch because without it I immediately got that 
Assertion failure again ).

As always most runs succeed (especially on this large 192GB 16-core 
server).

But attached is an output file of a number of runs of my 
pgbench_derail2.sh test.

Overal result:

-- out_20170509_1635.txt
       3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n   --  scale 25
       2 -- All is well.
       1 -- Not good, but breaking out of wait (21 times no change)

I broke it off after iteration 4, so 5 never ran, and
iteration 1 failed due to a mistake in the harness (somethind stupid I 
did) - not interesting.

iteration 2 succeeds. (eventually has 'replica ok')

iteration 3 succeeds. (eventually has 'replica ok')

iteration 4 fails.
   Just after 'alter subscription sub1 enable' I caught (as is usual) 
pg_stat_replication.state as 'catchup'. So far so good.
   After the 15-minute pgbench run pg_stat_replication has only 2 
'startup' lines (and none 'catchup' or 'streaming'):

  port | pg_stat_replication |  pid   |     wal     | replay_loc | diff | 
?column? |  state  |   app   | sync_state
  6972 | pg_stat_replication | 108349 | 19/8FBCC248 |            |      | 
          | startup | derail2 | async
  6972 | pg_stat_replication | 108351 | 19/8FBCC248 |            |      | 
          | startup | derail2 | async

(that's from:
    select $port1 as port,'pg_stat_replication' as pg_stat_replication, 
pid
      , pg_current_wal_location() wal, replay_location replay_loc, 
pg_current_wal_location() - replay_location as diff
      , pg_current_wal_location() <= replay_location
      , state, application_name as app, sync_state
    from pg_stat_replication
)

This remains in this state for as long as my test-programs lets it 
(i.e., 20 x 30s, or something like that, and then the loop is exited); 
in the ouput file it says: 'Not good, but breaking out of wait'

Below is the accompanying ps (with the 2 'deranged senders' as Jeff 
Janes would surely call them):


UID         PID   PPID  C STIME TTY      STAT   TIME CMD
rijkers  107147      1  0 17:11 pts/35   S+     0:00 
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres 
-D /var/data1/pg_stuff/pg_installations
rijkers  107149 107147  0 17:11 ?        Ss     0:00  \_ postgres: 
logger process
rijkers  107299 107147  0 17:11 ?        Ss     0:01  \_ postgres: 
checkpointer process
rijkers  107300 107147  0 17:11 ?        Ss     0:00  \_ postgres: 
writer process
rijkers  107301 107147  0 17:11 ?        Ss     0:00  \_ postgres: wal 
writer process
rijkers  107302 107147  0 17:11 ?        Ss     0:00  \_ postgres: 
autovacuum launcher process
rijkers  107303 107147  0 17:11 ?        Ss     0:00  \_ postgres: stats 
collector process
rijkers  107304 107147  0 17:11 ?        Ss     0:00  \_ postgres: 
bgworker: logical replication launcher
rijkers  108348 107147  0 17:12 ?        Ss     0:01  \_ postgres: 
bgworker: logical replication worker for subscription 70310 sync 70293
rijkers  108350 107147  0 17:12 ?        Ss     0:00  \_ postgres: 
bgworker: logical replication worker for subscription 70310 sync 70298
rijkers  107145      1  0 17:11 pts/35   S+     0:02 
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres 
-D /var/data1/pg_stuff/pg_installations
rijkers  107151 107145  0 17:11 ?        Ss     0:00  \_ postgres: 
logger process
rijkers  107160 107145  0 17:11 ?        Ss     0:08  \_ postgres: 
checkpointer process
rijkers  107161 107145  0 17:11 ?        Ss     0:07  \_ postgres: 
writer process
rijkers  107162 107145  0 17:11 ?        Ss     0:02  \_ postgres: wal 
writer process
rijkers  107163 107145  0 17:11 ?        Ss     0:00  \_ postgres: 
autovacuum launcher process
rijkers  107164 107145  0 17:11 ?        Ss     0:02  \_ postgres: stats 
collector process
rijkers  107165 107145  0 17:11 ?        Ss     0:00  \_ postgres: 
bgworker: logical replication launcher
rijkers  108349 107145  0 17:12 ?        Ss     0:27  \_ postgres: wal 
sender process rijkers [local] idle
rijkers  108351 107145  0 17:12 ?        Ss     0:26  \_ postgres: wal 
sender process rijkers [local] idle

I have had no time to add (or view) any CPUinfo.


Erik Rijkers




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 09/05/17 19:54, Erik Rijkers wrote:
> On 2017-05-09 11:50, Petr Jelinek wrote:
> 
>> I rebased the above mentioned patch to apply to the patches Andres sent,
>> if you could try to add it on top of what you have and check if it still
>> fails, that would be helpful.
> 
> It still fails.
> 
> With these patches
> 
> - 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
> - 00002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
> - fix-statistics-reporting-in-logical-replication-work.patch +
> - Skip-unnecessary-snapshot-builds.patch
> 
> built again on top of 44c528810a1 ( so I had to add the
> 'fix-statistics-rep*' patch because without it I immediately got that
> Assertion failure again ).
> 
> As always most runs succeed (especially on this large 192GB 16-core
> server).
> 
> But attached is an output file of a number of runs of my
> pgbench_derail2.sh test.
> 
> Overal result:
> 
> -- out_20170509_1635.txt
>       3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n   --  scale 25
>       2 -- All is well.
>       1 -- Not good, but breaking out of wait (21 times no change)
> 
> I broke it off after iteration 4, so 5 never ran, and
> iteration 1 failed due to a mistake in the harness (somethind stupid I
> did) - not interesting.
> 
> iteration 2 succeeds. (eventually has 'replica ok')
> 
> iteration 3 succeeds. (eventually has 'replica ok')
> 
> iteration 4 fails.
>   Just after 'alter subscription sub1 enable' I caught (as is usual)
> pg_stat_replication.state as 'catchup'. So far so good.
>   After the 15-minute pgbench run pg_stat_replication has only 2
> 'startup' lines (and none 'catchup' or 'streaming'):
> 
>  port | pg_stat_replication |  pid   |     wal     | replay_loc | diff |
> ?column? |  state  |   app   | sync_state
>  6972 | pg_stat_replication | 108349 | 19/8FBCC248 |            |      |
>          | startup | derail2 | async
>  6972 | pg_stat_replication | 108351 | 19/8FBCC248 |            |      |
>          | startup | derail2 | async
> 
> (that's from:
>    select $port1 as port,'pg_stat_replication' as pg_stat_replication, pid
>      , pg_current_wal_location() wal, replay_location replay_loc,
> pg_current_wal_location() - replay_location as diff
>      , pg_current_wal_location() <= replay_location
>      , state, application_name as app, sync_state
>    from pg_stat_replication
> )
> 
> This remains in this state for as long as my test-programs lets it
> (i.e., 20 x 30s, or something like that, and then the loop is exited);
> in the ouput file it says: 'Not good, but breaking out of wait'
> 
> Below is the accompanying ps (with the 2 'deranged senders' as Jeff
> Janes would surely call them):
> 
> 
> UID         PID   PPID  C STIME TTY      STAT   TIME CMD
> rijkers  107147      1  0 17:11 pts/35   S+     0:00
> /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres
> -D /var/data1/pg_stuff/pg_installations
> rijkers  107149 107147  0 17:11 ?        Ss     0:00  \_ postgres:
> logger process
> rijkers  107299 107147  0 17:11 ?        Ss     0:01  \_ postgres:
> checkpointer process
> rijkers  107300 107147  0 17:11 ?        Ss     0:00  \_ postgres:
> writer process
> rijkers  107301 107147  0 17:11 ?        Ss     0:00  \_ postgres: wal
> writer process
> rijkers  107302 107147  0 17:11 ?        Ss     0:00  \_ postgres:
> autovacuum launcher process
> rijkers  107303 107147  0 17:11 ?        Ss     0:00  \_ postgres: stats
> collector process
> rijkers  107304 107147  0 17:11 ?        Ss     0:00  \_ postgres:
> bgworker: logical replication launcher
> rijkers  108348 107147  0 17:12 ?        Ss     0:01  \_ postgres:
> bgworker: logical replication worker for subscription 70310 sync 70293
> rijkers  108350 107147  0 17:12 ?        Ss     0:00  \_ postgres:
> bgworker: logical replication worker for subscription 70310 sync 70298
> rijkers  107145      1  0 17:11 pts/35   S+     0:02
> /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres
> -D /var/data1/pg_stuff/pg_installations
> rijkers  107151 107145  0 17:11 ?        Ss     0:00  \_ postgres:
> logger process
> rijkers  107160 107145  0 17:11 ?        Ss     0:08  \_ postgres:
> checkpointer process
> rijkers  107161 107145  0 17:11 ?        Ss     0:07  \_ postgres:
> writer process
> rijkers  107162 107145  0 17:11 ?        Ss     0:02  \_ postgres: wal
> writer process
> rijkers  107163 107145  0 17:11 ?        Ss     0:00  \_ postgres:
> autovacuum launcher process
> rijkers  107164 107145  0 17:11 ?        Ss     0:02  \_ postgres: stats
> collector process
> rijkers  107165 107145  0 17:11 ?        Ss     0:00  \_ postgres:
> bgworker: logical replication launcher
> rijkers  108349 107145  0 17:12 ?        Ss     0:27  \_ postgres: wal
> sender process rijkers [local] idle
> rijkers  108351 107145  0 17:12 ?        Ss     0:26  \_ postgres: wal
> sender process rijkers [local] idle
> 
> I have had no time to add (or view) any CPUinfo.
> 
> 

Ah okay, so this is same issue that's reported by both Masahiko Sawada
[1] and Jeff Janes [2].

[1]
https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Erik Rijkers
Date:
On 2017-05-09 21:00, Petr Jelinek wrote:
> On 09/05/17 19:54, Erik Rijkers wrote:
>> On 2017-05-09 11:50, Petr Jelinek wrote:
>> 
> 
> Ah okay, so this is same issue that's reported by both Masahiko Sawada
> [1] and Jeff Janes [2].
> 
> [1]
> https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com
> [2]
> https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com

I don't understand why you come to that conclusion: both Masahiko Sawada 
and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't.  
Isn't that a real difference?

( I do sometimes get that DROP-SUBSCRIPTION too, but much less often 
than the sync-failure. )






Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 09/05/17 22:11, Erik Rijkers wrote:
> On 2017-05-09 21:00, Petr Jelinek wrote:
>> On 09/05/17 19:54, Erik Rijkers wrote:
>>> On 2017-05-09 11:50, Petr Jelinek wrote:
>>>
>>
>> Ah okay, so this is same issue that's reported by both Masahiko Sawada
>> [1] and Jeff Janes [2].
>>
>> [1]
>> https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com
>>
>> [2]
>> https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com
>>
> 
> I don't understand why you come to that conclusion: both Masahiko Sawada
> and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't. 
> Isn't that a real difference?
> 

Because I only see the sync workers running in the output you shown, the
bug described in those threads can as one of side effects cause the sync
workers to wait forever for the apply that was killed, crashed, etc, in
the right moment, which I guess is what happened during your failed test
(there should be some info in the log about apply exiting).

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Stas Kelvich
Date:
> On 10 May 2017, at 11:43, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
>
> On 09/05/17 22:11, Erik Rijkers wrote:
>> On 2017-05-09 21:00, Petr Jelinek wrote:
>>> On 09/05/17 19:54, Erik Rijkers wrote:
>>>> On 2017-05-09 11:50, Petr Jelinek wrote:
>>>>
>>>
>>> Ah okay, so this is same issue that's reported by both Masahiko Sawada
>>> [1] and Jeff Janes [2].
>>>
>>> [1]
>>> https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com
>>>
>>> [2]
>>> https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com
>>>
>>
>> I don't understand why you come to that conclusion: both Masahiko Sawada
>> and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't.
>> Isn't that a real difference?
>>
>

Just noted another one issue/feature with snapshot builder — when logical decoding is in progress
and vacuum full command is being issued quite a big amount of files appears in pg_logical/mappings
and reside there till the checkpoint. Also having consequent vacuum full’s before checkpoint yields even
more files.

Having two pgbench-filled instances with logical replication between them:

for i in {1..100}; do psql -c 'vacuum full' && ls -la pg_logical/mappings | wc -l; done;
VACUUM     73
VACUUM    454
VACUUM   1146
VACUUM   2149
VACUUM   3463
VACUUM   5088
<...>
VACUUM  44708
<…> // checkpoint happens somewhere here
VACUUM  20921
WARNING:  could not truncate file "base/16384/61773": Too many open files in system
ERROR:  could not fsync file "pg_logical/mappings/map-4000-4df-0_A4EA29F8-5aa5-5ae6": Too many open files in system


I’m not sure whether this is boils down to some of the previous issues mentioned here or not, so posting
here as observation.


Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 11/05/17 16:33, Stas Kelvich wrote:
> 
>> On 10 May 2017, at 11:43, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
>>
>> On 09/05/17 22:11, Erik Rijkers wrote:
>>> On 2017-05-09 21:00, Petr Jelinek wrote:
>>>> On 09/05/17 19:54, Erik Rijkers wrote:
>>>>> On 2017-05-09 11:50, Petr Jelinek wrote:
>>>>>
>>>>
>>>> Ah okay, so this is same issue that's reported by both Masahiko Sawada
>>>> [1] and Jeff Janes [2].
>>>>
>>>> [1]
>>>>
https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com
>>>>
>>>> [2]
>>>>
https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com
>>>>
>>>
>>> I don't understand why you come to that conclusion: both Masahiko Sawada
>>> and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't. 
>>> Isn't that a real difference?
>>>
>>
> 
> Just noted another one issue/feature with snapshot builder — when logical decoding is in progress
> and vacuum full command is being issued quite a big amount of files appears in pg_logical/mappings
> and reside there till the checkpoint. Also having consequent vacuum full’s before checkpoint yields even
> more files.
> 
> Having two pgbench-filled instances with logical replication between them:
> 
> for i in {1..100}; do psql -c 'vacuum full' && ls -la pg_logical/mappings | wc -l; done;
> VACUUM
>       73
> VACUUM
>      454
> VACUUM
>     1146
> VACUUM
>     2149
> VACUUM
>     3463
> VACUUM
>     5088
> <...>
> VACUUM
>    44708
> <…> // checkpoint happens somewhere here
> VACUUM
>    20921
> WARNING:  could not truncate file "base/16384/61773": Too many open files in system
> ERROR:  could not fsync file "pg_logical/mappings/map-4000-4df-0_A4EA29F8-5aa5-5ae6": Too many open files in system
> 
> 
> I’m not sure whether this is boils down to some of the previous issues mentioned here or not, so posting
> here as observation.
> 

This has nothing to do with snapshot builder so this is not the thread
for it. See the comment section near the bottom of
src/backend/access/heap/rewriteheap.c for explanation of why it does
what it does.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
andres@anarazel.de (Andres Freund)
Date:
On 2017-05-08 00:10:12 -0700, Andres Freund wrote:
> I plan to commit the next pending patch after the back branch releases
> are cut - it's an invasive fix and the issue doesn't cause corruption
> "just" slow slot creation. So it seems better to wait for a few days,
> rather than hurry it into the release.

Now that that's done, here's an updated version of that patch.  Note the
new logic to trigger xl_running_xact's to be logged at the right spot.
Works well in my testing.

I plan to commit this fairly soon, unless somebody wants a bit more time
to look into it.

Unless somebody protests, I'd like to slightly revise how the on-disk
snapshots are stored on master - given the issues this bug/commit showed
with the current method - but I can see one could argue that that should
rather be done next release.

- Andres



Re: [HACKERS] snapbuild woes

From
Peter Geoghegan
Date:
On Thu, May 11, 2017 at 2:51 PM, Andres Freund <andres@anarazel.de> wrote:
> Now that that's done, here's an updated version of that patch.  Note the
> new logic to trigger xl_running_xact's to be logged at the right spot.
> Works well in my testing.

You forgot the patch. :-)


-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-11 14:51:55 -0700,  wrote:
> On 2017-05-08 00:10:12 -0700, Andres Freund wrote:
> > I plan to commit the next pending patch after the back branch releases
> > are cut - it's an invasive fix and the issue doesn't cause corruption
> > "just" slow slot creation. So it seems better to wait for a few days,
> > rather than hurry it into the release.
> 
> Now that that's done, here's an updated version of that patch.  Note the
> new logic to trigger xl_running_xact's to be logged at the right spot.
> Works well in my testing.
> 
> I plan to commit this fairly soon, unless somebody wants a bit more time
> to look into it.
> 
> Unless somebody protests, I'd like to slightly revise how the on-disk
> snapshots are stored on master - given the issues this bug/commit showed
> with the current method - but I can see one could argue that that should
> rather be done next release.

As usual I forgot my attachement.

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-04-15 05:18:49 +0200, Petr Jelinek wrote:
> >>> From 3318a929e691870f3c1ca665bec3bfa8ea2af2a8 Mon Sep 17 00:00:00 2001
> >>> From: Petr Jelinek <pjmodos@pjmodos.net>
> >>> Date: Sun, 26 Feb 2017 01:07:33 +0100
> >>> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards
> >>
> >> A bit more commentary would be good. What does that protect us against?
> >>
> > 
> > I think I explained that in the email. We might export snapshot with
> > xmin smaller than global xmin otherwise.
> > 
> 
> Updated commit message with explanation as well.


> From ae60b52ae0ca96bc14169cd507f101fbb5dfdf52 Mon Sep 17 00:00:00 2001
> From: Petr Jelinek <pjmodos@pjmodos.net>
> Date: Sun, 26 Feb 2017 01:07:33 +0100
> Subject: [PATCH 3/5] Prevent snapshot builder xmin from going backwards
> 
> Logical decoding snapshot builder may encounter xl_running_xacts with
> older xmin than the xmin of the builder. This can happen because
> LogStandbySnapshot() sometimes sees already committed transactions as
> running (there is difference between "running" in terms for WAL and in
> terms of ProcArray). When this happens we must make sure that the xmin
> of snapshot builder won't go back otherwise the resulting snapshot would
> show some transaction as running even though they have already
> committed.
> ---
>  src/backend/replication/logical/snapbuild.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
> index ada618d..3e34f75 100644
> --- a/src/backend/replication/logical/snapbuild.c
> +++ b/src/backend/replication/logical/snapbuild.c
> @@ -1165,7 +1165,8 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
>       * looking, it's correct and actually more efficient this way since we hit
>       * fast paths in tqual.c.
>       */
> -    builder->xmin = running->oldestRunningXid;
> +    if (TransactionIdFollowsOrEquals(running->oldestRunningXid, builder->xmin))
> +        builder->xmin = running->oldestRunningXid;
>  
>      /* Remove transactions we don't need to keep track off anymore */
>      SnapBuildPurgeCommittedTxn(builder);
> -- 
> 2.7.4

I still don't understand.  The snapshot's xmin is solely managed via
xl_running_xacts, so I don't see how the WAL/procarray difference can
play a role here.  ->committed isn't pruned before xl_running_xacts
indicates it can be done, so I don't understand your explanation above.

I'd be ok with adding this as paranoia check, but I still don't
understand when it could practically be hit.

- Andres



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-11 14:54:26 -0700, Andres Freund wrote:
> On 2017-05-11 14:51:55 -0700,  wrote:
> > On 2017-05-08 00:10:12 -0700, Andres Freund wrote:
> > > I plan to commit the next pending patch after the back branch releases
> > > are cut - it's an invasive fix and the issue doesn't cause corruption
> > > "just" slow slot creation. So it seems better to wait for a few days,
> > > rather than hurry it into the release.
> > 
> > Now that that's done, here's an updated version of that patch.  Note the
> > new logic to trigger xl_running_xact's to be logged at the right spot.
> > Works well in my testing.
> > 
> > I plan to commit this fairly soon, unless somebody wants a bit more time
> > to look into it.
> > 
> > Unless somebody protests, I'd like to slightly revise how the on-disk
> > snapshots are stored on master - given the issues this bug/commit showed
> > with the current method - but I can see one could argue that that should
> > rather be done next release.
> 
> As usual I forgot my attachement.

This patch also seems to offer a way to do your 0005 in, possibly, more
efficient manner.  We don't ever need to assume a transaction is
timetravelling anymore.  Could you check whether the attached, hasty,
patch resolves the performance problems you measured?  Also, do you have
a "reference" workload for that?

Regards,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 12/05/17 03:31, Andres Freund wrote:
> On 2017-05-11 14:54:26 -0700, Andres Freund wrote:
>> On 2017-05-11 14:51:55 -0700,  wrote:
>>> On 2017-05-08 00:10:12 -0700, Andres Freund wrote:
>>>> I plan to commit the next pending patch after the back branch releases
>>>> are cut - it's an invasive fix and the issue doesn't cause corruption
>>>> "just" slow slot creation. So it seems better to wait for a few days,
>>>> rather than hurry it into the release.
>>>
>>> Now that that's done, here's an updated version of that patch.  Note the
>>> new logic to trigger xl_running_xact's to be logged at the right spot.
>>> Works well in my testing.
>>>
>>> I plan to commit this fairly soon, unless somebody wants a bit more time
>>> to look into it.
>>>
>>> Unless somebody protests, I'd like to slightly revise how the on-disk
>>> snapshots are stored on master - given the issues this bug/commit showed
>>> with the current method - but I can see one could argue that that should
>>> rather be done next release.
>>
>> As usual I forgot my attachement.
> 
> This patch also seems to offer a way to do your 0005 in, possibly, more
> efficient manner.  We don't ever need to assume a transaction is
> timetravelling anymore.  Could you check whether the attached, hasty,
> patch resolves the performance problems you measured?  Also, do you have
> a "reference" workload for that?
> 

Hmm, well it helps but actually now that we don't track individual
running transactions anymore it got much less effective (my version of
0005 does as well).

The example workload I test with is:
session 1: open transaction, do a write, keep it open
session 2: pgbench  -M simple -N -c 10 -P 1 -T 5
session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender
session 2: pgbench  -M simple -N -c 10 -P 1 -T 20
session 1: commit

And wait for session 3 to finish slot creation, takes about 20 mins on
my laptop without patches, minute and half with your patches for 0004
and 0005 (or with your 0004 and my 0005) and about 2s with my original
0004 and 0005.

What makes it slow is the constant qsorting IIRC.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 12/05/17 10:57, Petr Jelinek wrote:
> On 12/05/17 03:31, Andres Freund wrote:
>> On 2017-05-11 14:54:26 -0700, Andres Freund wrote:
>>> On 2017-05-11 14:51:55 -0700,  wrote:
>>>> On 2017-05-08 00:10:12 -0700, Andres Freund wrote:
>>>>> I plan to commit the next pending patch after the back branch releases
>>>>> are cut - it's an invasive fix and the issue doesn't cause corruption
>>>>> "just" slow slot creation. So it seems better to wait for a few days,
>>>>> rather than hurry it into the release.
>>>>
>>>> Now that that's done, here's an updated version of that patch.  Note the
>>>> new logic to trigger xl_running_xact's to be logged at the right spot.
>>>> Works well in my testing.
>>>>
>>>> I plan to commit this fairly soon, unless somebody wants a bit more time
>>>> to look into it.
>>>>
>>>> Unless somebody protests, I'd like to slightly revise how the on-disk
>>>> snapshots are stored on master - given the issues this bug/commit showed
>>>> with the current method - but I can see one could argue that that should
>>>> rather be done next release.
>>>
>>> As usual I forgot my attachement.
>>
>> This patch also seems to offer a way to do your 0005 in, possibly, more
>> efficient manner.  We don't ever need to assume a transaction is
>> timetravelling anymore.  Could you check whether the attached, hasty,
>> patch resolves the performance problems you measured?  Also, do you have
>> a "reference" workload for that?
>>
> 
> Hmm, well it helps but actually now that we don't track individual
> running transactions anymore it got much less effective (my version of
> 0005 does as well).
> 
> The example workload I test with is:
> session 1: open transaction, do a write, keep it open
> session 2: pgbench  -M simple -N -c 10 -P 1 -T 5
> session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender
> session 2: pgbench  -M simple -N -c 10 -P 1 -T 20
> session 1: commit
> 
> And wait for session 3 to finish slot creation, takes about 20 mins on
> my laptop without patches, minute and half with your patches for 0004
> and 0005 (or with your 0004 and my 0005) and about 2s with my original
> 0004 and 0005.
> 
> What makes it slow is the constant qsorting IIRC.
> 

Btw I still think that we should pursue the approach you proposed. I
think we can deal with the qsorting in some other ways (ordered insert
perhaps?) later. What you propose fixes the correctness, makes
performance less awful in the above workload and also makes the
snapbuild bit easier to read.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-12 10:57:55 +0200, Petr Jelinek wrote:
> Hmm, well it helps but actually now that we don't track individual
> running transactions anymore it got much less effective (my version of
> 0005 does as well).
> 
> The example workload I test with is:
> session 1: open transaction, do a write, keep it open
> session 2: pgbench  -M simple -N -c 10 -P 1 -T 5
> session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender
> session 2: pgbench  -M simple -N -c 10 -P 1 -T 20
> session 1: commit
> 
> And wait for session 3 to finish slot creation, takes about 20 mins on
> my laptop without patches, minute and half with your patches for 0004
> and 0005 (or with your 0004 and my 0005) and about 2s with my original
> 0004 and 0005.

Is that with assertions enabled or not?  With assertions all the time
post patches seems to be spent in some Asserts in reorderbuffer.c,
without it takes less than a second for me here.

I'm appylying these now.

Greetings,

Andres Freund



Re: [HACKERS] snapbuild woes

From
Petr Jelinek
Date:
On 13/05/17 22:23, Andres Freund wrote:
> On 2017-05-12 10:57:55 +0200, Petr Jelinek wrote:
>> Hmm, well it helps but actually now that we don't track individual
>> running transactions anymore it got much less effective (my version of
>> 0005 does as well).
>>
>> The example workload I test with is:
>> session 1: open transaction, do a write, keep it open
>> session 2: pgbench  -M simple -N -c 10 -P 1 -T 5
>> session 3: run CREATE_REPLICATION_SLOT LOGICAL in walsender
>> session 2: pgbench  -M simple -N -c 10 -P 1 -T 20
>> session 1: commit
>>
>> And wait for session 3 to finish slot creation, takes about 20 mins on
>> my laptop without patches, minute and half with your patches for 0004
>> and 0005 (or with your 0004 and my 0005) and about 2s with my original
>> 0004 and 0005.
> 
> Is that with assertions enabled or not?  With assertions all the time
> post patches seems to be spent in some Asserts in reorderbuffer.c,
> without it takes less than a second for me here.
>

Right you are, I always forget to switch that off before benchmarks.

> I'm appylying these now.

Cool. Just for posterity, this also fixes the issue number 3 as the
switch to consistent state is done purely based on xl_running_xacts and
not decoded commits/aborts.

So all done here, thanks!

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-05-14 14:54:37 +0200, Petr Jelinek wrote:
> On 13/05/17 22:23, Andres Freund wrote:
> >> And wait for session 3 to finish slot creation, takes about 20 mins on
> >> my laptop without patches, minute and half with your patches for 0004
> >> and 0005 (or with your 0004 and my 0005) and about 2s with my original
> >> 0004 and 0005.
> > 
> > Is that with assertions enabled or not?  With assertions all the time
> > post patches seems to be spent in some Asserts in reorderbuffer.c,
> > without it takes less than a second for me here.
> >
> 
> Right you are, I always forget to switch that off before benchmarks.

Phew ;)

> > I'm appylying these now.
> 
> Cool. Just for posterity, this also fixes the issue number 3 as the
> switch to consistent state is done purely based on xl_running_xacts and
> not decoded commits/aborts.

Cool.  Although I'm still not convinced, as noted somewhere in this
thread, that it actually did much to start with ;)

Greetings,

Andres Freund



Re: [HACKERS] snapbuild woes

From
Antonin Houska
Date:
Andres Freund <andres@anarazel.de> wrote:

> Looking at 0001:
> - GetOldestSafeDecodingTransactionId() only guarantees to return an xid
>   safe for decoding (note how procArray->replication_slot_catalog_xmin
>   is checked), not one for the initial snapshot - so afaics this whole
>   exercise doesn't guarantee much so far.

I happen to use CreateInitDecodingContext() in an extension, so I had to think
what the new argumen exactly means (as for the incompatibility between PG
9.6.2 and 9.6.3, I suppose preprocessor directives can handle it).

One thing I'm failing to understand is: if TRUE is passed for
need_full_snapshot, then slot->effective_xmin receives the result of
   GetOldestSafeDecodingTransactionId(need_full_snapshot)

but this does include "catalog xmin".

If the value returned is determined by an existing slot which has valid
effective_catalog_xmin and invalid effective_xmin (i.e. that slot only
protects catalog tables from VACUUM but not the regular ones), then the new
slot will think it (i.e. the new slot) protects even non-catalog tables, but
that's no true.

Shouldn't the xmin_horizon be computed by this call instead?
   GetOldestSafeDecodingTransactionId(!need_full_snapshot);

(If so, I think "considerCatalog" argument name would be clearer than
"catalogOnly".)

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at



Re: [HACKERS] snapbuild woes

From
Andres Freund
Date:
On 2017-06-09 09:25:34 +0200, Antonin Houska wrote:
> Andres Freund <andres@anarazel.de> wrote:
> 
> > Looking at 0001:
> > - GetOldestSafeDecodingTransactionId() only guarantees to return an xid
> >   safe for decoding (note how procArray->replication_slot_catalog_xmin
> >   is checked), not one for the initial snapshot - so afaics this whole
> >   exercise doesn't guarantee much so far.
> 
> I happen to use CreateInitDecodingContext() in an extension, so I had to think
> what the new argumen exactly means (as for the incompatibility between PG
> 9.6.2 and 9.6.3, I suppose preprocessor directives can handle it).
> 
> One thing I'm failing to understand is: if TRUE is passed for
> need_full_snapshot, then slot->effective_xmin receives the result of
> 
>     GetOldestSafeDecodingTransactionId(need_full_snapshot)
> 
> but this does include "catalog xmin".
> 
> If the value returned is determined by an existing slot which has valid
> effective_catalog_xmin and invalid effective_xmin (i.e. that slot only
> protects catalog tables from VACUUM but not the regular ones), then the new
> slot will think it (i.e. the new slot) protects even non-catalog tables, but
> that's no true.
> 
> Shouldn't the xmin_horizon be computed by this call instead?
> 
>     GetOldestSafeDecodingTransactionId(!need_full_snapshot);
> 
> (If so, I think "considerCatalog" argument name would be clearer than
> "catalogOnly".)

Good catch. Pushing a fix. Afaict it's luckily inconsequential atm
because fo the way we wait for concurrent snapshots when creating a
slot. But it obviously nevertheless needs tobe fixed.

Greetings,

Andres Freund