Thread: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Nikhil Sontakke
Date:
Hi,

As of today, in the existing code in HeapTupleSatisfiesVacuum, if a
row is inserted via an aborted transaction, then it's deemed to be
dead immediately and can be cleaned up for re-use by HOT or vacuum,
etc.

With logical decoding, there might arise a case that such a row, if it
belongs to a system catalog table or even a user catalog table
(https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=66abc2608c7c00fcd449e00a9e23f13f02e65d04),
then it might be being used for decoding at the very same moment that
the abort came in. If the row is re-claimed immediately, then it's
possible that the decoding happening alongside might face issues.

The main issue here is that HeapTupleSatisfiesVacuum *assumes* that
rows belonging to an aborted transaction are not visible to anyone
else. As mentioned above, that assumption needs to change in the case
when logical decoding is active and the row belongs to catalog tables.
This was not a problem before, but now there are at least two patches
out there which challenge these assumptions. One patch is for logical
decoding of 2PC, in which case a PREPARED transaction might be getting
decoded and a concurrent ROLLBACK PREPARED can happen alongside.
Another patch is about streaming logical changes before commit/abort.

The provided patch changes that assumption. We now pass in an
additional argument to HeapTupleSatisfiesVacuum() to indicate if the
current row belongs to a catalog table in the logical decoding
environment. We call the existing
RelationIsAccessibleInLogicalDecoding() function to ascertain if it's
a catalog table worth considering for this case. So if this is the
case, we return HEAPTUPLE_RECENTLY_DEAD if the xmin of the row is
newer than OldestXmin. So, for the following case, we now will return
HEAPTUPLE_RECENTLY_DEAD when earlier we would have returned
HEAPTUPLE_DEAD:

BEGIN;
ALTER subscribed_table ADD COLUMN;
....
ABORT;
VACUUM pg_attribute;

This is essentially the same logic that is used today for rows that
got deleted (via update or delete). We return HEAPTUPLE_RECENTLY_DEAD
for such deleted rows if the xmax of the row is newer than OldestXmin
because some open transactions can/could still see the tuple.

With changes to HeapTupleSatisfiesVacuum, we also had to tweak the
logic in heap_prune_chain(). That function again assumes that only
Updated tuples can return HEAPTUPLE_RECENTLY_DEAD. That assumption is
not true and we check explicitly for the HeapTupleHeaderGetUpdateXid
value to be valid before further processing.

The patch had to make changes in all locations where
HeapTupleSatisfiesVacuum gets called and I have done multiple "make
check-world" runs with cassert enabled and things run fine. Also,
since it only sets the flag for system/user catalog tables and that
too ONLY in the logical decoding environment, it does not cause any
performance issues in normal circumstances.

Regards,
Nikhils
-- 
 Nikhil Sontakke                   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Attachment

Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Simon Riggs
Date:
On 26 December 2017 at 14:21, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:

> With logical decoding, there might arise a case that such a row, if it
> belongs to a system catalog table or even a user catalog table
> (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=66abc2608c7c00fcd449e00a9e23f13f02e65d04),
> then it might be being used for decoding at the very same moment that
> the abort came in. If the row is re-claimed immediately, then it's
> possible that the decoding happening alongside might face issues.

If we want to make this change, I'd like to see something that
explains exactly how the problem in logical decoding occurs, or
preferably a test case. I can't understand the problem by reading the
archives. I'm not suggesting this patch doesn't work, I'm thinking
about whether there are other ways.

Surely if you are decoding a transaction and a concurrent abort is
requested then decoding should be interrupted at the next sensible
point. Allowing the two actions to occur without interlock is an
issue, so I suggest we just don't do it, rather than allow it and fix
subsequent breakage, which is what I understand this patch to do.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Tomas Vondra
Date:
Hi,

On 01/19/2018 03:34 PM, Simon Riggs wrote:
> On 26 December 2017 at 14:21, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
> 
>> With logical decoding, there might arise a case that such a row, if it
>> belongs to a system catalog table or even a user catalog table
>> (https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=66abc2608c7c00fcd449e00a9e23f13f02e65d04),
>> then it might be being used for decoding at the very same moment that
>> the abort came in. If the row is re-claimed immediately, then it's
>> possible that the decoding happening alongside might face issues.
> 
> If we want to make this change, I'd like to see something that
> explains exactly how the problem in logical decoding occurs, or
> preferably a test case. I can't understand the problem by reading the
> archives. I'm not suggesting this patch doesn't work, I'm thinking
> about whether there are other ways.
> 
> Surely if you are decoding a transaction and a concurrent abort is
> requested then decoding should be interrupted at the next sensible
> point. Allowing the two actions to occur without interlock is an
> issue, so I suggest we just don't do it, rather than allow it and fix
> subsequent breakage, which is what I understand this patch to do.
> 

I don't quite understand how the interlock would work? Can you elaborate
how would that be implemented? Consider that logical decoding

(a) is a read-only process, so it can't modify database state (e.g.
    locking a row in a catalog)

(b) is asynchronous process, i.e. it may be happening quite a bit of
    time after the changes actually happened

An example of a problematic transaction was already outlined in Nikhil's
post, i.e. it's a transaction that

   (1) does DDL, e.g. to add a column to the decoded table

   (2) inserts some data into the table (with the new column)

   (3) aborts (possibly hours / gigabytes of WAL in the future)

Currently, vacuum can just go and cleanup the catalog rows added in (1)
because the transaction is aborted (and postgres already knows that),
and we do ignore OldestXmin for catalog rows.

But if you try decode the changes (and hand them over to the plugin for
apply) without waiting for the final commit/abort, that will fail.
Because it will read changes from WAL with a column missing in any
existing catalog row. Kabooooom!

Until now this was not an issue, because the decoding happens at commit
time (or more precisely, we do decode the changes incrementally, but the
apply only happens on commit). But it's an issue for both Nikhil's and
mine patches.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Robert Haas
Date:
On Tue, Dec 26, 2017 at 9:21 AM, Nikhil Sontakke
<nikhils@2ndquadrant.com> wrote:
> The main issue here is that HeapTupleSatisfiesVacuum *assumes* that
> rows belonging to an aborted transaction are not visible to anyone
> else.

One problem here is that if a transaction aborts, it might have done
so after inserting or update a tuple in the heap and before inserting
new index entries for the tuple, or after inserting only some of the
necessary new index entries.  Therefore, even if you prevent pruning,
a snapshot from the point of view of the aborted transaction may be
inconsistent.  Similarly, if it aborts during a DDL operation, it may
have made some but not all of the catalog changes involved, so that
for example pg_class and pg_attribute could be inconsistent with each
other or various pg_attribute rows could even be inconsistent among
themselves.  If you have a view of the catalog where these problems
exist, you can't rely on, for example, being able to build a relcache
entry without error.  It is possible that you can avoid these problems
if your snapshot is always using a command ID value that was reached
prior to the error, although I'm not 100% sure that idea has no holes.

Another problem is that CTID chains may be broken.  Suppose that a
transaction T1, using CID 1, does a HOT update of tuple A1 producing a
new version A2. Then, later on, when the CID counter is at least 2, it
aborts.  A snapshot taken from the point of view of T1 at CID 1 should
see A2.  That will work fine most of the time.  However, if
transaction T2 comes along after T1 aborts and before logical decoding
gets there and does its own HOT update of tuple A1 producing a new
version A3, then tuple A2 is inaccessible through the indexes even if
it still exists in the heap page.  I think this problem is basically
unsolvable and likely means that this whole approach needs to be
abandoned.

One other issue to consider is that the tuple freezing code assumes
that any tuple that does not get removed when a page is pruned is OK
to freeze.  Commit 9c2f0a6c3cc8bb85b78191579760dbe9fb7814ec was
necessary to repair a case where that assumption was violated.  You
might want to consider carefully whether there's any chance that this
patch could introduce a similar problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Tomas Vondra
Date:
On 01/19/2018 06:54 PM, Robert Haas wrote:
> On Tue, Dec 26, 2017 at 9:21 AM, Nikhil Sontakke
> <nikhils@2ndquadrant.com> wrote:
>> The main issue here is that HeapTupleSatisfiesVacuum *assumes* that
>> rows belonging to an aborted transaction are not visible to anyone
>> else.
> 
> One problem here is that if a transaction aborts, it might have done
> so after inserting or update a tuple in the heap and before inserting
> new index entries for the tuple, or after inserting only some of the
> necessary new index entries.  ... [snip] ...
>

I think an important piece of this puzzle is that we only really care
about catalog changes made in a transaction that aborts after doing some
additional changes, with that catalog tuple in place. Because only then
we actually need that catalog tuple in order to interpret the changes.

AFAICS that guarantees the catalog changes were not interrupted half-way
through, leaving some of the catalogs in inconsistent state.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Robert Haas
Date:
On Fri, Jan 19, 2018 at 2:16 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I think an important piece of this puzzle is that we only really care
> about catalog changes made in a transaction that aborts after doing some
> additional changes, with that catalog tuple in place. Because only then
> we actually need that catalog tuple in order to interpret the changes.
>
> AFAICS that guarantees the catalog changes were not interrupted half-way
> through, leaving some of the catalogs in inconsistent state.

Yeah, that may be true, and I alluded to it in the part you didn't
quote.  However, it doesn't help with the second problem I mentioned,
which looks to me to be a fatal problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Tomas Vondra
Date:

On 01/19/2018 08:33 PM, Robert Haas wrote:
> On Fri, Jan 19, 2018 at 2:16 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> I think an important piece of this puzzle is that we only really care
>> about catalog changes made in a transaction that aborts after doing some
>> additional changes, with that catalog tuple in place. Because only then
>> we actually need that catalog tuple in order to interpret the changes.
>>
>> AFAICS that guarantees the catalog changes were not interrupted half-way
>> through, leaving some of the catalogs in inconsistent state.
> 
> Yeah, that may be true, and I alluded to it in the part you didn't 
> quote. However, it doesn't help with the second problem I mentioned, 
> which looks to me to be a fatal problem.
> 

Ah, OK - I didn't realize "using a command ID value that was reached
prior to the error" refers to this scenario.

Regarding the HOT issue - I have to admit I don't quite see why A2
wouldn't be reachable through the index, but that's likely due to my
limited knowledge of the HOT internals.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Robert Haas
Date:
On Fri, Jan 19, 2018 at 5:19 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Regarding the HOT issue - I have to admit I don't quite see why A2
> wouldn't be reachable through the index, but that's likely due to my
> limited knowledge of the HOT internals.

The index entries only point to the root tuple in the HOT chain.  Any
subsequent entries can only be reached by following the CTID pointers
(that's why they are called "Heap Only Tuples").  After T1 aborts,
we're still OK because the CTID link isn't immediately cleared.  But
after T2 updates the tuple, it makes A1's CTID link point to A3,
leaving no remaining link to A2.

Although in most respects PostgreSQL treats commits and aborts
surprisingly symmetrically, CTID links are an exception.  When T2
comes to A1, it sees that A1's xmax is T1 and checks the status of T1.
If T1 is still in progress, it waits.  If T2 has committed, it must
either abort with a serialization error or update A2 instead under
EvalPlanQual semantics, depending on the isolation level.  If T2 has
aborted, it assumes that the CTID field of T1 is garbage nobody cares
about, adds A3 to the page, and makes A1 point to A3 instead of A2.
No record of the A1->A2 link is kept anywhere *precisely because* A2
can no longer be visible to anyone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Petr Jelinek
Date:
Hi,

On 20/01/18 00:52, Robert Haas wrote:
> On Fri, Jan 19, 2018 at 5:19 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> Regarding the HOT issue - I have to admit I don't quite see why A2
>> wouldn't be reachable through the index, but that's likely due to my
>> limited knowledge of the HOT internals.
> 
> The index entries only point to the root tuple in the HOT chain.  Any
> subsequent entries can only be reached by following the CTID pointers
> (that's why they are called "Heap Only Tuples").  After T1 aborts,
> we're still OK because the CTID link isn't immediately cleared.  But
> after T2 updates the tuple, it makes A1's CTID link point to A3,
> leaving no remaining link to A2.
> 
> Although in most respects PostgreSQL treats commits and aborts
> surprisingly symmetrically, CTID links are an exception.  When T2
> comes to A1, it sees that A1's xmax is T1 and checks the status of T1.
> If T1 is still in progress, it waits.  If T2 has committed, it must
> either abort with a serialization error or update A2 instead under
> EvalPlanQual semantics, depending on the isolation level.  If T2 has
> aborted, it assumes that the CTID field of T1 is garbage nobody cares
> about, adds A3 to the page, and makes A1 point to A3 instead of A2.
> No record of the A1->A2 link is kept anywhere *precisely because* A2
> can no longer be visible to anyone.
> 

I think this is the only real problem from your list for logical
decoding catalog snapshots. But it's indeed quite a bad one. Is there
something preventing us to remove the assumption that the CTID of T1 is
garbage nobody cares about? I guess turning off HOT for catalogs is not
an option :)

General problem is that we have couple of assumptions
(HeapTupleSatisfiesVacuum being one, what you wrote is another) about
tuples from aborted transactions not being read by anybody. But if we
want to add decoding of 2PC or transaction streaming that's no longer
true so I think we should try to remove that assumption (even if we do
it only for catalogs since that what we care about).

The other option would be to make sure 2PC decoding/tx streaming does
not read aborted transaction but that would mean locking the transaction
every time we give control to output plugin. Given that output plugin
may do network write, this would really mean locking the transaction for
and unbounded period of time. That does not strike me as something we
want to do, decoding should not interact with frontend transaction
management, definitely not this badly.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Robert Haas
Date:
On Mon, Jan 22, 2018 at 10:40 AM, Petr Jelinek
<petr.jelinek@2ndquadrant.com> wrote:
> I think this is the only real problem from your list for logical
> decoding catalog snapshots. But it's indeed quite a bad one. Is there
> something preventing us to remove the assumption that the CTID of T1 is
> garbage nobody cares about? I guess turning off HOT for catalogs is not
> an option :)

It doesn't seem like a great idea.  For a lot of workloads it wouldn't
hurt, but for some it might cause a lot of catalog bloat.  Also, Tom
would probably hate it with a fiery passion, since as I understand it
he argued passionately for HOT to work for both catalog and
non-catalog tables back when it was first implemented.

CTID chain integrity is important for non-HOT updates too, at least if
any SELECT .. FOR UPDATE/SHARE or UPDATE operations are taking place.
Now, I suppose we're not going to do any of that during logical
decoding, but I guess I can't say I'm absolutely positive that there
are no other problem cases.  If there aren't, you can imagine
disallowing HOT updates only on catalogs and only when xmax is newer
than the newest XID we'll never need to decode, or more narrowly
still, only when it's in a list of XIDs currently being decoded.

Independently of CTID, I also wonder if there's a risk of us trying to
read from a multixact that's been truncated away.  I haven't checked
the multixact code in detail, but I bet there's no provision to keep
around multixacts that are only interesting to aborted transactions.
Conversely, what keeps us from freezing the xmin of a tuple that is
invisible only to some aborted transaction, but visible to all
committed transactions?  Or just marking the page all-visible?

> General problem is that we have couple of assumptions
> (HeapTupleSatisfiesVacuum being one, what you wrote is another) about
> tuples from aborted transactions not being read by anybody. But if we
> want to add decoding of 2PC or transaction streaming that's no longer
> true so I think we should try to remove that assumption (even if we do
> it only for catalogs since that what we care about).

I'm extremely skeptical about this idea.  I think that assumption is
fairly likely to be embedded in subtle ways in several more places
that we haven't thought about yet.  Unless we're very, very careful
about doing something like this, we could end up with a system that
mostly seems to work but is actually unreliable in ways that can't be
fixed without non-back-patchable changes.

> The other option would be to make sure 2PC decoding/tx streaming does
> not read aborted transaction but that would mean locking the transaction
> every time we give control to output plugin. Given that output plugin
> may do network write, this would really mean locking the transaction for
> and unbounded period of time. That does not strike me as something we
> want to do, decoding should not interact with frontend transaction
> management, definitely not this badly.

Yeah, I don't think it's acceptable to postpone abort indefinitely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Petr Jelinek
Date:
On 22/01/18 20:21, Robert Haas wrote:
> On Mon, Jan 22, 2018 at 10:40 AM, Petr Jelinek
> <petr.jelinek@2ndquadrant.com> wrote:
>> I think this is the only real problem from your list for logical
>> decoding catalog snapshots. But it's indeed quite a bad one. Is there
>> something preventing us to remove the assumption that the CTID of T1 is
>> garbage nobody cares about? I guess turning off HOT for catalogs is not
>> an option :)
> 
> It doesn't seem like a great idea.  For a lot of workloads it wouldn't
> hurt, but for some it might cause a lot of catalog bloat.  Also, Tom
> would probably hate it with a fiery passion, since as I understand it
> he argued passionately for HOT to work for both catalog and
> non-catalog tables back when it was first implemented.
>

I wasn't really serious about that proposal.

> 
>> General problem is that we have couple of assumptions
>> (HeapTupleSatisfiesVacuum being one, what you wrote is another) about
>> tuples from aborted transactions not being read by anybody. But if we
>> want to add decoding of 2PC or transaction streaming that's no longer
>> true so I think we should try to remove that assumption (even if we do
>> it only for catalogs since that what we care about).
> 
> I'm extremely skeptical about this idea.  I think that assumption is
> fairly likely to be embedded in subtle ways in several more places
> that we haven't thought about yet.  Unless we're very, very careful
> about doing something like this, we could end up with a system that
> mostly seems to work but is actually unreliable in ways that can't be
> fixed without non-back-patchable changes.

I am worried about it too, I don't feel like I can judge if we just need
to be very careful or if it's too big a risk so I'll defer to you here.
>> The other option would be to make sure 2PC decoding/tx streaming does
>> not read aborted transaction but that would mean locking the transaction
>> every time we give control to output plugin. Given that output plugin
>> may do network write, this would really mean locking the transaction for
>> and unbounded period of time. That does not strike me as something we
>> want to do, decoding should not interact with frontend transaction
>> management, definitely not this badly.
> 
> Yeah, I don't think it's acceptable to postpone abort indefinitely.
> 

Hmm, maybe less bad and potentially acceptable version of this would be
to make TransactionIdIsInProgress() treat transaction as running if it's
being decoded, that might solve most of the issues. There is still
potential interlocking issue with UPDATEs but they have to change same
tuples as the aborted tx did.

What I mean concretely is that decoding would check if tx has aborted,
if not then it would put somewhere in shmem info about it working on the
tx and once plugin is done with current call it would reset the shmem
info to invalid tx. Then the TransactionIdIsInProgress() would check
this shmem info before checking about abort. If the first check would
show that transaction is aborted then decoding would not bother decoding
it further.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Simon Riggs
Date:
On 22 January 2018 at 20:03, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:
> On 22/01/18 20:21, Robert Haas wrote:

>>> The other option would be to make sure 2PC decoding/tx streaming does
>>> not read aborted transaction but that would mean locking the transaction
>>> every time we give control to output plugin. Given that output plugin
>>> may do network write, this would really mean locking the transaction for
>>> and unbounded period of time. That does not strike me as something we
>>> want to do, decoding should not interact with frontend transaction
>>> management, definitely not this badly.
>>
>> Yeah, I don't think it's acceptable to postpone abort indefinitely.
>>
>
> Hmm, maybe less bad and potentially acceptable version of this would be
> to make TransactionIdIsInProgress() treat transaction as running if it's
> being decoded, that might solve most of the issues. There is still
> potential interlocking issue with UPDATEs but they have to change same
> tuples as the aborted tx did.
>
> What I mean concretely is that decoding would check if tx has aborted,
> if not then it would put somewhere in shmem info about it working on the
> tx and once plugin is done with current call it would reset the shmem
> info to invalid tx. Then the TransactionIdIsInProgress() would check
> this shmem info before checking about abort. If the first check would
> show that transaction is aborted then decoding would not bother decoding
> it further.

I must be missing something in this discussion, cos I don't see any
problems with this "other option".

Surely we prepare the 2PCxact and then it is persistent. Yes,
potentially for an unbounded period of time. And it is (usually) up to
the XA manager to resolve that. 2PC interacts with transaction
management and yes, it can be slow. But the choice is slow and
consistent, or not. This would only be used with the full choice of
the user, just like synchronous_commit.

In this case, we call the decoding plugin's precommit hook which would
then prepare the 2PCxact and set a non-persistent flag saying it is
being decoded. If decoding completes normally we release the lock and
commit. If decoding fails or the DBA has another reason to do so, we
provide a function that allows the flag to be unlocked. While it is
locked the 2PCxact cannot be aborted or committed.

There is no danger of accidental abort because the prepare has persistent state.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Nikhil Sontakke
Date:
> I must be missing something in this discussion, cos I don't see any
> problems with this "other option".
>
> Surely we prepare the 2PCxact and then it is persistent. Yes,
> potentially for an unbounded period of time. And it is (usually) up to
> the XA manager to resolve that. 2PC interacts with transaction
> management and yes, it can be slow. But the choice is slow and
> consistent, or not. This would only be used with the full choice of
> the user, just like synchronous_commit.
>
> In this case, we call the decoding plugin's precommit hook which would
> then prepare the 2PCxact and set a non-persistent flag saying it is
> being decoded. If decoding completes normally we release the lock and
> commit. If decoding fails or the DBA has another reason to do so, we
> provide a function that allows the flag to be unlocked. While it is
> locked the 2PCxact cannot be aborted or committed.
>
> There is no danger of accidental abort because the prepare has persistent state.

This concurrent abort handling while decoding is ongoing is turning
out to be complex affair.

Thinking more about this, just to provide an example, we have a
decoding plugin hook to determine if a GID for a 2PC was decoded at
PREPARE time or COMMIT time as part of the 2PC logical decoding patch.
We need that to determine the *same* static answer every time we see a
specific GID while decoding across restarts; the plugin should know
what it had done the last time around and should tell us the same
later as well. It just occurred to me that as Simon also mentioned, it
could/should also be the decoding plugin's responsibility to indicate
if it's ok to go ahead with the abort of the transaction.

So, we could consider adding a preAbort hook. That preAbort hook gets
the GID, XID and other parameters as needed and tells us whether we
can go ahead with the abort or if we need to wait out (maybe we pass
in an ok_to_wait param as well). As an example, a plugin could lookup
some shmem structure which points to the current transaction being
decoded and does related processing to ensure that it stops decoding
at a clean juncture, thus keeping the response time bounded to a
maximum of one change record apply cycle. That passes the onus onto
the plugin writers and keeps the core code around this concurrent
abort handling clean.

I will cook up something along the above lines unless there are any
objections to this approach.

Regards,
Nikhils
-- 
 Nikhil Sontakke                   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Petr Jelinek
Date:
On 23/01/18 16:01, Nikhil Sontakke wrote:
>> I must be missing something in this discussion, cos I don't see any
>> problems with this "other option".
>>
>> Surely we prepare the 2PCxact and then it is persistent. Yes,
>> potentially for an unbounded period of time. And it is (usually) up to
>> the XA manager to resolve that. 2PC interacts with transaction
>> management and yes, it can be slow. But the choice is slow and
>> consistent, or not. This would only be used with the full choice of
>> the user, just like synchronous_commit.

It's not about transaction being persistent but the abort command being
blocked.

>>
>> In this case, we call the decoding plugin's precommit hook which would
>> then prepare the 2PCxact and set a non-persistent flag saying it is
>> being decoded. If decoding completes normally we release the lock and
>> commit. If decoding fails or the DBA has another reason to do so, we
>> provide a function that allows the flag to be unlocked. While it is
>> locked the 2PCxact cannot be aborted or committed.

Output plugin can't deal with precommit, that has to be handled
elsewhere but in principle this is true.

>>
>> There is no danger of accidental abort because the prepare has persistent state.
> 
> This concurrent abort handling while decoding is ongoing is turning
> out to be complex affair.
> 
> Thinking more about this, just to provide an example, we have a
> decoding plugin hook to determine if a GID for a 2PC was decoded at
> PREPARE time or COMMIT time as part of the 2PC logical decoding patch.
> We need that to determine the *same* static answer every time we see a
> specific GID while decoding across restarts; the plugin should know
> what it had done the last time around and should tell us the same
> later as well. It just occurred to me that as Simon also mentioned, it
> could/should also be the decoding plugin's responsibility to indicate
> if it's ok to go ahead with the abort of the transaction.
> 
> So, we could consider adding a preAbort hook. That preAbort hook gets
> the GID, XID and other parameters as needed and tells us whether we
> can go ahead with the abort or if we need to wait out (maybe we pass
> in an ok_to_wait param as well). As an example, a plugin could lookup
> some shmem structure which points to the current transaction being
> decoded and does related processing to ensure that it stops decoding
> at a clean juncture, thus keeping the response time bounded to a
> maximum of one change record apply cycle. That passes the onus onto
> the plugin writers and keeps the core code around this concurrent
> abort handling clean.
> 

Having this as responsibility of plugin sounds interesting. It certainly
narrows the scope for which we need to solve the abort issue. For 2PC
that may be okay as we need to somehow interact with transaction manager
as Simon noted. I am not sure if this helps streaming use-case though as
there is not going to be any external transaction management involved there.

In any case all this interlocking could potentially be made less
impact-full by only doing it when we know the transaction did catalog
changes prior to currently decoded change (which we do during decoding)
since that's the only time we are interested in if it aborted or not.

This all leads me to another idea. What if logical decoding provided API
for "locking/unlocking" the currently decoded transaction against abort.
This function would then be called by both decoding and output plugin
before any catalog read. The function can be smart enough to be NOOP if
the transaction is not running (ie we are not doing 2PC decoding or
streaming) or when the transaction didn't do any catalog modifications
(we already have that info easily accessible as bool).

That would mean we'd never do any kind of heavy locking for prolonged
periods of time (ie network calls) but only during catalog access and
only when needed. It would also solve this for both 2PC and streaming
and it would be easy to use by plugin authors. Just document that some
call should be done before catalog access when in output plugin, can
even be Asserted that the call was done probably.

Thoughts?

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services



Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Nikhil Sontakke
Date:
Hi,

> Having this as responsibility of plugin sounds interesting. It certainly
> narrows the scope for which we need to solve the abort issue. For 2PC
> that may be okay as we need to somehow interact with transaction manager
> as Simon noted. I am not sure if this helps streaming use-case though as
> there is not going to be any external transaction management involved there.
>
> In any case all this interlocking could potentially be made less
> impact-full by only doing it when we know the transaction did catalog
> changes prior to currently decoded change (which we do during decoding)
> since that's the only time we are interested in if it aborted or not.
>
> This all leads me to another idea. What if logical decoding provided API
> for "locking/unlocking" the currently decoded transaction against abort.
> This function would then be called by both decoding and output plugin
> before any catalog read. The function can be smart enough to be NOOP if
> the transaction is not running (ie we are not doing 2PC decoding or
> streaming) or when the transaction didn't do any catalog modifications
> (we already have that info easily accessible as bool).
>
> That would mean we'd never do any kind of heavy locking for prolonged
> periods of time (ie network calls) but only during catalog access and
> only when needed. It would also solve this for both 2PC and streaming
> and it would be easy to use by plugin authors. Just document that some
> call should be done before catalog access when in output plugin, can
> even be Asserted that the call was done probably.
>
> Thoughts?
>

Yeah, this might work. We already have SET_LOCKTAG_TRANSACTION() via
which every transaction takes an exclusive lock on its own
transactionid when it starts, for example. We ideally want a single
solution to handle 2PC and ongoing (streaming) transactions. We could
introduce a new SET_LOCKTAG_LOGICALTRANSACTION(). The logical decoding
process could take a SHARED lock on this, check if the XID is still ok
to decode, read the catalog and unlock. Abort/Commit transaction
processing could take this in EXCLUSIVE mode.

As mentioned above, the plugin API which takes this lock will be smart
enough to be a NOOP if the transaction is not running (i.e we are not
doing 2PC decoding or streaming) or when the transaction didn't do any
catalog modifications.

Thoughts?

Regards,
Nikhils
-- 
 Nikhil Sontakke                   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Simon Riggs
Date:
On 29 January 2018 at 07:15, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:

>> Having this as responsibility of plugin sounds interesting. It certainly
>> narrows the scope for which we need to solve the abort issue. For 2PC
>> that may be okay as we need to somehow interact with transaction manager
>> as Simon noted. I am not sure if this helps streaming use-case though as
>> there is not going to be any external transaction management involved there.

I think we should recognize that the two cases are different.

2PC decoding patch is looking at what happens AFTER a PREPARE has
occurred on a transaction.

Streaming looks to have streaming occur right in the middle of a
transaction, so it has other concerns and likely other solutions as a
result.

>> In any case all this interlocking could potentially be made less
>> impact-full by only doing it when we know the transaction did catalog
>> changes prior to currently decoded change (which we do during decoding)
>> since that's the only time we are interested in if it aborted or not.
>>
>> This all leads me to another idea. What if logical decoding provided API
>> for "locking/unlocking" the currently decoded transaction against abort.
>> This function would then be called by both decoding and output plugin
>> before any catalog read. The function can be smart enough to be NOOP if
>> the transaction is not running (ie we are not doing 2PC decoding or
>> streaming) or when the transaction didn't do any catalog modifications
>> (we already have that info easily accessible as bool).
>>
>> That would mean we'd never do any kind of heavy locking for prolonged
>> periods of time (ie network calls) but only during catalog access and
>> only when needed. It would also solve this for both 2PC and streaming
>> and it would be easy to use by plugin authors. Just document that some
>> call should be done before catalog access when in output plugin, can
>> even be Asserted that the call was done probably.
>>
>> Thoughts?
>>
>
> Yeah, this might work. We already have SET_LOCKTAG_TRANSACTION() via
> which every transaction takes an exclusive lock on its own
> transactionid when it starts, for example. We ideally want a single
> solution to handle 2PC and ongoing (streaming) transactions. We could
> introduce a new SET_LOCKTAG_LOGICALTRANSACTION(). The logical decoding
> process could take a SHARED lock on this, check if the XID is still ok
> to decode, read the catalog and unlock. Abort/Commit transaction
> processing could take this in EXCLUSIVE mode.
>
> As mentioned above, the plugin API which takes this lock will be smart
> enough to be a NOOP if the transaction is not running (i.e we are not
> doing 2PC decoding or streaming) or when the transaction didn't do any
> catalog modifications.

I think it is enough to say that ROLLBACK PREPARED cannot abort a
transaction while decoding is taking place, if the plugin decides it
wishes to prevent that. So the plugin API must have a
pre-rollback-prepared call to allow the plugin to decide how to handle
that (Is that what you meant by the pre-abort hook?). That call will
just return when and if it is OK to abort. Any waiting or throwing of
ERRORs can occur inside the plugin.

The plugin can provide a function to allow an abort that type of abort
if it wishes, but no new core functionality required for that. If
needed, we would wait on a latch the plugin provides for that.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Simon Riggs
Date:
On 23 January 2018 at 19:17, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote:

> I am not sure if this helps streaming use-case though as
> there is not going to be any external transaction management involved there.

So, I think we need some specific discussion of what to do in that case.

Streaming happens only with big transactions and only for short periods.

The problem only occurs when we are decoding and we hit a catalog
table change. Processing of that is short, then we continue. So it
seems perfectly fine to block aborts in those circumstances. We can
just mark that state in a in-memory array of
StreamingDecodedTransactions that has size SizeOf(TransactionId) *
MaxNumWalSenders.

We can add a check into RecordTransactionAbort() just before the
critical section to see if we are currently processing a
StreamingDecodedTransaction and if so, poll until we're OK to abort.
The check will be quick and the abort path is not considered one we
need to optimize.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Tomas Vondra
Date:
Hi,

On 01/29/2018 11:23 AM, Simon Riggs wrote:
> On 29 January 2018 at 07:15, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
> 
>>> Having this as responsibility of plugin sounds interesting. It
>>> certainly narrows the scope for which we need to solve the abort
>>> issue. For 2PC that may be okay as we need to somehow interact
>>> with transaction manager as Simon noted. I am not sure if this
>>> helps streaming use-case though as there is not going to be any
>>> external transaction management involved there.
> 
> I think we should recognize that the two cases are different.
> 
> 2PC decoding patch is looking at what happens AFTER a PREPARE has
> occurred on a transaction.
> 
> Streaming looks to have streaming occur right in the middle of a
> transaction, so it has other concerns and likely other solutions as a
> result.
> 

I don't quite see how these cases are so different, or why would they
require different handling. Can you explain?

We need to deal with decoding a transaction that aborts while we're
decoding it - we must not continue decoding it (in the sense of passing
the changes to the plugin). I don't see any major difference between
ROLLBACK and ROLLBACK PREPARED.

So I think the pre-abort hook solution should work for the streaming
case too. At least - I don't see why it wouldn't.

While discussing this with Peter Eisentraut off-list, I think we came up
with an alternative idea for the streaming case, that does not require
the pre-abort hook. The important detail is that we only really care
about aborts in transactions that modified catalogs in some way (e.g. by
doing DDL). But we can safely decode (and stream) changes up to the
point when the catalogs get modified, so we can do two different things
at that point:

(1) Stop streaming changes from that transaction, and instead start
spilling it to disk (and then continue with the replay only when it
actually commits).

(2) Note the transaction ID somewhere and restart the decoding (that is,
notify the downstream to throw away the data and go back in WAL to read
all the data from scratch) but spilling that one transaction to disk
instead of streaming it.

Neither of these solutions is currently implemented and would require
changes to ReorderBuffer (which currently does not support mixing of
spill-to-disk and streaming) and possibly the logical decoding
infrastructure (e.g. to stash the XIDs somewhere and allow restarting
from previous LSN).

The good thing is it does not need to know about aborted transactions
and so does not require communication with transaction manager using
pre-abort hooks etc. I think this would be a massive advantage.

The main question (at least for me) is if it's actually cheaper,
compared to the pre-abort hook. In my experience, aborts tend to be
fairly rare in practice - maybe 1:1000 to commits. On the other hand,
temporary tables are fairly common thing, and they count as catalog
changes, of course.

Maybe this is not that bad though, as it only really matters for large
transactions, which is when we start to spill to disk or stream. And
that should not be very often - if it happens very often, you probably
need a higher limit (similarly to work_mem).

The cases where this would matter are large ETL jobs, upgrade scripts
and so on - these tend to be large and mix DDL (temporary tables, ALTER
TABLE, ...). That's unfortunate, as it's one of the cases the streaming
was supposed to help.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Simon Riggs
Date:
On 29 January 2018 at 13:34, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> The important detail is that we only really care
> about aborts in transactions that modified catalogs in some way (e.g. by
> doing DDL). But we can safely decode (and stream) changes up to the
> point when the catalogs get modified, so we can do two different things
> at that point:

Safely? Hmm.

I think what we are missing here is a test case, or a detailed
description of what we think happens. I can't see why a concurrent
abort would be unsafe, as in, it would make decoding explode.

If all it does is give the wrong answer when we process DDL, then all
we have to do is check for abort before and after we process any DDL
during decoding. Handling DDL is rare, so that test won't cost much in
terms of handling DDL. So ISTM that we can continue decoding after we
hit DDL, we just need some care.

(My earlier comments were based on the idea that 2PC patch would
simply block aborts, which of course will not work for streaming,
hence the difference)

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Tomas Vondra
Date:

On 01/29/2018 02:49 PM, Simon Riggs wrote:
> On 29 January 2018 at 13:34, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> 
>> The important detail is that we only really care
>> about aborts in transactions that modified catalogs in some way (e.g. by
>> doing DDL). But we can safely decode (and stream) changes up to the
>> point when the catalogs get modified, so we can do two different things
>> at that point:
> 
> Safely? Hmm.
> 
> I think what we are missing here is a test case, or a detailed 
> description of what we think happens. I can't see why a concurrent 
> abort would be unsafe, as in, it would make decoding explode.
> 
> If all it does is give the wrong answer when we process DDL, then
> all we have to do is check for abort before and after we process any
> DDL during decoding. Handling DDL is rare, so that test won't cost
> much in terms of handling DDL. So ISTM that we can continue decoding
> after we hit DDL, we just need some care.
> 

We do not decode DDL, but maybe I misunderstood what you mean.

The issue is that we consult catalogs during the decoding for various
reasons (to interpret table structure, see if the table is part of a
replication set, and the plugin is free to use arbitrary user catalogs).

Hitting the issue is tricky, because of timing issues.

But imagine a transaction like this.

BEGIN;
INSERT INTO t(a,b) VALUES (1,2);
SAVEPOINT s;
ALTER TABLE t ADD COLUMN c INT;
INSERT INTO t(a,b,c) VALUES (1,2,3);
ROLLBACK TO s;
COMMIT;

Now assume that we start streaming the transaction, and it happens like
this (T refers to the steps of the transaction):

1) start decoding the subtransaction (not yet aborted)

2) T: ROLLBACK TO s;

3) VACUUM the catalogs, which may remove pg_attribute rows for "c" or
   make them inaccessible through indexes, etc.

4) inspect the new row (which we still have in reorderbuffer)

5) Kabooom! The row has column "c" which we don't see in the catalog.

> (My earlier comments were based on the idea that 2PC patch would
> simply block aborts, which of course will not work for streaming,
> hence the difference)
> 

I'm sorry, but I don't quite see why wouldn't that work? There are far
too many ideas flying around on this thread, so I'm probably missing
something. Can you elaborate?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Simon Riggs
Date:
On 29 January 2018 at 14:13, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> 4) inspect the new row (which we still have in reorderbuffer)
>
> 5) Kabooom! The row has column "c" which we don't see in the catalog.

We don't use caches? Why does a cache miss cause it to explode?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Tomas Vondra
Date:

On 01/29/2018 03:17 PM, Simon Riggs wrote:
> On 29 January 2018 at 14:13, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> 
>> 4) inspect the new row (which we still have in reorderbuffer)
>>
>> 5) Kabooom! The row has column "c" which we don't see in the catalog.
> 
> We don't use caches? Why does a cache miss cause it to explode?
> 

We do use caches (and we invalidate them), of course.

But the problem is that by the time we get to lookup the row, it may be
either removed by VACUUM (because the catalog cleanup is more
aggressive) or not reachable using an index (which is the HOT issue
pointed out by Robert earlier in this thread).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Logical Decoding and HeapTupleSatisfiesVacuum assumptions

From
Nikhil Sontakke
Date:
Hi all,

On 29 January 2018 at 12:45, Nikhil Sontakke <nikhils@2ndquadrant.com> wrote:
> Hi,
>
>> Having this as responsibility of plugin sounds interesting. It certainly
>> narrows the scope for which we need to solve the abort issue. For 2PC
>> that may be okay as we need to somehow interact with transaction manager
>> as Simon noted. I am not sure if this helps streaming use-case though as
>> there is not going to be any external transaction management involved there.
>>
>> In any case all this interlocking could potentially be made less
>> impact-full by only doing it when we know the transaction did catalog
>> changes prior to currently decoded change (which we do during decoding)
>> since that's the only time we are interested in if it aborted or not.
>>
>> This all leads me to another idea. What if logical decoding provided API
>> for "locking/unlocking" the currently decoded transaction against abort.
>> This function would then be called by both decoding and output plugin
>> before any catalog read. The function can be smart enough to be NOOP if
>> the transaction is not running (ie we are not doing 2PC decoding or
>> streaming) or when the transaction didn't do any catalog modifications
>> (we already have that info easily accessible as bool).
>>
>> That would mean we'd never do any kind of heavy locking for prolonged
>> periods of time (ie network calls) but only during catalog access and
>> only when needed. It would also solve this for both 2PC and streaming
>> and it would be easy to use by plugin authors. Just document that some
>> call should be done before catalog access when in output plugin, can
>> even be Asserted that the call was done probably.
>>
>> Thoughts?
>>
>
> Yeah, this might work. We already have SET_LOCKTAG_TRANSACTION() via
> which every transaction takes an exclusive lock on its own
> transactionid when it starts, for example. We ideally want a single
> solution to handle 2PC and ongoing (streaming) transactions. We could
> introduce a new SET_LOCKTAG_LOGICALTRANSACTION(). The logical decoding
> process could take a SHARED lock on this, check if the XID is still ok
> to decode, read the catalog and unlock. Abort/Commit transaction
> processing could take this in EXCLUSIVE mode.
>
> As mentioned above, the plugin API which takes this lock will be smart
> enough to be a NOOP if the transaction is not running (i.e we are not
> doing 2PC decoding or streaming) or when the transaction didn't do any
> catalog modifications.
>
> Thoughts?
>

This latest version takes care of the abort-while-decoding issue along
with additional test cases and documentation changes.

We now maintain a list of processes that are decoding a specific
transactionID and make it a decode groupmember of a decode groupleader
process. The decode groupleader process is basically the PGPROC entry
which points to the prepared 2PC transaction or an ongoing regular
transaction.

If the 2PC is rollback'ed then FinishPreparedTransactions uses the
decode groupleader process to let all the decode groupmember processes
know that it's aborting. A similar logic can be used for the decoding
of uncommitted transactions. The decode groupmember processes are able
to abort sanely in such a case. We also have two new APIs
"LogicalLockTransaction" and "LogicalUnlockTransaction" that the
decoding backends need to use while doing system or user catalog
tables access. The abort code interlocks with decoding backends that
might be in the process of accessing catalog tables and waits for
those few moments before aborting the transaction.

The implementation uses the LockHashPartitionLockByProc on the decode
groupleader process to control access to these additional fields in
the PGPROC structure amongst the decode groupleader and the other
decode groupmember processes and does not need to use the
ProcArrayLock at all. The implementation is inspired from the
*existing* lockGroupLeader solution which uses a similar technique to
track processes waiting on a leader holding that lock. I believe it's
an optimal solution for this problem of ours.

Have added TAP tests to test multiple decoding backends working on the
same transaction. Used delays in the test-decoding plugin to introduce
waits after making the LogicalLockTransaction call and calling
ROLLBACK to ensure that it interlocks with such decoding backends
which are doing catalog access. Tests working as desired. Also "make
check-world" passes with asserts enabled.

I have separated out the decode groupleader/groupmember changes from
the main logical decoding of 2PC patch into a separate patch for REVIEW
only. Both main patch and this separate review only patch are attached
with this email.

I have posted this same explanation about abort handling on the other
thread for the main 2PC logical decoding patch
(http://www.postgresql-archive.org/logical-decoding-of-two-phase-transactions-td5936887.html).

Comments appreciated.

Regards,
Nikhils
-- 
 Nikhil Sontakke                   http://www.2ndQuadrant.com/
 PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Attachment