Thread: vacuum vs heap_update_tuple() and multixactids

vacuum vs heap_update_tuple() and multixactids

From
Andres Freund
Date:
Hi,

In [1] I'd discovered a only mildly related bug while reading code to
make sure my fix [2] et al was correct.

Quoting a couple messages by myself:
> Staring at the vacuumlazy hunk I think I might have found a related bug:
> heap_update_tuple() just copies the old xmax to the new tuple's xmax if
> a multixact and still running.  It does so without verifying liveliness
> of members.  Isn't that buggy? Consider what happens if we have three
> blocks: 1 has free space, two is being vacuumed and is locked, three is
> full and has a tuple that's key share locked by a live tuple and is
> updated by a dead xmax from before the xmin horizon. In that case afaict
> the multi will be copied from the third page to the first one.  Which is
> quite bad, because vacuum already processed it, and we'll set
> relfrozenxid accordingly.  I hope I'm missing something here?
>
> Trying to write a testcase for that now.
>
> This indeed happens, but I can't quite figure out a way to write an
> isolationtester test for this. The problem is that to have something
> reproducible one has to make vacuum block on a cleanup lock, but that
> currently doesn't register as waiting for the purpose of
> isolationtester's logic.

So what basically happens is that vacuum might be at block X, and a
concurrent update will move a tuple from a block > X to a block < X,
preserving the multixactid in xmax. Which can mean there later is a
multixactid in the table that's from before relminmxid.

I manually reproduced the issue, but it's pretty painful to do so
manually. I've not found any way to reliably do so in isolationtester so
far.  Cleanup locks make it possible to schedule this without race
conditions, but isolationtester currently won't switch sessions when
blocked on a cleanup lock.

Could I perhaps convince somebody to add that as a feature to
isolationtester? I'm willing to work on a bugfix for the bug itself, but
I've already spent tremendous amounts of time, energy and pain on
multixact bugs, and I'm at the moment feeling a bit unenthusiastic about
also working on test infrastructure for it...  If somebody else is
willing to work on both infrastructure *and* a bugfix, that's obviously
even better ;)

I think the bugfix is going to have to essentially be something similar
to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
tuple version, we need to prune dead multixact members. I think we can
do so unconditionally and rely on multixact id caching layer to avoid
unnecesarily creating multis when all members are the same.

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/20171103145330.5ycjoje5s6lfwxps%40alap3.anarazel.de
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9c2f0a6c3cc8bb85b78191579760dbe9fb7814ec


Re: vacuum vs heap_update_tuple() and multixactids

From
Alvaro Herrera
Date:
Andres Freund wrote:

> I think the bugfix is going to have to essentially be something similar
> to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
> tuple version, we need to prune dead multixact members. I think we can
> do so unconditionally and rely on multixact id caching layer to avoid
> unnecesarily creating multis when all members are the same.

Actually, isn't the cache subject to the very same problem?  If you use
a value from the cache, it could very well be below whatever the cutoff
multi was chosen in the other process ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: vacuum vs heap_update_tuple() and multixactids

From
Alvaro Herrera
Date:
Andres Freund wrote:

> I think the bugfix is going to have to essentially be something similar
> to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
> tuple version, we need to prune dead multixact members. I think we can
> do so unconditionally and rely on multixact id caching layer to avoid
> unnecesarily creating multis when all members are the same.

Actually, isn't the cache subject to the very same problem?  If you use
a value from the cache, it could very well be below whatever the cutoff
multi was chosen in the other process ...

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: vacuum vs heap_update_tuple() and multixactids

From
Andres Freund
Date:
On 2017-12-19 15:35:12 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > I think the bugfix is going to have to essentially be something similar
> > to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
> > tuple version, we need to prune dead multixact members. I think we can
> > do so unconditionally and rely on multixact id caching layer to avoid
> > unnecesarily creating multis when all members are the same.
> 
> Actually, isn't the cache subject to the very same problem?  If you use
> a value from the cache, it could very well be below whatever the cutoff
> multi was chosen in the other process ...

That's an potential issue somewhat indepent of this bug though (IIRC I
also mentioned it in the other thread). I hit that problem a bunch in
manual testing, but I didn't manage to create an actual testcase for it,
without pre-existing corruption.  I don't think this specific instance
would be more-vulnerable than FreezeMultiXactId() itself - we'd just use
alive multis and pass them to MultiXactIdCreateFromMembers(), which is
exactly what FreezeMultiXactId() does.

I think the cache issue ends up not quite being a live bug, because
every transaction that's a member of a multixact also has done
MultiXactIdSetOldestMember(). Which in turn probably, but I'm not sure,
prevents the existance of multis with just alive members in the cache,
that are below the multi horizon. That relies on the fact that we only
create multis with alive members though, which seems fragile.

It'd be good if we added some assurances to
MultiXactIdCreateFromMembers() that it actually can't happen.

Hope I didn't miss a live version of the above problem?

Greetings,

Andres Freund


Re: vacuum vs heap_update_tuple() and multixactids

From
Andres Freund
Date:
On 2017-12-19 15:35:12 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > I think the bugfix is going to have to essentially be something similar
> > to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
> > tuple version, we need to prune dead multixact members. I think we can
> > do so unconditionally and rely on multixact id caching layer to avoid
> > unnecesarily creating multis when all members are the same.
> 
> Actually, isn't the cache subject to the very same problem?  If you use
> a value from the cache, it could very well be below whatever the cutoff
> multi was chosen in the other process ...

That's an potential issue somewhat indepent of this bug though (IIRC I
also mentioned it in the other thread). I hit that problem a bunch in
manual testing, but I didn't manage to create an actual testcase for it,
without pre-existing corruption.  I don't think this specific instance
would be more-vulnerable than FreezeMultiXactId() itself - we'd just use
alive multis and pass them to MultiXactIdCreateFromMembers(), which is
exactly what FreezeMultiXactId() does.

I think the cache issue ends up not quite being a live bug, because
every transaction that's a member of a multixact also has done
MultiXactIdSetOldestMember(). Which in turn probably, but I'm not sure,
prevents the existance of multis with just alive members in the cache,
that are below the multi horizon. That relies on the fact that we only
create multis with alive members though, which seems fragile.

It'd be good if we added some assurances to
MultiXactIdCreateFromMembers() that it actually can't happen.

Hope I didn't miss a live version of the above problem?

Greetings,

Andres Freund


Re: vacuum vs heap_update_tuple() and multixactids

From
Robert Haas
Date:
On Tue, Dec 19, 2017 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote:
> Could I perhaps convince somebody to add that as a feature to
> isolationtester? I'm willing to work on a bugfix for the bug itself, but
> I've already spent tremendous amounts of time, energy and pain on
> multixact bugs, and I'm at the moment feeling a bit unenthusiastic about
> also working on test infrastructure for it...  If somebody else is
> willing to work on both infrastructure *and* a bugfix, that's obviously
> even better ;)

Hmm.  The problem with trying to make the isolation tester do this is
that pg_isolation_test_session_is_blocked(X, A) is documented to tell
us whether X is waiting for one the PIDs listed in A.  It's easy
enough to tell whether X is blocked waiting for a cleanup lock by
looking at BackendPidGetProc(pid)->wait_event_info, but that gives us
no information about which processes are holding the buffer pins that
prevent us from acquiring that lock.  That's a hard problem to solve
because that data is not recorded in shared memory and doing so would
probably be prohibitively expensive.

<me thinks for a while>

What if we add a hook to vacuum that lets you invoke some arbitrary C
code after each block, and then write a test module that uses that
hook to acquire and release an advisory lock at that point?  Then you
could do:

S1: SELECT pg_advisory_lock(...);
S2: VACUUM;  -- blocks on S1, because of the test module
S3: UPDATE ...;
S1: COMMIT; -- unblocks VACUUM

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: vacuum vs heap_update_tuple() and multixactids

From
Robert Haas
Date:
On Tue, Dec 19, 2017 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote:
> Could I perhaps convince somebody to add that as a feature to
> isolationtester? I'm willing to work on a bugfix for the bug itself, but
> I've already spent tremendous amounts of time, energy and pain on
> multixact bugs, and I'm at the moment feeling a bit unenthusiastic about
> also working on test infrastructure for it...  If somebody else is
> willing to work on both infrastructure *and* a bugfix, that's obviously
> even better ;)

Hmm.  The problem with trying to make the isolation tester do this is
that pg_isolation_test_session_is_blocked(X, A) is documented to tell
us whether X is waiting for one the PIDs listed in A.  It's easy
enough to tell whether X is blocked waiting for a cleanup lock by
looking at BackendPidGetProc(pid)->wait_event_info, but that gives us
no information about which processes are holding the buffer pins that
prevent us from acquiring that lock.  That's a hard problem to solve
because that data is not recorded in shared memory and doing so would
probably be prohibitively expensive.

<me thinks for a while>

What if we add a hook to vacuum that lets you invoke some arbitrary C
code after each block, and then write a test module that uses that
hook to acquire and release an advisory lock at that point?  Then you
could do:

S1: SELECT pg_advisory_lock(...);
S2: VACUUM;  -- blocks on S1, because of the test module
S3: UPDATE ...;
S1: COMMIT; -- unblocks VACUUM

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: vacuum vs heap_update_tuple() and multixactids

From
Andres Freund
Date:
Hi,

On 2017-12-19 15:01:03 -0500, Robert Haas wrote:
> On Tue, Dec 19, 2017 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote:
> > Could I perhaps convince somebody to add that as a feature to
> > isolationtester? I'm willing to work on a bugfix for the bug itself, but
> > I've already spent tremendous amounts of time, energy and pain on
> > multixact bugs, and I'm at the moment feeling a bit unenthusiastic about
> > also working on test infrastructure for it...  If somebody else is
> > willing to work on both infrastructure *and* a bugfix, that's obviously
> > even better ;)
> 
> Hmm.  The problem with trying to make the isolation tester do this is
> that pg_isolation_test_session_is_blocked(X, A) is documented to tell
> us whether X is waiting for one the PIDs listed in A.  It's easy
> enough to tell whether X is blocked waiting for a cleanup lock by
> looking at BackendPidGetProc(pid)->wait_event_info, but that gives us
> no information about which processes are holding the buffer pins that
> prevent us from acquiring that lock.  That's a hard problem to solve
> because that data is not recorded in shared memory and doing so would
> probably be prohibitively expensive.

Indeed. I kinda wonder whether we hackishly solve this by simply
returning true fore all pids if it's waiting for a cleanup lock. That's
not actually that far from the truth... The big problem with that I see
is very short waits that resolve themselves causing issues - but given
that waiting for cleanup locks is really rare, that might not be too
bad.


> <me thinks for a while>
> 
> What if we add a hook to vacuum that lets you invoke some arbitrary C
> code after each block, and then write a test module that uses that
> hook to acquire and release an advisory lock at that point?  Then you
> could do:

I guess that'd work, albeit a bit invasive and specific to this
test. Trying to come up with a concept that'll be generizable longer
term would be neat.

Greetings,

Andres Freund


Re: vacuum vs heap_update_tuple() and multixactids

From
Andres Freund
Date:
Hi,

On 2017-12-19 15:01:03 -0500, Robert Haas wrote:
> On Tue, Dec 19, 2017 at 1:31 PM, Andres Freund <andres@anarazel.de> wrote:
> > Could I perhaps convince somebody to add that as a feature to
> > isolationtester? I'm willing to work on a bugfix for the bug itself, but
> > I've already spent tremendous amounts of time, energy and pain on
> > multixact bugs, and I'm at the moment feeling a bit unenthusiastic about
> > also working on test infrastructure for it...  If somebody else is
> > willing to work on both infrastructure *and* a bugfix, that's obviously
> > even better ;)
> 
> Hmm.  The problem with trying to make the isolation tester do this is
> that pg_isolation_test_session_is_blocked(X, A) is documented to tell
> us whether X is waiting for one the PIDs listed in A.  It's easy
> enough to tell whether X is blocked waiting for a cleanup lock by
> looking at BackendPidGetProc(pid)->wait_event_info, but that gives us
> no information about which processes are holding the buffer pins that
> prevent us from acquiring that lock.  That's a hard problem to solve
> because that data is not recorded in shared memory and doing so would
> probably be prohibitively expensive.

Indeed. I kinda wonder whether we hackishly solve this by simply
returning true fore all pids if it's waiting for a cleanup lock. That's
not actually that far from the truth... The big problem with that I see
is very short waits that resolve themselves causing issues - but given
that waiting for cleanup locks is really rare, that might not be too
bad.


> <me thinks for a while>
> 
> What if we add a hook to vacuum that lets you invoke some arbitrary C
> code after each block, and then write a test module that uses that
> hook to acquire and release an advisory lock at that point?  Then you
> could do:

I guess that'd work, albeit a bit invasive and specific to this
test. Trying to come up with a concept that'll be generizable longer
term would be neat.

Greetings,

Andres Freund


Re: vacuum vs heap_update_tuple() and multixactids

From
Robert Haas
Date:
On Wed, Dec 20, 2017 at 9:05 AM, Andres Freund <andres@anarazel.de> wrote:
> Indeed. I kinda wonder whether we hackishly solve this by simply
> returning true fore all pids if it's waiting for a cleanup lock. That's
> not actually that far from the truth... The big problem with that I see
> is very short waits that resolve themselves causing issues - but given
> that waiting for cleanup locks is really rare, that might not be too
> bad.

I'm wondering if that might cause random, spurious failures.  We go on
to the next step if we detect that the current step is waiting... and
the decisions we make about whether to do that cause differences in
the output.

>> <me thinks for a while>
>>
>> What if we add a hook to vacuum that lets you invoke some arbitrary C
>> code after each block, and then write a test module that uses that
>> hook to acquire and release an advisory lock at that point?  Then you
>> could do:
>
> I guess that'd work, albeit a bit invasive and specific to this
> test. Trying to come up with a concept that'll be generizable longer
> term would be neat.

True, but I'm not sure there's much hope of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: vacuum vs heap_update_tuple() and multixactids

From
Robert Haas
Date:
On Wed, Dec 20, 2017 at 9:05 AM, Andres Freund <andres@anarazel.de> wrote:
> Indeed. I kinda wonder whether we hackishly solve this by simply
> returning true fore all pids if it's waiting for a cleanup lock. That's
> not actually that far from the truth... The big problem with that I see
> is very short waits that resolve themselves causing issues - but given
> that waiting for cleanup locks is really rare, that might not be too
> bad.

I'm wondering if that might cause random, spurious failures.  We go on
to the next step if we detect that the current step is waiting... and
the decisions we make about whether to do that cause differences in
the output.

>> <me thinks for a while>
>>
>> What if we add a hook to vacuum that lets you invoke some arbitrary C
>> code after each block, and then write a test module that uses that
>> hook to acquire and release an advisory lock at that point?  Then you
>> could do:
>
> I guess that'd work, albeit a bit invasive and specific to this
> test. Trying to come up with a concept that'll be generizable longer
> term would be neat.

True, but I'm not sure there's much hope of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: vacuum vs heap_update_tuple() and multixactids

From
Amit Kapila
Date:
On Wed, Dec 20, 2017 at 12:12 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-12-19 15:35:12 -0300, Alvaro Herrera wrote:
>> Andres Freund wrote:
>>
>> > I think the bugfix is going to have to essentially be something similar
>> > to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
>> > tuple version, we need to prune dead multixact members. I think we can
>> > do so unconditionally and rely on multixact id caching layer to avoid
>> > unnecesarily creating multis when all members are the same.
>>
>> Actually, isn't the cache subject to the very same problem?  If you use
>> a value from the cache, it could very well be below whatever the cutoff
>> multi was chosen in the other process ...
>
> That's an potential issue somewhat indepent of this bug though (IIRC I
> also mentioned it in the other thread). I hit that problem a bunch in
> manual testing, but I didn't manage to create an actual testcase for it,
> without pre-existing corruption.  I don't think this specific instance
> would be more-vulnerable than FreezeMultiXactId() itself - we'd just use
> alive multis and pass them to MultiXactIdCreateFromMembers(), which is
> exactly what FreezeMultiXactId() does.
>
> I think the cache issue ends up not quite being a live bug, because
> every transaction that's a member of a multixact also has done
> MultiXactIdSetOldestMember(). Which in turn probably, but I'm not sure,
> prevents the existance of multis with just alive members in the cache,
> that are below the multi horizon. That relies on the fact that we only
> create multis with alive members though, which seems fragile.
>

Yeah, but it seems that as of now this assumption (create multis with
alive members) seems to be true.  There are three callers of
MultiXactIdCreateFromMembers and below is analysis.

a. MultiXactIdExpand() calls it after removing dead members or call it
to create a singleton MultiXact which is in progress.
b. FreezeMultiXactId() calls it for non-dead members.
c. MultiXactIdCreate() calls it for two members which are either in
progress or one of them could be committed.

> It'd be good if we added some assurances to
> MultiXactIdCreateFromMembers() that it actually can't happen.
>

+1.  I think even though it is true now, it is better to have such an assurance.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: vacuum vs heap_update_tuple() and multixactids

From
Amit Kapila
Date:
On Wed, Dec 20, 2017 at 12:12 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-12-19 15:35:12 -0300, Alvaro Herrera wrote:
>> Andres Freund wrote:
>>
>> > I think the bugfix is going to have to essentially be something similar
>> > to FreezeMultiXactId(). I.e. when reusing an old tuple's xmax for a new
>> > tuple version, we need to prune dead multixact members. I think we can
>> > do so unconditionally and rely on multixact id caching layer to avoid
>> > unnecesarily creating multis when all members are the same.
>>
>> Actually, isn't the cache subject to the very same problem?  If you use
>> a value from the cache, it could very well be below whatever the cutoff
>> multi was chosen in the other process ...
>
> That's an potential issue somewhat indepent of this bug though (IIRC I
> also mentioned it in the other thread). I hit that problem a bunch in
> manual testing, but I didn't manage to create an actual testcase for it,
> without pre-existing corruption.  I don't think this specific instance
> would be more-vulnerable than FreezeMultiXactId() itself - we'd just use
> alive multis and pass them to MultiXactIdCreateFromMembers(), which is
> exactly what FreezeMultiXactId() does.
>
> I think the cache issue ends up not quite being a live bug, because
> every transaction that's a member of a multixact also has done
> MultiXactIdSetOldestMember(). Which in turn probably, but I'm not sure,
> prevents the existance of multis with just alive members in the cache,
> that are below the multi horizon. That relies on the fact that we only
> create multis with alive members though, which seems fragile.
>

Yeah, but it seems that as of now this assumption (create multis with
alive members) seems to be true.  There are three callers of
MultiXactIdCreateFromMembers and below is analysis.

a. MultiXactIdExpand() calls it after removing dead members or call it
to create a singleton MultiXact which is in progress.
b. FreezeMultiXactId() calls it for non-dead members.
c. MultiXactIdCreate() calls it for two members which are either in
progress or one of them could be committed.

> It'd be good if we added some assurances to
> MultiXactIdCreateFromMembers() that it actually can't happen.
>

+1.  I think even though it is true now, it is better to have such an assurance.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com