Re: BUG #13681: Serialization failures caused by new multixact code of 9.3 (back-patch request) - Mailing list pgsql-bugs

From Alvaro Herrera
Subject Re: BUG #13681: Serialization failures caused by new multixact code of 9.3 (back-patch request)
Date
Msg-id 20151217183153.GO2618@alvherre.pgsql
Whole thread Raw
In response to BUG #13681: Serialization failures caused by new multixact code of 9.3 (back-patch request)  (odo@odoo.com)
Responses Re: BUG #13681: Serialization failures caused by new multixact code of 9.3 (back-patch request)  (Kevin Grittner <kgrittn@gmail.com>)
List pgsql-bugs
odo@odoo.com wrote:

> This is a back-patch request of 05315498012530d44cd89a209242a243374e274d to
> 9.3 and 9.4.
>
> As discussed in the -general list[1], both 9.3 and 9.4 show spurious
> serialization failures when faced with the use case included below.
>
> In 9.2, T2 used to block until T1's commit, but then continued without
> error, and in 9.5 both T1 and T2 proceed without blocking nor error.
>
> Kevin Grittner located[2] the root cause as a regression that was fixed by
> Álvaro at 0531549 [3].
>
> For what it's worth, our system uses many long-running transactions
> (background jobs, batch data imports, etc.) that are frequently interrupted
> and rolled back by micro-transactions coming from users who just happen to
> update minor data on their records (such as their last login date). So this
> bug appears to cause more than just a performance regression.

I would like to understand why does that patch fix the problem -- maybe
it's spurious and the real reason is something different.  The commit
message states:
    Commit 0ac5ad5134f2 removed an optimization in multixact.c that skipped
    fetching members of MultiXactId that were older than our
    OldestVisibleMXactId value.  The reason this was removed is that it is
    possible for multixacts that contain updates to be older than that
    value.  However, if the caller is certain that the multi does not
    contain an update (because the infomask bits say so), it can pass this
    info down to GetMultiXactIdMembers, enabling it to use the old
    optimization.

In your test case,

>               T1                                T2
> |-----------------------------|----------------------------------|
>     BEGIN ISOLATION LEVEL
>           REPEATABLE READ;
>
>     UPDATE orders
>     SET name = 'order of foo',
>         user_id = 1
>     WHERE id = 1;
>
>                                       BEGIN ISOLATION LEVEL
>                                             REPEATABLE READ;
>
>                                       UPDATE users
>                                       SET date = now()
>                                       WHERE id = 1;
>
>                                       COMMIT;
>
>     UPDATE orders
>     SET name = 'order of foo (2)',
>         user_id = 1
>     WHERE id = 1;

we have a transaction that takes a lock-only multi in
table users, and then when we do the second update we don't look it up
because ...??  And then this causes the test case not to fail because
..?

I would like to understand the mechanism for this fix before declaring
that the fix is correct.

The patch doesn't apply cleanly because of other changes in the area, so
I attach the backpatched version here, as well as a test case for
isolationtester in a separate commit (with which it's easy to confirm
that the problem does exist and that the patch indeed fixes it.)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #13666: REASSIGN OWNED BY doesn't affect the relation underlying composite type
Next
From: Kevin Grittner
Date:
Subject: Re: BUG #13681: Serialization failures caused by new multixact code of 9.3 (back-patch request)