Re: BUG #11732: Non-serializable outcomes under serializable isolation - Mailing list pgsql-bugs

From Kevin Grittner
Subject Re: BUG #11732: Non-serializable outcomes under serializable isolation
Date
Msg-id 2081625385.1952368.1420067509048.JavaMail.yahoo@jws100150.mail.ne1.yahoo.com
Whole thread Raw
In response to Re: BUG #11732: Non-serializable outcomes under serializable isolation  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: BUG #11732: Non-serializable outcomes under serializable isolation
List pgsql-bugs
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Kevin Grittner wrote:

>> Everything suggests a race condition.  I haven't been able to
>> pin down exactly where it happens, but that's just a matter of
>> time.  It is interesting that the data type of the column
>> changes the timing enough to have such a large effect on seeing
>> the failure.  I've set it aside for the moment but expect to get
>> back to it later this week.
>
> Any luck here?

Yes and no.  I have further evidence, and have narrowed the area of
investigation, but I haven't been able to pin down the cause yet.
I'll post what I have "for the record" and in case someone has an
idea that will help me find it.

I was able to create a bash script and simple plpgsql function to
demonstrate the problem.  They're short enough to include in the
email, but I will attach, too, in case the email formatting mangles
it.

Bash file "stress":

#!/bin/bash
pg_ctl -D Debug/data -m fast -w stop
rm -fr Debug/data/*
initdb Debug/data
echo "default_transaction_isolation = 'serializable'" >> Debug/data/postgresql.conf
echo "shared_buffers = 1GB" >> Debug/data/postgresql.conf
echo "max_connections = 200" >> Debug/data/postgresql.conf
pg_ctl -D Debug/data -l Debug/data/logfile -w start || ( cat Debug/data/logfile && exit 1 )
createdb test
psql -f ~/stress.sql test
psql -c "CREATE TABLE \"stress\" (\"id\" serial primary key, \"key\" text);" test
for i in {1..1000}
do
  echo $i
  for j in {1..128}
  do
    psql -c "select stress_fn('$i');" test >/dev/null 2>&1 &
  done
  wait
  sleep 0.3s
done
psql -c "select count(*) from stress;" test
psql -c "select key, count(*) from stress group by key having count(*) > 1 order by key::int;" test

"stress.sql" file:

CREATE FUNCTION stress_fn(target_key text)
  RETURNS VOID
  LANGUAGE plpgsql
  VOLATILE
AS $$
BEGIN
  PERFORM 1 AS one FROM "stress" WHERE "stress"."key" = target_key;
  IF FOUND THEN
    RAISE EXCEPTION 'already inserted';
  END IF;


  INSERT INTO "stress" ("key")
    VALUES (target_key);
END;
$$;

So far this has generated 1 to 3 duplicated values per run.  I have
changed the timing to 10.0 seconds and still had an occurrence.
Changing the "key" column and the corresponding parameter to
integer causes it to run without generating any duplicates (so far,
at least).

For each non-duplicated "key" value I get about 10 instances of this:

ERROR:  could not serialize access due to read/write dependencies among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during write.
HINT:  The transaction might succeed if retried.
CONTEXT:  SQL statement "INSERT INTO "stress" ("key")
        VALUES (target_key)"
    PL/pgSQL function stress_fn(text) line 8 at SQL statement
STATEMENT:  select stress_fn('1');

... followed by enough of these to add up to 127 errors per value:

ERROR:  already inserted
STATEMENT:  select stress_fn('1');

For the "key" values that generate one or more duplicates I get
different logging.  I see about 30 of these:

ERROR:  could not serialize access due to read/write dependencies among transactions
DETAIL:  Reason code: Canceled on conflict out to pivot 9137, during read.
HINT:  The transaction might succeed if retried.
CONTEXT:  SQL statement "SELECT 1 AS one FROM "stress" WHERE "stress"."key" = target_key"
    PL/pgSQL function stress_fn(text) line 3 at PERFORM
STATEMENT:  select stress_fn('339');

... before it switches to:

ERROR:  already inserted
STATEMENT:  select stress_fn('339');

We used a different reason code in the detail for each spot that
could generate a serialization failure based on what serializable
transactions add to the checking done for repeatable read
transactions.  This particular reason code can only be generated if
a "dangerous structure" of read-write dependencies is found where
the "pivot" has committed -- a fairly rare occurrence.  So far in
this test, I have not seen that reason code without the duplicates,
nor duplicates without that reason code.

For those not intimately familiar with our serializable
implementation but who want to follow along to some degree or
other, the only way that serialization anomalies can occur under
snapshot isolation (what we have for repeatable read) is when a
"pivot" transaction (Tpivot) has a read-write dependency out to a
transaction (Tout) and a read-write dependency in from a
transaction (Tin) (in other words, Tout writes something that would
have been read by Tpivot if Tout had been committed before Tpivot
began, and Tpivot writes something that would have been read by Tin
if Tpivot had committed before Tin began) and Tout commits before
Tin or Tpivot commit.  As a further test to avoid false positives,
an anomaly cannot occur if Tin is read-only (or if it commits
without modifying data) and Tin started before Tout committed.

Just for completeness, Tin and Tout can be the same transaction.

So, to hit this code, Tin must be reading data written by Tpivot,
Tpivot must have committed (or at least be well into the commit
process), and Tout must have committed before Tpivot committed.
Tin is executing a SELECT and checks MVCC visibility information of
a row and finds that it was written by a transaction which is
committed (or at least past the point where it can throw a
serialization failure), *and* Tpivot had either read data that had
been written by Tout or Tout had written data that predicate locks
should show had already been read by Tpivot.  Either way, what Tin
is seeing should not be possible, because what I just described
should have caused a dangerous structure for which should have
caused either Tpivot to be canceled.

Following from that it appears that the different reason code Tin
is seeing is probably a symptom of a failure which already occurred
-- otherwise how could both of the other transactions have
committed?  Since there is no index on the "key" column, we must do
a heap scan, and that means a predicate lock on the heap relation.
Somehow, two connections get that predicate lock, scan the relation
and find that the key value doesn't yet exist, and insert without
seeing the other transaction's predicate lock.  But I'm not seeing
how that could happen.

Another possibility is that this is somehow happening as the result
of some bug in the "commit sequence number" used by the
serializable implementation.  In the absence of a real CSN, and to
avoid additional contention on the PROCARRAY locking, we created an
approximation of a CSN which plays some tricks which assume atomic
memory access where perhaps that isn't safe or needs a memory
barrier.  If neither Tpivot nor Tout sees the other as having
committed first, we could see exactly this behavior.

Those aren't the only places this race condition bug could be
hiding, but they seem like the two most obvious candidates.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-bugs by date:

Previous
From: mail@bwe.im
Date:
Subject: BUG #12379: pgbench should hint to pgbench -i
Next
From: Kevin Grittner
Date:
Subject: Re: BUG #11732: Non-serializable outcomes under serializable isolation