Thread: SSI and predicate locks - a non-trivial use case

SSI and predicate locks - a non-trivial use case

From
Gianni Ceccarelli
Date:
Hello.

At work we have a program that seems to be stressing the SSI
implementation, and I thought that it could provide useful insights to
better tune it. In particular, there are a few parts that are
described as "chosen entirely arbitrarily (and without benchmarking)",
and we may provide some of that benchmarking.

First of all, we're running "PostgreSQL 9.2.4 on
x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080 704 (Red
Hat 4.1.2-52), 64-bit"

The program consumes messages from a message bus (ActiveMQ in our
case), and uses the data contained in them to update unstructured
documents; some values from those documents are extracted into an
attribute-value table to make it possible to search for them
later. The schema is essentially this::

  CREATE TABLE docs (
    id VARCHAR(255) PRIMARY KEY,
    contents TEXT NOT NULL
  );

  CREATE TABLE doc_attributes (
    document_id VARCHAR(255) NOT NULL REFERENCES docs(id)
                ON DELETE CASCADE,
    attribute_name VARCHAR(255) NOT NULL,
    value VARCHAR(255) NOT NULL
  );

  CREATE INDEX idx_attribute_doc
            ON doc_attributes(document_id);

  CREATE INDEX idx_attribute_name_str
            ON doc_attributes(attribute_name,value);

The interesting part of the program works like this:

* Figure out which documents to update::

    BEGIN;
    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
    SELECT id FROM docs WHERE ...;
    COMMIT;

* Update each of them in turn::

    BEGIN;
    SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
    SELECT contents FROM docs WHERE id=?;
    -- change the contents, in client code
    UPDATE docs SET contents=? WHERE id=?;
    DELETE FROM doc_attributes WHERE document_id=?;
    INSERT INTO doc_attributes(document_id,attribute_name,value)
           VALUES (?,?,?); -- for each attribute
    COMMIT;

  If we receive a serialisation error, we retry the whole transaction,
  applying the changes to the new version of the document. Each retry
  takes about 0.1 seconds.

We have a few processes doing this in parallel, to keep up with the
amount of messages that are sent. We have an average of 30 rows in
``doc_attribute`` for each row in ``docs``. This is a typical
situation::

  SELECT pid, locktype,
         COUNT(*)/COUNT(DISTINCT virtualtransaction) AS tl,
         COUNT(*) AS total
  FROM pg_locks
  WHERE mode LIKE 'SI%'
  GROUP BY pid, locktype
  ORDER BY pid, locktype;

   pid  | locktype | tl  | total
  ------+----------+-----+-------
    445 | page     |   5 |  2706
    445 | tuple    |   1 |   767
    446 | page     |  14 |    28
    446 | tuple    |  37 |    74
    447 | page     |   1 |    19
    448 | page     |   1 |    19
    449 | page     |   5 |  2759
    449 | tuple    |   1 |   758
    454 | page     |  10 |  2209
    454 | tuple    |  37 |  7663
   1113 | page     |   5 |   604
   1113 | tuple    |   4 |   531
   1346 | page     |   6 |  1557
   1346 | tuple    |   1 |   454
        | page     | 174 |   174
        | tuple    | 236 |   236
  (16 rows)

Due to the large number of predicate locks, we have
``max_pred_locks_per_transaction = 10000``, and ``max_connections =
300`` (this is probably going to be reduced, we don't need more than
100).

Questions:

- What are locks without a pid? I thought they were leftover from
  transactions of now-disconnected clients, awaiting that all
  overlapping transactions complete, but the numbers don't behave as I
  would expect in that case (i.e. they don't grow when a client
  disconnect)

- Is the large number of page locks to be expected? How long should
  we expect them to stay? Some seem to stay around for minutes.

- Can this be of any use to benchmarking / tuning the SSI logic?

--
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88

Well, I think Perl should run faster than C.  :-)
             -- Larry Wall in <199801200306.TAA11638@wall.org>

Attachment

Re: SSI and predicate locks - a non-trivial use case

From
Kevin Grittner
Date:
Gianni Ceccarelli <dakkar@thenautilus.net> wrote:

> At work we have a program that seems to be stressing the SSI
> implementation, and I thought that it could provide useful
> insights to better tune it. In particular, there are a few parts
> that are described as "chosen entirely arbitrarily (and without
> benchmarking)", and we may provide some of that benchmarking.

Any insights on where the current algorithms don't perform well
would be welcome.  Of course, the subject line gives me some pause
-- I'm aware of many uses of SSI in non-trivial production
environments, including multi-terrabyte databases with thousands of
concurrent users.  In some cases the performance hit compared to
REPEATABLE READ, which would allow anomalies, is about 1%.

> First of all, we're running "PostgreSQL 9.2.4 on
> x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080 704
> (Red Hat 4.1.2-52), 64-bit"
>
> The program consumes messages from a message bus (ActiveMQ in our
> case), and uses the data contained in them to update unstructured
> documents; some values from those documents are extracted into an
> attribute-value table to make it possible to search for them
> later.

Using SSI to manage a database queue is known to be a "worst case"
workload.  Since there is no blocking beyond what is present in
snapshot isolation level techniques (which is what you get at the
REPEATABLE READ level), all concurrent attempts will pull the same
item off the queue and attempt to process it, and all but one will
be rolled back with a serialization failure. Efficient management
of a queue in the database really requires blocking techniques to
actually serialize the accesses to the ends of the queue.

That said, it sounds like you are not using SSI for that, but
leaving the queue management to ActiveMQ, which presumably handles
this correctly, and your problems are with the access to the
documents based on the requests pulled from the queue.  Is that
correct?

> The schema is essentially this::
>
>   CREATE TABLE docs (
>     id VARCHAR(255) PRIMARY KEY,
>     contents TEXT NOT NULL
>   );
>
>   CREATE TABLE doc_attributes (
>     document_id VARCHAR(255) NOT NULL REFERENCES docs(id)
>                 ON DELETE CASCADE,
>     attribute_name VARCHAR(255) NOT NULL,
>     value VARCHAR(255) NOT NULL
>   );
>
>   CREATE INDEX idx_attribute_doc
>             ON doc_attributes(document_id);
>
>   CREATE INDEX idx_attribute_name_str
>             ON doc_attributes(attribute_name,value);
>
> The interesting part of the program works like this:
>
> * Figure out which documents to update::
>
>     BEGIN;
>     SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
>     SELECT id FROM docs WHERE ...;
>     COMMIT;
>
> * Update each of them in turn::
>
>     BEGIN;
>     SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
>     SELECT contents FROM docs WHERE id=?;
>     -- change the contents, in client code
>     UPDATE docs SET contents=? WHERE id=?;
>     DELETE FROM doc_attributes WHERE document_id=?;
>     INSERT INTO doc_attributes(document_id,attribute_name,value)
>           VALUES (?,?,?); -- for each attribute
>     COMMIT;

I'm afraid that one of the most interesting and pertinent bits of
the code is written as ellipses.  Can you show or describe what the
REPEATABLE READ transaction is doing in more detail?

>   If we receive a serialisation error, we retry the whole
>   transaction, applying the changes to the new version of the
>   document. Each retry takes about 0.1 seconds.

What percentage of transactions are rolled back?  What is the
processing time in such transactions as a percentage of the total
load?

> We have a few processes doing this in parallel, to keep up with
> the amount of messages that are sent. We have an average of 30
> rows in ``doc_attribute`` for each row in ``docs``. This is a
> typical situation::
>
>   SELECT pid, locktype,
>         COUNT(*)/COUNT(DISTINCT virtualtransaction) AS tl,
>         COUNT(*) AS total     
>   FROM pg_locks
>   WHERE mode LIKE 'SI%'
>   GROUP BY pid, locktype
>   ORDER BY pid, locktype;
>
>    pid  | locktype | tl  | total
>   ------+----------+-----+-------
>     445 | page     |   5 |  2706
>     445 | tuple    |   1 |   767
>     446 | page     |  14 |    28
>     446 | tuple    |  37 |    74
>     447 | page     |   1 |    19
>     448 | page     |   1 |    19
>     449 | page     |   5 |  2759
>     449 | tuple    |   1 |   758
>     454 | page     |  10 |  2209
>     454 | tuple    |  37 |  7663
>    1113 | page     |   5 |   604
>    1113 | tuple    |   4 |   531
>    1346 | page     |   6 |  1557
>    1346 | tuple    |   1 |   454
>         | page     | 174 |   174
>         | tuple    | 236 |   236
>   (16 rows)
>
> Due to the large number of predicate locks, we have
> ``max_pred_locks_per_transaction = 10000``, and ``max_connections
> = 300`` (this is probably going to be reduced, we don't need more
> than 100).
>
> Questions:
>
> - What are locks without a pid? I thought they were leftover from
>   transactions of now-disconnected clients, awaiting that all
>   overlapping transactions complete, but the numbers don't behave
>   as I would expect in that case (i.e. they don't grow when a
>   client disconnect)

Those are predicate locks related to a transaction which has been
PREPARED (for two-phase commit) or committed, but which may still
be relevant because there are overlapping read-write transactions
which are still active.  One long-running SELECT statement, if it
is not declared to be in a read-only transaction, could cause a
large number of such locks to accumulate.

If you are not already doing so, make sure that any serializable
transaction which is not going to modify data is declared to be
READ ONLY at the start.  I have seen some shops set that to the
default and explicitly declare transactions to be READ WRITE when
that is needed.

> - Is the large number of page locks to be expected? How long
> should we expect them to stay? Some seem to stay around for
> minutes.

There is probably some database transaction (and that will count
individual statements not explicitly included in transactions)
which is running for minutes.

> - Can this be of any use to benchmarking / tuning the SSI logic?

As you probably noticed, the heuristics in
PredicateLockPromotionThreshold() are pretty simple.  We promote to
a page lock when a process acquires its third tuple lock on a page,
and we convert to a relation lock when a process uses half of
max_pred_locks_per_transaction on a single relation.  The thing
was, even those crude heuristics worked quite well in our testing,
as long as max_pred_locks_per_transaction and possibly
max_connections were adjusted.  (It is something of a hack that it
can be beneficial to increase max_connections if you get a lot of
page locks on a lot of tables, but something which rarely seems to
be seen in practice -- and you can always just go really high on
max_pred_locks_per_transaction instead.)

We have no illusions that these are the best possible heuristics,
but they have worked so well for the workloads we have tried, that
there wasn't really a basis for testing an alternative.  If you
have a workload which you think would do better with something more
sophisticated, it would be great to have more details.  If you
wanted to benchmark your workload against a custom version with a
modified PredicateLockPromotionThreshold() function, that would be
fantastic.

There are many known ways in which SSI logic could be tuned.  I've
been meaning to itemize them on the TODO list, but will mention the
ones which come to mind here to get them "on the record".  If you
have an interest in working on, or sponsoring development of, any
of these -- that would be great.

 - A recent university study of various techniques for achieving
serializable transactions found the PostgreSQL implementation to
scale better than any others they tested, even with retries on
serialization failure, but at high concurrency they found about
half the CPU time going to these functions, which are O(N^2), and
should probably be reworked to scale better:
  CreatePredXact()
  ReleasePredXact()
  FirstPredXact()
  NextPredXact()
Perhaps a hash table instead of a double linked list is needed.

 - After the lightweight locking changes made in 9.2 to allow
scaling to more processes, the lightweight locking in SSI predicate
locking have become a bottleneck at higher concurrency.  This area
now needs tuning.  This may or may not be related to the prior
point.

 - btree locking does not go down to the "next-key" granularity
that most predicate locking systems do; its finest granularity is
page level.  We could reduce false positive serialization failures
by implementing next-key locking granularity.

 - Index types other than btree only support relation-level locks.
Finer granularity would reduce false positives involving queries
which select using other index types.

 - We don't distinguish between heap relation locks which need to
prohibit inserts (those caused by a table scan) and heap relation
locks which don't conflict with inserts (those caused by promotion
from finer granularity).  We would reduce false positives if we
did.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: SSI and predicate locks - a non-trivial use case

From
Kevin Grittner
Date:
Kevin Grittner <kgrittn@ymail.com> wrote:

>  - We don't distinguish between heap relation locks which need to

> prohibit inserts (those caused by a table scan) and heap relation
> locks which don't conflict with inserts (those caused by promotion
> from finer granularity).  We would reduce false positives if we
> did.


Correction: in the above point "prohibit" is misleading.

s/prohibit/cause read-write conflicts with/


A single read-write conflict does not cause blocking or transaction rollback.


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: SSI and predicate locks - a non-trivial use case

From
Gianni Ceccarelli
Date:
On 2013-08-31 Kevin Grittner <kgrittn@ymail.com> wrote:
> Of course, the subject line gives me some pause -- I'm aware of many
> uses of SSI in non-trivial production environments, including
> multi-terrabyte databases with thousands of concurrent users. In
> some cases the performance hit compared to REPEATABLE READ, which
> would allow anomalies, is about 1%.

The subject line was a product of my ignorance of such use
cases. Also, I'm not really talking about performance: I've seen it
working with no appreciable slow-down in updating documents, relative
to our previous usage (``select for update;update;commit``), which had
the distinct disadvantage of locking out all transactions that wanted
to touch the same documents. Performance-wise, it works brilliantly.

> Using SSI to manage a database queue is known to be a "worst case"
> workload.

Sorry, I was not clear enough. The queue is managed by ActiveMQ, each
concurrent consumer gets a different message and (usually) updates a
different set of documents.

> your problems are with the access to the documents based on the
> requests pulled from the queue.

It's not even much of a problem. I mean, it works and it works well.
I was looking for explanations about the details that I don't
understand, and I thought that providing a use case could help the
implementers tune some of the internal logic.

> I'm afraid that one of the most interesting and pertinent bits of
> the code is written as ellipses. Can you show or describe what the
> REPEATABLE READ transaction is doing in more detail?

It's really not much. Most of the times it's doing a ``SELECT id FROM
docs WHERE id IN (?,?,?,...)`` to check which documents to update
(missing ones may be created, or ignored, depending on the particular
message). Other times it's doing ``SELECT document_id FROM
doc_attribute WHERE attribute_name=? AND value=?`` to get a set of ids
of documents that have some attribute set to a particular value.

> > If we receive a serialisation error, we retry the whole
> > transaction, applying the changes to the new version of the
> > document. Each retry takes about 0.1 seconds.
>
> What percentage of transactions are rolled back? What is the
> processing time in such transactions as a percentage of the total
> load?

Depends on the message. Most times messages touch one or two
documents, and very rarely collide. Sometimes we get two messages that
touch the same 100 documents, and about a quarter of the 200 commits
will fail (200 because 2 consumers are updating 100 documents
each). The consumer processes spend most of their time waiting for
messages, and the rest inside a serialisable transaction. Under our
load tests, the consumers were force-fed messages, so they were
spending essentially all their time updating documents inside such
transactions.

> [Locks without PIDs] are predicate locks related to a transaction
> which has been PREPARED (for two-phase commit) or committed, but
> which may still be relevant because there are overlapping read-write
> transactions which are still active.  One long-running SELECT
> statement, if it is not declared to be in a read-only transaction,
> could cause a large number of such locks to accumulate.

So a long "read committed" transaction will cause locks from
"serialisable" transactions to accumulate? Good to know, I had not
realised that.

> If you are not already doing so, make sure that any serializable
> transaction which is not going to modify data is declared to be
> READ ONLY at the start.

All our serializable transactions modify data. Those that don't, don't
need to be isolated that much, so we declare them "read
committed". Should we declare them as "read committed, read only"?

> > - Is the large number of page locks to be expected?
>
> There is probably some database transaction (and that will count
> individual statements not explicitly included in transactions)
> which is running for minutes.

The slow transactions should only be "read committed". Or we may have
some bug in the code. I'll keep looking.

> As you probably noticed, the heuristics in
> PredicateLockPromotionThreshold() are pretty simple.

Yes, that's the main reason I decided to write to the list.

> and you can always just go really high on
> max_pred_locks_per_transaction instead.

Could it be useful to document that predicate locks are very small
(from the source, I'd say around 150 bytes), so that people don't get
scared to set max_pred_locks_per_transaction very high?

> If you have a workload which you think would do better with
> something more sophisticated, it would be great to have more
> details.

I'm no longer sure that our system is interesting, but I'll be glad to
provide as much detail as I can gather. What kind of details would be
useful?

> If you wanted to benchmark your workload against a custom version
> with a modified PredicateLockPromotionThreshold() function, that
> would be fantastic.

I don't know enough to write such a modified version, but I can run
it.

--
    Dakkar - <Mobilis in mobile>
    GPG public key fingerprint = A071 E618 DD2C 5901 9574
                                 6FE2 40EA 9883 7519 3F88
                        key id = 0x75193F88

Chicken Little was right.

Attachment

Re: SSI and predicate locks - a non-trivial use case

From
Kevin Grittner
Date:
Gianni Ceccarelli <dakkar@thenautilus.net> wrote:
> On 2013-08-31 Kevin Grittner <kgrittn@ymail.com> wrote:

>> [Locks without PIDs] are predicate locks related to a
>> transaction which has been PREPARED (for two-phase commit) or
>> committed, but which may still be relevant because there are
>> overlapping read-write transactions which are still active.  One
>> long-running SELECT statement, if it is not declared to be in a
>> read-only transaction, could cause a large number of such locks
>> to accumulate.
>
> So a long "read committed" transaction will cause locks from
> "serialisable" transactions to accumulate? Good to know, I had
> not realised that.

I stated that poorly -- if I remember correctly, long-running
serializable read-write transactions should cause predicate locks
of committed overlapping serializable transactions to retained;
transactions using other isolation levels or which are read-only
should not have this affect.  Predicate locks from a prepared
transaction, however, must be kept at least until commit of the
prepared transaction, at which point they must be kept until
completion of all serializable read-write transactions running at
the moment of commit.

There is one more special case of predicate locks without a pid,
although it seems rather unlikely to be in play here -- if a large
number of committed transactions exhausts the limit on predicate
locks, the locks will be summarized.  A summary predicate lock will
not only have a missing pid but a missing virtualtransaction in
pg_locks.  These should get cleaned up when the last of the
summarized transactions would have been.  Even if you are seeing
these, summarization should make the count of locks without a pid
lower, not higher -- unless there is a bug specific to cleanup of
summarized transactions.

If none of this explains the locks without you are seeing without a
pid, it it possible there is an undiscovered bug in SSI.

>> If you are not already doing so, make sure that any serializable
>> transaction which is not going to modify data is declared to be
>> READ ONLY at the start.
>
> All our serializable transactions modify data. Those that don't,
> don't need to be isolated that much, so we declare them "read
> committed". Should we declare them as "read committed, read
> only"?

If they are not serializable declaring them read-only should not
affect this issue.

This is a digression, but be sure to consider ways in which even a
read-only transaction might see anomalies.  You might not be
vulnerable to such problems, but I just wanted to point out the
possibility:

http://wiki.postgresql.org/wiki/SSI#Read_Only_Transactions

>>> - Is the large number of page locks to be expected?
>>
>> There is probably some database transaction (and that will count
>> individual statements not explicitly included in transactions)
>> which is running for minutes.
>
> The slow transactions should only be "read committed". Or we may
> have some bug in the code. I'll keep looking.

Unless I'm missing something, there is either a long running
serializable read-write transaction on your cluster or a bug in SSI
cleanup.  Be sure to consider transactions in other databases and
ad hoc queries.

>> As you probably noticed, the heuristics in
>> PredicateLockPromotionThreshold() are pretty simple.
>
> Yes, that's the main reason I decided to write to the list.
>
>> and you can always just go really high on
>> max_pred_locks_per_transaction instead.
>
> Could it be useful to document that predicate locks are very
> small (from the source, I'd say around 150 bytes), so that people
> don't get scared to set max_pred_locks_per_transaction very high?

Perhaps.  Increasing the default setting by a factor of 10 with the
default max_connections of 100 would reserve about 8.4 MB of
additional RAM for predicate locks. With heavy use of serializable
transactions, it is common to need to go that far or further, which
surprises many people; and I suspect there may be some trepidation
about the memory impact that the documentation could allay.  Do you
have suggested wording?

>> If you have a workload which you think would do better with
>> something more sophisticated, it would be great to have more
>> details.
>
> I'm no longer sure that our system is interesting, but I'll be
> glad to provide as much detail as I can gather. What kind of
> details would be useful?

When you see the high count of SIRead locks with null PID in
pg_locks, the complete output of pg_stat_activity and pg_locks
would be interesting.  You could attach those as a compressed
tarball, or if there is concern about posting that publicly you
could send it to me.  I think that might be enough to determine
whether there is a bug in predicate lock cleanup.

>> If you wanted to benchmark your workload against a custom
>> version with a modified PredicateLockPromotionThreshold()
>> function, that would be fantastic.
>
> I don't know enough to write such a modified version, but I can
> run it.

At this point I'm not sure how to tweak it to make anything better
for you.  If the rows are narrow and you are getting false positive
serialization failures because of the promotion of heap tuple locks
to page locks, it might be interesting to test a version of the
function which bases the number needed for promotion on the maximum
number of tuples which can fit on a page for that particular
relation.  Perhaps promote to a page lock when the page hits the
point where 50% of the maximum tuples for a page have been locked.
The question would be whether the performance gain from fewer
transaction retries would outweigh the cost of the more complex
calculation.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company