Exposing the lock manager's WaitForLockers() to SQL - Mailing list pgsql-hackers

From Will Mortensen
Subject Exposing the lock manager's WaitForLockers() to SQL
Date
Msg-id CAMpnoC5SBCh7Ue+wO17_T-SEXN4U0oFQdXnBh9H5pLM8vjbUDQ@mail.gmail.com
Whole thread Raw
Responses Re: Exposing the lock manager's WaitForLockers() to SQL  (Marco Slot <marco.slot@gmail.com>)
List pgsql-hackers
Hi there,

We'd like to be able to call the lock manager's WaitForLockers() and
WaitForLockersMultiple() from SQL. Below I describe our use case, but
basically I'm wondering if this:

    1. Seems like a reasonable thing to do

    2. Would be of interest upstream

    3. Should be done with a new pg_foo() function (taking an
       oid?), or a whole new SQL command, or something else

If this sounds promising, we may be able to code this up and submit it.

The rest of this email describes our use cases and related observations:


==== Use Case Background ====

Our use case is inspired by this blog post by Marco Slot (CC'ed) at
Citus Data: https://www.citusdata.com/blog/2018/06/14/scalable-incremental-data-aggregation/
. This describes a scheme for correctly aggregating rows given minimal
coordination with an arbitrary number of writers while keeping minimal
additional state. It relies on two simple facts:

    1. INSERT/UPDATE take their ROW EXCLUSIVE lock on the target
       table before evaluating any column DEFAULT expressions,
       and thus before e.g. calling nextval() on a sequence in
       the DEFAULT expression. And of course, this lock is only
       released when the transaction commits or rolls back.

    2. pg_sequence_last_value() (still undocumented!) can be
       used to obtain an instantaneous upper bound on the
       sequence values that have been returned by nextval(), even
       if the transaction that called nextval() hasn't yet
       committed.

So, assume we have a table:

    create table tbl (
        id bigserial,
        data text
    );

which is only ever modified by INSERTs that use DEFAULT for id. Then,
a client can process each row exactly once using a loop like this
(excuse the pseudo-SQL):

    min_id := 0;
    while true:
        max_id := pg_sequence_last_value('tbl_id_seq');
        wait_for_writers('tbl'::regclass);
        SELECT
            some_aggregation(data)
            FROM tbl
            WHERE id > min_id AND id <= max_id;
        min_id := max_id;

In the blog post, the equivalent of wait_for_writers() is implemented
by taking and immediately releasing a SHARE ROW EXCLUSIVE lock on tbl.
It's unclear why this can't be SHARE, since it just needs to conflict
with INSERT's ROW EXCLUSIVE, but in any case it's sufficient for
correctness.

(Note that this version only works if the rows committed by the
transactions that it waited for are actually visible to the SELECT, so
for example, the whole thing can't be within a Repeatable Read or
Serializable transaction.)


==== Why WaitForLockers()? ====

No new writer can acquire a ROW EXCLUSIVE lock as long as we're
waiting to obtain the SHARE lock, even if we only hold it for an
instant. If we have to wait a long time, because some existing writer
holds its ROW EXCLUSIVE lock for a long time, this could noticeably
reduce overall writer throughput.

But we don't actually need to obtain a lock at all--and waiting for
transactions that already hold conflicting locks is exactly what
WaitForLockers() / WaitForLockersMultiple() does. Using it instead
would prevent any interference with writers.


==== Appendix: Extensions and Observations ====

Aside from downgrading to SHARE mode and merely waiting instead of
locking, we propose a couple other extensions and observations related
to Citus' scheme. These only tangentially motivate our need for
WaitForLockers(), so you may stop reading here unless the overall
scheme is of interest.

== Separate client for reading sequences and waiting ==

First, in our use case each batch of rows might require extensive
processing as part of a larger operation that doesn't want to block
waiting for writers to commit. A simple extension is to separate the
processing from the determination of sequence values. In other words,
have a single client that sits in a loop:

    while true:
        seq_val := pg_sequence_last_value('tbl_id_seq');
        WaitForLockers('tbl'::regclass, 'SHARE');
        publish(seq_val);

and any number of other clients that use the series of published
sequence values to do their own independent processing (maintaining
their own additional state).

This can be extended to multiple tables with WaitForLockersMultiple():

    while true:
        seq_val1 := pg_sequence_last_value('tbl1_id_seq');
        seq_val2 := pg_sequence_last_value('tbl2_id_seq');
        WaitForLockersMultiple(
            ARRAY['tbl1', 'tbl2']::regclass[], 'SHARE');
        publish('tbl1', seq_val1);
        publish('tbl2', seq_val2);

Which is clearly more efficient than locking or waiting for the tables
in sequence, hence the desire for that function as well.

== Latency ==

This brings us to a series of observations about latency. If some
writers take a long time to commit, some already-committed rows might
not be processed for a long time. To avoid exacerbating this when
using WaitForLockersMultiple(), which obviously has to wait for the
last writer of any specified table, it should be used with groups of
tables that are generally written by the same transactions.

Also, while in Citus' example the aggregation needs to process each
row exactly once, latency can be reduced if a row may be processed
more than once and if rows can be processed out of order by sequence
value (id), by simply removing the "id <= max_id" term from the WHERE
clause in the reader. This particularly reduces latency if waiting and
processing are separated as described in the above section.

== Updates and latency ==

In our application we have some use cases with a table like:

    create table tbl (
        id bigint primary key,
        data text,
        mod_idx bigserial
    );

where writers do:

    INSERT INTO tbl (id, data) VALUES (1, 'foo')
    ON CONFLICT (id) DO UPDATE
    SET data = excluded.data, mod_idx = DEFAULT;

and where the reader's job is to continuously replicate rows within a
fixed range of id's in an eventually-consistent fashion. Since the
writer always bumps mod_idx by setting it to DEFAULT, superficially it
seems we can use this scheme with mod_idx:

    min_mod_idx := 0;
    while true:
        max_mod_idx := pg_sequence_last_value('tbl_mod_idx_seq');
        WaitForLockers('tbl'::regclass, 'SHARE');
        SELECT
            do_replicate(id, data, mod_idx)
            FROM tbl
            WHERE
                id >= my_min_id     -- app-specific
                AND id < my_max_id  -- app-specific
                AND mod_idx > min_mod_idx
                AND mod_idx <= max_mod_idx
            ORDER BY mod_idx;
        min_mod_idx := max_mod_idx;

This version replicates all rows eventually (if writers stop),
replicates each version of a row at most once (allowing updates to be
skipped if obviated by a later committed update), and replicates
changes in order by mod_idx, which may make bookkeeping easier. But
continuous overlapping updates could keep some rows *perpetually* out
of the reader's reach, leading to *unbounded* latency. If the
application can instead tolerate potentially replicating the same
version of a row more than once, and replicating changes to different
rows out of order by mod_idx, latency can be minimized by removing
"mod_idx <= max_mod_idx" from the WHERE clause. (The ORDER BY should
likely also be removed, since later batches may contain rows with a
lower mod_idx.) The remainder of the scheme still ensures that all
rows are eventually replicated, and limits redundant replication while
keeping minimal state.

== Latency tradeoff and advantages ==

In conclusion, with this scheme there is a tradeoff between minimizing
latency and avoiding redundant processing, where (depending on the
scenario) the amount of latency or redundant processing is related to
the maximum amount of time that a writer transaction holds a ROW
EXCLUSIVE lock on the table. Therefore, this time should be minimized
wherever possible.

This tradeoff seems to be an inherent consequence of the minimalist
advantages of this scheme:

    1. If we use WaitForLockers(), no additional locks are taken,
       so there's no impact on concurrency of writers

    2. If WaitForLockers() is separated from readers, there's no
       impact on concurrency/waiting of readers

    3. Can be used to guarantee eventual consistency as desired

    4. Keeps O(1) state per table (per reader)--no tracking of
       individual writers or individual row updates

    5. Requires minimal cooperation from writers (just use DEFAULT
       expressions that use nextval())



pgsql-hackers by date:

Previous
From: Bharath Rupireddy
Date:
Subject: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Next
From: Andres Freund
Date:
Subject: Re: daitch_mokotoff module