Exposing the lock manager's WaitForLockers() to SQL - Mailing list pgsql-hackers
From | Will Mortensen |
---|---|
Subject | Exposing the lock manager's WaitForLockers() to SQL |
Date | |
Msg-id | CAMpnoC5SBCh7Ue+wO17_T-SEXN4U0oFQdXnBh9H5pLM8vjbUDQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Exposing the lock manager's WaitForLockers() to SQL
Re: Exposing the lock manager's WaitForLockers() to SQL |
List | pgsql-hackers |
Hi there, We'd like to be able to call the lock manager's WaitForLockers() and WaitForLockersMultiple() from SQL. Below I describe our use case, but basically I'm wondering if this: 1. Seems like a reasonable thing to do 2. Would be of interest upstream 3. Should be done with a new pg_foo() function (taking an oid?), or a whole new SQL command, or something else If this sounds promising, we may be able to code this up and submit it. The rest of this email describes our use cases and related observations: ==== Use Case Background ==== Our use case is inspired by this blog post by Marco Slot (CC'ed) at Citus Data: https://www.citusdata.com/blog/2018/06/14/scalable-incremental-data-aggregation/ . This describes a scheme for correctly aggregating rows given minimal coordination with an arbitrary number of writers while keeping minimal additional state. It relies on two simple facts: 1. INSERT/UPDATE take their ROW EXCLUSIVE lock on the target table before evaluating any column DEFAULT expressions, and thus before e.g. calling nextval() on a sequence in the DEFAULT expression. And of course, this lock is only released when the transaction commits or rolls back. 2. pg_sequence_last_value() (still undocumented!) can be used to obtain an instantaneous upper bound on the sequence values that have been returned by nextval(), even if the transaction that called nextval() hasn't yet committed. So, assume we have a table: create table tbl ( id bigserial, data text ); which is only ever modified by INSERTs that use DEFAULT for id. Then, a client can process each row exactly once using a loop like this (excuse the pseudo-SQL): min_id := 0; while true: max_id := pg_sequence_last_value('tbl_id_seq'); wait_for_writers('tbl'::regclass); SELECT some_aggregation(data) FROM tbl WHERE id > min_id AND id <= max_id; min_id := max_id; In the blog post, the equivalent of wait_for_writers() is implemented by taking and immediately releasing a SHARE ROW EXCLUSIVE lock on tbl. It's unclear why this can't be SHARE, since it just needs to conflict with INSERT's ROW EXCLUSIVE, but in any case it's sufficient for correctness. (Note that this version only works if the rows committed by the transactions that it waited for are actually visible to the SELECT, so for example, the whole thing can't be within a Repeatable Read or Serializable transaction.) ==== Why WaitForLockers()? ==== No new writer can acquire a ROW EXCLUSIVE lock as long as we're waiting to obtain the SHARE lock, even if we only hold it for an instant. If we have to wait a long time, because some existing writer holds its ROW EXCLUSIVE lock for a long time, this could noticeably reduce overall writer throughput. But we don't actually need to obtain a lock at all--and waiting for transactions that already hold conflicting locks is exactly what WaitForLockers() / WaitForLockersMultiple() does. Using it instead would prevent any interference with writers. ==== Appendix: Extensions and Observations ==== Aside from downgrading to SHARE mode and merely waiting instead of locking, we propose a couple other extensions and observations related to Citus' scheme. These only tangentially motivate our need for WaitForLockers(), so you may stop reading here unless the overall scheme is of interest. == Separate client for reading sequences and waiting == First, in our use case each batch of rows might require extensive processing as part of a larger operation that doesn't want to block waiting for writers to commit. A simple extension is to separate the processing from the determination of sequence values. In other words, have a single client that sits in a loop: while true: seq_val := pg_sequence_last_value('tbl_id_seq'); WaitForLockers('tbl'::regclass, 'SHARE'); publish(seq_val); and any number of other clients that use the series of published sequence values to do their own independent processing (maintaining their own additional state). This can be extended to multiple tables with WaitForLockersMultiple(): while true: seq_val1 := pg_sequence_last_value('tbl1_id_seq'); seq_val2 := pg_sequence_last_value('tbl2_id_seq'); WaitForLockersMultiple( ARRAY['tbl1', 'tbl2']::regclass[], 'SHARE'); publish('tbl1', seq_val1); publish('tbl2', seq_val2); Which is clearly more efficient than locking or waiting for the tables in sequence, hence the desire for that function as well. == Latency == This brings us to a series of observations about latency. If some writers take a long time to commit, some already-committed rows might not be processed for a long time. To avoid exacerbating this when using WaitForLockersMultiple(), which obviously has to wait for the last writer of any specified table, it should be used with groups of tables that are generally written by the same transactions. Also, while in Citus' example the aggregation needs to process each row exactly once, latency can be reduced if a row may be processed more than once and if rows can be processed out of order by sequence value (id), by simply removing the "id <= max_id" term from the WHERE clause in the reader. This particularly reduces latency if waiting and processing are separated as described in the above section. == Updates and latency == In our application we have some use cases with a table like: create table tbl ( id bigint primary key, data text, mod_idx bigserial ); where writers do: INSERT INTO tbl (id, data) VALUES (1, 'foo') ON CONFLICT (id) DO UPDATE SET data = excluded.data, mod_idx = DEFAULT; and where the reader's job is to continuously replicate rows within a fixed range of id's in an eventually-consistent fashion. Since the writer always bumps mod_idx by setting it to DEFAULT, superficially it seems we can use this scheme with mod_idx: min_mod_idx := 0; while true: max_mod_idx := pg_sequence_last_value('tbl_mod_idx_seq'); WaitForLockers('tbl'::regclass, 'SHARE'); SELECT do_replicate(id, data, mod_idx) FROM tbl WHERE id >= my_min_id -- app-specific AND id < my_max_id -- app-specific AND mod_idx > min_mod_idx AND mod_idx <= max_mod_idx ORDER BY mod_idx; min_mod_idx := max_mod_idx; This version replicates all rows eventually (if writers stop), replicates each version of a row at most once (allowing updates to be skipped if obviated by a later committed update), and replicates changes in order by mod_idx, which may make bookkeeping easier. But continuous overlapping updates could keep some rows *perpetually* out of the reader's reach, leading to *unbounded* latency. If the application can instead tolerate potentially replicating the same version of a row more than once, and replicating changes to different rows out of order by mod_idx, latency can be minimized by removing "mod_idx <= max_mod_idx" from the WHERE clause. (The ORDER BY should likely also be removed, since later batches may contain rows with a lower mod_idx.) The remainder of the scheme still ensures that all rows are eventually replicated, and limits redundant replication while keeping minimal state. == Latency tradeoff and advantages == In conclusion, with this scheme there is a tradeoff between minimizing latency and avoiding redundant processing, where (depending on the scenario) the amount of latency or redundant processing is related to the maximum amount of time that a writer transaction holds a ROW EXCLUSIVE lock on the table. Therefore, this time should be minimized wherever possible. This tradeoff seems to be an inherent consequence of the minimalist advantages of this scheme: 1. If we use WaitForLockers(), no additional locks are taken, so there's no impact on concurrency of writers 2. If WaitForLockers() is separated from readers, there's no impact on concurrency/waiting of readers 3. Can be used to guarantee eventual consistency as desired 4. Keeps O(1) state per table (per reader)--no tracking of individual writers or individual row updates 5. Requires minimal cooperation from writers (just use DEFAULT expressions that use nextval())
pgsql-hackers by date: