FlexLocks - Mailing list pgsql-hackers

From Robert Haas
Subject FlexLocks
Date
Msg-id CA+Tgmob4YE_k5dpO0T07PNf1SOKPybo+wj4m4FryOS7Z4_yOzg@mail.gmail.com
Whole thread Raw
Responses Re: FlexLocks
Re: FlexLocks
Re: FlexLocks
List pgsql-hackers
I've been noodling around with various methods of reducing
ProcArrayLock contention and (after many false starts) I think I've
finally found one that works really well.  I apologize in advance if
this makes your head explode; I think that the design I have hear is
solid, but it represents a significant and invasive overhaul of the
LWLock system - I think for the better, but you'll have to be the
judge.  I'll start with the performance numbers (from that good ol'
32-core Nate Boley system), where I build from commit
f1585362856d4da17113ba2e4ba46cf83cba0cf2, with and without the
attached patches, and then ran pgbench on logged and unlogged tables
with various numbers of clients, with shared_buffers = 8GB,
maintenance_work_mem = 1GB, synchronous_commit = off,
checkpoint_segments = 300, checkpoint_timeout = 15min,
checkpoint_completion_target = 0.9, wal_writer_delay = 20ms.  The
numbers below are (as usual) the median of three-five minute runs at
scale factor 100.  The lines starting with "m" and a number are that
number of clients on unpatched master, and the lines starting with "f"
are that number of clients with this patch set.

The really big win here is unlogged tables at 32 clients, where
throughput has *doubled* and now scales *better than linearly* as
compared with the single-client results.

== Unlogged Tables ==
m01 tps = 679.737639 (including connections establishing)
f01 tps = 668.275270 (including connections establishing)
m08 tps = 4771.757193 (including connections establishing)
f08 tps = 4867.520049 (including connections establishing)
m32 tps = 10736.232426 (including connections establishing)
f32 tps = 21303.295441 (including connections establishing)
m80 tps = 7829.989887 (including connections establishing)
f80 tps = 19835.231438 (including connections establishing)

== Permanent Tables ==
m01 tps = 634.424125 (including connections establishing)
f01 tps = 633.450405 (including connections establishing)
m08 tps = 4544.781551 (including connections establishing)
f08 tps = 4556.298219 (including connections establishing)
m32 tps = 9902.844302 (including connections establishing)
f32 tps = 11028.745881 (including connections establishing)
m80 tps = 7467.437442 (including connections establishing)
f80 tps = 11909.738232 (including connections establishing)

A couple of other interesting things buried in these numbers:

1. Permanent tables don't derive nearly as much benefit as unlogged
tables.  I believe that this is because, for permanent tables, the
major bottleneck is WALInsertLock.  Fixing ProcArrayLock squeezes out
a healthy 10%, but we'll have to make significant performance on
WALInsertLock to get anywhere close to linear scaling.
2. The drop-off between 32 clients and 80 clients is greatly reduced
with this patch set; indeed, for permanent tables, tps increased
slightly between 32 and 80 clients.  I believe that small decrease for
unlogged tables is likely due to the fact that by 80 tables,
WALInsertLock starts to become a contention point, due to the need to
insert the commit records.

In terms of the actual patches, it's been bugging me for a while that
the LWLock code contains a lot of infrastructure that's not easily
reusable by other parts of the system.  So the first of the two
attached patches, flexlock-v1.patch, separates the LWLock code into an
upper layer and a lower layer.  The lower layer I called "FlexLocks",
and it's designed to allow a variety of locking implementations to be
built on top of it and reuse as much of the basic infrastructure as I
could figure out how to make reusable without hurting performance too
much.  LWLocks become the anchor client of the FlexLock system; in
essence, most of flexlock.c is code that was removed from lwlock.c.
The second patch, procarraylock.c, uses that infrastructure to define
a new type of FlexLock specifically for ProcArrayLock.  It basically
works like a regular LWLock, except that it has a special operation to
optimize ProcArrayEndTransaction().  In the uncontended case, instead
of acquiring and releasing the lock, it just grabs the lock, observes
that there is no contention, clears the critical PGPROC fields (which
isn't noticeably slower than updating the state of the lock would be)
and releases the spin lock.  There's then no need to reacquire the
spinlock to "release" the lock; we're done.  In the contended case,
the backend wishing to end adds itself to a queue of ending
transactions.  When ProcArrayLock is released, the last person out
clears the PGPROC structures for all the waiters and wakes them all
up; they don't need to reacquire the lock, because the work they
wished to perform while holding it is already done.  Thus, in the
*worst* case, ending transactions only need to acquire the spinlock
protecting ProcArrayLock half as often (once instead of twice), and in
the best case (where backends have to keep retrying only to repeatedly
fail to get the lock) it's far better than that.

Of course, there are ways that this could be implemented without the
FlexLock stuff, if people don't like this solution.  Myself, I find it
quite elegant (though there are certainly arguable points in there
where the code could probably be improved), but then again, I wrote
it.  For what it's worth, I believe that there are other places where
the FlexLock infrastructure could be helpful.  In this case, the new
ProcArrayLock is very specific to what ProcArrayLock actually does,
and can't be really reused for anything else.  But I've had a thought
that we might want to have a type of FlexLock that contains an LSN.
The lock holder advances the LSN and can then release everyone who was
waiting for a value <= that LSN without them needing to reacquire the
lock.  This could be useful for things like WALWriteLock, and sync
rep.  Also, I think there might be interesting applications for buffer
locks, perhaps by having a lock type that manages both content locks
and pins.  Alternatively, if we want to support CRCs, it might be
useful to have a third buffer lock mode in between shared and
exclusive.  SX would conflict with itself and with exclusive but not
with shared, and would be required to write out the page or set hint
bits but not just to examine tuples; this could be used to ensure that
the page doesn't change (thus invalidating the CRC) while the write is
in progress.  I'm not necessarily saying that any of these particular
things are what we want to do, just throwing out the idea that we may
want a variety of lock types that are similar to lightweight locks but
with subtly different behavior, yet with common infrastructure for
error handling and wait queue management.

Anyway, this is all up for discussion, argument, etc. - but here are
the patches.  Comments, idea, thoughts, code review, and/or testing
are appreciated.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Minor optimisation of XLogInsert()
Next
From: Tom Lane
Date:
Subject: Re: ToDo: pg_backup - using a conditional DROP