Thread: Experimenting with transactional memory for SERIALIZABLE

Experimenting with transactional memory for SERIALIZABLE

From

Thomas Munro

Date:

20 February 2020, 03:55:12

Hello hackers,

Here's a *highly* experimental patch set that tries to skip the LWLock
protocol in predicate.c and use HTM[1] instead.  HTM is itself a sort
of hardware-level implementation of SSI for shared memory.  My
thinking was that if your workload already suits the optimistic nature
of SSI, perhaps it could make sense to go all-in and remove the rather
complicated pessimistic locking it's built on top of.  It falls back
to an LWLock-based path at compile time if you don't build with
--enable-htm, or at runtime if a startup test discovered that your CPU
doesn't have the Intel TSX instruction set (microarchitectures older
than Skylake, and some mobile and low power variants of current ones),
or if a hardware transaction is aborted for various reasons.

The good news is that it seems to produce correct results in simple
tests (well, some lock-held-by-me assertions can fail in an
--enable-cassert build, that's trivial to fix).  The bad news is that
it doesn't perform very well yet, and I think the reason for that is
that there are some inherently serial parts of the current design that
cause frequent conflicts.  In particular, the
FinishedSerializableTransactions list, protected by
SerializableFinishedListLock, produces a stream of conflicts, and
falls back to the traditional behaviour which involves long lock wait
queues and thereby more HTM conflicts.  I think we probably need a
more concurrent way to release SSI transactions, entirely independent
of this HTM experiment.  There's another point of serialisation at
snapshot acquisition time, which may be less avoidable; I don't know.
For much of the code that runs between snapshot acquisition and
transaction release, we really only care about touching memory
directly related to the SQL objects we touch in our SQL transaction,
and the other SQL transactions which have also touched them.  The
question is whether it's possible to get to a situation where
non-overlapping read/write sets at the SQL level don't cause conflicts
at the memory level and everything goes faster, or whether the SSI
algorithm is somehow inherently unsuitable for running on top of, erm,
SSI-like technology.  It seems like a potentially interesting research
project.

Here's my one paragraph introduction to HTM programming:  Using the
wrapper macros from my 0001 patch, you call pg_htm_begin(), and if
that returns true you're in a memory transaction and should eventually
call pg_htm_commit() or pg_htm_abort(), and if it returns false your
transaction has aborted and you need to fall back to some other
strategy.  (Retrying is also an option, but the reason codes are
complicated, and progress is not guaranteed, so introductions to the
topic often advise going straight to a fallback.)  No other thread is
allowed to see your changes to memory until you commit, and if you
abort (explicitly, due to lack of cache for uncommitted changes, due
to a serialisation conflict, or due to other internal details possibly
known only to Intel), all queued changes to memory are abandoned, and
control returns at pg_htm_begin(), a bit like the way setjmp() returns
non-locally when you call longjmp().  There are plenty of sources to
read about this stuff in detail, but for a very gentle introduction I
recommend Maurice Herlihy's 2-part talk[2][3] (the inventor of this
stuff at DEC in the early 90s), despite some strange claims he makes
about database hackers.

In theory this should work on POWER and future ARM systems too, with a
bit more work, but I haven't looked into that.  There are doubtless
many other applications for this type of technology within PostgreSQL.
Perhaps some more fruitful.

[1] https://en.wikipedia.org/wiki/Transactional_memory
[2] https://www.youtube.com/watch?v=S3Fx-7avfs4
[3] https://www.youtube.com/watch?v=94ieceVxSHs

Attachment

Re: Experimenting with transactional memory for SERIALIZABLE

From

Dmitry Dolgov

Date:

20 February 2020, 10:39:45

> On Thu, Feb 20, 2020 at 04:55:12PM +1300, Thomas Munro wrote:
> Hello hackers,
>
> Here's a *highly* experimental patch set that tries to skip the LWLock
> protocol in predicate.c and use HTM[1] instead.  HTM is itself a sort
> of hardware-level implementation of SSI for shared memory.  My
> thinking was that if your workload already suits the optimistic nature
> of SSI, perhaps it could make sense to go all-in and remove the rather
> complicated pessimistic locking it's built on top of.  It falls back
> to an LWLock-based path at compile time if you don't build with
> --enable-htm, or at runtime if a startup test discovered that your CPU
> doesn't have the Intel TSX instruction set (microarchitectures older
> than Skylake, and some mobile and low power variants of current ones),
> or if a hardware transaction is aborted for various reasons.

Thanks, that sounds cool!

> The good news is that it seems to produce correct results in simple
> tests (well, some lock-held-by-me assertions can fail in an
> --enable-cassert build, that's trivial to fix).  The bad news is that
> it doesn't perform very well yet, and I think the reason for that is
> that there are some inherently serial parts of the current design that
> cause frequent conflicts.

Can you share some numbers about how not well it perform and how many
hardware transactions were aborted with a fallback? I'm curious because
from this paper [1] I've got an impression that the bigger (in terms of
memory) and longer transaction is, the higher changes for it to get
aborted. So I wonder if it needs to be taken into account, or using it
for SSI as presented in the patch somehow implicitely minimize those
chances? Otherwise not only conflicting transactions will cause
fallback, but also those that e.g. span too much memory.

Another interesting for me question is how much is it affected by TAA
vulnerability [2], and what are the prospects of this approach in the
view of many suggests to disable TSX due to that (there are mitigations
ofcourse, but if I understand correctly e.g. for Linux it's similar to
MDS, where a mitigation is done via flushing cpu buffers on entering the
kernel space, but in between speculative access still could be
performed).

[1]: https://db.in.tum.de/~leis/papers/HTM.pdf
[2]: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html

Re: Experimenting with transactional memory for SERIALIZABLE

From

Thomas Munro

Date:

20 February 2020, 12:33:17

On Thu, Feb 20, 2020 at 11:38 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Can you share some numbers about how not well it perform and how many
> hardware transactions were aborted with a fallback? I'm curious because
> from this paper [1] I've got an impression that the bigger (in terms of
> memory) and longer transaction is, the higher changes for it to get
> aborted. So I wonder if it needs to be taken into account, or using it
> for SSI as presented in the patch somehow implicitely minimize those
> chances? Otherwise not only conflicting transactions will cause
> fallback, but also those that e.g. span too much memory.

Good questions, and I don't have good enough numbers to share right
now; to be clear, the stage this work is at is: "wow, I think this new
alien technology might actually be producing the right answers at
least some of the time, now maybe we could start to think about
analysing its behaviour some more", and I wanted to share early and
see if anyone else was interested in the topic too :-)

Thanks for that paper, added to my reading list.  The HTM
transactions' size is not linked to the size of database transactions,
which would certainly be too large.  It's just used for lower level
operations that need to be atomic and serializable, replacing a bunch
of LWLocks.  I see from skimming the final paragraph of that paper
that they're also not mapping database transactions directly to HTM.
So, the amount of memory you touch depends on the current size of
various lists in SSI's internal book keeping, and I haven't done the
work to figure out at which point space runs out (_XABORT_CAPACITY) in
any test workloads etc, or to consider whether some operations that I
covered with one HTM transaction could be safely broken up into
multiple transactions to minimise transaction size, though I am aware
of at least one opportunity like that.

> Another interesting for me question is how much is it affected by TAA
> vulnerability [2], and what are the prospects of this approach in the
> view of many suggests to disable TSX due to that (there are mitigations
> ofcourse, but if I understand correctly e.g. for Linux it's similar to
> MDS, where a mitigation is done via flushing cpu buffers on entering the
> kernel space, but in between speculative access still could be
> performed).

Yeah, the rollout of TSX has been a wild ride since the beginning.  I
didn't want to comment on that aspect because I just don't know enough
about it and at this point it's frankly pretty confusing.   As far as
I know from limited reading, as of late last year a few well known
OSes are offering easy ways to disable TSX due to Zombieload v2 if you
would like to, but not doing so by default.  I tested with the Debian
intel-microcode package version 3.20191115.2~deb10u1 installed which I
understand to the be latest and greatest, and made no relevant
modifications, and the instructions were available.  I haven't read
anywhere that TSX itself is ending.  Other ISAs have comparable
technology[1][2], and the concept has been worked on for over 20
years, so... I just don't know.

[1] https://developer.arm.com/docs/101028/0008/transactional-memory-extension-tme-intrinsics
[2] https://www.ibm.com/developerworks/aix/library/au-aix-ibm-xl-compiler-built-in-functions/index.html