Thread: Rewrite sinval messaging to reduce contention

Rewrite sinval messaging to reduce contention

From
Tom Lane
Date:
In this thread back in January
http://archives.postgresql.org/pgsql-performance/2008-01/msg00001.php
we saw evidence that contention for SInvalLock was a problem,
and we (or at least I) concluded that the sinval messaging system
needed a nontrivial rewrite to get rid of its habit of sending everyone
to read the queue at the same time.  Here is the proposed rewrite.

The major changes are:

* When the queue overflows, we only issue a cache reset to the specific
backend or backends that still haven't read the oldest message, rather
than resetting everyone as in the original coding.  The reset-everyone
approach forced unnecessary cache-rebuilding work in backends that
weren't actually behind.

* When we observe backend(s) falling well behind, we signal SIGUSR1
to only one backend, the one that is furthest behind and doesn't
already have a signal outstanding for it.  When it finishes catching
up, it will "pass on" the SIGUSR1 to the next-furthest-back guy,
if there is one that is far enough behind to justify a signal.
This changes the former behavior where every backend in the system
might try to hit the queue at the same time into a "daisy chain"
where only one backend is catching up at a time; furthermore,
backends that aren't well behind aren't forced to catch up too.

* We don't attempt to clean out dead messages after every
message-receipt operation; rather, we do it on the insertion side,
and only when the queue fullness passes certain thresholds.
(As presented, the thresholds are 8/16, 9/16, 10/16, etc, but this
could be adjusted by changing a couple of #defines.)  This greatly
reduces the amount of contention for exclusive hold on SInvalLock.

* Readers attempt to pull multiple messages (as presented, up to 32)
out of the queue for each acquisition of SInvalLock.  Although they
take the lock in shared mode and thus aren't contending at the
LWLock level, I thought this was important to do to reduce contention
for the lock's spinlock --- the prevalence of s_lock() in the "before"
oprofile convinces me that there was just too much traffic through
the lock itself.

Note: PMSIGNAL_WAKEN_CHILDREN is now unused and should be removed,
but I didn't include that change in the patch.

The test case I'm using looks like this:

$ cat createdrop.sql
create temp table register_reqs(full_key integer, register_index integer) WITHOUT OIDS ON COMMIT DROP;
$ pgbench -n -f createdrop.sql -c 40 -t 1000 bench

On my machine CVS HEAD gets about 485 transactions/sec on this; with the
patch that improves to 700 tps.

oprofile says that the routines above 1% of CPU in CVS HEAD are

CPU: P4 / Xeon with 2 hyper-threads, speed 2792.99 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory)
count100000 
samples  %        image name               symbol name
363116   11.2433  postgres                 CatalogCacheIdInvalidate
316608    9.8032  postgres                 s_lock
305881    9.4711  postgres                 LWLockAcquire
272717    8.4442  postgres                 LWLockRelease
234812    7.2706  postgres                 CatalogCacheFlushRelation
215445    6.6709  postgres                 hash_search_with_hash_value
78544     2.4320  postgres                 _bt_compare
62007     1.9199  postgres                 SIGetDataEntry
48614     1.5052  postgres                 AllocSetAlloc
48033     1.4873  postgres                 XLogInsert
41280     1.2782  postgres                 PinBuffer
40659     1.2589  postgres                 LocalExecuteInvalidationMessage
36906     1.1427  postgres                 ResetCatalogCache
35708     1.1056  postgres                 hash_any
34151     1.0574  postgres                 PrepareToInvalidateCacheTuple

and with the patch

CPU: P4 / Xeon with 2 hyper-threads, speed 2792.99 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory)
count100000 
samples  %        image name               symbol name
550342   13.5424  postgres                 CatalogCacheIdInvalidate
362684    8.9246  postgres                 hash_search_with_hash_value
199291    4.9040  postgres                 LWLockAcquire
198277    4.8790  postgres                 CatalogCacheFlushRelation
149716    3.6841  postgres                 _bt_compare
114006    2.8054  postgres                 LWLockRelease
103138    2.5379  postgres                 XLogInsert
84471     2.0786  postgres                 PrepareToInvalidateCacheTuple
78464     1.9308  postgres                 LocalExecuteInvalidationMessage
75336     1.8538  postgres                 PinBuffer
74837     1.8415  postgres                 AllocSetAlloc
63764     1.5691  postgres                 SIGetDataEntries
60468     1.4879  postgres                 hash_any
51662     1.2713  libc-2.7.so              memcpy
43999     1.0827  libc-2.7.so              memcmp
43407     1.0681  postgres                 _bt_checkpage

so this is quite successful at eliminating spinlock-level contention
and does a reasonable job at reducing the amount of LWLock processing.

Comments?  Does anyone have more realistic test cases for this problem,
or want to fiddle with the tuning #defines in sinvaladt.c?

            regards, tom lane


Attachment

Re: Rewrite sinval messaging to reduce contention

From
Tom Lane
Date:
I wrote:
> In this thread back in January
> http://archives.postgresql.org/pgsql-performance/2008-01/msg00001.php
> we saw evidence that contention for SInvalLock was a problem,
> and we (or at least I) concluded that the sinval messaging system
> needed a nontrivial rewrite to get rid of its habit of sending everyone
> to read the queue at the same time.  Here is the proposed rewrite.

After further thought, I came up with two more improvements:

* It turns out not to be too hard to apply the idea of moving multiple
messages per LWLock acquisition on the writing side too, so I did that.

* There is not actually any good reason why writers have to block
readers or vice versa, except in the infrequent case that SICleanupQueue
is needed.  Writers don't look at the ProcStates that the readers are
changing.  The only point of overlap is that a writer will change
maxMsgNum which readers want to look at --- but we are already assuming
in several places that reading or writing an int is atomic, and if we
assume that here too, it seems to work fine.  To implement this, I split
SInvalLock into SInvalReadLock and SInvalWriteLock.

This seems to be reaching a point of diminishing returns for the
particular test case I'm using --- the TPS rate only went up from 700
to 730 or so.  However, enabling LWLOCK_STATS shows that the contention
rate on the sinval locks is now completely negligible --- one block
per thousand acquisitions on SInvalWriteLock, and less than one in
10000 on SInvalReadLock.  The vast majority of the LWLock contention
now comes from WALInsertLock and the LockMgr locks:

    Lock            # acquisitions    # times blocked

    SInvalReadLock        6469840        380
    SInvalWriteLock        240567        163
    WALInsertLock        2388805        89142
    LockMgr partition locks    8253142        177715

So I think this patch accomplishes the goal of making sinval not be a
cause of contention.

Patch version 2 attached.

            regards, tom lane

Attachment

Re: Rewrite sinval messaging to reduce contention

From
Tom Lane
Date:
I wrote:
> ... enabling LWLOCK_STATS shows that the contention
> rate on the sinval locks is now completely negligible --- one block
> per thousand acquisitions on SInvalWriteLock, and less than one in
> 10000 on SInvalReadLock.  The vast majority of the LWLock contention
> now comes from WALInsertLock and the LockMgr locks:

>     Lock            # acquisitions    # times blocked

>     SInvalReadLock        6469840        380
>     SInvalWriteLock        240567        163
>     WALInsertLock        2388805        89142
>     LockMgr partition locks    8253142        177715

For comparison's sake I rebuilt CVS HEAD with LWLOCK_STATS enabled
and repeated the same test.  I got

    SInvalLock        81090044    505750
    WALInsertLock        2382254        62747
    LockMgr locks        10657480    171799

The change in sinval numbers is gratifying, but for awhile I didn't
believe these results because of the discrepancy in LockMgr acquisition
figures.  I think though that what we are seeing here is that CVS HEAD
has the reset-everyone-on-sinval-queue-overflow behavior, which results
in a whole lot of useless cache resets, which results in a lot of
unnecessary cache reloads, and every one of those requires taking
AccessShareLock on one or more system catalogs in order to suck the data
back in.  So the reduction in LockMgr traffic is explained by not doing
so many cache resets.

            regards, tom lane