SSI slows down over time - Mailing list pgsql-performance

From Ryan Johnson
Subject SSI slows down over time
Date
Msg-id 534BDB66.9020000@cs.utoronto.ca
Whole thread Raw
Responses Re: SSI slows down over time
List pgsql-performance
Hi all,

(Referred here from pgsql-performance)

tl;dr: every time I shut down a database and bring it back up, SSI seems
to go slower. In order to avoid thousands of SSI aborts due to running
out of shared memory, I've had to set max_predicate_locks to several
thousand (2000 is tolerable, 8000 required to avoid all errors); this
seems excessively high considering how short TPC-C transactions are, and
how aggressively SSI reclaims storage. I've also found what appears to
be a bug, where the SSI SLRU storage (pg_serial) sometimes jumps from
~200kB to 8GB within a matter of seconds. The 8GB persists through later
runs and seems to be part of the performance problem; deleting the
pg_serial directory after each database shutdown seems to resolve most
of the problem.

Excruciatingly long and detailed version follows...

This is with pgsql-9.3.4, x86_64-linux, home-built with `./configure
--prefix=...' and gcc-4.7.
24-core Intel box with hyperthreading (so 48 contexts).
TPC-C courtesy of oltpbenchmark.com. 12WH TPC-C, 24 clients.

I get a strange behavior across repeated runs: each 100-second run is a
bit slower than the one preceding it, when run with SSI (SERIALIZABLE).
Switching to SI (REPEATABLE_READ) removes the problem, so it's
apparently not due to the database growing. The database is completely
shut down (pg_ctl stop) between runs, but the data lives in tmpfs, so
there's no I/O problem here. 64GB RAM, so no paging, either.

Note that this slowdown is in addition to the 30% performance gap
between SI and SSI on my 24-core machine. I understand that the latter
is a known bottleneck [1]; my question is why the bottleneck should get
worse over time:

With SI, I get ~4.4ktps, consistently.
With SSI, I get 3.9, 3.8, 3.4. 3.3, 3.1, 2.9, ...

So the question: what should I look for to diagnose/triage this problem?
I've done some legwork already, but have no idea where to go next.

Looking through the logs, abort rates due to SSI aren't changing in any
obvious way. I've been hacking on SSI for over a month now as part of a
research project, and am fairly familiar with predicate.c, but I don't
see any obvious reason this behavior should arise (in particular, SLRU
storage seems to be re-initialized every time the postmaster restarts,
so there shouldn't be any particular memory effect due to SIREAD locks).
I'm also familiar with both Cahill's and Ports/Grittner's published
descriptions of SSI, but again, nothing obvious jumps out.

Top reports only 50-60% CPU usage for most clients, and linux perf shows
5-10% of time going to various lwlock calls. Compiling with
-DLWLOCK_STATS and comparing results for SI vs. SSI shows that the
WALInsertLock (7) is white-hot in all cases, followed by almost equally
white-hot SSI lwlocks (28,30,29) and the 16 PredicateLockMgrLocks
(65-80). Other than the log bottleneck, this aligns precisely with
previous results reported by others [1]; unfortunately, I don't see
anything obvious in the lock stats to say why the problem is getting
worse over time.

In my experience this sort of behavior indicates a bug, where fixing it
could have a significant impact on performance (because the early
"damage" is done so quickly after start-up that even the very first run
doesn't live up to its true potential).

I also strongly suspect a bug because the SLRU storage (pg_serial)
occasionally jumps from the 100-200kB range to 8GB. It's rare, but when
it happens, performance tanks to tens of tps for the rest of the run,
and the 8GB persists into subsequent runs. I saw some code comments in
predicate.c suggesting that SLRU pages which fall out of range would not
be reclaimed until the next SLRU wraparound. Deleting pg_serial/* before
each `pg_ctl start' seems to remove most of the problem (and, from my
understanding of SSI, should be harmless, because no serial conflicts
can persists across a database shutdown).

I tried to repro, and a series of 30-second runs gave the following
throughputs (tps):
*4615
3155 3149 3115 3206 3162 3069 3005 2978 2953 **308
2871 2876 2838 2853 2817 2768 2736 2782 2732 2833
2749 2675 2771 2700 2675 2682 2647 2572 2626 2567
*4394

That ** entry was an 8GB blow-up. All files in the directory had been
created at the same time (= not during a previous run), and persisted
through the runs that followed. There was also a run where abort rates
jumped through the roof (~40k aborts rather than the usual 2000 or so),
with a huge number of "out of shared memory" errors; apparently
max_predicate_locks=2000 wasn't high enough.

$ cat pgsql.conf
shared_buffers = 8GB
synchronous_commit = off
checkpoint_segments = 64
max_pred_locks_per_transaction = 2000
default_statistics_target = 100
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
effective_cache_size = 40GB
work_mem = 1920MB
wal_buffers = 16MB


[1]
http://www.postgresql.org/message-id/CA+TgmoYAiSM2jWEndReY5PL0sKbhgg7dbDH6r=oXKYzi9B7KJA@mail.gmail.com

Thoughts?
Ryan



pgsql-performance by date:

Previous
From: Borodin Vladimir
Date:
Subject: Checkpoint distribution
Next
From: Ryan Johnson
Date:
Subject: Re: SSI slows down over time