SSI slows down over time - Mailing list pgsql-performance

From Ryan Johnson
Subject SSI slows down over time
Date
Msg-id 5340BB09.5010101@cs.utoronto.ca
Whole thread Raw
Responses Re: SSI slows down over time  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: SSI slows down over time  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: SSI slows down over time  (Ryan Johnson <ryan.johnson@cs.utoronto.ca>)
List pgsql-performance
Hi all,

Disclaimer: this question probably belongs on the hackers list, but the
instructions say you have to try somewhere else first... toss-up between
this list and a bug report; list seemed more appropriate as a starting
point. Happy to file a bug if that's more appropriate, though.

This is with pgsql-9.3.4, x86_64-linux, home-built with `./configure
--prefix=...' and gcc-4.7.
TPC-C courtesy of oltpbenchmark.com. 12WH TPC-C, 24 clients.

I get a strange behavior across repeated runs: each 100-second run is a
bit slower than the one preceding it, when run with SSI (SERIALIZABLE).
Switching to SI (REPEATABLE_READ) removes the problem, so it's
apparently not due to the database growing. The database is completely
shut down (pg_ctl stop) between runs, but the data lives in tmpfs, so
there's no I/O problem here. 64GB RAM, so no paging, either.

Note that this slowdown is in addition to the 30% performance from using
SSI on my 24-core machine. I understand that the latter is a known
bottleneck; my question is why the bottleneck should get worse over time:

With SI, I get ~4.4ktps, consistently.
With SSI, I get 3.9, 3.8, 3.4. 3.3, 3.1, 2.9, ...

So the question: what should I look for to diagnose/triage this problem?
I'm willing to do some legwork, but have no idea where to go next.

I've tried linux perf, but all it says is that lots of time is going to
LWLock (but callgraph tracing doesn't work in my not-bleeding-edge
kernel). Looking through the logs, the abort rates due to SSI aren't
changing in any obvious way. I've been hacking on SSI for over a month
now as part of a research project, and am fairly familiar with
predicate.c, but I don't see any obvious reason this behavior should
arise (in particular, SLRU storage seems to be re-initialized every time
the postmaster restarts, so there shouldn't be any particular memory
effect due to SIREAD locks). I'm also familiar with both Cahill's and
Ports/Grittner's published descriptions of SSI, but again, nothing
obvious jumps out.

In my experience this sort of behavior indicates a type of bug where
fixing it would have a large impact on performance (because the early
"damage" is done so quickly that even the very first run doesn't live up
to its true potential).

$ cat pgsql.conf
shared_buffers = 8GB
synchronous_commit = off
checkpoint_segments = 64
max_pred_locks_per_transaction = 2000
default_statistics_target = 100
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
effective_cache_size = 40GB
work_mem = 1920MB
wal_buffers = 16MB

Thanks,
Ryan



pgsql-performance by date:

Previous
From: Varadharajan Mukundan
Date:
Subject: Re: Fwd: Slow Count-Distinct Query
Next
From: Heikki Linnakangas
Date:
Subject: Re: SSI slows down over time