CLOG contention - Mailing list pgsql-hackers

From Robert Haas
Subject CLOG contention
Date
Msg-id CA+TgmoYngU+RjLrZDn2vhWOVFpcg5z+jXZZsJXwPw7K6s8C11A@mail.gmail.com
Whole thread Raw
Responses Re: CLOG contention  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: CLOG contention  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: CLOG contention  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-hackers
A few weeks ago I posted some performance results showing that
increasing NUM_CLOG_BUFFERS was improving pgbench performance.

http://archives.postgresql.org/pgsql-hackers/2011-12/msg00095.php

I spent some time today looking at this in a bit more detail.
Somewhat obviously in retrospect, it turns out that the problem
becomes more severe the longer you run the test.  CLOG lookups are
induced when we go to update a row that we've previously updated.
When the test first starts, just after pgbench -i, all the rows are
hinted and, even if they weren't, they all have the same XID.  So no
problem.  But, as the fraction of rows that have been updated
increases, it becomes progressively more likely that the next update
will hit a row that's already been updated.  Initially, that's OK,
because we can keep all the CLOG pages of interest in the 8 available
buffers.  But eaten through enough XIDs - specifically, 8 buffers *
8192 bytes/buffer * 4 xids/byte = 256k - we can't keep all the
necessary pages in memory at the same time, and so we have to keep
replacing CLOG pages.  This effect is not difficult to see even on my
2-core laptop, although I'm not sure whether it causes any material
performance degradation.

If you have enough concurrent tasks, a probably-more-serious form of
starvation can occur.  As SlruSelectLRUPage notes:
               /*                * We need to wait for I/O.  Normal case is that it's
dirty and we                * must initiate a write, but it's possible that the
page is already                * write-busy, or in the worst case still read-busy.
In those cases                * we wait for the existing I/O to complete.                */

On Nate Boley's 32-core box, after running pgbench for a few minutes,
that "in the worst case" scenario starts happening quite regularly,
apparently because the number of people who simultaneously wish to
read a different CLOG pages exceeds the number of available buffers
into which they can be read.  The ninth and following backends to come
along have to wait until the least-recently-used page is no longer
read-busy before starting their reads.

So, what do we do about this?  The obvious answer is "increase
NUM_CLOG_BUFFERS", and I'm not sure that's a bad idea.  64kB is a
pretty small cache on anything other than an embedded system, these
days.  We could either increase the hard-coded value, or make it
configurable - but it would have to be PGC_POSTMASTER, since there's
no way to allocate more shared memory later on.  The downsides of this
approach are:

1. If we make it configurable, nobody will have a clue what value to set.
2. If we just make it bigger, people laboring under the default 32MB
shared memory limit will conceivably suffer even more than they do now
if they just initdb and go.

A more radical approach would be to try to merge the buffer arenas for
the various SLRUs either with each other or with shared_buffers, which
would presumably allow a lot more flexibility to ratchet the number of
CLOG buffers up or down depending on overall memory pressure.  Merging
the buffer arenas into shared_buffers seems like the most flexible
solution, but it also seems like a big, complex, error-prone behavior
change, because the SLRU machinery does things quite differently from
shared_buffers: we look up buffers with a linear array search rather
than a hash table probe; we have only a per-SLRU lock and a per-page
lock, rather than separate mapping locks, content locks,
io-in-progress locks, and pins; and while the main buffer manager is
content with some loosey-goosey approximation of recency, the SLRU
code makes a fervent attempt at strict LRU (slightly compromised for
the sake of reduced locking in SimpleLruReadPage_Readonly).

Any thoughts on what makes most sense here?  I find it fairly tempting
to just crank up NUM_CLOG_BUFFERS and call it good, but the siren song
of refactoring is whispering in my other ear.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: 高增琦
Date:
Subject: Re: why do we need create tuplestore for each fetch?
Next
From: Tom Lane
Date:
Subject: Re: why do we need create tuplestore for each fetch?