Re: CLOG contention - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: CLOG contention
Date
Msg-id CA+U5nM+8wNw2H6sKNLwFfj2Zvy_yXLd6s8-YgGs1mhhw-ZDoUw@mail.gmail.com
Whole thread Raw
In response to Re: CLOG contention  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: CLOG contention
List pgsql-hackers
On Wed, Dec 21, 2011 at 7:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> I am not arguing

This seems like a normal and cool technical discussion to me here.

>  I'm merely saying
> that the specific plan of having multiple SLRUs for CLOG doesn't
> appeal to me -- mostly because I think it will make life difficult for
> pg_upgrade without any compensating advantage.  If we're going to go
> that route, I'd rather build something into the SLRU machinery
> generally that allows for the cache to be less than fully-associative,
> with all of the savings in terms of lock contention that this entails.
>  Such a system could be used by any SLRU, not just CLOG, if it proved
> to be helpful; and it would avoid any on-disk changes, with, as far as
> I can see, basically no downside.

Partitioning will give us more buffers and more LWlocks, to spread the
contention when we access the buffers. I use that word because its
what we call the technique already used in the buffer manager and lock
manager. If you wish to call this "less than fully-associative" I
really don't mind, as long as we're discussing the same overall
concept, so we can then focus on an implementation of that concept,
which no doubt has many ways of doing it.

More buffers per lock does reduce the lock contention somewhat, but
not by much. So for me, it seems essential that we have more LWlocks
to solve the problem, which is where partitioning comes in.

My perspective is that there is clog contention in many places, not
just in the ones you identified. Main places I see are:

* Access to older pages (identified by you upthread). More buffers
addresses this problem.

* Committing requires us to hold exclusive lock on a page, so there is
contention from nearly all sessions for the same page. The only way to
solve that is by striping pages, so that one page in the current clog
architecture would be striped across N pages with consecutive xids in
separate partitions. Notably this addresses Tom's concern that there
is a much higher request rate on very recent pages - each page would
be split into N pages, so reducing contention.

* We allocate a new clog page every 32k xids. At the rates you have
now measured, we will do this every 1-2 seconds. When we do this, we
must allocate a new page, which means writing the LRU page, which will
be dirty, since we fill 8 buffers in 16 seconds (or even 32 buffers in
about a minute), yet only flush buffers at checkpoint every 5 minutes.
We then need to write an XLogRecord for the new page. All of that
happens while we have the XidGenLock held. Also, while this is
happening nothing can commit, or check clog. That causes nearly all
work to halt for about a second, perhaps longer while the traffic
queue clears. More obvious when writing to logged tables, since the
XLogInsert for the new clog page is then very badly contended. If we
partition then we will be able to continue accessing most of the clog
pages.

So I think we need
* more buffers
* clog page striping
* partitioning

And I would characterise what I am suggesting as "partitioning +
striping" with the free benefit that we increase the number of buffers
as well via partitioning.

With all of that in mind, its relatively easy to rewrite the clog code
so we allocate N SLRUs rather than just 1. That means we just touch
the clog code. Striping adjacent xids onto separate pages in other
ways would gut the SLRU code. We could just partition but then won't
address Tom's concern, as you say. That is based upon code analysis
and hacking something together while thinking - if it helps discussion
I post that hack here, but its not working yet. I don't think reusing
code from bufmgr/lockmgr would help either.

Yes, you're right that I'm suggesting we change the clog data
structures and that therefore we'd need to change pg_upgrade as well.
But that seems like a relatively simple piece of code given the clear
mapping between old and new structures. It would be able to run
quickly at upgrade time.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Aidan Van Dyk
Date:
Subject: Re: bghinter process
Next
From: David Fetter
Date:
Subject: Page Checksums + Double Writes