Home > mailing lists

Re: measuring lwlock-related latency spikes - Mailing list pgsql-hackers

From	Kevin Grittner
Subject	Re: measuring lwlock-related latency spikes
Date	April 2, 2012 13:58:45
Msg-id	4F7994600200002500046A9A@gw.wicourts.gov Whole thread Raw
In response to	Re: measuring lwlock-related latency spikes (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: measuring lwlock-related latency spikes
List	pgsql-hackers

Tree view

Robert Haas <robertmhaas@gmail.com> wrote:
> This particular example shows the above chunk of code taking >13s
> to execute.  Within 3s, every other backend piles up behind that,
> leading to the database getting no work at all done for a good ten
> seconds.
> 
> My guess is that what's happening here is that one backend needs
> to read a page into CLOG, so it calls SlruSelectLRUPage to evict
> the oldest SLRU page, which is dirty.  For some reason, that I/O
> takes a long time.  Then, one by one, other backends comes along
> and also need to read various SLRU pages, but the oldest SLRU page
> hasn't changed, so SlruSelectLRUPage keeps returning the exact
> same page that it returned before, and everybody queues up waiting
> for that I/O, even though there might be other buffers available
> that aren't even dirty.
> 
> I am thinking that SlruSelectLRUPage() should probably do
> SlruRecentlyUsed() on the selected buffer before calling
> SlruInternalWritePage, so that the next backend that comes along
> looking for a buffer doesn't pick the same one.
That, or something else which prevents this the same page from being
targeted by all processes, sounds like a good idea.
> Possibly we should go further and try to avoid replacing dirty
> buffers in the first place, but sometimes there may be no choice,
> so doing SlruRecentlyUsed() is still a good idea.
I can't help thinking that the "background hinter" I had ideas about
writing would prevent many of the reads of old CLOG pages, taking a
lot of pressure off of this area.  It just occurred to me that the
difference between that idea and having an autovacuum thread which
just did first-pass work on dirty heap pages is slim to none.  I
know how much time good benchmarking can take, so I hesitate to
suggest another permutation, but it might be interesting to see what
it does to the throughput if autovacuum is configured to what would
otherwise be considered insanely aggressive values (just for vacuum,
not analyze).  To give this a fair shot, the whole database would
need to be vacuumed between initial load and the start of the
benchmark.
-Kevin

pgsql-hackers by date:

From: Andrew Dunstan
Date: 02 April 2012, 13:56:02
Subject: Re: log chunking broken with large queries under load

From: Tom Lane
Date: 02 April 2012, 14:03:58
Subject: Re: log chunking broken with large queries under load

Re: measuring lwlock-related latency spikes - Mailing list pgsql-hackers

Previous

Next