Re: limiting hint bit I/O - Mailing list pgsql-hackers

From Robert Haas
Subject Re: limiting hint bit I/O
Date
Msg-id AANLkTikPDXtY8P7QmGZ4VNe2c6feHKCQJKaJ54CDATqB@mail.gmail.com
Whole thread Raw
In response to Re: limiting hint bit I/O  (Merlin Moncure <mmoncure@gmail.com>)
Responses Re: limiting hint bit I/O  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Wed, Jan 19, 2011 at 11:18 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, Jan 19, 2011 at 10:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Here's a new version of the patch based on some experimentation with
>> ideas I posted yesterday.  At least on my Mac laptop, this is pretty
>> effective at blunting the response time spike for the first table
>> scan, and it converges to steady-state after about 20 tables scans.
>> Rather than write every 20th page, what I've done here is make every
>> 2000'th buffer allocation grant an allowance of 100 "hint bit only"
>> writes.  All dirty pages and the next 100 pages that are
>> dirty-only-for-hint-bits get written out.  Then we stop writing the
>> dirty-only-for-hint-bits-pages until we get our next allowance of
>> writes.  The idea is to try to avoid creating a lot of random writes
>> on each scan through the table.  At least here, that seems to work
>> pretty well - the initial scan is only about 25% slower than the
>> steady state (rather than 6x or more slower).
>
> does this only impact the scan case?  in oltp scenarios you want to
> write out the bits asap, i would imagine.   what about time based
> flushing, so that only x dirty hint bit pages can be written out per
> time unit y?

No, it doesn't only affect the scan case.  But I don't think that's
bad.  The goal is for the background writer to provide enough clean
pages that backends don't have to write anything at all.  If that's
not happening, the backends will be slowed by the need to write out
pages themselves in order to create a sufficient supply of clean pages
to satisfy their allocation needs.  The easiest way for that situation
to occur is if the backend is doing a large sequential scan of a table
- in that case, it's by definition cycling through pages at top speed,
and the fact that it's cycling through them in a ring buffer rather
than using all of shared_buffers makes the loop even tighter.  But if
it's possible under some other set of circumstances, the behavior is
still reasonable.  This behavior kicks in if more than 100 out of some
set of 2000 page allocations would require a write only for the
purpose of flushing hint bits.

Time-based flushing would be problematic in several respects.  First,
it would require a kernel call, which would be vastly more expensive
than what I'm doing now, and might have undesirable performance
implications for that reason.  Second, I don't think it would be the
right way to tune it even if that were not an issue.  It doesn't
really matter whether the system takes a millisecond or a microsecond
or a nanosecond to write each buffer - what matters is that writing
all the buffers is a lot slower than writing none of them.  So what we
want to do is write a percentage of them, in a way that guarantees
that they'll all eventually get written if people continue to access
the same data.  This does that, and a time-based setting would not; it
would also almost certainly require tuning based on the I/O capacities
of the system it's running on, which isn't necessary with this
approach.

Before we get too deeply involved in theory, can you give this a test
drive on your system and see how it looks?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: limiting hint bit I/O
Next
From: Robert Haas
Date:
Subject: Re: Re: patch: fix performance problems with repated decomprimation of varlena values in plpgsql