Re: Bug: Buffer cache is not scan resistant - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Re: Bug: Buffer cache is not scan resistant
Date
Msg-id 45EC6577.7050402@enterprisedb.com
Whole thread Raw
In response to Re: Bug: Buffer cache is not scan resistant  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Bug: Buffer cache is not scan resistant  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:
>
> Nope, Pavan's nailed it: the problem is that after using a buffer, the
> seqscan leaves it with usage_count = 1, which means it has to be passed
> over once by the clock sweep before it can be re-used.  I was misled in
> the 32-buffer case because catalog accesses during startup had left the
> buffer state pretty confused, so that there was no long stretch before
> hitting something available.  With a large number of buffers, the
> behavior is that the seqscan fills all of shared memory with buffers
> having usage_count 1.  Once the clock sweep returns to the first of
> these buffers, it will have to pass over all of them, reducing all of
> their counts to 0, before it returns to the first one and finds it now
> usable.  Subsequent tries find a buffer immediately, of course, until we
> have again filled shared_buffers with usage_count 1 everywhere.  So the
> problem is not so much the clock sweep overhead as that it's paid in a
> very nonuniform fashion: with N buffers you pay O(N) once every N reads
> and O(1) the rest of the time.  This is no doubt slowing things down
> enough to delay that one read, instead of leaving it nicely I/O bound
> all the time.  Mark, can you detect "hiccups" in the read rate using
> your setup?
>
>   

Cool. You posted the same analysis before I could hit the "send" button :)

I am wondering whether seqscan would set the usage_count to 1 or to a higher
value. usage_count is  incremented while unpinning the buffer. Even if 
we use
page-at-a-time mode, won't the buffer itself would get pinned/unpinned
every time seqscan returns a tuple ? If thats the case, the overhead would
be O(BM_MAX_USAGE_COUNT * N) for every N reads.

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea.  But at least with large
> shared_buffers it doesn't sound like such a bad idea.
>
>   
How about smaller value for BM_MAX_USAGE_COUNT ?

> Another issue nearby to this is whether to avoid selecting buffers that
> are dirty --- IIRC someone brought that up again recently.  Maybe
> predecrement for clean buffers, postdecrement for dirty ones would be a
> cute compromise.
>   
Can we use a 2-bit counter where the higher bit is set if the buffer is 
dirty
and lower bit is set whenever the buffer is used. The clock-sweep then
decrement this counter and chooses a victim with counter value 0.
ISTM that we should optimize for large shared buffer pool case,
because that would be more common in the coming days. RAM is
getting cheaper everyday.

Thanks,
Pavan




pgsql-hackers by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Bug: Buffer cache is not scan resistant
Next
From: Tom Lane
Date:
Subject: Re: Bug: Buffer cache is not scan resistant