Thread: Bug: Buffer cache is not scan resistant

Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
I'm putting this out there before we publish a fix so that we can discuss
how best to fix it.

Doug and Sherry recently found the source of an important performance issue
with the Postgres shared buffer cache.

The issue is summarized like this: the buffer cache in PGSQL is not "scan
resistant" as advertised.  A sequential scan of a table larger than cache
will pollute the buffer cache in almost all circumstances.

Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500
(thumper-3) with 4 cores where "bigtable" is a table 2x the size of RAM and
"memtable" is a table that fits into I/O cache:

With our default setting of shared_buffers (16MB):

Operation               memtable        bigtable
---------------------------------------------------
SELECT COUNT(*)         1221 MB/s       973 MB/s
VACUUM                  1709 MB/s       1206 MB/s

We had observed that VACUUM would perform better when done right after a
SELECT.  In the above example, the faster rate from disk was 1608 MB/s,
compared to the normal rate of 1206 MB/s.

We verified this behavior on Postgres 8.2 as well.  The buffer selection
algorithm is choosing buffer pages scattered throughout the buffer cache in
almost all circumstances.

Sherry traced the behavior to the processor repeatedly flushing the L2
cache.  Doug found that we weren't using the Postgres buffer cache the way
we expected, instead we were loading the scanned data from disk into the
cache even though there was no possibility of reusing it.  In addition to
pushing other, possibly useful pages from the cache, it has the additional
behavior of invalidating the L2 cache for the remainder of the executor path
that uses the data.

To prove that the buffer cache was the source of the problem, we dropped the
shared buffer size to fit into L2 cache (1MB per Opteron core), and this is
what we saw (drop size of shared buffers to 680KB):

Operation               memtable        bigtable
---------------------------------------------------
SELECT COUNT(*)         1320 MB/s       1059 MB/s
VACUUM                  3033 MB/s       1597 MB/s

These results do not vary with the order of operations.

Thoughts on the best way to fix the buffer selection algorithm?  Ideally,
one page would be used in the buffer cache in circumstances where the table
to be scanned is (significantly?) larger than the size of the buffer cache.
- Luke




Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> The issue is summarized like this: the buffer cache in PGSQL is not "scan
> resistant" as advertised.

Sure it is.  As near as I can tell, your real complaint is that the
bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
which is hardly surprising considering it doesn't know the size of L2
cache.  That's not a consideration that we've ever taken into account.

I'm also less than convinced that it'd be helpful for a big seqscan:
won't reading a new disk page into memory via DMA cause that memory to
get flushed from the processor cache anyway?  I wonder whether your
numbers are explained by some other consideration than you think.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
<p><font size="2">When we instrument the page selections made within the buffer cache, they are sequential and span the
entireaddress space of the cache.<br /><br /> With respect to whether it's L2, it's a conclusion based on the
experimentalresults.  It's not the TLB, as we also tested for the 512 entries for each L2.<br /><br /> One thing I left
outof the previous post: the difference between fast and slow behavior was that in the fast case, the buffer selection
alternatedbetween two buffer pages.  This was the case only when the preceding statement was a SELECT and the statement
wasVACUUM.<br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br />  -----Original Message-----<br />
From:  Tom Lane [<a href="mailto:tgl@sss.pgh.pa.us">mailto:tgl@sss.pgh.pa.us</a>]<br /> Sent:   Sunday, March 04, 2007
08:36PM Eastern Standard Time<br /> To:     Luke Lonergan<br /> Cc:     PGSQL Hackers; Doug Rady; Sherry Moore<br />
Subject:       Re: [HACKERS] Bug: Buffer cache is not scan resistant<br /><br /> "Luke Lonergan"
<llonergan@greenplum.com>writes:<br /> > The issue is summarized like this: the buffer cache in PGSQL is not
"scan<br/> > resistant" as advertised.<br /><br /> Sure it is.  As near as I can tell, your real complaint is that
the<br/> bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;<br /> which is hardly surprising
consideringit doesn't know the size of L2<br /> cache.  That's not a consideration that we've ever taken into
account.<br/><br /> I'm also less than convinced that it'd be helpful for a big seqscan:<br /> won't reading a new disk
pageinto memory via DMA cause that memory to<br /> get flushed from the processor cache anyway?  I wonder whether
your<br/> numbers are explained by some other consideration than you think.<br /><br />                        
regards,tom lane<br /><br /></font> 

Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
<p><font size="2">One more thing: the L2 is invalidated when re-written from the kernel IO cache, but the pages
addressedin L2 retain their values when 'writeen thru' which allows the new data to be re-used up the executor
chain.<br/><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br />  -----Original Message-----<br /> From:  
TomLane [<a href="mailto:tgl@sss.pgh.pa.us">mailto:tgl@sss.pgh.pa.us</a>]<br /> Sent:   Sunday, March 04, 2007 08:36 PM
EasternStandard Time<br /> To:     Luke Lonergan<br /> Cc:     PGSQL Hackers; Doug Rady; Sherry Moore<br />
Subject:       Re: [HACKERS] Bug: Buffer cache is not scan resistant<br /><br /> "Luke Lonergan"
<llonergan@greenplum.com>writes:<br /> > The issue is summarized like this: the buffer cache in PGSQL is not
"scan<br/> > resistant" as advertised.<br /><br /> Sure it is.  As near as I can tell, your real complaint is that
the<br/> bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;<br /> which is hardly surprising
consideringit doesn't know the size of L2<br /> cache.  That's not a consideration that we've ever taken into
account.<br/><br /> I'm also less than convinced that it'd be helpful for a big seqscan:<br /> won't reading a new disk
pageinto memory via DMA cause that memory to<br /> get flushed from the processor cache anyway?  I wonder whether
your<br/> numbers are explained by some other consideration than you think.<br /><br />                        
regards,tom lane<br /><br /></font> 

Re: Bug: Buffer cache is not scan resistant

From
Mark Kirkwood
Date:
Tom Lane wrote:
> "Luke Lonergan" <llonergan@greenplum.com> writes:
>> The issue is summarized like this: the buffer cache in PGSQL is not "scan
>> resistant" as advertised.
> 
> Sure it is.  As near as I can tell, your real complaint is that the
> bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;
> which is hardly surprising considering it doesn't know the size of L2
> cache.  That's not a consideration that we've ever taken into account.
> 

To add a little to this - forgetting the scan resistant point for the
moment... cranking down shared_buffers to be smaller than the L2 cache
seems to help *any* sequential scan immensely, even on quite modest HW:

e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram,

SELECT count(*) FROM lineitem (which is about 11GB) performance:

Shared_buffers  Elapsed
--------------  -------
400MB           101 s
128KB            74 s

When I've profiled this activity, I've seen a lot of time spent
searching for/allocating a new buffer for each page being fetched.
Obviously having less of them to search through will help, but having
less than the L2 cache-size worth of 'em seems to help a whole lot!

Cheers

Mark






Re: Bug: Buffer cache is not scan resistant

From
Gavin Sherry
Date:
On Mon, 5 Mar 2007, Mark Kirkwood wrote:

> To add a little to this - forgetting the scan resistant point for the
> moment... cranking down shared_buffers to be smaller than the L2 cache
> seems to help *any* sequential scan immensely, even on quite modest HW:
>
> e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram,
>
> SELECT count(*) FROM lineitem (which is about 11GB) performance:
>
> Shared_buffers  Elapsed
> --------------  -------
> 400MB           101 s
> 128KB            74 s
>
> When I've profiled this activity, I've seen a lot of time spent
> searching for/allocating a new buffer for each page being fetched.
> Obviously having less of them to search through will help, but having
> less than the L2 cache-size worth of 'em seems to help a whole lot!

Could you demonstrate that point by showing us timings for shared_buffers
sizes from 512K up to, say, 2 MB? The two numbers you give there might
just have to do with managing a large buffer.

Thanks,

Gavin


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Gavin, Mark,

> Could you demonstrate that point by showing us timings for
> shared_buffers sizes from 512K up to, say, 2 MB? The two
> numbers you give there might just have to do with managing a
> large buffer.

I suggest two experiments that we've already done:
1) increase shared buffers to double the L2 cache size, you should see
that the behavior reverts to the "slow" performance and is constant at
larger sizes

2) instrument the calls to BufferGetPage() (a macro) and note that the
buffer block numbers returned increase sequentially during scans of
tables larger than the buffer size

- Luke



Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Gavin Sherry <swm@alcove.com.au> writes:
> Could you demonstrate that point by showing us timings for shared_buffers
> sizes from 512K up to, say, 2 MB? The two numbers you give there might
> just have to do with managing a large buffer.

Using PG CVS HEAD on 64-bit Intel Xeon (1MB L2 cache), Fedora Core 5,
I don't measure any noticeable difference in seqscan speed for
shared_buffers set to 32MB or 256kB.  I note that the code would
not let me choose the latter setting without a large decrease in
max_connections, which might be expected to cause some performance
changes in itself.

Now this may only prove that the disk subsystem on this machine is
too cheap to let the system show any CPU-related issues.  I'm seeing
a scan rate of about 43MB/sec for both count(*) and plain ol' "wc",
which is a factor of 4 or so less than Mark's numbers suggest...
but "top" shows CPU usage of less than 5%, so even with a 4x faster
disk I'd not really expect that CPU speed would become interesting.

(This is indeed a milestone, btw, because it wasn't so long ago that
count(*) was nowhere near disk speed.)
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Grzegorz Jaskiewicz <gj@pointblue.com.pl> writes:
> On Mar 5, 2007, at 2:36 AM, Tom Lane wrote:
>> I'm also less than convinced that it'd be helpful for a big seqscan:
>> won't reading a new disk page into memory via DMA cause that memory to
>> get flushed from the processor cache anyway?

> Nope. DMA is writing directly into main memory. If the area was in  
> the L2/L1 cache, it will get invalidated. But if it isn't there, it  
> is okay.

So either way, it isn't in processor cache after the read.  So how can
there be any performance benefit?
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
> So either way, it isn't in processor cache after the read.
> So how can there be any performance benefit?

It's the copy from kernel IO cache to the buffer cache that is L2
sensitive.  When the shared buffer cache is polluted, it thrashes the L2
cache.  When the number of pages being written to in the kernel->user
space writes fits in L2, then the L2 lines are "written through" (see
the link below on page 264 for the write combining features of the
opteron for example) and the writes to main memory are deferred.

http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
25112.PDF

- Luke



Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Hi Tom,

> Now this may only prove that the disk subsystem on this
> machine is too cheap to let the system show any CPU-related
> issues.

Try it with a warm IO cache.  As I posted before, we see double the
performance of a VACUUM from a table in IO cache when the shared buffer
cache isn't being polluted.  The speed with large buffer cache should be
about 450 MB/s and the speed with a buffer cache smaller than L2 should
be about 800 MB/s.

The real issue here isn't the L2 behavior, though that's important when
trying to reach very high IO speeds, the issue is that we're seeing the
buffer cache pollution in the first place.  When we instrument the
blocks selected by the buffer page selection algorithm, we see that they
iterate sequentially, filling the shared buffer cache.  That's the
source of the problem here.

Do we have a regression test somewhere for this?

- Luke



Re: Bug: Buffer cache is not scan resistant

From
Grzegorz Jaskiewicz
Date:
On Mar 5, 2007, at 2:36 AM, Tom Lane wrote:
> n into account.
>
> I'm also less than convinced that it'd be helpful for a big seqscan:
> won't reading a new disk page into memory via DMA cause that memory to
> get flushed from the processor cache anyway?

Nope. DMA is writing directly into main memory. If the area was in  
the L2/L1 cache, it will get invalidated. But if it isn't there, it  
is okay.

-- 
Grzegorz Jaskiewicz
gj@pointblue.com.pl





Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
"Luke Lonergan" <LLonergan@greenplum.com> writes:
>> So either way, it isn't in processor cache after the read.  
>> So how can there be any performance benefit?

> It's the copy from kernel IO cache to the buffer cache that is L2
> sensitive.  When the shared buffer cache is polluted, it thrashes the L2
> cache.  When the number of pages being written to in the kernel->user
> space writes fits in L2, then the L2 lines are "written through" (see
> the link below on page 264 for the write combining features of the
> opteron for example) and the writes to main memory are deferred.

That makes absolutely zero sense.  The data coming from the disk was
certainly not in processor cache to start with, and I hope you're not
suggesting that it matters whether the *target* page of a memcpy was
already in processor cache.  If the latter, it is not our bug to fix.

> http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
> 25112.PDF

Even granting that your conclusions are accurate, we are not in the
business of optimizing Postgres for a single CPU architecture.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Hi Tom,

> Even granting that your conclusions are accurate, we are not
> in the business of optimizing Postgres for a single CPU architecture.

I think you're missing my/our point:

The Postgres shared buffer cache algorithm appears to have a bug.  When
there is a sequential scan the blocks are filling the entire shared
buffer cache.  This should be "fixed".

My proposal for a fix: ensure that when relations larger (much larger?)
than buffer cache are scanned, they are mapped to a single page in the
shared buffer cache.

- Luke



Re: Bug: Buffer cache is not scan resistant

From
Heikki Linnakangas
Date:
Luke Lonergan wrote:
> The Postgres shared buffer cache algorithm appears to have a bug.  When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache.  This should be "fixed".
> 
> My proposal for a fix: ensure that when relations larger (much larger?)
> than buffer cache are scanned, they are mapped to a single page in the
> shared buffer cache.

It's not that simple. Using the whole buffer cache for a single seqscan 
is ok, if there's currently no better use for the buffer cache. Running 
a single select will indeed use the whole cache, but if you run any 
other smaller queries, the pages they need should stay in cache and the 
seqscan will loop through the other buffers.

In fact, the pages that are left in the cache after the seqscan finishes 
would be useful for the next seqscan of the same table if we were smart 
enough to read those pages first. That'd make a big difference for 
seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if 
Jeff's sync seqscan patch adresses that.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Bug: Buffer cache is not scan resistant

From
Hannu Krosing
Date:
Ühel kenal päeval, E, 2007-03-05 kell 03:51, kirjutas Luke Lonergan:
>  Hi Tom,
> 
> > Even granting that your conclusions are accurate, we are not 
> > in the business of optimizing Postgres for a single CPU architecture.
> 
> I think you're missing my/our point:
> 
> The Postgres shared buffer cache algorithm appears to have a bug.  When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache.  This should be "fixed".
> 
> My proposal for a fix: ensure that when relations larger (much larger?)
> than buffer cache are scanned, they are mapped to a single page in the
> shared buffer cache.

How will this approach play together with synchronized scan patches ?

Or should synchronized scan rely on systems cache only ?

> - Luke
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
"Luke Lonergan" <LLonergan@greenplum.com> writes:
> I think you're missing my/our point:

> The Postgres shared buffer cache algorithm appears to have a bug.  When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache.  This should be "fixed".

No, this is not a bug; it is operating as designed.  The point of the
current bufmgr algorithm is to replace the page least recently used,
and that's what it's doing.

If you want to lobby for changing the algorithm, then you need to
explain why one test case on one platform justifies de-optimizing
for a lot of other cases.  In almost any concurrent-access situation
I think that what you are suggesting would be a dead loss --- for
instance we might as well forget about Jeff Davis' synchronized-scan
work.

In any case, I'm still not convinced that you've identified the problem
correctly, because your explanation makes no sense to me.  How can the
processor's L2 cache improve access to data that it hasn't got yet?
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Florian Weimer
Date:
* Tom Lane:

> That makes absolutely zero sense.  The data coming from the disk was
> certainly not in processor cache to start with, and I hope you're not
> suggesting that it matters whether the *target* page of a memcpy was
> already in processor cache.  If the latter, it is not our bug to fix.

Uhm, if it's not in the cache, you typically need to evict some cache
lines to make room for the data, so I'd expect an indirect performance
hit.  I could be mistaken, though.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


Re: Bug: Buffer cache is not scan resistant

From
Hannu Krosing
Date:
Ühel kenal päeval, E, 2007-03-05 kell 04:15, kirjutas Tom Lane:
> "Luke Lonergan" <LLonergan@greenplum.com> writes:
> > I think you're missing my/our point:
> 
> > The Postgres shared buffer cache algorithm appears to have a bug.  When
> > there is a sequential scan the blocks are filling the entire shared
> > buffer cache.  This should be "fixed".
> 
> No, this is not a bug; it is operating as designed.  

Maybe he means that there is an oversight (aka "bug") in the design ;)

> The point of the
> current bufmgr algorithm is to replace the page least recently used,
> and that's what it's doing.
> 
> If you want to lobby for changing the algorithm, then you need to
> explain why one test case on one platform justifies de-optimizing
> for a lot of other cases. 

If you know beforehand that you will definitely overflow cache and not
reuse it anytime soon, then it seems quite reasonable to not even start
polluting the cache. Especially, if you get a noticable boost in
performance while doing so.

> In almost any concurrent-access situation
> I think that what you are suggesting would be a dead loss 

Only if the concurrent access patern is over data mostly fitting in
buffer cache. If we can avoid polluting buffer cache with data we know
we will use only once, more useful data will be available.

> --- for
> instance we might as well forget about Jeff Davis' synchronized-scan
> work.

Depends on ratio of system cache/shared buffer cache. I don't think
Jeff's patch is anywere near the point it needs to start worrying about
data swapping between system cache and shared burrers, or L2 cache usage

> In any case, I'm still not convinced that you've identified the problem
> correctly, because your explanation makes no sense to me.  How can the
> processor's L2 cache improve access to data that it hasn't got yet?
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
> 
>                 http://www.postgresql.org/about/donate
-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:

> > The Postgres shared buffer cache algorithm appears to have a bug.
> > When there is a sequential scan the blocks are filling the entire
> > shared buffer cache.  This should be "fixed".
>
> No, this is not a bug; it is operating as designed.  The
> point of the current bufmgr algorithm is to replace the page
> least recently used, and that's what it's doing.

At least we've established that for certain.
> If you want to lobby for changing the algorithm, then you
> need to explain why one test case on one platform justifies
> de-optimizing for a lot of other cases.  In almost any
> concurrent-access situation I think that what you are
> suggesting would be a dead loss --- for instance we might as
> well forget about Jeff Davis' synchronized-scan work.

Instead of forgetting about it, we'd need to change it.
> In any case, I'm still not convinced that you've identified
> the problem correctly, because your explanation makes no
> sense to me.  How can the processor's L2 cache improve access
> to data that it hasn't got yet?

The evidence seems to clearly indicate reduced memory writing due to an
L2 related effect.  The actual data shows a dramatic reduction in main
memory writing when the destination of the written data fits in the L2
cache.

I'll try to fit a hypothesis to explain it.  Assume you've got a warm IO
cache in the OS.

The heapscan algorithm now works like this:
0) select a destination user buffer
1) uiomove->kcopy memory from the IO cache to the user buffer
1A) step 1: read from kernel space
1B) step 2: write to user space
2) the user buffer is accessed many times by the executor nodes above
Repeat

There are two situations we are evaluating: one where the addresses of
the user buffer are scattered over a space larger than the size of L2
(caseA) and one where they are confined to the size of L2 (caseB).  Note
that we could also consider another situation where the addresses are
scattered over a space smaller than the TLB entries mapped by the L2
cache (512 max) and larger than the size of L2, but we've tried that and
it proved uninteresting.

For both cases step 1A is the same: each block (8KB) write from (1) will
read from IO cache into 128 L2 (64B each) lines, evicting the previous
data there.

In step 1B for caseA the destination for the writes is mostly an address
not currently mapped into L2 cache, so 128 victim L2 lines are found
(LRU), stored into, and writes are flushed to main memory.

In step 1B for caseB, the destination for the writes is located in L2
already.  The 128 L2 lines are stored into, and the write to main memory
is delayed under the assumption that these lines are "hot" as they were
already in L2.

I don't know enough to be sure this is the right answer, but it does fit
the experimental data.

- Luke



Re: Bug: Buffer cache is not scan resistant

From
Mark Kirkwood
Date:
Gavin Sherry wrote:
> On Mon, 5 Mar 2007, Mark Kirkwood wrote:
> 
>> To add a little to this - forgetting the scan resistant point for the
>> moment... cranking down shared_buffers to be smaller than the L2 cache
>> seems to help *any* sequential scan immensely, even on quite modest HW:
>>
> (snipped)
>> When I've profiled this activity, I've seen a lot of time spent
>> searching for/allocating a new buffer for each page being fetched.
>> Obviously having less of them to search through will help, but having
>> less than the L2 cache-size worth of 'em seems to help a whole lot!
> 
> Could you demonstrate that point by showing us timings for shared_buffers
> sizes from 512K up to, say, 2 MB? The two numbers you give there might
> just have to do with managing a large buffer.

Yeah - good point:

PIII 1.26 Ghz 512Kb L2 cache 2G RAM

Test is elapsed time for: SELECT count(*) FROM lineitem

lineitem has 1535724 pages (11997 MB)

Shared Buffers  Elapsed  IO rate (from vmstat)
--------------  -------  ---------------------
400MB           101 s    122 MB/s

2MB             100 s
1MB              97 s
768KB            93 s
512KB            86 s
256KB            77 s
128KB            74 s    166 MB/s

I've added the observed IO rate for the two extreme cases (the rest can 
be pretty much deduced via interpolation).

Note that the system will do about 220 MB/s with the now (in)famous dd 
test, so we have a bit of headroom (not too bad for a PIII).

Cheers

Mark


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Hi Mark,

> lineitem has 1535724 pages (11997 MB)
>
> Shared Buffers  Elapsed  IO rate (from vmstat)
> --------------  -------  ---------------------
> 400MB           101 s    122 MB/s
>
> 2MB             100 s
> 1MB              97 s
> 768KB            93 s
> 512KB            86 s
> 256KB            77 s
> 128KB            74 s    166 MB/s
>
> I've added the observed IO rate for the two extreme cases
> (the rest can be pretty much deduced via interpolation).
>
> Note that the system will do about 220 MB/s with the now
> (in)famous dd test, so we have a bit of headroom (not too bad
> for a PIII).

What's really interesting: try this with a table that fits into I/O
cache (say half your system memory), and run VACUUM on the table.  That
way the effect will stand out more dramatically.

- Luke



Re: Bug: Buffer cache is not scan resistant

From
Gregory Stark
Date:
"Luke Lonergan" <LLonergan@greenplum.com> writes:

> The evidence seems to clearly indicate reduced memory writing due to an
> L2 related effect.  

You might try using valgrind's cachegrind tool which I understand can actually
emulate various processors' cache to show how efficiently code uses it. I
haven't done much with it though so I don't know how applicable it would be to
a large-scale effect like this. 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> Shared Buffers  Elapsed  IO rate (from vmstat)
> --------------  -------  ---------------------
> 400MB           101 s    122 MB/s
> 2MB             100 s
> 1MB              97 s
> 768KB            93 s
> 512KB            86 s
> 256KB            77 s
> 128KB            74 s    166 MB/s

Hm, that seems to blow the "it's an L2 cache effect" theory out of the
water.  If it were a cache effect then there should be a performance
cliff at the point where the cache size is exceeded.  I see no such
cliff, in fact the middle part of the curve is darn near a straight
line on a log scale ...

So I'm back to asking what we're really measuring here.  Buffer manager
inefficiency of some sort, but what?  Have you tried oprofile?
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
"Pavan Deolasee"
Date:
Tom Lane wrote:
> Mark Kirkwood <markir@paradise.net.nz> writes:
>   
>> Shared Buffers  Elapsed  IO rate (from vmstat)
>> --------------  -------  ---------------------
>> 400MB           101 s    122 MB/s
>> 2MB             100 s
>> 1MB              97 s
>> 768KB            93 s
>> 512KB            86 s
>> 256KB            77 s
>> 128KB            74 s    166 MB/s
>>     
>
> So I'm back to asking what we're really measuring here.  Buffer manager
> inefficiency of some sort, but what?  Have you tried oprofile?
>   
Isn't the size of the shared buffer pool itself acting as a performance
penalty in this case ? May be StrategyGetBuffer() needs to make multiple
passes over the buffers before the usage_count of any buffer is reduced
to zero and the buffer is chosen as replacement victim.

There is no real advantage of having larger shared buffer pool in this
particular test. A heap buffer is hardly accessed again once the seqscan
passes over it. Can we try with a smaller value for
BM_MAX_USAGE_COUNT and see if that has any positive impact
for large shared pool in this case ?

Thanks,
Pavan



Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Hi Tom,

On 3/5/07 8:53 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Hm, that seems to blow the "it's an L2 cache effect" theory out of the
> water.  If it were a cache effect then there should be a performance
> cliff at the point where the cache size is exceeded.  I see no such
> cliff, in fact the middle part of the curve is darn near a straight
> line on a log scale ...
> 
> So I'm back to asking what we're really measuring here.  Buffer manager
> inefficiency of some sort, but what?  Have you tried oprofile?

How about looking at the CPU performance counters directly using cpustat: cpustat -c BU_fill_into_L2,umask=0x1 1

This shows us how many L2 fills there are on all four cores (we use all
four).  In the case without buffer cache pollution, below is the trace of L2
fills.  In the pollution case we fill 27 million lines, in the pollution
case we fill 44 million lines.

VACUUM orders (no buffer pollution):51.006   1  tick   275429351.006   2  tick   315956551.006   3  tick
297192951.007  0  tick   357748752.006   1  tick   421417952.006   3  tick   365019352.006   2  tick   390582852.007
0 tick   346526153.006   1  tick   181876653.006   3  tick   154601853.006   2  tick   170938553.007   0  tick
1483371

And here is the case with buffer pollution:
VACUUM orders (with buffer pollution)22.006   0  tick   157611422.006   1  tick   154260422.006   2  tick
198736622.006  3  tick   178456723.006   3  tick   270605923.006   2  tick   236204823.006   0  tick   219071923.006
1 tick   208882724.006   0  tick   224747324.006   1  tick   215385024.006   2  tick   242273024.006   3  tick
275879525.006  0  tick   241943625.006   1  tick   222960225.006   2  tick   261933325.006   3  tick   271233226.006
1 tick   182792326.006   2  tick   188655626.006   3  tick   290974626.006   0  tick   1467164
 




Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
"Pavan Deolasee" <pavan@enterprisedb.com> writes:
> Isn't the size of the shared buffer pool itself acting as a performance
> penalty in this case ? May be StrategyGetBuffer() needs to make multiple
> passes over the buffers before the usage_count of any buffer is reduced
> to zero and the buffer is chosen as replacement victim.

I read that and thought you were onto something, but it's not acting
quite the way I expect.  I made a quick hack in StrategyGetBuffer() to
count the number of buffers it looks at before finding a victim.
Running it with just 32 buffers on a large count(*), the behavior after
the initial startup transient is quite odd:

got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries
got buffer 3 after 1 tries
got buffer 4 after 1 tries
got buffer 5 after 1 tries
got buffer 6 after 1 tries
got buffer 7 after 1 tries
got buffer 8 after 1 tries
got buffer 12 after 4 tries
got buffer 14 after 2 tries
got buffer 21 after 7 tries
got buffer 26 after 5 tries
got buffer 27 after 1 tries
got buffer 31 after 4 tries
got buffer 0 after 1 tries
got buffer 1 after 1 tries
got buffer 9 after 8 tries
got buffer 10 after 1 tries
got buffer 11 after 1 tries
got buffer 13 after 2 tries
got buffer 15 after 2 tries
got buffer 16 after 1 tries
got buffer 17 after 1 tries
got buffer 18 after 1 tries
got buffer 19 after 1 tries
got buffer 20 after 1 tries
got buffer 22 after 2 tries
got buffer 23 after 1 tries
got buffer 24 after 1 tries
got buffer 25 after 1 tries
got buffer 28 after 3 tries
got buffer 29 after 1 tries
got buffer 30 after 1 tries
got buffer 2 after 4 tries

Yes, autovacuum is off, and bgwriter shouldn't have anything useful to
do either, so I'm a bit at a loss what's going on --- but in any case,
it doesn't look like we are cycling through the entire buffer space
for each fetch.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Josh Berkus
Date:
Tom,

> Yes, autovacuum is off, and bgwriter shouldn't have anything useful to
> do either, so I'm a bit at a loss what's going on --- but in any case,
> it doesn't look like we are cycling through the entire buffer space
> for each fetch.

I'd be happy to DTrace it, but I'm a little lost as to where to look in the 
kernel.   I'll see if I can find someone who knows more about memory 
management than me (that ought to be easy).

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Tom,

On 3/5/07 8:53 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> Hm, that seems to blow the "it's an L2 cache effect" theory out of the
> water.  If it were a cache effect then there should be a performance
> cliff at the point where the cache size is exceeded.  I see no such
> cliff, in fact the middle part of the curve is darn near a straight
> line on a log scale ...

Here's that cliff you were looking for:

Size of Orders table: 7178MB
Blocksize: 8KB

Shared_buffers  Select Count    Vacuum
(KB)            (s)             (s)
=======================================
248             5.52            2.46
368             4.77            2.40
552             5.82            2.40
824             6.20            2.43
1232            5.60            3.59
1848            6.02            3.14
2768            5.53            4.56

All of these were run three times and the *lowest* time reported.  Also, the
behavior of "fast VACUUM after SELECT" begins abruptly at 1232KB of
shared_buffers.

These are Opterons with 2MB of L2 cache shared between two cores.

- Luke   




Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
I wrote:
> "Pavan Deolasee" <pavan@enterprisedb.com> writes:
>> Isn't the size of the shared buffer pool itself acting as a performance
>> penalty in this case ? May be StrategyGetBuffer() needs to make multiple
>> passes over the buffers before the usage_count of any buffer is reduced
>> to zero and the buffer is chosen as replacement victim.

> I read that and thought you were onto something, but it's not acting
> quite the way I expect.  I made a quick hack in StrategyGetBuffer() to
> count the number of buffers it looks at before finding a victim.
> ...
> Yes, autovacuum is off, and bgwriter shouldn't have anything useful to
> do either, so I'm a bit at a loss what's going on --- but in any case,
> it doesn't look like we are cycling through the entire buffer space
> for each fetch.

Nope, Pavan's nailed it: the problem is that after using a buffer, the
seqscan leaves it with usage_count = 1, which means it has to be passed
over once by the clock sweep before it can be re-used.  I was misled in
the 32-buffer case because catalog accesses during startup had left the
buffer state pretty confused, so that there was no long stretch before
hitting something available.  With a large number of buffers, the
behavior is that the seqscan fills all of shared memory with buffers
having usage_count 1.  Once the clock sweep returns to the first of
these buffers, it will have to pass over all of them, reducing all of
their counts to 0, before it returns to the first one and finds it now
usable.  Subsequent tries find a buffer immediately, of course, until we
have again filled shared_buffers with usage_count 1 everywhere.  So the
problem is not so much the clock sweep overhead as that it's paid in a
very nonuniform fashion: with N buffers you pay O(N) once every N reads
and O(1) the rest of the time.  This is no doubt slowing things down
enough to delay that one read, instead of leaving it nicely I/O bound
all the time.  Mark, can you detect "hiccups" in the read rate using
your setup?

I seem to recall that we've previously discussed the idea of letting the
clock sweep decrement the usage_count before testing for 0, so that a
buffer could be reused on the first sweep after it was initially used,
but that we rejected it as being a bad idea.  But at least with large
shared_buffers it doesn't sound like such a bad idea.

Another issue nearby to this is whether to avoid selecting buffers that
are dirty --- IIRC someone brought that up again recently.  Maybe
predecrement for clean buffers, postdecrement for dirty ones would be a
cute compromise.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Here's four more points on the curve - I'd use a "dirac delta function" for
your curve fit ;-)

Shared_buffers  Select Count    Vacuum
(KB)            (s)             (s)
=======================================
248             5.52            2.46
368             4.77            2.40
552             5.82            2.40
824             6.20            2.43
1232            5.60            3.59
1848            6.02            3.14
2768            5.53            4.56
5536            6.05            3.95
8304            5.80            4.37
12456           5.86            4.12
18680           5.83            4.10
28016           6.11            4.46

WRT what you found on the selection algorithm, it might also explain the L2
effects I think.

I'm also still of the opinion that polluting the shared buffer cache for a
seq scan does not make sense.

- Luke

On 3/5/07 10:21 AM, "Luke Lonergan" <llonergan@greenplum.com> wrote:

> Tom,
> 
> On 3/5/07 8:53 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> 
>> Hm, that seems to blow the "it's an L2 cache effect" theory out of the
>> water.  If it were a cache effect then there should be a performance
>> cliff at the point where the cache size is exceeded.  I see no such
>> cliff, in fact the middle part of the curve is darn near a straight
>> line on a log scale ...
> 
> Here's that cliff you were looking for:
> 
> Size of Orders table: 7178MB
> Blocksize: 8KB
> 
> Shared_buffers  Select Count    Vacuum
> (KB)            (s)             (s)
> =======================================
> 248             5.52            2.46
> 368             4.77            2.40
> 552             5.82            2.40
> 824             6.20            2.43
> 1232            5.60            3.59
> 1848            6.02            3.14
> 2768            5.53            4.56
> 
> All of these were run three times and the *lowest* time reported.  Also, the
> behavior of "fast VACUUM after SELECT" begins abruptly at 1232KB of
> shared_buffers.
> 
> These are Opterons with 2MB of L2 cache shared between two cores.
> 
> - Luke
>     
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
> 




Re: Bug: Buffer cache is not scan resistant

From
Josh Berkus
Date:
Tom,

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea.  But at least with large
> shared_buffers it doesn't sound like such a bad idea.

We did discuss an number of formulas for setting buffers with different
clock-sweep numbers, including ones with higher usage_count for indexes and
starting numbers of 0 for large seq scans as well as vacuums.  However, we
didn't have any way to prove that any of these complex algorithms would
result in higher performance, so went with the simplest formula, with the
idea of tinkering with it when we had more data.  So maybe now's the time.

Note, though, that the current algorithm is working very, very well for OLTP
benchmarks, so we'd want to be careful not to gain performance in one area at
the expense of another.  In TPCE testing, we've been able to increase
shared_buffers to 10GB with beneficial performance effect (numbers posted
when I have them) and even found that "taking over RAM" with the
shared_buffers (ala Oracle) gave us equivalent performance to using the FS
cache.  (yes, this means with a little I/O management engineering we could
contemplate discarding use of the FS cache for a net performance gain.  Maybe
for 8.4)

--
Josh Berkus
PostgreSQL @ Sun
San Francisco


Re: Bug: Buffer cache is not scan resistant

From
"Pavan Deolasee"
Date:
Tom Lane wrote:
>
> Nope, Pavan's nailed it: the problem is that after using a buffer, the
> seqscan leaves it with usage_count = 1, which means it has to be passed
> over once by the clock sweep before it can be re-used.  I was misled in
> the 32-buffer case because catalog accesses during startup had left the
> buffer state pretty confused, so that there was no long stretch before
> hitting something available.  With a large number of buffers, the
> behavior is that the seqscan fills all of shared memory with buffers
> having usage_count 1.  Once the clock sweep returns to the first of
> these buffers, it will have to pass over all of them, reducing all of
> their counts to 0, before it returns to the first one and finds it now
> usable.  Subsequent tries find a buffer immediately, of course, until we
> have again filled shared_buffers with usage_count 1 everywhere.  So the
> problem is not so much the clock sweep overhead as that it's paid in a
> very nonuniform fashion: with N buffers you pay O(N) once every N reads
> and O(1) the rest of the time.  This is no doubt slowing things down
> enough to delay that one read, instead of leaving it nicely I/O bound
> all the time.  Mark, can you detect "hiccups" in the read rate using
> your setup?
>
>   

Cool. You posted the same analysis before I could hit the "send" button :)

I am wondering whether seqscan would set the usage_count to 1 or to a higher
value. usage_count is  incremented while unpinning the buffer. Even if 
we use
page-at-a-time mode, won't the buffer itself would get pinned/unpinned
every time seqscan returns a tuple ? If thats the case, the overhead would
be O(BM_MAX_USAGE_COUNT * N) for every N reads.

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea.  But at least with large
> shared_buffers it doesn't sound like such a bad idea.
>
>   
How about smaller value for BM_MAX_USAGE_COUNT ?

> Another issue nearby to this is whether to avoid selecting buffers that
> are dirty --- IIRC someone brought that up again recently.  Maybe
> predecrement for clean buffers, postdecrement for dirty ones would be a
> cute compromise.
>   
Can we use a 2-bit counter where the higher bit is set if the buffer is 
dirty
and lower bit is set whenever the buffer is used. The clock-sweep then
decrement this counter and chooses a victim with counter value 0.
ISTM that we should optimize for large shared buffer pool case,
because that would be more common in the coming days. RAM is
getting cheaper everyday.

Thanks,
Pavan




Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
"Pavan Deolasee" <pavan@enterprisedb.com> writes:
> I am wondering whether seqscan would set the usage_count to 1 or to a higher
> value. usage_count is  incremented while unpinning the buffer. Even if 
> we use
> page-at-a-time mode, won't the buffer itself would get pinned/unpinned
> every time seqscan returns a tuple ? If thats the case, the overhead would
> be O(BM_MAX_USAGE_COUNT * N) for every N reads.

No, it's only once per page.  There's a good deal of PrivateRefCount
thrashing that goes on while examining the individual tuples, but the
shared state only changes when we leave the page, because we hold pin
continuously on the current page of a seqscan.  If you don't believe
me, insert some debug printouts for yourself.

> How about smaller value for BM_MAX_USAGE_COUNT ?

This is not relevant to the problem: we are concerned about usage count
1 versus 0, not the other end of the range.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> I seem to recall that we've previously discussed the idea of letting the
> clock sweep decrement the usage_count before testing for 0, so that a
> buffer could be reused on the first sweep after it was initially used,
> but that we rejected it as being a bad idea.  But at least with large
> shared_buffers it doesn't sound like such a bad idea.

I seem to recall the classic clock sweep algorithm was to just use a single
bit. Either a buffer was touched recently or it wasn't.

I also vaguely recall a refinement involving keeping a bitmask and shifting it
right each time the clock hand comes around. So you have a precise history of
which recent clock sweeps the buffer was used and which it wasn't.

I think the coarseness of not caring how heavily it was used is a key part of
the algorithm. By not caring if it was lightly or heavily used during the
clock sweep, just that it was used at least once it avoids making sticking
with incorrect deductions about things like sequential scans even if multiple
sequential scans pass by. As soon as they stop seeing the buffer it
immediately reacts and discards the buffer.

I would check out my OS book from school but it's on another continent :(

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote:
> Tom,
> 
> > I seem to recall that we've previously discussed the idea of letting the
> > clock sweep decrement the usage_count before testing for 0, so that a
> > buffer could be reused on the first sweep after it was initially used,
> > but that we rejected it as being a bad idea.  But at least with large
> > shared_buffers it doesn't sound like such a bad idea.

> Note, though, that the current algorithm is working very, very well for OLTP 
> benchmarks, so we'd want to be careful not to gain performance in one area at 
> the expense of another. 

Agreed.

What we should also add to the analysis is that this effect only occurs
when only uniform workloads is present, like SeqScan, VACUUM or COPY.
When you have lots of indexed access the scan workloads don't have as
much effect on the cache pollution as we are seeing in these tests.

Itakgaki-san and I were discussing in January the idea of cache-looping,
whereby a process begins to reuse its own buffers in a ring of ~32
buffers. When we cycle back round, if usage_count==1 then we assume that
we can reuse that buffer. This avoids cache swamping for read and write
workloads, plus avoids too-frequent WAL writing for VACUUM.

It would be simple to implement the ring buffer and enable/disable it
with a hint StrategyHintCyclicBufferReuse() in a similar manner to the
hint VACUUM provides now.

This would maintain the beneficial behaviour for OLTP, while keeping
data within the L2 cache for DSS and bulk workloads.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> Itakgaki-san and I were discussing in January the idea of cache-looping,
> whereby a process begins to reuse its own buffers in a ring of ~32
> buffers. When we cycle back round, if usage_count==1 then we assume that
> we can reuse that buffer. This avoids cache swamping for read and write
> workloads, plus avoids too-frequent WAL writing for VACUUM.

> This would maintain the beneficial behaviour for OLTP,

Justify that claim.  It sounds to me like this would act very nearly the
same as having shared_buffers == 32 ...
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
<p><font size="2">This sounds like a good idea.<br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br />
 -----OriginalMessage-----<br /> From:   Simon Riggs [<a
href="mailto:simon@2ndquadrant.com">mailto:simon@2ndquadrant.com</a>]<br/> Sent:   Monday, March 05, 2007 02:37 PM
EasternStandard Time<br /> To:     Josh Berkus; Tom Lane; Pavan Deolasee; Mark Kirkwood; Gavin Sherry; Luke Lonergan;
PGSQLHackers; Doug Rady; Sherry Moore<br /> Cc:     pgsql-hackers@postgresql.org<br /> Subject:        Re: [HACKERS]
Bug:Buffer cache is not scan resistant<br /><br /> On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote:<br /> >
Tom,<br/> ><br /> > > I seem to recall that we've previously discussed the idea of letting the<br /> > >
clocksweep decrement the usage_count before testing for 0, so that a<br /> > > buffer could be reused on the
firstsweep after it was initially used,<br /> > > but that we rejected it as being a bad idea.  But at least with
large<br/> > > shared_buffers it doesn't sound like such a bad idea.<br /><br /> > Note, though, that the
currentalgorithm is working very, very well for OLTP<br /> > benchmarks, so we'd want to be careful not to gain
performancein one area at<br /> > the expense of another.<br /><br /> Agreed.<br /><br /> What we should also add to
theanalysis is that this effect only occurs<br /> when only uniform workloads is present, like SeqScan, VACUUM or
COPY.<br/> When you have lots of indexed access the scan workloads don't have as<br /> much effect on the cache
pollutionas we are seeing in these tests.<br /><br /> Itakgaki-san and I were discussing in January the idea of
cache-looping,<br/> whereby a process begins to reuse its own buffers in a ring of ~32<br /> buffers. When we cycle
backround, if usage_count==1 then we assume that<br /> we can reuse that buffer. This avoids cache swamping for read
andwrite<br /> workloads, plus avoids too-frequent WAL writing for VACUUM.<br /><br /> It would be simple to implement
thering buffer and enable/disable it<br /> with a hint StrategyHintCyclicBufferReuse() in a similar manner to the<br />
hintVACUUM provides now.<br /><br /> This would maintain the beneficial behaviour for OLTP, while keeping<br /> data
withinthe L2 cache for DSS and bulk workloads.<br /><br /> --<br />   Simon Riggs            <br />   EnterpriseDB   <a
href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/><br /><br /><br /></font> 

Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > Itakgaki-san and I were discussing in January the idea of cache-looping,
> > whereby a process begins to reuse its own buffers in a ring of ~32
> > buffers. When we cycle back round, if usage_count==1 then we assume that
> > we can reuse that buffer. This avoids cache swamping for read and write
> > workloads, plus avoids too-frequent WAL writing for VACUUM.
> 
> > This would maintain the beneficial behaviour for OLTP,
> 
> Justify that claim.  It sounds to me like this would act very nearly the
> same as having shared_buffers == 32 ...

Sure. We wouldn't set the hint for IndexScans or Inserts, only for
SeqScans, VACUUM and COPY.

So OLTP-only workloads would be entirely unaffected. In the presence of
a mixed workload the scan tasks would have only a limited effect on the
cache, maintaining performance for the response time critical tasks. So
its an OLTP benefit because of cache protection and WAL-flush reduction
during VACUUM.

As we've seen, the scan tasks look like they'll go faster with this.

The assumption that we can reuse the buffer if usage_count<=1 seems
valid. If another user had requested the block, then the usage_count
would be > 1, unless the buffer has been used, unpinned and then a cycle
of the buffer cache had spun round, all within the time taken to process
32 blocks sequentially. We do have to reuse one of the buffers, so
cyclical reuse seems like a better bet most of the time than more
arbitrary block reuse, as we see in a larger cache.

Best way is to prove it though. Seems like not too much work to have a
private ring data structure when the hint is enabled. The extra
bookeeping is easily going to be outweighed by the reduction in mem->L2
cache fetches. I'll do it tomorrow, if no other volunteers.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Mon, 2007-03-05 at 03:51 -0500, Luke Lonergan wrote:
> The Postgres shared buffer cache algorithm appears to have a bug.  When
> there is a sequential scan the blocks are filling the entire shared
> buffer cache.  This should be "fixed".
> 
> My proposal for a fix: ensure that when relations larger (much larger?)
> than buffer cache are scanned, they are mapped to a single page in the
> shared buffer cache.
> 

I don't see why we should strictly limit sequential scans to one buffer
per scan. I assume you mean one buffer per scan, but that raises these
two questions:

(1) What happens when there are more seq scans than cold buffers
available?
(2) What happens when two sequential scans need the same page, do we
double-up?

Also, the first time we access any heap page of a big table, we are very
unsure whether we will access it again, regardless of whether it's part
of a seq scan or not.

In our current system of 4 LRU lists (depending on how many times a
buffer has been referenced), we could start "more likely" (e.g. system
catalogs, index pages) pages in higher list, and heap reads from big
tables in the lowest possible list. Assuming, of course, that has any
benefit (frequently accessed cache pages are likely to move up in the
lists very quickly anyway).

But I don't think we should eliminate caching of heap pages in big
tables all together. A few buffers might be polluted during the scan,
but most of the time they will be replacing other low-priority pages
(perhaps from another seq scan) and probably be replaced again quickly.
If that's not happening, and it's polluting frequently-accessed pages, I
agree that's a bug.

Regards,Jeff Davis



Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Mon, 2007-03-05 at 11:10 +0200, Hannu Krosing wrote:
> > My proposal for a fix: ensure that when relations larger (much larger?)
> > than buffer cache are scanned, they are mapped to a single page in the
> > shared buffer cache.
> 
> How will this approach play together with synchronized scan patches ?
> 

Thanks for considering my patch in this discussion. I will test by
turning shared_buffers down as low as I can, and see if that makes a big
difference.

> Or should synchronized scan rely on systems cache only ?
> 

I don't know what the performance impact of that will be; still good
compared to reading from disk, but I assume much more CPU time.

Regards,Jeff Davis



Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> Best way is to prove it though. Seems like not too much work to have a
> private ring data structure when the hint is enabled. The extra
> bookeeping is easily going to be outweighed by the reduction in mem->L2
> cache fetches. I'll do it tomorrow, if no other volunteers.

[ shrug... ]  No one has yet proven to my satisfaction that L2 cache has
anything to do with this.  The notion that you can read a new disk page
into a shared buffer and have that buffer still be live in the processor
cache is so obviously bogus that I think there must be some other effect
at work.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Mon, 2007-03-05 at 09:09 +0000, Heikki Linnakangas wrote:
> In fact, the pages that are left in the cache after the seqscan finishes 
> would be useful for the next seqscan of the same table if we were smart 
> enough to read those pages first. That'd make a big difference for 
> seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if 
> Jeff's sync seqscan patch adresses that.
> 

Absolutely. I've got a parameter in my patch "sync_scan_offset" that
starts a seq scan N pages before the position of the last seq scan
running on that table (or a current seq scan if there's still a scan
going). 

If the last scan is not still in progress, the pages are less likely to
be in cache. If the pages are in cache, great; if not, it doesn't matter
where we start anyway.

If the last scan is still in progress, those recently-read pages are
very likely to be in cache (shared buffers or OS buffer cache).

Regards,Jeff Davis



Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> Absolutely. I've got a parameter in my patch "sync_scan_offset" that
> starts a seq scan N pages before the position of the last seq scan
> running on that table (or a current seq scan if there's still a scan
> going). 

Strikes me that expressing that parameter as a percentage of
shared_buffers might make it less in need of manual tuning ...
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > Absolutely. I've got a parameter in my patch "sync_scan_offset" that
> > starts a seq scan N pages before the position of the last seq scan
> > running on that table (or a current seq scan if there's still a scan
> > going). 
> 
> Strikes me that expressing that parameter as a percentage of
> shared_buffers might make it less in need of manual tuning ...
> 

The original patch was a percentage of effective_cache_size, because in
theory it may be helpful to have this parameter larger than shared
buffers. Synchronized Scannning can take advantage of OS buffer cache as
well.

Someone convinced me to change it to be an independent variable.

I don't have a strong opinion, but now I have three different opinions:
(1) Independent parameter
(2) Percentage of shared_buffers
(3) Percentage of effective_cache_size

Regards,Jeff Davis



Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote:
>> Strikes me that expressing that parameter as a percentage of
>> shared_buffers might make it less in need of manual tuning ...

> The original patch was a percentage of effective_cache_size, because in
> theory it may be helpful to have this parameter larger than shared
> buffers. Synchronized Scannning can take advantage of OS buffer cache as
> well.

I didn't say you couldn't allow it to be more than 100% ;-).  But basing
it on effective_cache_size strikes me as a bad idea because that parameter
is seldom better than a wild guess.  shared_buffers at least means
something.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Heikki Linnakangas
Date:
Jeff Davis wrote:
> On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote:
>> Jeff Davis <pgsql@j-davis.com> writes:
>>> Absolutely. I've got a parameter in my patch "sync_scan_offset" that
>>> starts a seq scan N pages before the position of the last seq scan
>>> running on that table (or a current seq scan if there's still a scan
>>> going). 
>> Strikes me that expressing that parameter as a percentage of
>> shared_buffers might make it less in need of manual tuning ...
>>
> 
> The original patch was a percentage of effective_cache_size, because in
> theory it may be helpful to have this parameter larger than shared
> buffers. Synchronized Scannning can take advantage of OS buffer cache as
> well.
> 
> Someone convinced me to change it to be an independent variable.
> 
> I don't have a strong opinion, but now I have three different opinions:
> (1) Independent parameter
> (2) Percentage of shared_buffers
> (3) Percentage of effective_cache_size

Another approach I proposed back in December is to not have a variable 
like that at all, but scan the buffer cache for pages belonging to the 
table you're scanning to initialize the scan. Scanning all the 
BufferDescs is a fairly CPU and lock heavy operation, but it might be ok 
given that we're talking about large I/O bound sequential scans. It 
would require no DBA tuning and would work more robustly in varying 
conditions. I'm not sure where you would continue after scanning the 
in-cache pages. At the highest in-cache block number, perhaps.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Bug: Buffer cache is not scan resistant

From
Mark Kirkwood
Date:
Tom Lane wrote:

> So the
> problem is not so much the clock sweep overhead as that it's paid in a
> very nonuniform fashion: with N buffers you pay O(N) once every N reads
> and O(1) the rest of the time.  This is no doubt slowing things down
> enough to delay that one read, instead of leaving it nicely I/O bound
> all the time.  Mark, can you detect "hiccups" in the read rate using
> your setup?
> 

I think so, here's the vmstat output for 400MB of shared_buffers during
the scan:

procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu---- r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa 1  0    764  51772      0 1990688    0    0 120422     2 1546  1755 16 
37 46  1 1  0    764  53640      0 1988792    0    0 120422     2 1544  1446 14 
40 46  1 1  0    788  54900      0 1987564    0    0 116746    15 1470  3067 15 
39 44  2 1  0    788  52800      0 1989552    0    0 119199    20 1488  2216 14 
37 47  1 1  0    788  52372      0 1990000    0    0 122880     7 1532  1203 15 
39 45  1 1  0    788  54592      0 1987872    0    5 124928     5 1557  1058 17 
38 46  0 2  0    788  54052      0 1987836    0    0 118787     0 1500  2469 16 
36 47  1 1  0    788  52552      0 1989892    0    0 120419     0 1506  2531 15 
36 48  1 1  0    788  53452      0 1989356    0    0 119195     2 1501  1698 15 
37 47  1 1  0    788  52680      0 1989796    0    0 120424     2 1521  1610 16 
37 47  1


Cheers

Mark





Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> Tom Lane wrote:
>> Mark, can you detect "hiccups" in the read rate using
>> your setup?

> I think so, here's the vmstat output for 400MB of shared_buffers during
> the scan:

Hm, not really a smoking gun there.  But just for grins, would you try
this patch and see if the numbers change?

            regards, tom lane

*** src/backend/storage/buffer/freelist.c.orig    Fri Jan  5 18:02:12 2007
--- src/backend/storage/buffer/freelist.c    Mon Mar  5 16:33:11 2007
***************
*** 104,116 ****
           * it; decrement the usage_count and keep scanning.
           */
          LockBufHdr(buf);
-         if (buf->refcount == 0 && buf->usage_count == 0)
-             return buf;
          if (buf->usage_count > 0)
          {
              buf->usage_count--;
!             trycounter = NBuffers;
          }
          else if (--trycounter == 0)
          {
              /*
--- 104,116 ----
           * it; decrement the usage_count and keep scanning.
           */
          LockBufHdr(buf);
          if (buf->usage_count > 0)
          {
              buf->usage_count--;
!             trycounter = NBuffers + 1;
          }
+         if (buf->refcount == 0 && buf->usage_count == 0)
+             return buf;
          else if (--trycounter == 0)
          {
              /*

Re: Bug: Buffer cache is not scan resistant

From
Mark Kirkwood
Date:
Tom Lane wrote:

> 
> Hm, not really a smoking gun there.  But just for grins, would you try
> this patch and see if the numbers change?
> 

Applied to 8.2.3 (don't have lineitem loaded in HEAD yet) - no change 
that I can see:

procs -----------memory---------- ---swap-- -----io---- --system-- 
----cpu---- r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us 
sy id wa 1  0    764  55300      0 1988032    0    0 118797    32 1532  1775 15 
39 44  2 1  0    764  54476      0 1989720    0    0 115507     9 1456  3970 15 
39 45  1 1  0    788  54540      0 1989592    0    0 121651     0 1508  3221 16 
37 47  0 1  0    788  52808      0 1991320    0    0 124109     0 1532  1236 15 
38 46  0 1  0    788  52504      0 1991784    0    0 124518     0 1547  1005 16 
39 45  0 2  0    788  54544      0 1989740    0    5 117965     5 1491  2012 15 
36 47  2 1  0    788  53596      0 1991184    0    0 120424     0 1504  1910 16 
37 46  1

Elapsed time is exactly the same (101 s). Is is expected that HEAD would 
behave differently?

Cheers

Mark


Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> Elapsed time is exactly the same (101 s). Is is expected that HEAD would 
> behave differently?

Offhand I don't think so.  But what I wanted to see was the curve of
elapsed time vs shared_buffers?
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Mon, 2007-03-05 at 21:03 +0000, Heikki Linnakangas wrote:
> Another approach I proposed back in December is to not have a variable 
> like that at all, but scan the buffer cache for pages belonging to the 
> table you're scanning to initialize the scan. Scanning all the 
> BufferDescs is a fairly CPU and lock heavy operation, but it might be ok 
> given that we're talking about large I/O bound sequential scans. It 
> would require no DBA tuning and would work more robustly in varying 
> conditions. I'm not sure where you would continue after scanning the 
> in-cache pages. At the highest in-cache block number, perhaps.
> 

I assume you're referring to this:

"each backend keeps a bitmap of pages it has processed during the scan,
and read the pages in the order they're available in cache."

which I think is a great idea. However, I was unable to devise a good
answer to all these questions at once:

* How do we attempt to maintain sequential reads on the underlying I/O
layer?

* My current implementation takes advantage of the OS buffer cache, how
could we maintain that advantage from PostgreSQL-specific cache logic?

* How do I test to see whether it actually helps in a realistic
scenario? It seems like it would help the most when scans are
progressing at different rates, but how often do people have CPU-bound
queries on tables that don't fit into physical memory (and how long
would it take for me to benchmark such a query)?

It seems like your idea is more analytical, and my current
implementation is more guesswork. I like the analytical approach, but I
don't know that we have enough information to pull it off because we're
missing what's in the OS buffer cache. The OS buffer cache is crucial to
Synchronized Scanning, because shared buffers are evicted based on a
more complex set of circumstances, whereas the OS buffer cache is
usually LRU and forms a nicer "cache trail" (upon which Synchronized
Scanning is largely based). 

If you have some tests you'd like me to run, I'm planning to do some
benchmarks this week and next. I can see if my current patch holds up
under the scenarios you're worried about.

Regards,Jeff Davis





Re: Bug: Buffer cache is not scan resistant

From
"Florian G. Pflug"
Date:
Simon Riggs wrote:
> On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote:
>> "Simon Riggs" <simon@2ndquadrant.com> writes:
>>> Itakgaki-san and I were discussing in January the idea of cache-looping,
>>> whereby a process begins to reuse its own buffers in a ring of ~32
>>> buffers. When we cycle back round, if usage_count==1 then we assume that
>>> we can reuse that buffer. This avoids cache swamping for read and write
>>> workloads, plus avoids too-frequent WAL writing for VACUUM.
>>> This would maintain the beneficial behaviour for OLTP,
>> Justify that claim.  It sounds to me like this would act very nearly the
>> same as having shared_buffers == 32 ...
> 
> Sure. We wouldn't set the hint for IndexScans or Inserts, only for
> SeqScans, VACUUM and COPY.
> 
> So OLTP-only workloads would be entirely unaffected. In the presence of
> a mixed workload the scan tasks would have only a limited effect on the
> cache, maintaining performance for the response time critical tasks. So
> its an OLTP benefit because of cache protection and WAL-flush reduction
> during VACUUM.
> 
> As we've seen, the scan tasks look like they'll go faster with this.

But it would break the idea of letting a second seqscan follow in the
first's hot cache trail, no?

greetings, Florian Pflug





Re: Bug: Buffer cache is not scan resistant

From
Mark Kirkwood
Date:
Tom Lane wrote:
> 
> But what I wanted to see was the curve of
> elapsed time vs shared_buffers?
> 

Of course! (lets just write that off to me being pre coffee...).

With the patch applied:

Shared Buffers  Elapsed  vmstat IO rate
--------------  -------  --------------
400MB           101 s    122 MB/s
2MB             101 s
1MB              97 s
768KB            94 s
512KB            84 s
256KB            79 s
128KB            75 s    166 MB/s

Looks *very* similar.

Cheers

Mark


Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Mark Kirkwood <markir@paradise.net.nz> writes:
> Tom Lane wrote:
>> But what I wanted to see was the curve of
>> elapsed time vs shared_buffers?
> ...
> Looks *very* similar.

Yup, thanks for checking.

I've been poking into this myself.  I find that I can reproduce the
behavior to some extent even with a slow disk drive (this machine is a
dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test
shows the disk read speed as 43MB/sec or so).  Test case is a
several-gig table, no indexes, fully vacuumed so that neither VACUUM nor
COUNT(*) have to do anything but seqscan as fast as they can.  Given a
*freshly started* postmaster, I see

regression=# show shared_buffers;shared_buffers
----------------128MB
(1 row)

regression=# \timing
Timing is on.
regression=# vacuum lineitem;
VACUUM
Time: 63652.333 ms
regression=# vacuum lineitem;
VACUUM
Time: 63562.303 ms
regression=# select count(*) from lineitem; count
----------10240000
(1 row)

Time: 63142.174 ms
regression=# vacuum lineitem;
VACUUM
Time: 61638.421 ms
regression=# vacuum lineitem;
VACUUM
Time: 61785.905 ms

I didn't show it here, but you can repeat the VACUUM all you want before
the SELECT, and its times are stable; and you can repeat all you want
after the SELECT, and the times are stable but a couple seconds lower.
Restart the postmaster and it goes back to the slower behavior.  (I'm
keeping autovac off so it doesn't change the results.)

I decided to get down and dirty with oprofile, and soon found that the
user-space CPU consumption is indistinguishable in both states:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...| samples|      %|
------------------  141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux   26368 13.7984
/home/tgl/testversion/bin/postgres  12765  6.6799 /libata    2238  1.1711 /lib64/libc-2.4.so    1112  0.5819 /dm_mod
 

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
GLOBAL_POWER_E...| samples|      %|
------------------  113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux   26284 16.3070
/home/tgl/testversion/bin/postgres  12004  7.4475 /libata    2093  1.2985 /lib64/libc-2.4.so     996  0.6179 /dm_mod
 

Inside the kernel, there's only one routine that's significantly different:

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples  %        symbol name
57779    40.9591  copy_user_generic
18175    12.8841  __delay
3994      2.8313  _raw_spin_lock
2388      1.6928  put_page
2184      1.5482  mwait_idle
2083      1.4766  _raw_write_unlock
1909      1.3533  _raw_write_lock

CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)
with a unit mask of 0x01 (mandatory) count 240000
samples  %        symbol name
37437    33.0783  copy_user_generic
17891    15.8080  __delay
3372      2.9794  _raw_spin_lock
2218      1.9598  mwait_idle
2067      1.8263  _raw_write_unlock
1837      1.6231  _raw_write_lock
1531      1.3527  put_page

So that's part of the mystery: apparently copy_user_generic is coded in
such a way that it's faster to copy into memory that's already in
processor cache.  This strikes me as something that probably
could/should be fixed in the kernel; I don't see any good reason why
overwriting a whole cache line oughtn't be the same speed either way.

The other thing that was bothering me is why does the SELECT change
VACUUM's behavior?  A debugging printout added to ReadBuffer gave the
answer: after postmaster start, we see things like

read block 353094 into buffer 11386
read block 353095 into buffer 11387
read block 353096 into buffer 11388
read block 353097 into buffer 11389
read block 353098 into buffer 11390
read block 353099 into buffer 11391
read block 353100 into buffer 11392
read block 353101 into buffer 11393
read block 353102 into buffer 11394
read block 353103 into buffer 11395

and after the SELECT it behaves like

read block 336761 into buffer 9403
read block 336762 into buffer 9402
read block 336763 into buffer 9403
read block 336764 into buffer 9402
read block 336765 into buffer 9403
read block 336766 into buffer 9402
read block 336767 into buffer 9403
read block 336768 into buffer 9402
read block 336769 into buffer 9403
read block 336770 into buffer 9402

What's going on is that VACUUM puts each buffer it's finished with on
the tail of the freelist.  In the post-SELECT state, there are just two
buffers cycling through the freelist (not sure why not only one, but it
doesn't matter) and so the cache footprint is limited.  But immediately
after postmaster start, (nearly) all the buffers are in the freelist and
so they all cycle through VACUUM's usage.  In any real-world situation,
of course, the freelist is going to be nearly empty most of the time and
so I don't think this part is worth changing.

So I now concede Luke's argument that this behavior is related to L2
cache usage.  But the next question is whether we ought to change
regular seqscan's behavior to mimic VACUUM.  I'm very worried about
pessimizing other cases if we do.  ISTM there's a fairly clear case that
this might be fixable at the kernel level.  Moreover, the issue only
arises because of the way that the copy-from-kernel-buffer-to-userspace
routine behaves, which means that if we go to a regime where we stop
relying on OS caching and ask for direct DMA into our buffers, the
advantage would disappear anyway.  Lastly, I think the case where a win
is possible is fairly narrow --- as soon as you've got anything but the
one seqscan going on, it's not going to help.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> I don't see any good reason why overwriting a whole cache line oughtn't be
> the same speed either way.

I can think of a couple theories, but I don't know if they're reasonable. The
one the comes to mind is the inter-processor cache coherency protocol. When
writing to a cache line the processor already owns maybe it can skip having to
check for other processors owning that cache line?

What happens if VACUUM comes across buffers that *are* already in the buffer
cache. Does it throw those on the freelist too? That seems like it would be
dangerous if they were in the buffer cache for a reason.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
<p><font size="2">Hi Tom,<br /><br /> Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote
it).<br/><br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br />  -----Original Message-----<br />
From:  Tom Lane [<a href="mailto:tgl@sss.pgh.pa.us">mailto:tgl@sss.pgh.pa.us</a>]<br /> Sent:   Monday, March 05, 2007
07:43PM Eastern Standard Time<br /> To:     Mark Kirkwood<br /> Cc:     Pavan Deolasee; Gavin Sherry; Luke Lonergan;
PGSQLHackers; Doug Rady; Sherry Moore<br /> Subject:        Re: [HACKERS] Bug: Buffer cache is not scan resistant<br
/><br/> Mark Kirkwood <markir@paradise.net.nz> writes:<br /> > Tom Lane wrote:<br /> >> But what I
wantedto see was the curve of<br /> >> elapsed time vs shared_buffers?<br /> > ...<br /> > Looks *very*
similar.<br/><br /> Yup, thanks for checking.<br /><br /> I've been poking into this myself.  I find that I can
reproducethe<br /> behavior to some extent even with a slow disk drive (this machine is a<br /> dual 2.8GHz Xeon EM64T
runningFedora Core 5; the dd-to-dev-null test<br /> shows the disk read speed as 43MB/sec or so).  Test case is a<br />
several-gigtable, no indexes, fully vacuumed so that neither VACUUM nor<br /> COUNT(*) have to do anything but seqscan
asfast as they can.  Given a<br /> *freshly started* postmaster, I see<br /><br /> regression=# show shared_buffers;<br
/> shared_buffers<br /> ----------------<br />  128MB<br /> (1 row)<br /><br /> regression=# \timing<br /> Timing is
on.<br/> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 63652.333 ms<br /> regression=# vacuum lineitem;<br />
VACUUM<br/> Time: 63562.303 ms<br /> regression=# select count(*) from lineitem;<br />   count<br /> ----------<br />
 10240000<br/> (1 row)<br /><br /> Time: 63142.174 ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br /> Time:
61638.421ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 61785.905 ms<br /><br /> I didn't show it
here,but you can repeat the VACUUM all you want before<br /> the SELECT, and its times are stable; and you can repeat
allyou want<br /> after the SELECT, and the times are stable but a couple seconds lower.<br /> Restart the postmaster
andit goes back to the slower behavior.  (I'm<br /> keeping autovac off so it doesn't change the results.)<br /><br />
Idecided to get down and dirty with oprofile, and soon found that the<br /> user-space CPU consumption is
indistinguishablein both states:<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br />
CountedGLOBAL_POWER_EVENTS events (time during which processor is not stopped)<br /> with a unit mask of 0x01
(mandatory)count 240000<br /> GLOBAL_POWER_E...|<br />   samples|      %|<br /> ------------------<br />    141065
73.8193/usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux<br />     26368 13.7984
/home/tgl/testversion/bin/postgres<br/>     12765  6.6799 /libata<br />      2238  1.1711 /lib64/libc-2.4.so<br />     
1112 0.5819 /dm_mod<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> Counted
GLOBAL_POWER_EVENTSevents (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory) count
240000<br/> GLOBAL_POWER_E...|<br />   samples|      %|<br /> ------------------<br />    113177 70.2169
/usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux<br/>     26284 16.3070 /home/tgl/testversion/bin/postgres<br />
   12004  7.4475 /libata<br />      2093  1.2985 /lib64/libc-2.4.so<br />       996  0.6179 /dm_mod<br /><br /> Inside
thekernel, there's only one routine that's significantly different:<br /><br /> CPU: P4 / Xeon with 2 hyper-threads,
speed2793.08 MHz (estimated)<br /> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)<br
/>with a unit mask of 0x01 (mandatory) count 240000<br /> samples  %        symbol name<br /> 57779    40.9591 
copy_user_generic<br/> 18175    12.8841  __delay<br /> 3994      2.8313  _raw_spin_lock<br /> 2388      1.6928 
put_page<br/> 2184      1.5482  mwait_idle<br /> 2083      1.4766  _raw_write_unlock<br /> 1909      1.3533 
_raw_write_lock<br/><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> Counted
GLOBAL_POWER_EVENTSevents (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory) count
240000<br/> samples  %        symbol name<br /> 37437    33.0783  copy_user_generic<br /> 17891    15.8080  __delay<br
/>3372      2.9794  _raw_spin_lock<br /> 2218      1.9598  mwait_idle<br /> 2067      1.8263  _raw_write_unlock<br />
1837     1.6231  _raw_write_lock<br /> 1531      1.3527  put_page<br /><br /> So that's part of the mystery: apparently
copy_user_genericis coded in<br /> such a way that it's faster to copy into memory that's already in<br /> processor
cache. This strikes me as something that probably<br /> could/should be fixed in the kernel; I don't see any good
reasonwhy<br /> overwriting a whole cache line oughtn't be the same speed either way.<br /><br /> The other thing that
wasbothering me is why does the SELECT change<br /> VACUUM's behavior?  A debugging printout added to ReadBuffer gave
the<br/> answer: after postmaster start, we see things like<br /><br /> read block 353094 into buffer 11386<br /> read
block353095 into buffer 11387<br /> read block 353096 into buffer 11388<br /> read block 353097 into buffer 11389<br />
readblock 353098 into buffer 11390<br /> read block 353099 into buffer 11391<br /> read block 353100 into buffer
11392<br/> read block 353101 into buffer 11393<br /> read block 353102 into buffer 11394<br /> read block 353103 into
buffer11395<br /><br /> and after the SELECT it behaves like<br /><br /> read block 336761 into buffer 9403<br /> read
block336762 into buffer 9402<br /> read block 336763 into buffer 9403<br /> read block 336764 into buffer 9402<br />
readblock 336765 into buffer 9403<br /> read block 336766 into buffer 9402<br /> read block 336767 into buffer 9403<br
/>read block 336768 into buffer 9402<br /> read block 336769 into buffer 9403<br /> read block 336770 into buffer
9402<br/><br /> What's going on is that VACUUM puts each buffer it's finished with on<br /> the tail of the freelist. 
Inthe post-SELECT state, there are just two<br /> buffers cycling through the freelist (not sure why not only one, but
it<br/> doesn't matter) and so the cache footprint is limited.  But immediately<br /> after postmaster start, (nearly)
allthe buffers are in the freelist and<br /> so they all cycle through VACUUM's usage.  In any real-world situation,<br
/>of course, the freelist is going to be nearly empty most of the time and<br /> so I don't think this part is worth
changing.<br/><br /> So I now concede Luke's argument that this behavior is related to L2<br /> cache usage.  But the
nextquestion is whether we ought to change<br /> regular seqscan's behavior to mimic VACUUM.  I'm very worried about<br
/>pessimizing other cases if we do.  ISTM there's a fairly clear case that<br /> this might be fixable at the kernel
level. Moreover, the issue only<br /> arises because of the way that the copy-from-kernel-buffer-to-userspace<br />
routinebehaves, which means that if we go to a regime where we stop<br /> relying on OS caching and ask for direct DMA
intoour buffers, the<br /> advantage would disappear anyway.  Lastly, I think the case where a win<br /> is possible is
fairlynarrow --- as soon as you've got anything but the<br /> one seqscan going on, it's not going to help.<br /><br />
                       regards, tom lane<br /><br /></font> 

Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> What happens if VACUUM comes across buffers that *are* already in the buffer
> cache. Does it throw those on the freelist too?

Not unless they have usage_count 0, in which case they'd be subject to
recycling by the next clock sweep anyway.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
"Luke Lonergan" <LLonergan@greenplum.com> writes:
> Good info - it's the same in Solaris, the routine is uiomove (Sherry
> wrote it).

Cool.  Maybe Sherry can comment on the question whether it's possible
for a large-scale-memcpy to not take a hit on filling a cache line
that wasn't previously in cache?

I looked a bit at the Linux code that's being used here, but it's all
x86_64 assembler which is something I've never studied :-(.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Tom,

On 3/5/07 7:58 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:

> I looked a bit at the Linux code that's being used here, but it's all
> x86_64 assembler which is something I've never studied :-(.

Here's the C wrapper routine in Solaris:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/
move.c

Here's the x86 assembler routine for Solaris: 
http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32
/ml/copy.s

The actual uiomove routine is a simple wrapper that calls the assembler
kcopy or xcopyout routines.  There are two versions (for Opteron), one that
uses the NTA instructions that bypass the L2 cache on writing to avoid L2
cache pollution, and the second writes normally - through the L2 cache.
Which one is used depends on a parameter (global) based on the size of the
I/O. It is tuned to identify operations that might pollute the L2 cache
(sound familiar?) 

I think what we're seeing is a generic artifact of the write-through
behavior of the cache.  I wouldn't expect this to get any better with
DIRECTIO to the shared_buffers in pgsql - if we iterate over a large number
of user space buffers we'll still hit the increased L2 thrashing.

I think we're best off with a hybrid approach - when we "detect" a seq scan
larger (much larger?) than buffer cache, we can switch into the "cache
bypass" behavior, much like the above code uses the NTA instruction when
appropriate.

We can handle syncscan using a small buffer space.

- Luke




Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:
> Here's the x86 assembler routine for Solaris:
> http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32
> /ml/copy.s
> The actual uiomove routine is a simple wrapper that calls the assembler
> kcopy or xcopyout routines.  There are two versions (for Opteron), one that
> uses the NTA instructions that bypass the L2 cache on writing to avoid L2
> cache pollution, and the second writes normally - through the L2 cache.

Hm.  If it were bypassing the L2 cache then the hit would be taken when
PG later tries to read the data.  This is clearly not what's happening
in the Linux system, but I've not seen any oprofile-equivalent data for
Solaris?
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Jim Nasby
Date:
On Mar 5, 2007, at 11:46 AM, Josh Berkus wrote:
> Tom,
>
>> I seem to recall that we've previously discussed the idea of  
>> letting the
>> clock sweep decrement the usage_count before testing for 0, so that a
>> buffer could be reused on the first sweep after it was initially  
>> used,
>> but that we rejected it as being a bad idea.  But at least with large
>> shared_buffers it doesn't sound like such a bad idea.
>
> We did discuss an number of formulas for setting buffers with  
> different
> clock-sweep numbers, including ones with higher usage_count for  
> indexes and
> starting numbers of 0 for large seq scans as well as vacuums.   
> However, we
> didn't have any way to prove that any of these complex algorithms  
> would
> result in higher performance, so went with the simplest formula,  
> with the
> idea of tinkering with it when we had more data.  So maybe now's  
> the time.
>
> Note, though, that the current algorithm is working very, very well  
> for OLTP
> benchmarks, so we'd want to be careful not to gain performance in  
> one area at
> the expense of another.  In TPCE testing, we've been able to increase
> shared_buffers to 10GB with beneficial performance effect (numbers  
> posted
> when I have them) and even found that "taking over RAM" with the
> shared_buffers (ala Oracle) gave us equivalent performance to using  
> the FS
> cache.  (yes, this means with a little I/O management engineering  
> we could
> contemplate discarding use of the FS cache for a net performance  
> gain.  Maybe
> for 8.4)

An idea I've been thinking about would be to have the bgwriter or  
some other background process actually try and keep the free list  
populated, so that backends needing to grab a page would be much more  
likely to find one there (and not have to wait to scan through the  
entire buffer pool, perhaps multiple times).

My thought is to keep track of how many page requests occurred during  
a given interval, and use that value (probably averaged over time) to  
determine how many pages we'd like to see on the free list. The  
background process would then run through the buffers decrementing  
usage counts until it found enough for the free list. Before putting  
a buffer on the 'free list', it would write the buffer out; I'm not  
sure if it would make sense to de-associate the buffer with whatever  
it had been storing or not, though. If we don't do that, that would  
mean that we could pull pages back off the free list if we wanted to.  
That would be helpful if the background process got a bit over-zealous.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Bug: Buffer cache is not scan resistant

From
Jim Nasby
Date:
On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> Another approach I proposed back in December is to not have a  
> variable like that at all, but scan the buffer cache for pages  
> belonging to the table you're scanning to initialize the scan.  
> Scanning all the BufferDescs is a fairly CPU and lock heavy  
> operation, but it might be ok given that we're talking about large  
> I/O bound sequential scans. It would require no DBA tuning and  
> would work more robustly in varying conditions. I'm not sure where  
> you would continue after scanning the in-cache pages. At the  
> highest in-cache block number, perhaps.

If there was some way to do that, it'd be what I'd vote for.

Given the partitioning of the buffer lock that Tom did it might not  
be that horrible for many cases, either, since you'd only need to  
scan through one partition.

We also don't need an exact count, either. Perhaps there's some way  
we could keep a counter or something...
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Jim Nasby <decibel@decibel.org> writes:
> An idea I've been thinking about would be to have the bgwriter or  
> some other background process actually try and keep the free list  
> populated,

The bgwriter already tries to keep pages "just in front" of the clock
sweep pointer clean.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote:
> Simon Riggs wrote:

> But it would break the idea of letting a second seqscan follow in the
> first's hot cache trail, no?

No, but it would make it somewhat harder to achieve without direct
synchronization between scans. It could still work; lets see.

I'm not sure thats an argument against fixing the problem with the
buffer strategy though. We really want both, not just one or the other.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Bug: Buffer cache is not scan resistant

From
Sherry Moore
Date:
Hi Tom,

Sorry about the delay.  I have been away from computers all day.

In the current Solaris release in development (Code name Nevada,
available for download at http://opensolaris.org), I have implemented
non-temporal access (NTA) which bypasses L2 for most writes, and reads
larger than copyout_max_cached (patchable, default to 128K).  The block
size used by Postgres is 8KB.  If I patch copyout_max_cached to 4KB to
trigger NTA for reads, the access time with 16KB buffer or 128MB buffer
are very close.

I wrote readtest to simulate the access pattern of VACUUM (attached).
tread is a 4-socket dual-core Opteron box.

<81 tread >./readtest -h
Usage: readtest [-v] [-N] -s <size> -n iter [-d delta] [-c count]
        -v:             Verbose mode
        -N:             Normalize results by number of reads
        -s <size>:      Working set size (may specify K,M,G suffix)
        -n iter:        Number of test iterations
        -f filename:    Name of the file to read from
        -d [+|-]delta:  Distance between subsequent reads
        -c count:       Number of reads
        -h:             Print this help

With copyout_max_cached at 128K (in nanoseconds, NTA not triggered):

<82 tread >./readtest -s 16k -f boot_archive
46445262
<83 tread >./readtest -s 128M -f boot_archive
118294230
<84 tread >./readtest -s 16k -f boot_archive -n 100
4230210856
<85 tread >./readtest -s 128M -f boot_archive -n 100
6343619546

With copyout_max_cached at 4K (in nanoseconds, NTA triggered):

<89 tread >./readtest -s 16k -f boot_archive
43606882
<90 tread >./readtest -s 128M -f boot_archive
100547909
<91 tread >./readtest -s 16k -f boot_archive -n 100
4251823995
<92 tread >./readtest -s 128M -f boot_archive -n 100
4205491984

When the iteration is 1 (the default), the timing difference between
using 16k buffer and 128M buffer is much bigger for both
copyout_max_cached sizes, mostly due to the cost of TLB misses.  When
the iteration count is bigger, most of the page tables would be in Page
Descriptor Cache for the later page accesses so the overhead of TLB
misses become smaller.  As you can see, when we do bypass L2, the
performance with either buffer size is comparable.

I am sure your next question is why the 128K limitation for reads.
Here are the main reasons:

    - Based on a lot of the benchmarks and workloads I traced, the
      target buffer of read operations are typically accessed again
      shortly after the read, while writes are usually not.  Therefore,
      the default operation mode is to bypass L2 for writes, but not
      for reads.

    - The Opteron's L1 cache size is 64K.  If reads are larger than
      128KB, it would have displacement flushed itself anyway, so for
      large reads, I will also bypass L2. I am working on dynamically
      setting copyout_max_cached based on the L1 D-cache size on the
      system.

The above heuristic should have worked well in Luke's test case.
However, due to the fact that the reads was done as 16,000 8K reads
rather than one 128MB read, the NTA code was not triggered.

Since the OS code has to be general enough to handle with most
workloads, we have to pick some defaults that might not work best for
some specific operations.  It is a calculated balance.

Thanks,
Sherry


On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote:
> "Luke Lonergan" <LLonergan@greenplum.com> writes:
> > Good info - it's the same in Solaris, the routine is uiomove (Sherry
> > wrote it).
>
> Cool.  Maybe Sherry can comment on the question whether it's possible
> for a large-scale-memcpy to not take a hit on filling a cache line
> that wasn't previously in cache?
>
> I looked a bit at the Linux code that's being used here, but it's all
> x86_64 assembler which is something I've never studied :-(.
>
>             regards, tom lane

--
Sherry Moore, Solaris Kernel Development    http://blogs.sun.com/sherrym

Attachment

Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> > Another approach I proposed back in December is to not have a  
> > variable like that at all, but scan the buffer cache for pages  
> > belonging to the table you're scanning to initialize the scan.  
> > Scanning all the BufferDescs is a fairly CPU and lock heavy  
> > operation, but it might be ok given that we're talking about large  
> > I/O bound sequential scans. It would require no DBA tuning and  
> > would work more robustly in varying conditions. I'm not sure where  
> > you would continue after scanning the in-cache pages. At the  
> > highest in-cache block number, perhaps.
> 
> If there was some way to do that, it'd be what I'd vote for.
> 

I still don't know how to make this take advantage of the OS buffer
cache. 

However, no DBA tuning is a huge advantage, I agree with that.

If I were to implement this idea, I think Heikki's bitmap of pages
already read is the way to go. Can you guys give me some pointers about
how to walk through the shared buffers, reading the pages that I need,
while being sure not to read a page that's been evicted, and also not
potentially causing a performance regression somewhere else?

> Given the partitioning of the buffer lock that Tom did it might not  
> be that horrible for many cases, either, since you'd only need to  
> scan through one partition.
> 
> We also don't need an exact count, either. Perhaps there's some way  
> we could keep a counter or something...

Exact count of what? The pages already in cache?

Regards,Jeff Davis



Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> If I were to implement this idea, I think Heikki's bitmap of pages
> already read is the way to go.

I think that's a good way to guarantee that you'll not finish in time
for 8.3.  Heikki's idea is just at the handwaving stage at this point,
and I'm not even convinced that it will offer any win.  (Pages in
cache will be picked up by a seqscan already.)
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Tue, 2007-03-06 at 12:59 -0500, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
> > If I were to implement this idea, I think Heikki's bitmap of pages
> > already read is the way to go.
> 
> I think that's a good way to guarantee that you'll not finish in time
> for 8.3.  Heikki's idea is just at the handwaving stage at this point,
> and I'm not even convinced that it will offer any win.  (Pages in
> cache will be picked up by a seqscan already.)
> 

I agree that it's a good idea stick with the current implementation
which is, as far as I can see, meeting all of my performance goals.

Regards,Jeff Davis





Re: Bug: Buffer cache is not scan resistant

From
Heikki Linnakangas
Date:
Jeff Davis wrote:
> On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
>> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
>>> Another approach I proposed back in December is to not have a
>>> variable like that at all, but scan the buffer cache for pages
>>> belonging to the table you're scanning to initialize the scan.
>>> Scanning all the BufferDescs is a fairly CPU and lock heavy
>>> operation, but it might be ok given that we're talking about large
>>> I/O bound sequential scans. It would require no DBA tuning and
>>> would work more robustly in varying conditions. I'm not sure where
>>> you would continue after scanning the in-cache pages. At the
>>> highest in-cache block number, perhaps.
>> If there was some way to do that, it'd be what I'd vote for.
>>
>
> I still don't know how to make this take advantage of the OS buffer
> cache.

Yep, I don't see any way to do that. I think we could live with that,
though. If we went with the sync_scan_offset approach, you'd have to
leave a lot of safety margin in that as well.

> However, no DBA tuning is a huge advantage, I agree with that.
>
> If I were to implement this idea, I think Heikki's bitmap of pages
> already read is the way to go. Can you guys give me some pointers about
> how to walk through the shared buffers, reading the pages that I need,
> while being sure not to read a page that's been evicted, and also not
> potentially causing a performance regression somewhere else?

You could take a look at BufferSync, for example. It walks through the
buffer cache, syncing all dirty buffers.

FWIW, I've attached a function I wrote some time ago when I was playing
with the same idea for vacuums. A call to the new function loops through
the buffer cache and returns the next buffer that belong to a certain
relation. I'm not sure that it's correct and safe, and there's not much
comments, but should work if you want to play with it...

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.214
diff -c -r1.214 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    5 Jan 2007 22:19:37 -0000    1.214
--- src/backend/storage/buffer/bufmgr.c    22 Jan 2007 16:38:37 -0000
***************
*** 97,102 ****
--- 97,134 ----
  static void AtProcExit_Buffers(int code, Datum arg);


+ Buffer
+ ReadAnyBufferForRelation(Relation reln)
+ {
+     static int last_buf_id = 0;
+     int new_buf_id;
+     volatile BufferDesc *bufHdr;
+
+     /* Make sure we will have room to remember the buffer pin */
+     ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+     new_buf_id = last_buf_id;
+     do
+     {
+         if (++new_buf_id >= NBuffers)
+             new_buf_id = 0;
+
+         bufHdr = &BufferDescriptors[new_buf_id];
+         LockBufHdr(bufHdr);
+
+         if ((bufHdr->flags & BM_VALID) && RelFileNodeEquals(bufHdr->tag.rnode, reln->rd_node))
+         {
+             PinBuffer_Locked(bufHdr);
+             last_buf_id = new_buf_id;
+             return BufferDescriptorGetBuffer(bufHdr);
+         }
+         UnlockBufHdr(bufHdr);
+     } while(new_buf_id != last_buf_id);
+     last_buf_id = new_buf_id;
+     return InvalidBuffer;
+ }
+
+
  /*
   * ReadBuffer -- returns a buffer containing the requested
   *        block of the requested relation.  If the blknum

Re: Bug: Buffer cache is not scan resistant

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> If I were to implement this idea, I think Heikki's bitmap of pages
>> already read is the way to go.
> 
> I think that's a good way to guarantee that you'll not finish in time
> for 8.3.  Heikki's idea is just at the handwaving stage at this point,
> and I'm not even convinced that it will offer any win.  (Pages in
> cache will be picked up by a seqscan already.)

The scenario that I'm worried about is that you have a table that's 
slightly larger than RAM. If you issue many seqscans on that table, one 
at a time, every seqscan will have to read the whole table from disk, 
even though say 90% of it is in cache when the scan starts.

This can be alleviated by using a large enough sync_scan_offset, but a 
single setting like that is tricky to tune, especially if your workload 
is not completely constant. Tune it too low, and you don't get much 
benefit, tune it too high and your scans diverge and you lose all benefit.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote:

>     - Based on a lot of the benchmarks and workloads I traced, the
>       target buffer of read operations are typically accessed again
>       shortly after the read, while writes are usually not.  Therefore,
>       the default operation mode is to bypass L2 for writes, but not
>       for reads.

Hi Sherry,

I'm trying to relate what you've said to how we should proceed from
here. My understanding of what you've said is:

- Tom's assessment that the observed performance quirk could be fixed in
the OS kernel is correct and you have the numbers to prove it

- currently Solaris only does NTA for 128K reads, which we don't
currently do. If we were to request 16 blocks at time, we would get this
benefit on Solaris, at least. The copyout_max_cached parameter can be
patched, but isn't a normal system tunable.

- other workloads you've traced *do* reuse the same buffer again very
soon afterwards when reading sequentially (not writes). Reducing the
working set size is an effective technique in improving performance if
we don't have a kernel that does NTA or we don't read in big enough
chunks (we need both to get NTA to kick in).

and what you haven't said

- all of this is orthogonal to the issue of buffer cache spoiling in
PostgreSQL itself. That issue does still exist as a non-OS issue, but
we've been discussing in detail the specific case of L2 cache effects
with specific kernel calls. All of the test results have been
stand-alone, so we've not done any measurements in that area. I say this
because you make the point that reducing the working set size of write
workloads has no effect on the L2 cache issue, but ISTM its still
potentially a cache spoiling issue.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Tue, 2007-03-06 at 18:47 +0000, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Jeff Davis <pgsql@j-davis.com> writes:
> >> If I were to implement this idea, I think Heikki's bitmap of pages
> >> already read is the way to go.
> > 
> > I think that's a good way to guarantee that you'll not finish in time
> > for 8.3.  Heikki's idea is just at the handwaving stage at this point,
> > and I'm not even convinced that it will offer any win.  (Pages in
> > cache will be picked up by a seqscan already.)
> 
> The scenario that I'm worried about is that you have a table that's 
> slightly larger than RAM. If you issue many seqscans on that table, one 
> at a time, every seqscan will have to read the whole table from disk, 
> even though say 90% of it is in cache when the scan starts.
> 

If you're issuing sequential scans one at a time, that 90% of the table
that was cached is probably not cached any more, unless the scans are
close together in time without overlapping (serial sequential scans).
And the problem you describe is no worse than current behavior, where
you have exactly the same problem.

> This can be alleviated by using a large enough sync_scan_offset, but a 
> single setting like that is tricky to tune, especially if your workload 
> is not completely constant. Tune it too low, and you don't get much 
> benefit, tune it too high and your scans diverge and you lose all benefit.
> 

I see why you don't want to manually tune this setting, however it's
really not that tricky. You can be quite conservative and still use a
good fraction of your physical memory. I will come up with some numbers
and see how much we have to gain.

Regards,Jeff Davis



Re: Bug: Buffer cache is not scan resistant

From
Jim Nasby
Date:
On Mar 6, 2007, at 12:17 AM, Tom Lane wrote:
> Jim Nasby <decibel@decibel.org> writes:
>> An idea I've been thinking about would be to have the bgwriter or
>> some other background process actually try and keep the free list
>> populated,
>
> The bgwriter already tries to keep pages "just in front" of the clock
> sweep pointer clean.

True, but that still means that each backend has to run the clock- 
sweep. AFAICT that's something that backends will serialize on (due  
to BufFreelistLock), so it would be best to make StrategyGetBuffer as  
fast as possible. It certainly seems like grabbing a buffer off the  
free list is going to be a lot faster than running the clock sweep.  
That's why I think it'd be better to have the bgwriter run the clock  
sweep and put enough buffers on the free list to try and keep up with  
demand.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Bug: Buffer cache is not scan resistant

From
Jim Nasby
Date:
On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:
>> We also don't need an exact count, either. Perhaps there's some way
>> we could keep a counter or something...
>
> Exact count of what? The pages already in cache?

Yes. The idea being if you see there's 10k pages in cache, you can  
likely start 9k pages behind the current scan point and still pick  
everything up.

But this is nowhere near as useful as the bitmap idea, so I'd only  
look at it if it's impossible to make the bitmaps work. And like  
others have said, that should wait until there's at least a first- 
generation patch that's going to make it into 8.3.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)




Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote:
> On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote:
> >> We also don't need an exact count, either. Perhaps there's some way
> >> we could keep a counter or something...
> >
> > Exact count of what? The pages already in cache?
> 
> Yes. The idea being if you see there's 10k pages in cache, you can  
> likely start 9k pages behind the current scan point and still pick  
> everything up.
> 
> But this is nowhere near as useful as the bitmap idea, so I'd only  
> look at it if it's impossible to make the bitmaps work. And like  
> others have said, that should wait until there's at least a first- 
> generation patch that's going to make it into 8.3.

You still haven't told me how we take advantage of the OS buffer cache
with the bitmap idea. What makes you think that my current
implementation is "nowhere near as useful as the bitmap idea"?

My current implementation is making use of OS buffers + shared memory;
the bitmap idea can only make use of shared memory, and is likely
throwing the OS buffers away completely.

I also suspect that the bitmap idea relies too much on the idea that
there's a contiguous cache trail in the shared buffers alone. Any
devation from that -- which could be caused by PG's page replacement
algorithm, especially in combination with a varied load pattern -- would
negate any benefit from the bitmap idea. I feel much more confident that
there will exist a trail of pages that are cached in *either* the PG
shared buffers *or* the OS buffer cache. There may be holes/gaps in
either one, but it's much more likely that they combine into a
contiguous series of cached pages. Do you have an idea how I might test
this claim?

Regards,Jeff Davis




Re: Bug: Buffer cache is not scan resistant

From
Jeff Davis
Date:
On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote:
> Jeff Davis wrote:
> > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> >>> Another approach I proposed back in December is to not have a  
> >>> variable like that at all, but scan the buffer cache for pages  
> >>> belonging to the table you're scanning to initialize the scan.  
> >>> Scanning all the BufferDescs is a fairly CPU and lock heavy  
> >>> operation, but it might be ok given that we're talking about large  
> >>> I/O bound sequential scans. It would require no DBA tuning and  
> >>> would work more robustly in varying conditions. I'm not sure where  
> >>> you would continue after scanning the in-cache pages. At the  
> >>> highest in-cache block number, perhaps.
> >> If there was some way to do that, it'd be what I'd vote for.
> >>
> > 
> > I still don't know how to make this take advantage of the OS buffer
> > cache. 
> 
> Yep, I don't see any way to do that. I think we could live with that, 
> though. If we went with the sync_scan_offset approach, you'd have to 
> leave a lot of safety margin in that as well.
> 

Right, there would certainly have to be a safety margin with
sync_scan_offset. However, your plan only works when the shared buffers
are dominated by this sequential scan. Let's say you have 40% of
physical memory for shared buffers, and say that 50% are being used for
hot pages in other parts of the database. That means you have access to
only 20% of physical memory to optimize for this sequential scan, and
20% of the physical memory is basically unavailable (being used for
other parts of the database).

In my current implementation, you could set sync_scan_offset to 1.0
(meaning 1.0 x shared_buffers), giving you 40% of physical memory that
would be used for starting this sequential scan. In this case, that
should be a good margin of error, considering that as much as 80% of the
physical memory might actually be in cache (OS or PG cache).

This all needs to be backed up by testing, of course. I'm just
extrapolating some numbers that look vaguely reasonable to me.

Regards,Jeff Davis



Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
<p><font size="2">Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and
didnot see the substantial gain as with reducing the working set size.<br /><br /> My conclusion: Fixing the OS is not
sufficientto alleviate the issue.  We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this
effect.<br/><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br />  -----Original Message-----<br /> From:
 Sherry Moore [<a href="mailto:sherry.moore@sun.com">mailto:sherry.moore@sun.com</a>]<br /> Sent:   Tuesday, March 06,
200710:05 PM Eastern Standard Time<br /> To:     Simon Riggs<br /> Cc:     Sherry Moore; Tom Lane; Luke Lonergan; Mark
Kirkwood;Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady<br /> Subject:        Re: [HACKERS] Bug: Buffer cache
isnot scan resistant<br /><br /> Hi Simon,<br /><br /> > and what you haven't said<br /> ><br /> > - all of
thisis orthogonal to the issue of buffer cache spoiling in<br /> > PostgreSQL itself. That issue does still exist as
anon-OS issue, but<br /> > we've been discussing in detail the specific case of L2 cache effects<br /> > with
specifickernel calls. All of the test results have been<br /> > stand-alone, so we've not done any measurements in
thatarea. I say this<br /> > because you make the point that reducing the working set size of write<br /> >
workloadshas no effect on the L2 cache issue, but ISTM its still<br /> > potentially a cache spoiling issue.<br
/><br/> What I wanted to point out was that (reiterating to avoid requoting),<br /><br />     - My test was simply to
demonstratethat the observed performance<br />       difference with VACUUM was caused by whether the size of the<br />
     user buffer caused L2 thrashing.<br /><br />     - In general, application should reduce the size of the working
set<br/>       to reduce the penalty of TLB misses and cache misses.<br /><br />     - If the application access
patternmeets the NTA trigger condition,<br />       the benefit of reducing the working set size will be much
smaller.<br/><br /> Whatever I said is probably orthogonal to the buffer cache issue you<br /> guys have been
discussing,but I haven't read all the email exchange<br /> on the subject.<br /><br /> Thanks,<br /> Sherry<br /> --<br
/>Sherry Moore, Solaris Kernel Development        <a
href="http://blogs.sun.com/sherrym">http://blogs.sun.com/sherrym</a><br/><br /></font> 

Re: Bug: Buffer cache is not scan resistant

From
Hannu Krosing
Date:
Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis:
> On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote:
> > Jeff Davis wrote:
> > > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote:
> > >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote:
> > >>> Another approach I proposed back in December is to not have a  
> > >>> variable like that at all, but scan the buffer cache for pages  
> > >>> belonging to the table you're scanning to initialize the scan.  
> > >>> Scanning all the BufferDescs is a fairly CPU and lock heavy  
> > >>> operation, but it might be ok given that we're talking about large  
> > >>> I/O bound sequential scans. It would require no DBA tuning and  
> > >>> would work more robustly in varying conditions. I'm not sure where  
> > >>> you would continue after scanning the in-cache pages. At the  
> > >>> highest in-cache block number, perhaps.
> > >> If there was some way to do that, it'd be what I'd vote for.
> > >>
> > > 
> > > I still don't know how to make this take advantage of the OS buffer
> > > cache. 

Maybe it should not ?

Mostly there can be use of OS cache only if it is much bigger than
shared buffer cache. It may make sense to forget about OS cache and just
tell those who can make use of sync scans to set most of memory aside
for shared buffers.

Then we can do better predictions/lookups of how much of a table is
actually in memory.

Dual caching is usually not very beneficial anyway, not to mention about
difficulties in predicting any doual-caching effects.

> > Yep, I don't see any way to do that. I think we could live with that, 
> > though. If we went with the sync_scan_offset approach, you'd have to 
> > leave a lot of safety margin in that as well.
> > 
> 
> Right, there would certainly have to be a safety margin with
> sync_scan_offset. However, your plan only works when the shared buffers
> are dominated by this sequential scan. Let's say you have 40% of
> physical memory for shared buffers, and say that 50% are being used for
> hot pages in other parts of the database. That means you have access to
> only 20% of physical memory to optimize for this sequential scan, and
> 20% of the physical memory is basically unavailable (being used for
> other parts of the database).

The simplest thing in case table si much bigger than buffer cache usable
for it is to start the second scan at the point the first scan is
traversing *now*, and hope that the scans will stay together. Or start
at some fixed lag, which makes the first scan to be always the one
issuing reads and second just freerides on buffers already in cache. It
may even be a good idea to throttle the second scan to stay N pages
behind if the OS readahead gets confused when same file is read from
multiple processes.

If the table is smaller than the cache, then just scan it without
syncing.

Trying to read buffers in the same order starting from near the point
where ppages are still in shared buffer cache seems good mostly for case
where table is as big as or just a little larger than cache.

> In my current implementation, you could set sync_scan_offset to 1.0
> (meaning 1.0 x shared_buffers), giving you 40% of physical memory that
> would be used for starting this sequential scan. In this case, that
> should be a good margin of error, considering that as much as 80% of the
> physical memory might actually be in cache (OS or PG cache).
> 
> This all needs to be backed up by testing, of course. I'm just
> extrapolating some numbers that look vaguely reasonable to me.

If there is an easy way to tell PG "give me this page only if it is in
shared cache already", then a good approach might be to start 2nd scan
at the point where 1st is now, and move in both directions
simultabeously, like this:

First scan is at page N.

Second scan:

M=N-1
WHILE NOT ALL PAGES ARE READ:    IF PAGE N IS IN CACHE :   -- FOLLOW FIRST READER       READ PAGE N       N++   ELSE IF
M>=0AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES       READ PAGE M       M--   ELSE IF FIRST READER STILL GOING:
--NO OLDER PAGES, WAIT FOR 1st       WAIT FOR PAGE N TO BECOME AVAILABLE       READ PAGE N       N++   ELSE:
--BECOME 1st reader      READ PAGE N      N++   PROCESS PAGE   --   IF N > PAGES_IF_TABLE: N=0   IF M < 0:
M=PAGES_IF_TABLE


This should work reasonably well for LRU caches and it may be made to
work with clock sweep scheme if the sweep arranges pages to purge in
file order.

If we could make the IF PAGE x IS IN CACHE part also know about OS cache
this could also make use of os cache.


Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
systems ?

-- 
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com




Re: Bug: Buffer cache is not scan resistant

From
"Marko Kreen"
Date:
On 3/7/07, Hannu Krosing <hannu@skype.net> wrote:
> Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix
> systems ?

Supposedly you could mmap() a file and then do mincore() on the
area to see which pages are cached.

But you were talking about postgres cache before, there it should
be easily implementable.

-- 
marko


Re: Bug: Buffer cache is not scan resistant

From
Sherry Moore
Date:
Hi Simon,

> and what you haven't said
> 
> - all of this is orthogonal to the issue of buffer cache spoiling in
> PostgreSQL itself. That issue does still exist as a non-OS issue, but
> we've been discussing in detail the specific case of L2 cache effects
> with specific kernel calls. All of the test results have been
> stand-alone, so we've not done any measurements in that area. I say this
> because you make the point that reducing the working set size of write
> workloads has no effect on the L2 cache issue, but ISTM its still
> potentially a cache spoiling issue.

What I wanted to point out was that (reiterating to avoid requoting),
   - My test was simply to demonstrate that the observed performance     difference with VACUUM was caused by whether
thesize of the     user buffer caused L2 thrashing.
 
   - In general, application should reduce the size of the working set     to reduce the penalty of TLB misses and
cachemisses.
 
   - If the application access pattern meets the NTA trigger condition,     the benefit of reducing the working set
sizewill be much smaller.
 

Whatever I said is probably orthogonal to the buffer cache issue you
guys have been discussing, but I haven't read all the email exchange
on the subject.

Thanks,
Sherry
-- 
Sherry Moore, Solaris Kernel Development    http://blogs.sun.com/sherrym


Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote:
> Incidentally, we tried triggering NTA (L2 cache bypass)
> unconditionally and in various patterns and did not see the
> substantial gain as with reducing the working set size.
> 
> My conclusion: Fixing the OS is not sufficient to alleviate the issue.
> We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data
> rates due to this effect.
> 
I've implemented buffer recycling, as previously described, patch being
posted now to -patches as "scan_recycle_buffers".

This version includes buffer recycling

- for SeqScans larger than shared buffers, with the objective of
improving L2 cache efficiency *and* reducing the effects of shared
buffer cache spoiling (both as previously discussed on this thread)

- for VACUUMs of any size, with the objective of reducing WAL thrashing
whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
originally suggested by Itagaki-san, just with a different
implementation).

Behaviour is not activated by default in this patch. To request buffer
recycling, set the USERSET GUCSET scan_recycle_buffers = N
tested with 1,4,8,16, but only > 8 seems sensible, IMHO.

Patch effects StrategyGetBuffer, so only effects the disk->cache path.
The idea is that if its already in shared buffer cache then we get
substantial benefit already and nothing else is needed. So for the
general case, the patch adds a single if test into the I/O path.

The parameter is picked up at the start of SeqScan and VACUUM
(currently). Any change mid-scan will be ignored.

IMHO its possible to do this and to allow Synch Scans at the same time,
with some thought. There is no need for us to rely on cache spoiling
behaviour of scans to implement that feature as well.

Independent performance tests requested, so that we can discuss this
objectively.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
<p><font size="2">Cool!<br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br />  -----Original
Message-----<br/> From:   Simon Riggs [<a href="mailto:simon@2ndquadrant.com">mailto:simon@2ndquadrant.com</a>]<br />
Sent:  Friday, March 09, 2007 02:32 PM Eastern Standard Time<br /> To:     Luke Lonergan; ITAGAKI Takahiro<br />
Cc:    Sherry Moore; Tom Lane; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady<br />
Subject:       Re: [HACKERS] Bug: Buffer cache is not scan resistant<br /><br /> On Tue, 2007-03-06 at 22:32 -0500,
LukeLonergan wrote:<br /> > Incidentally, we tried triggering NTA (L2 cache bypass)<br /> > unconditionally and
invarious patterns and did not see the<br /> > substantial gain as with reducing the working set size.<br /> ><br
/>> My conclusion: Fixing the OS is not sufficient to alleviate the issue.<br /> > We see a 2x penalty (1700MB/s
versus3500MB/s) at the higher data<br /> > rates due to this effect.<br /> ><br /> I've implemented buffer
recycling,as previously described, patch being<br /> posted now to -patches as "scan_recycle_buffers".<br /><br /> This
versionincludes buffer recycling<br /><br /> - for SeqScans larger than shared buffers, with the objective of<br />
improvingL2 cache efficiency *and* reducing the effects of shared<br /> buffer cache spoiling (both as previously
discussedon this thread)<br /><br /> - for VACUUMs of any size, with the objective of reducing WAL thrashing<br />
whilstkeeping VACUUM's behaviour of not spoiling the buffer cache (as<br /> originally suggested by Itagaki-san, just
witha different<br /> implementation).<br /><br /> Behaviour is not activated by default in this patch. To request
buffer<br/> recycling, set the USERSET GUC<br />         SET scan_recycle_buffers = N<br /> tested with 1,4,8,16, but
only> 8 seems sensible, IMHO.<br /><br /> Patch effects StrategyGetBuffer, so only effects the disk->cache
path.<br/> The idea is that if its already in shared buffer cache then we get<br /> substantial benefit already and
nothingelse is needed. So for the<br /> general case, the patch adds a single if test into the I/O path.<br /><br />
Theparameter is picked up at the start of SeqScan and VACUUM<br /> (currently). Any change mid-scan will be ignored.<br
/><br/> IMHO its possible to do this and to allow Synch Scans at the same time,<br /> with some thought. There is no
needfor us to rely on cache spoiling<br /> behaviour of scans to implement that feature as well.<br /><br />
Independentperformance tests requested, so that we can discuss this<br /> objectively.<br /><br /> --<br />   Simon
Riggs            <br/>   EnterpriseDB   <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br /><br
/><br/><br /></font> 

Re: Bug: Buffer cache is not scan resistant

From
ITAGAKI Takahiro
Date:
"Simon Riggs" <simon@2ndquadrant.com> wrote:

> I've implemented buffer recycling, as previously described, patch being
> posted now to -patches as "scan_recycle_buffers".
> 
> - for VACUUMs of any size, with the objective of reducing WAL thrashing
> whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
> originally suggested by Itagaki-san, just with a different
> implementation).

I tested your patch with VACUUM FREEZE. The performance was improved when
I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic,
but this patch should be useful for normal VACUUMs with backgrond jobs!
  N | time  | WAL flush(*)
-----+-------+-----------  0 | 58.7s |  0.01%  1 | 80.3s | 81.76%  8 | 73.4s | 16.73% 16 | 64.2s |  9.24% 32 | 59.0s |
4.88%64 | 56.7s |  2.63%128 | 55.1s |  1.41%
 

(*) WAL flush is the ratio of the need of fsync to buffer recycle.

# SET scan_recycle_buffers = 0;
# UPDATE accounts SET aid=aid WHERE random() < 0.005;
# CHECKPOINT;
# SET scan_recycle_buffers = <N>;
# VACUUM FREEZE accounts;


BTW, does the patch change the default usage of buffer in vacuum? From what
I've seen, scan_recycle_buffers = 1 is the same as before. With the default
value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
just like existing sequential scans. Is this intended?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
> "Simon Riggs" <simon@2ndquadrant.com> wrote:
> 
> > I've implemented buffer recycling, as previously described, patch being
> > posted now to -patches as "scan_recycle_buffers".
> > 
> > - for VACUUMs of any size, with the objective of reducing WAL thrashing
> > whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
> > originally suggested by Itagaki-san, just with a different
> > implementation).
> 
> I tested your patch with VACUUM FREEZE. The performance was improved when
> I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic,
> but this patch should be useful for normal VACUUMs with backgrond jobs!

Thanks.

>    N | time  | WAL flush(*)
> -----+-------+-----------
>    0 | 58.7s |  0.01%
>    1 | 80.3s | 81.76%
>    8 | 73.4s | 16.73%
>   16 | 64.2s |  9.24%
>   32 | 59.0s |  4.88%
>   64 | 56.7s |  2.63%
>  128 | 55.1s |  1.41%
> 
> (*) WAL flush is the ratio of the need of fsync to buffer recycle.

Do you have the same measurement without patch applied? I'd be
interested in the current state also (the N=0 path is modified as well
for VACUUM, in this patch).

> # SET scan_recycle_buffers = 0;
> # UPDATE accounts SET aid=aid WHERE random() < 0.005;
> # CHECKPOINT;
> # SET scan_recycle_buffers = <N>;
> # VACUUM FREEZE accounts;

Very good results, thanks. I'd be interested in the same thing for just
VACUUM and for varying ratios of dirty/clean blocks during vacuum.

> BTW, does the patch change the default usage of buffer in vacuum? From what
> I've seen, scan_recycle_buffers = 1 is the same as before. 

That was the intention.

> With the default
> value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
> just like existing sequential scans. Is this intended?

Yes, but its not very useful for testing to have done that. I'll do
another version within the hour that sets N=0 (only) back to current
behaviour for VACUUM.

One of the objectives of the patch was to prevent VACUUM from tripping
up other backends. I'm confident that it will improve that situation for
OLTP workloads running regular concurrent VACUUMs, but we will need to
wait a couple of days to get those results in also.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote:
> On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:

> > With the default
> > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
> > just like existing sequential scans. Is this intended?
>
> Yes, but its not very useful for testing to have done that. I'll do
> another version within the hour that sets N=0 (only) back to current
> behaviour for VACUUM.

New test version enclosed, where scan_recycle_buffers = 0 doesn't change
existing VACUUM behaviour.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com


Attachment

Re: Bug: Buffer cache is not scan resistant

From
Tom Lane
Date:
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> I tested your patch with VACUUM FREEZE. The performance was improved when
> I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic,
> but this patch should be useful for normal VACUUMs with backgrond jobs!

Proving that you can see a different in a worst-case scenario is not the
same as proving that the patch is useful in normal cases.
        regards, tom lane


Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Mon, 2007-03-12 at 10:30 -0400, Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> > I tested your patch with VACUUM FREEZE. The performance was improved when
> > I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic,
> > but this patch should be useful for normal VACUUMs with backgrond jobs!
> 
> Proving that you can see a different in a worst-case scenario is not the
> same as proving that the patch is useful in normal cases.

I agree, but I think that this VACUUM FREEZE test does actually
represent the normal case, here's why:

The poor buffer manager behaviour occurs if the block is dirty as a
result of WAL-logged changes. It only takes the removal of a single row
for us to have WAL logged this and dirtied the block.

If we invoke VACUUM from autovacuum, we do this by default when 20% of
the rows have been updated, which means with many distributions of
UPDATEs we'll have touched a very large proportion of blocks before we
VACUUM. That isn't true for *all* distributions of UPDATEs, but it will
soon be. Dead Space Map will deliver only dirty blocks for us.

So running a VACUUM FREEZE is a reasonable simulation of running a large
VACUUM on a real production system with default autovacuum enabled, as
will be the case for 8.3. 

It is possible that we run VACUUM when fewer dirty blocks are generated,
but this won't be the common situation and not something we should
optimise for.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Bug: Buffer cache is not scan resistant

From
ITAGAKI Takahiro
Date:
"Simon Riggs" <simon@2ndquadrant.com> wrote:

> > > With the default
> > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
> > > just like existing sequential scans. Is this intended?
> > 
> New test version enclosed, where scan_recycle_buffers = 0 doesn't change
> existing VACUUM behaviour.

This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM
with background load using slowdown-ed pgbench in this instance. I believe
the patch is useful in normal cases, not only for VACUUM FREEZE.
  N |  time  | WAL flush(*)
-----+--------+-----------  0 | 112.8s | 44.3%  1 | 148.9s | 52.1%  8 | 105.1s | 17.6% 16 |  96.9s |  8.7% 32 | 103.9s
| 6.3% 64 |  89.4s |  6.6%128 |  80.0s |  3.8%
 

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center




Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Simon,

You may know we've built something similar and have seen similar gains.
We're planning a modification that I think you should consider: when there
is a sequential scan of a table larger than the size of shared_buffers, we
are allowing the scan to write through the shared_buffers cache.

The hypothesis is that if a relation is of a size equal to or less than the
size of shared_buffers, it is "cacheable" and should use the standard LRU
approach to provide for reuse.

- Luke

On 3/12/07 3:08 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote:

> On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote:
>> On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
> 
>>> With the default
>>> value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in
>>> pool,
>>> just like existing sequential scans. Is this intended?
>> 
>> Yes, but its not very useful for testing to have done that. I'll do
>> another version within the hour that sets N=0 (only) back to current
>> behaviour for VACUUM.
> 
> New test version enclosed, where scan_recycle_buffers = 0 doesn't change
> existing VACUUM behaviour.




Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Mon, 2007-03-12 at 22:16 -0700, Luke Lonergan wrote:

> You may know we've built something similar and have seen similar gains.

Cool

> We're planning a modification that I think you should consider: when there
> is a sequential scan of a table larger than the size of shared_buffers, we
> are allowing the scan to write through the shared_buffers cache.

Write? For which operations?

I was thinking to do this for bulk writes also, but it would require
changes to bgwriter's cleaning sequence. Are you saying to write say ~32
buffers then fsync them, rather than letting bgwriter do that? Then
allow those buffers to be reused?

> The hypothesis is that if a relation is of a size equal to or less than the
> size of shared_buffers, it is "cacheable" and should use the standard LRU
> approach to provide for reuse.

Sounds reasonable. Please say more.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Bug: Buffer cache is not scan resistant

From
"Simon Riggs"
Date:
On Tue, 2007-03-13 at 13:40 +0900, ITAGAKI Takahiro wrote:
> "Simon Riggs" <simon@2ndquadrant.com> wrote:
> 
> > > > With the default
> > > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
> > > > just like existing sequential scans. Is this intended?
> > > 
> > New test version enclosed, where scan_recycle_buffers = 0 doesn't change
> > existing VACUUM behaviour.
> 
> This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM
> with background load using slowdown-ed pgbench in this instance. I believe
> the patch is useful in normal cases, not only for VACUUM FREEZE.
> 
>    N |  time  | WAL flush(*)
> -----+--------+-----------
>    0 | 112.8s | 44.3%
>    1 | 148.9s | 52.1%
>    8 | 105.1s | 17.6%
>   16 |  96.9s |  8.7%
>   32 | 103.9s |  6.3%
>   64 |  89.4s |  6.6%
>  128 |  80.0s |  3.8%

Looks good.

Not sure what value of N to pick for normal use. The objectives are
i) don't let VACUUMs spoil the cache
ii) speed up standalone VACUUMs
iii) don't let VACUUM cause others to repeatedly WAL flush

I'm thinking N=16 meets all 3 objectives. We could make VACUUM go faster
still, but by knocking more blocks out of cache that someone doing real
work might need. That will slow them down almost as much as forcing them
to flush WAL, so I'd want to be conservative with VACUUM.

Does anybody think we need a new parameter for this, or are we happy at
16 buffers in the recycle loop for VACUUM?

At this point I should note something I haven't mentioned before.
VACUUMs force other backends to flush out WAL only when we have an I/O
bound workload. If the database already fits in memory then BufferAlloc
never needs to run and therefore we don't need to flush WAL. So I can
understand that the effect of WAL flushing may not have been noticed by
many testers. 

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: Bug: Buffer cache is not scan resistant

From
"Luke Lonergan"
Date:
Simon,

On 3/13/07 2:37 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote:

>> We're planning a modification that I think you should consider: when there
>> is a sequential scan of a table larger than the size of shared_buffers, we
>> are allowing the scan to write through the shared_buffers cache.
> 
> Write? For which operations?

I'm actually just referring to the sequential scan "writing into" the shared
buffers cache, sorry for the "write through" :-) 
> I was thinking to do this for bulk writes also, but it would require
> changes to bgwriter's cleaning sequence. Are you saying to write say ~32
> buffers then fsync them, rather than letting bgwriter do that? Then
> allow those buffers to be reused?

Off topic, but we think we just found the reason(s) for the abysmal heap
insert performance of pgsql and are working on a fix to that as well.  It
involves two main things: the ping-ponging seeks used to extend a relfile
and the bgwriter not flushing aggressively enough.  We're hoping to move the
net heap insert rate from 12MB/s for a single stream to something more like
100 MB/s per stream, but it may take a week to get some early results and
find out if we're on the right track.  We've been wrong on this before ;-)

- Luke   




Re: Bug: Buffer cache is not scan resistant

From
Bruce Momjian
Date:
Simon, is this patch ready to be added to the patch queue? I assume not.

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote:
> > On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
> 
> > > With the default
> > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
> > > just like existing sequential scans. Is this intended?
> > 
> > Yes, but its not very useful for testing to have done that. I'll do
> > another version within the hour that sets N=0 (only) back to current
> > behaviour for VACUUM.
> 
> New test version enclosed, where scan_recycle_buffers = 0 doesn't change
> existing VACUUM behaviour.
> 
> -- 
>   Simon Riggs             
>   EnterpriseDB   http://www.enterprisedb.com
> 

[ Attachment, skipping... ]

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Bug: Buffer cache is not scan resistant

From
Bruce Momjian
Date:
"test" version, but I am putting in the queue so we can track it there.

Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Simon Riggs wrote:
> On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote:
> > On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote:
> 
> > > With the default
> > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool,
> > > just like existing sequential scans. Is this intended?
> > 
> > Yes, but its not very useful for testing to have done that. I'll do
> > another version within the hour that sets N=0 (only) back to current
> > behaviour for VACUUM.
> 
> New test version enclosed, where scan_recycle_buffers = 0 doesn't change
> existing VACUUM behaviour.
> 
> -- 
>   Simon Riggs             
>   EnterpriseDB   http://www.enterprisedb.com
> 

[ Attachment, skipping... ]

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +