Re: Bug: Buffer cache is not scan resistant - Mailing list pgsql-hackers

From Luke Lonergan
Subject Re: Bug: Buffer cache is not scan resistant
Date
Msg-id C3E62232E3BCF24CBA20D72BFDCB6BF802CFC0D6@MI8NYCMAIL08.Mi8.com
Whole thread Raw
In response to Re: Bug: Buffer cache is not scan resistant  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Bug: Buffer cache is not scan resistant  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers

> > The Postgres shared buffer cache algorithm appears to have a bug.
> > When there is a sequential scan the blocks are filling the entire
> > shared buffer cache.  This should be "fixed".
>
> No, this is not a bug; it is operating as designed.  The
> point of the current bufmgr algorithm is to replace the page
> least recently used, and that's what it's doing.

At least we've established that for certain.
> If you want to lobby for changing the algorithm, then you
> need to explain why one test case on one platform justifies
> de-optimizing for a lot of other cases.  In almost any
> concurrent-access situation I think that what you are
> suggesting would be a dead loss --- for instance we might as
> well forget about Jeff Davis' synchronized-scan work.

Instead of forgetting about it, we'd need to change it.
> In any case, I'm still not convinced that you've identified
> the problem correctly, because your explanation makes no
> sense to me.  How can the processor's L2 cache improve access
> to data that it hasn't got yet?

The evidence seems to clearly indicate reduced memory writing due to an
L2 related effect.  The actual data shows a dramatic reduction in main
memory writing when the destination of the written data fits in the L2
cache.

I'll try to fit a hypothesis to explain it.  Assume you've got a warm IO
cache in the OS.

The heapscan algorithm now works like this:
0) select a destination user buffer
1) uiomove->kcopy memory from the IO cache to the user buffer
1A) step 1: read from kernel space
1B) step 2: write to user space
2) the user buffer is accessed many times by the executor nodes above
Repeat

There are two situations we are evaluating: one where the addresses of
the user buffer are scattered over a space larger than the size of L2
(caseA) and one where they are confined to the size of L2 (caseB).  Note
that we could also consider another situation where the addresses are
scattered over a space smaller than the TLB entries mapped by the L2
cache (512 max) and larger than the size of L2, but we've tried that and
it proved uninteresting.

For both cases step 1A is the same: each block (8KB) write from (1) will
read from IO cache into 128 L2 (64B each) lines, evicting the previous
data there.

In step 1B for caseA the destination for the writes is mostly an address
not currently mapped into L2 cache, so 128 victim L2 lines are found
(LRU), stored into, and writes are flushed to main memory.

In step 1B for caseB, the destination for the writes is located in L2
already.  The 128 L2 lines are stored into, and the write to main memory
is delayed under the assumption that these lines are "hot" as they were
already in L2.

I don't know enough to be sure this is the right answer, but it does fit
the experimental data.

- Luke



pgsql-hackers by date:

Previous
From: Galy Lee
Date:
Subject: Restartable VACUUM design overview version 2
Next
From: Mark Kirkwood
Date:
Subject: Re: Bug: Buffer cache is not scan resistant