Re: [HACKERS] Clock with Adaptive Replacement - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: [HACKERS] Clock with Adaptive Replacement
Date
Msg-id CAEepm=1hFjHs=pzFwvAFP4wfiF8+33LxuXwigugXDY3kpVy_Sg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Clock with Adaptive Replacement  (Andrey Borodin <x4mmm@yandex-team.ru>)
Responses Re: [HACKERS] Clock with Adaptive Replacement
List pgsql-hackers
On Tue, Apr 24, 2018 at 5:04 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Here's what we have about size of shared buffer now [1] (taken from [2]). I believe right hill must be big enough to
reducecentral pit to zero and make function monotonic: OS page cache knows less about data blocks and is expected to be
lessefficient.
 

> [1]
https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png

Interesting curve.  Can you explain it?  Here is my guess, assuming
that pgbench is generating uniformly distributed random access, and
assuming 26GB dataset and 32GB physical RAM:

1.  In the range 128MB to 6GB, only a small fraction of the dataset
fits into shared_buffers, but it doesn't matter much: you get a 100%
hit ratio from the kernel's page cache.

2.  In the range 6GB to 16GB, the dataset doesn't fit in shared_buffer
OR the kernel's page cache, so you begin to risk misses in both
caches.  Every byte you give to one you take away from the other, but
it might be worse than zero sum: at the lowest point in your graph, x
= 16GB, I suppose you have a ~61% hit ratio in shared buffers
(16GB/26GB), but when you miss you're also likely to miss in the OS
page cache too because that probably holds the most recently read 16GB
(= other half of physical RAM) of data... in other words the same
data, assuming uniform random access.  (If it were not random, then
some pages would have a greater chance of staying in PG's cache and
falling out of the kernel's cache, but essentially random replacement
prevents that (?))

3.  In the range 16GB to 26GB, the shared_buffers hit ratio increases
from 50% to 100%.

I guess if you had 48GB of RAM the dip between 6GB and 26GB wouldn't happen?

Am I close?

> I'm not sure CART is the best possibility, though.

I don't know either, but LRU and ARC don't seem promising for various
reasons.  Aside from the scan resistance claim, it sounds like CART
should avoid long searches for a buffer.  In my primitive attempts to
understand that a bit better, I tried instrumenting freelist.c to tell
dtrace how far it had to move the clock hand to find a buffer, and I
got histograms like this (this example from a 1GB dataset, 512MB
shared buffers, pgbench select-only, and the most frequent numbers
were in the 4-7 bucket but there were 3 freak cases in the 16384-32767
bucket):

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@                                     65535
               1 |@@@@@                                    89190
               2 |@@@@@@@@@                                139114
               4 |@@@@@@@@@@@                              170881
               8 |@@@@@@@@                                 132579
              16 |@@@                                      47731
              32 |                                         5530
              64 |                                         137
             128 |                                         1
             256 |                                         0
             512 |                                         0
            1024 |                                         0
            2048 |                                         0
            4096 |                                         0
            8192 |                                         1
           16384 |                                         3
           32768 |                                         0

You can see a just a few outliers there.  With smaller shared buffers
the counts are bigger but the average hand movement distance gets
smaller (here 128MB shared buffers):

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |                                         16383
               1 |@@@@@@@@@@@@@@@                          1289265
               2 |@@@@@@@@@@@@@@@                          1270531
               4 |@@@@@@@@                                 659389
               8 |@                                        111091
              16 |                                         2815
              32 |                                         19
              64 |                                         1
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |                                         0
            2048 |                                         2
            4096 |                                         2
            8192 |                                         0


I think pgbench isn't a very interesting workload though.  I don't
have time right now but it would be fun to dig further and try to
construct workloads that generate pathological searches for buffers (I
believe that such workloads exist, from anecdotal reports).

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Porting PG Extension from UNIX to Windows
Next
From: Magnus Hagander
Date:
Subject: Re: pg_recvlogical broken in back branches