Thread: Optimize kernel readahead using buffer access strategy

Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

14 November 2013, 12:04:08

Hi,

I create a patch that is improvement of disk-read and OS file caches. It can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.

In general OS, readahead parameter was dynamically decided by disk-read
situations. If long time disk-read was happened, readahead parameter becomes big.
However it is based on experienced or heuristic algorithm, it causes waste
disk-read and throws out useful OS file caches in some case. It is bad for
disk-read performance a lot.

My proposed method is controlling OS readahead parameter by using buffer access
strategy in PostgreSQL and posix_fadvice() system call which can control OS
readahead parameter. Though, it is a general method in database.

For your information of effect of this patch, I got results of pgbench which are
in-memory-size database and out-memory-size database, and postgresql.conf
settings are always used by us. It seems to improve performance to a better. And
I think that this feature is going to be necessary for business intelligence
which will be realized at PostgreSQL version 10. I seriously believe Simon's
presentation in PostgreSQL conference Europe 2013! It was very exciting!!!

PostgreSQL have a lot of kind of disk-read method that are selected by planner,
however. I think that we need to discuss more other situations except pgbench,
and other cache cold situations. I think that optimizing kernel readahead
parameter with considering planner in PostgreSQL seems to be quite difficult, so
I seriously recruit co-author in this patch:-)

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment

Re: Optimize kernel readahead using buffer access strategy

From

Claudio Freire

Date:

14 November 2013, 13:53:45

On Thu, Nov 14, 2013 at 9:09 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> I create a patch that is improvement of disk-read and OS file caches. It can
> optimize kernel readahead parameter using buffer access strategy and
> posix_fadvice() in various disk-read situations.
>
> In general OS, readahead parameter was dynamically decided by disk-read
> situations. If long time disk-read was happened, readahead parameter becomes big.
> However it is based on experienced or heuristic algorithm, it causes waste
> disk-read and throws out useful OS file caches in some case. It is bad for
> disk-read performance a lot.

It would be relevant to know which kernel did you use for those tests.

@@ -677,6 +677,7 @@ mdread(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,                 errmsg("could not seek to block %u in file \"%s\": %m",
blocknum,FilePathName(v->mdfd_vfd))));

+    BufferHintIOAdvise(v->mdfd_vfd, buffer, BLCKSZ, strategy);    nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
    TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,

A while back, I tried to use posix_fadvise to prefetch index pages. I
ended up finding out that interleaving posix_fadvise with I/O like
that severly hinders (ie: completely disables) the kernel's read-ahead
algorithm.

How exactly did you set up those benchmarks? pg_bench defaults?

pg_bench does not exercise heavy sequential access patterns, or long
index scans. It performs many single-page index lookups per
transaction and that's it. You may want to try your patch with more
real workloads, and maybe you'll confirm what I found out last time I
messed with posix_fadvise. If my experience is still relevant, those
patterns will have suffered a severe performance penalty with this
patch, because it will disable kernel read-ahead on sequential index
access. It may still work for sequential heap scans, because the
access strategy will tell the kernel to do read-ahead, but many other
access methods will suffer.

Try OLAP-style queries.

Re: Optimize kernel readahead using buffer access strategy

From

Fujii Masao

Date:

14 November 2013, 17:03:16

On Thu, Nov 14, 2013 at 9:09 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> Hi,
>
> I create a patch that is improvement of disk-read and OS file caches. It can
> optimize kernel readahead parameter using buffer access strategy and
> posix_fadvice() in various disk-read situations.

When I compiled the HEAD code with this patch on MacOS, I got the following
error and warnings.

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I../../../../src/include   -c -o fd.o fd.c
fd.c: In function 'BufferHintIOAdvise':
fd.c:1182: error: 'POSIX_FADV_SEQUENTIAL' undeclared (first use in
this function)
fd.c:1182: error: (Each undeclared identifier is reported only once
fd.c:1182: error: for each function it appears in.)
fd.c:1185: error: 'POSIX_FADV_RANDOM' undeclared (first use in this function)
make[4]: *** [fd.o] Error 1
make[3]: *** [file-recursive] Error 2
make[2]: *** [storage-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2

tablecmds.c:9120: warning: passing argument 5 of 'smgrread' makes
pointer from integer without a cast
bufmgr.c:455: warning: passing argument 5 of 'smgrread' from
incompatible pointer type

Regards,

-- 
Fujii Masao

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

15 November 2013, 02:07:42

Hi Claudio,

(2013/11/14 22:53), Claudio Freire wrote:
> On Thu, Nov 14, 2013 at 9:09 AM, KONDO Mitsumasa
> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>> I create a patch that is improvement of disk-read and OS file caches. It can
>> optimize kernel readahead parameter using buffer access strategy and
>> posix_fadvice() in various disk-read situations.
>>
>> In general OS, readahead parameter was dynamically decided by disk-read
>> situations. If long time disk-read was happened, readahead parameter becomes big.
>> However it is based on experienced or heuristic algorithm, it causes waste
>> disk-read and throws out useful OS file caches in some case. It is bad for
>> disk-read performance a lot.
>
> It would be relevant to know which kernel did you use for those tests.
I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this test.


> A while back, I tried to use posix_fadvise to prefetch index pages.
I search your past work. Do you talk about this ML-thread? Or is there another 
latest discussion? I see your patch is interesting, but it wasn't submitted to CF 
and stopping discussions.
http://www.postgresql.org/message-id/CAGTBQpZzf70n0PYJ=VQLd+jb3wJGo=2TXmY+SkJD6G_vjC5QNg@mail.gmail.com

>I ended up finding out that interleaving posix_fadvise with I/O like
> that severly hinders (ie: completely disables) the kernel's read-ahead
> algorithm.
Your patch becomes maximum readahead, when a sql is selected index range scan. Is 
it right? I think that your patch assumes that pages are ordered by index-data. 
This assumption is partially wrong. If your assumption is true, we don't need 
CLUSTER command. In actuary, CLUSTER command becomes better performance than nothing.

> How exactly did you set up those benchmarks? pg_bench defaults?
My detail test setting is under following,
* Server info  CPU: Intel(R) Xeon(R) CPU E5645  @ 2.40GHz (2U/12C)  RAM: 6GB    -> I reduced it intentionally in OS
paraemter,because large memory tests       have long time.  HDD: SEAGATE  Model: ST2000NM0001 @ 7200rpm * 1  RAID:
none.

* postgresql.conf(summarized)  shared_buffers = 600MB (10% of RAM = 6GB)  work_mem = 1MB  maintenance_work_mem = 64MB
wal_level= archive  fsync = on  archive_mode = on  checkpoint_segments = 300  checkpoint_timeout = 15min
checkpoint_completion_target= 0.7
 

* pgbench settings
pgbench -j 4 -c 32 -T 600 pgbench


> pg_bench does not exercise heavy sequential access patterns, or long
> index scans. It performs many single-page index lookups per
> transaction and that's it.
Yes, your argument is right. And it is also a fact that performance becomes 
better in these situations.

> You may want to try your patch with more
> real workloads, and maybe you'll confirm what I found out last time I
> messed with posix_fadvise. If my experience is still relevant, those
> patterns will have suffered a severe performance penalty with this
> patch, because it will disable kernel read-ahead on sequential index
> access. It may still work for sequential heap scans, because the
> access strategy will tell the kernel to do read-ahead, but many other
> access methods will suffer.
The decisive difference with your patch is that my patch uses buffer hint control 
architecture, so it can control readahaed smarter in some cases.
However, my patch is on the way and needed to more improvement. I am going to add 
method of controlling readahead by GUC, for user can freely select readahed 
parameter in their transactions.

> Try OLAP-style queries.
I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell me 
good OLAP benchmark tools?

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

15 November 2013, 02:13:20

(2013/11/15 2:03), Fujii Masao wrote:
> On Thu, Nov 14, 2013 at 9:09 PM, KONDO Mitsumasa
> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>> Hi,
>>
>> I create a patch that is improvement of disk-read and OS file caches. It can
>> optimize kernel readahead parameter using buffer access strategy and
>> posix_fadvice() in various disk-read situations.
>
> When I compiled the HEAD code with this patch on MacOS, I got the following
> error and warnings.
>
> gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels
> -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
> -fwrapv -g -I../../../../src/include   -c -o fd.o fd.c
> fd.c: In function 'BufferHintIOAdvise':
> fd.c:1182: error: 'POSIX_FADV_SEQUENTIAL' undeclared (first use in
> this function)
> fd.c:1182: error: (Each undeclared identifier is reported only once
> fd.c:1182: error: for each function it appears in.)
> fd.c:1185: error: 'POSIX_FADV_RANDOM' undeclared (first use in this function)
> make[4]: *** [fd.o] Error 1
> make[3]: *** [file-recursive] Error 2
> make[2]: *** [storage-recursive] Error 2
> make[1]: *** [install-backend-recurse] Error 2
> make: *** [install-src-recurse] Error 2
>
> tablecmds.c:9120: warning: passing argument 5 of 'smgrread' makes
> pointer from integer without a cast
> bufmgr.c:455: warning: passing argument 5 of 'smgrread' from
> incompatible pointer type
Thanks you for your report!
I will fix it. Could you tell me your Mac OS version and gcc version? I have only 
mac book air with Maverick OS(10.9).

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center

Re: Optimize kernel readahead using buffer access strategy

From

Peter Geoghegan

Date:

15 November 2013, 02:17:21

On Thu, Nov 14, 2013 at 6:18 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> I will fix it. Could you tell me your Mac OS version and gcc version? I have
> only mac book air with Maverick OS(10.9).

I have an idea that Mac OSX doesn't have posix_fadvise at all. Didn't
you use the relevant macros so that the code at least builds on those
platforms?

-- 
Peter Geoghegan

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

15 November 2013, 02:26:24

(2013/11/15 11:17), Peter Geoghegan wrote:
> On Thu, Nov 14, 2013 at 6:18 PM, KONDO Mitsumasa
> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>> I will fix it. Could you tell me your Mac OS version and gcc version? I have
>> only mac book air with Maverick OS(10.9).
>
> I have an idea that Mac OSX doesn't have posix_fadvise at all. Didn't
> you use the relevant macros so that the code at least builds on those
> platforms?
Thank you for your nice advice, too.
I try to fix macro program.

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center

Re: Optimize kernel readahead using buffer access strategy

From

Claudio Freire

Date:

15 November 2013, 04:48:29

On Thu, Nov 14, 2013 at 11:13 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> Hi Claudio,
>
>
> (2013/11/14 22:53), Claudio Freire wrote:
>>
>> On Thu, Nov 14, 2013 at 9:09 AM, KONDO Mitsumasa
>> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>>>
>>> I create a patch that is improvement of disk-read and OS file caches. It
>>> can
>>> optimize kernel readahead parameter using buffer access strategy and
>>> posix_fadvice() in various disk-read situations.
>>>
>>> In general OS, readahead parameter was dynamically decided by disk-read
>>> situations. If long time disk-read was happened, readahead parameter
>>> becomes big.
>>> However it is based on experienced or heuristic algorithm, it causes
>>> waste
>>> disk-read and throws out useful OS file caches in some case. It is bad
>>> for
>>> disk-read performance a lot.
>>
>>
>> It would be relevant to know which kernel did you use for those tests.
>
> I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this
> test.

That's close to the kernel version I was using, so you should see the
same effect.

>> A while back, I tried to use posix_fadvise to prefetch index pages.
>
> I search your past work. Do you talk about this ML-thread? Or is there
> another latest discussion? I see your patch is interesting, but it wasn't
> submitted to CF and stopping discussions.
> http://www.postgresql.org/message-id/CAGTBQpZzf70n0PYJ=VQLd+jb3wJGo=2TXmY+SkJD6G_vjC5QNg@mail.gmail.com

Yes, I didn't, exactly because of that bad interaction with the
kernel. It needs either more smarts to only do fadvise on known-random
patterns (what you did mostly), or an accompanying kernel patch (which
I was working on, but ran out of test machines).

>> I ended up finding out that interleaving posix_fadvise with I/O like
>> that severly hinders (ie: completely disables) the kernel's read-ahead
>> algorithm.
>
> Your patch becomes maximum readahead, when a sql is selected index range
> scan. Is it right?

Ehm... sorta.

> I think that your patch assumes that pages are ordered by
> index-data.

No. It just knows which pages will be needed, and fadvises them. No
guessing involved, except the guess that the scan will not be aborted.
There's a heuristic to stop limited scans from attempting to fadvise,
and that's that prefetch strategy is applied only from the Nth+ page
walk.

It improves index-only scans the most, but I also attempted to handle
heap prefetches. That's where the kernel started conspiring against
me, because I used many naturally-clustered indexes, and THERE
performance was adversely affected because of that kernel bug.

>> You may want to try your patch with more
>> real workloads, and maybe you'll confirm what I found out last time I
>> messed with posix_fadvise. If my experience is still relevant, those
>> patterns will have suffered a severe performance penalty with this
>> patch, because it will disable kernel read-ahead on sequential index
>> access. It may still work for sequential heap scans, because the
>> access strategy will tell the kernel to do read-ahead, but many other
>> access methods will suffer.
>
> The decisive difference with your patch is that my patch uses buffer hint
> control architecture, so it can control readahaed smarter in some cases.

Indeed, but it's not enough. See my above comment about naturally
clustered indexes. The planner expects that, and plans accordingly. It
will notice correlation between a PK and physical location, and will
treat an index scan over PK to be almost sequential. With your patch,
that assumption will be broken I believe.

> However, my patch is on the way and needed to more improvement. I am going
> to add method of controlling readahead by GUC, for user can freely select
> readahed parameter in their transactions.

Rather, I'd try to avoid fadvising consecutive or almost-consecutive
blocks. Detecting that is hard at the block level, but maybe you can
tie that detection into the planner, and specify a sequential strategy
when the planner expects index-heap correlation?

>> Try OLAP-style queries.
>
> I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell
> me good OLAP benchmark tools?

I don't really know. Skimming the specs, I'm not sure if those queries
generate large index range queries. You could try, maybe with
autoexplain?

Re: Optimize kernel readahead using buffer access strategy

From

Peter Eisentraut

Date:

15 November 2013, 17:01:56

On 11/14/13, 7:09 AM, KONDO Mitsumasa wrote:
> I create a patch that is improvement of disk-read and OS file caches. It can
> optimize kernel readahead parameter using buffer access strategy and
> posix_fadvice() in various disk-read situations.

Various compiler warnings:

tablecmds.c: In function ‘copy_relation_data’:
tablecmds.c:9120:3: warning: passing argument 5 of ‘smgrread’ makes pointer from integer without a cast [enabled by
default]
In file included from tablecmds.c:79:0:
../../../src/include/storage/smgr.h:94:13: note: expected ‘char *’ but argument is of type ‘int’

bufmgr.c: In function ‘ReadBuffer_common’:
bufmgr.c:455:4: warning: passing argument 5 of ‘smgrread’ from incompatible pointer type [enabled by default]
In file included from ../../../../src/include/storage/buf_internals.h:22:0,                from bufmgr.c:45:
../../../../src/include/storage/smgr.h:94:13: note: expected ‘char *’ but argument is of type ‘BufferAccessStrategy’

md.c: In function ‘mdread’:
md.c:680:2: warning: passing argument 2 of ‘BufferHintIOAdvise’ makes integer from pointer without a cast [enabled by
default]
In file included from md.c:34:0:
../../../../src/include/storage/fd.h:79:12: note: expected ‘off_t’ but argument is of type ‘char *’

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

18 November 2013, 01:57:42

(2013/11/15 13:48), Claudio Freire wrote:
> On Thu, Nov 14, 2013 at 11:13 PM, KONDO Mitsumasa
>> I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this
>> test.
>
> That's close to the kernel version I was using, so you should see the
> same effect.
OK. You proposed readahead maximum patch, I think it seems to get benefit for 
perofomance and your part of argument is really true.

>> Your patch becomes maximum readahead, when a sql is selected index range
>> scan. Is it right?
>
> Ehm... sorta.
>
>> I think that your patch assumes that pages are ordered by
>> index-data.
>
> No. It just knows which pages will be needed, and fadvises them. No
> guessing involved, except the guess that the scan will not be aborted.
> There's a heuristic to stop limited scans from attempting to fadvise,
> and that's that prefetch strategy is applied only from the Nth+ page
> walk.
We may completely optimize kernel readahead in PostgreSQL in the future,
however it is very difficult and takes long time that it completely comes true 
from a beginning. So I propose GUC switch that can use in their transactions.(I　 
will create this patch in this CF.). If someone off readahed for using file cache 
more efficient in his transactions, he can set "SET readahead = off". PostgreSQL 
is open source, and I think that it becomes clear which case it is effective for, 
by using many people.

> It improves index-only scans the most, but I also attempted to handle
> heap prefetches. That's where the kernel started conspiring against
> me, because I used many naturally-clustered indexes, and THERE
> performance was adversely affected because of that kernel bug.
I also create gaussinan-distributed pgbench now and submit this CF. It can clear 
which situasion is effective, partially we will know.

>>> You may want to try your patch with more
>>> real workloads, and maybe you'll confirm what I found out last time I
>>> messed with posix_fadvise. If my experience is still relevant, those
>>> patterns will have suffered a severe performance penalty with this
>>> patch, because it will disable kernel read-ahead on sequential index
>>> access. It may still work for sequential heap scans, because the
>>> access strategy will tell the kernel to do read-ahead, but many other
>>> access methods will suffer.
>>
>> The decisive difference with your patch is that my patch uses buffer hint
>> control architecture, so it can control readahaed smarter in some cases.
>
> Indeed, but it's not enough. See my above comment about naturally
> clustered indexes. The planner expects that, and plans accordingly. It
> will notice correlation between a PK and physical location, and will
> treat an index scan over PK to be almost sequential. With your patch,
> that assumption will be broken I believe.
~
>> However, my patch is on the way and needed to more improvement. I am going
>> to add method of controlling readahead by GUC, for user can freely select
>> readahed parameter in their transactions.
>
> Rather, I'd try to avoid fadvising consecutive or almost-consecutive
> blocks. Detecting that is hard at the block level, but maybe you can
> tie that detection into the planner, and specify a sequential strategy
> when the planner expects index-heap correlation?
I think we had better to develop these patches in step by step each patches, 
because it is difficult that readahead optimizetion is completely come true from 
a beginning of one patch. We need flame-work in these patches, first.

>>> Try OLAP-style queries.
>>
>> I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell
>> me good OLAP benchmark tools?
>
> I don't really know. Skimming the specs, I'm not sure if those queries
> generate large index range queries. You could try, maybe with
> autoexplain?
OK, I do. And, I will use simple large index range queries with explain command.

Regards,
-- 
Mitsuamsa KONDO
NTT Open Source Software Center

Re: Optimize kernel readahead using buffer access strategy

From

Claudio Freire

Date:

18 November 2013, 02:25:26

On Sun, Nov 17, 2013 at 11:02 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
>>> However, my patch is on the way and needed to more improvement. I am
>>> going
>>> to add method of controlling readahead by GUC, for user can freely select
>>> readahed parameter in their transactions.
>>
>>
>> Rather, I'd try to avoid fadvising consecutive or almost-consecutive
>> blocks. Detecting that is hard at the block level, but maybe you can
>> tie that detection into the planner, and specify a sequential strategy
>> when the planner expects index-heap correlation?
>
> I think we had better to develop these patches in step by step each patches,
> because it is difficult that readahead optimizetion is completely come true
> from a beginning of one patch. We need flame-work in these patches, first.

Well, problem is, that without those smarts, I don't think this patch
can be enabled by default. It will considerably hurt common use cases
for postgres.

But I guess we'll have a better idea about that when we see how much
of a performance impact it makes when you run those tests, so no need
to guess in the dark.

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

18 November 2013, 04:02:32

(2013/11/18 11:25), Claudio Freire wrote:
> On Sun, Nov 17, 2013 at 11:02 PM, KONDO Mitsumasa
> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>>>> However, my patch is on the way and needed to more improvement. I am
>>>> going
>>>> to add method of controlling readahead by GUC, for user can freely select
>>>> readahed parameter in their transactions.
>>>
>>>
>>> Rather, I'd try to avoid fadvising consecutive or almost-consecutive
>>> blocks. Detecting that is hard at the block level, but maybe you can
>>> tie that detection into the planner, and specify a sequential strategy
>>> when the planner expects index-heap correlation?
>>
>> I think we had better to develop these patches in step by step each patches,
>> because it is difficult that readahead optimizetion is completely come true
>> from a beginning of one patch. We need flame-work in these patches, first.
>
> Well, problem is, that without those smarts, I don't think this patch
> can be enabled by default. It will considerably hurt common use cases
> for postgres.
Yes. I have thought as much you that defalut setting is false.
(use normal readahead as before). Next version of my patch will become these.

> But I guess we'll have a better idea about that when we see how much
> of a performance impact it makes when you run those tests, so no need
> to guess in the dark.
Yes, sure.

Regards,
-- 
Mitsumasa KONDO
NTT Open Source Software Center

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

10 December 2013, 07:58:27

Hi,

I revise this patch and re-run performance test, it can work collectry in Linux
and no complile wanings. I add GUC about enable_kernel_readahead option in new
version. When this GUC is on(default), it works in POSIX_FADV_NORMAL which is
general readahead in OS. And when it is off, it works in POSXI_FADV_RANDOM or
POSIX_FADV_SEQUENTIAL which is judged by buffer hint in Postgres, readahead
parameter is optimized by postgres. We can change this parameter in their
transactions everywhere and everytime.

* Test server
   Server: HP Proliant DL360 G7
   CPU:    Xeon E5640 2.66GHz (1P/4C)
   Memory: 18GB(PC3-10600R-9)
   Disk:   146GB(15k)*4 RAID1+0
   RAID controller: P410i/256MB
   OS: RHEL 6.4(x86_64)
   FS: Ext4

* Test setting
   I use "pgbench -c 8 -j 4 -T 2400 -S -P 10 -a"
   I also use my accurate patch in this test. So I exexuted under following
command before each benchmark.
     1. cluster all database
     2. truncate pgbench_history
     3. checkpoint
     4. sync
     5. checkpoint

* postresql.conf
shared_buffers = 2048MB
maintenance_work_mem = 64MB
wal_level = minimal
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.7

* Performance test result
** In memory database size
s=1000        |   1   |   2   |   3   |  avg
---------------------------------------------
readahead=on  | 39836 | 40229 | 40055 | 40040
readahead=off | 31259 | 29656 | 30693 | 30536
ratio         |  78%  |  74%  |  77%  |   76%

** Over memory database size
s=2000        |   1  |   2  |    3    |  avg
---------------------------------------------
readahead=on  | 1288 | 1370 |   1367  | 1341
readahead=off | 1683 | 1688 |   1395  | 1589
ratio         | 131% | 123% |   102%  | 118%

s=3000        |   1  |   2  |    3    |  avg
---------------------------------------------
readahead=on  |  965 |  862 |   993   |  940
readahead=off | 1113 | 1098 |   935   | 1049
ratio         | 115% | 127% |   94%   | 112%


It seems good performance expect scale factor=1000. When readahead parameter is
off, disk IO keep to a minimum or necessary, therefore it is faster than
"readahead=on". "readahead=on" uses useless diskIO. For example, which is faster
8KB random read or 12KB random read from disks in many times transactions? It is
self-evident that the former is faster.

In scale factor 1000, it becomes to slower buffer-is-hot than "readahead=on". So
it seems to less performance. But it is essence in measuring perfomance. And you
can confirm it by attached benchmark graphs. We can use this parameter when
buffer is reratively hot. If you want to see other trial graphs, I will send.

And I will support to MacOS and create document about this patch in this week.
#MacOS is in my house.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment

Re: Optimize kernel readahead using buffer access strategy

From

Claudio Freire

Date:

10 December 2013, 13:56:08

On Tue, Dec 10, 2013 at 5:03 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> I revise this patch and re-run performance test, it can work collectry in
> Linux and no complile wanings. I add GUC about enable_kernel_readahead
> option in new version. When this GUC is on(default), it works in
> POSIX_FADV_NORMAL which is general readahead in OS. And when it is off, it
> works in POSXI_FADV_RANDOM or POSIX_FADV_SEQUENTIAL which is judged by
> buffer hint in Postgres, readahead parameter is optimized by postgres. We
> can change this parameter in their transactions everywhere and everytime.

I'd change the naming to

enable_readahead=os|fadvise

with os = on, fadvise = off

And, if you want to keep the on/off values, I'd reverse them. Because
off reads more like "I don't do anything special", and in your patch
it's quite the opposite.

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

11 December 2013, 06:09:09

(2013/12/10 22:55), Claudio Freire wrote:
> On Tue, Dec 10, 2013 at 5:03 AM, KONDO Mitsumasa
> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>> I revise this patch and re-run performance test, it can work collectry in
>> Linux and no complile wanings. I add GUC about enable_kernel_readahead
>> option in new version. When this GUC is on(default), it works in
>> POSIX_FADV_NORMAL which is general readahead in OS. And when it is off, it
>> works in POSXI_FADV_RANDOM or POSIX_FADV_SEQUENTIAL which is judged by
>> buffer hint in Postgres, readahead parameter is optimized by postgres. We
>> can change this parameter in their transactions everywhere and everytime.
>
> I'd change the naming to
OK. I think "on" or "off" naming is not good, too.

> enable_readahead=os|fadvise
>
> with os = on, fadvise = off
Hmm. fadvise is method and is not a purpose. So I consider another idea of this GUC.

1)readahead_strategy=os|pg  This naming is good for future another implements. If we will want to set  maximum
readaheadparaemeter which is always use POSIX_FADV_SEQUENTIAL, we can 
 
set "max".

2)readahead_optimizer=os|pg or readahaed_strategist=os|pg  This naming is easy to understand to who is opitimized
readahead. But it isn't extensibility for future another implements.
 

> And, if you want to keep the on/off values, I'd reverse them. Because
> off reads more like "I don't do anything special", and in your patch
> it's quite the opposite.
I understand your feeling. If we adopt "on|off" setting, I would like to set GUC
optimized_readahead=off|on.


Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Re: Optimize kernel readahead using buffer access strategy

From

Claudio Freire

Date:

12 December 2013, 00:30:41

On Wed, Dec 11, 2013 at 3:14 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
>
>> enable_readahead=os|fadvise
>>
>> with os = on, fadvise = off
>
> Hmm. fadvise is method and is not a purpose. So I consider another idea of
> this GUC.


Yeah, I was thinking of opening the door for readahead=aio, but
whatever clearer than on-off would work ;)

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

12 December 2013, 10:52:25

(2013/12/12 9:30), Claudio Freire wrote:
> On Wed, Dec 11, 2013 at 3:14 AM, KONDO Mitsumasa
> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>>
>>> enable_readahead=os|fadvise
>>>
>>> with os = on, fadvise = off
>>
>> Hmm. fadvise is method and is not a purpose. So I consider another idea of
>> this GUC.
>
> Yeah, I was thinking of opening the door for readahead=aio, but
> whatever clearer than on-off would work ;)

I'm very interested in Postgres with libaio, and I'd like to see the perfomance 
improvements. I'm not sure about libaio, however, it will face 
exclusive-buffer-lock problem in asynchronous IO.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Re: Optimize kernel readahead using buffer access strategy

From

Simon Riggs

Date:

12 December 2013, 13:18:07

On 14 November 2013 12:09, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

> For your information of effect of this patch, I got results of pgbench which are
> in-memory-size database and out-memory-size database, and postgresql.conf
> settings are always used by us. It seems to improve performance to a better. And
> I think that this feature is going to be necessary for business intelligence
> which will be realized at PostgreSQL version 10. I seriously believe Simon's
> presentation in PostgreSQL conference Europe 2013! It was very exciting!!!

Thank you.

I like the sound of this patch, sorry I've not been able to help as yet.

Your tests seem to relate to pgbench. Do we have tests on more BI related tasks?

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Optimize kernel readahead using buffer access strategy

From

Mitsumasa KONDO

Date:

12 December 2013, 13:43:09

2013/12/12 Simon Riggs <simon@2ndquadrant.com>

On 14 November 2013 12:09, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

> For your information of effect of this patch, I got results of pgbench which are
> in-memory-size database and out-memory-size database, and postgresql.conf
> settings are always used by us. It seems to improve performance to a better. And
> I think that this feature is going to be necessary for business intelligence
> which will be realized at PostgreSQL version 10. I seriously believe Simon's
> presentation in PostgreSQL conference Europe 2013! It was very exciting!!!

Thank you.

I like the sound of this patch, sorry I've not been able to help as yet.

Your tests seem to relate to pgbench. Do we have tests on more BI related tasks?

Yes, off-course! We will need another benchmark test before conclusion of this patch.

What kind of benchmark do you have?

Regards,

Mitsumasa KONDO

NTT Open Source Software Center

Re: Optimize kernel readahead using buffer access strategy

From

Simon Riggs

Date:

12 December 2013, 14:05:18

On 12 December 2013 13:43, Mitsumasa KONDO <kondo.mitsumasa@gmail.com> wrote:

>> Your tests seem to relate to pgbench. Do we have tests on more BI related
>> tasks?
>
> Yes, off-course!  We will need another benchmark test before conclusion of
> this patch.
> What kind of benchmark do you have?

I suggest isolating SeqScan and IndexScan and BitmapIndex/HeapScan
examples, as well as some of the simpler TPC-H queries.

But start with some SeqScan and VACUUM cases.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

17 December 2013, 11:44:51

Hi,

I fixed the patch to improve followings.

   - Can compile in MacOS.
   - Change GUC name enable_kernel_readahead to readahead_strategy.
   - Change POSIX_FADV_SEQUNENTIAL to POISX_FADV_NORMAL when we select sequential
     access strategy, this reason is later...

I tested simple two access paterns which are followings in pgbench tables scale
size is 1000.

   A) SELECT count(bid) FROM pgbench_accounts; (Index only scan)
   B) SELECT count(bid) FROM pgbench_accounts; (Seq scan)

  In each test, I restart postgres and drop file cache before each test.

Unpatched PG is faster than patched in A and B query. It was about 1.3 times
faster. Result of A query as expected, because patched PG cannot execute
readahead at all. So cache cold situation is bad for patched PG. However, it
might good for cache hot situation, because it doesn't read disk IO at all and
can calculate file cache usage and know which cache is important.

However, result of B query as unexpected, because my patch select
POSIX_FADV_SEQUNENTIAL collectry, but it slow. I cannot understand that,
nevertheless I read kernel source code... Next, I change POSIX_FADV_SEQUNENTIAL
to POISX_FADV_NORMAL in my patch. B query was faster as unpatched PG.

In heavily random access benchmark tests which are pgbench and DBT-2, my patched
PG is about 1.1 - 1.3 times faster than unpatched PG. But postgres buffer hint
strategy algorithm have not optimized for readahead strategy yet, and I don't fix
it. It is still only for ring buffer algorithm in shared_buffer.


Attached printf-debug patch will show you inside postgres buffer strategy. When
you see "S" it selects sequential access strategy, on the other hands, when you
see "R" it selects random access strategy. It might interesting for you. It's
very visual.

Example output is here.
> [mitsu-ko@localhost postgresql]$ bin/vacuumdb
> SSSSSSSSSSSSSSSSSSSSSSSSSSS~~SSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
> [mitsu-ko@localhost postgresql]$ bin/psql -c "EXPLAIN ANALYZE SELECT count(aid) FROM pgbench_accounts"
>                                                                            QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=2854.29..2854.30 rows=1 width=4) (actual time=33.438..33.438 rows=1 loops=1)
>    ->  Index Only Scan using pgbench_accounts_pkey on pgbench_accounts  (cost=0.29..2604.29 rows=100000 width=4)
(actualtime=0.072..20.912 rows=100000 loops=1) 
>          Heap Fetches: 0
>  Total runtime: 33.552 ms
> (4 rows)
>
> RRRRRRRRRRRRRRRRRRRRRRRRRRR~~RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
> [mitsu-ko@localhost postgresql]$ bin/psql -c "EXPLAIN ANALYZE SELECT count(bid) FROM pgbench_accounts"
> SSSSSSSSSSSSSSSSSSSSSSSSSSS~~SSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
>
------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=2890.00..2890.01 rows=1 width=4) (actual time=40.315..40.315 rows=1 loops=1)
>    ->  Seq Scan on pgbench_accounts  (cost=0.00..2640.00 rows=100000 width=4) (actual time=0.112..23.001 rows=100000
loops=1)
>  Total runtime: 40.472 ms
> (3 rows)

Thats's all now.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment

Re: Optimize kernel readahead using buffer access strategy

From

Simon Riggs

Date:

17 December 2013, 12:29:47

On 17 December 2013 11:50, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

> Unpatched PG is faster than patched in A and B query. It was about 1.3 times
> faster. Result of A query as expected, because patched PG cannot execute
> readahead at all. So cache cold situation is bad for patched PG. However, it
> might good for cache hot situation, because it doesn't read disk IO at all
> and can calculate file cache usage and know which cache is important.
>
> However, result of B query as unexpected, because my patch select
> POSIX_FADV_SEQUNENTIAL collectry, but it slow. I cannot understand that,
> nevertheless I read kernel source code... Next, I change
> POSIX_FADV_SEQUNENTIAL to POISX_FADV_NORMAL in my patch. B query was faster
> as unpatched PG.
>
> In heavily random access benchmark tests which are pgbench and DBT-2, my
> patched PG is about 1.1 - 1.3 times faster than unpatched PG. But postgres
> buffer hint strategy algorithm have not optimized for readahead strategy
> yet, and I don't fix it. It is still only for ring buffer algorithm in
> shared_buffer.

These are interesting results. Good research.

They also show that the benefit of this is very specific to the exact
task being performed. I can't see any future for a setting that
applies to everything or nothing. We must be more selective.

We also need much better benchmark results, clearly laid out, so they
can be reproduced and discussed.

Please keep working on this.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

18 December 2013, 02:23:56

(2013/12/17 21:29), Simon Riggs wrote:
> These are interesting results. Good research.
Thanks!

> They also show that the benefit of this is very specific to the exact
> task being performed. I can't see any future for a setting that
> applies to everything or nothing. We must be more selective.
This patch is still needed some human judgement whether readahead is on or off.
But it might have been already useful for clever users. However, I'd like to 
implement adding more the minimum optimization.

> We also need much better benchmark results, clearly laid out, so they
> can be reproduced and discussed.
I think this feature is big benefit for OLTP, and it might useful for BI now.
BI queries are mostly compicated, so we will need to test more in some
situations. Printf debug is very useful for debugging my patch, and it will 
accelerate the optimization.

> Please keep working on this.
OK. I do it patiently.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Re: Optimize kernel readahead using buffer access strategy

From

KONDO Mitsumasa

Date:

14 January 2014, 11:52:50

Hi,

I fix and submit this patch in CF4.

In my past patch, it is significant bug which is mistaken caluculation of
offset in posix_fadvise():-( However it works well without problem in pgbench.
Because pgbench transactions are always random access...

And I test my patch in DBT-2 benchmark. Results are under following.

* Test server
   Server: HP Proliant DL360 G7
   CPU:    Xeon E5640 2.66GHz (1P/4C)
   Memory: 18GB(PC3-10600R-9)
   Disk:   146GB(15k)*4 RAID1+0
   RAID controller: P410i/256MB
   OS: RHEL 6.4(x86_64)
   FS: Ext4


* DBT-2 result(WH400, SESSION=100, ideal_score=5160)
Method     | score | average | 90%tile | Maximum
------------------------------------------------
plain      | 3589  | 9.751   | 33.680  | 87.8036
option=off | 3670  | 9.107   | 34.267  | 79.3773
option=on  | 4222  | 5.140   | 7.619   | 102.473

  "option" is "readahead_strategy" option, and "on" is my proposed.
  "average", "90%tile", and Maximum represent latency.
  Average_latency is 2 times faster than plain!


* Detail results (uploading now. please wait for a hour...)
[plain]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html
[option=off]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_off_20140109/HTML/index_thput.html
[option=on]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_on_20140109/HTML/index_thput.html

We can see part of super slow latency in my proposed method test.
Part of transaction active is 20%, and part of slow transactions is 80%.
It might be Pareto principle in CHECKPOINT;-)
#It's joke.

I will test some join sqls performance and TPC-3 benchmark in this or next week.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachment

optimizing_kernel-readahead_using_buffer-access-strategy_v5.patch

Re: Optimize kernel readahead using buffer access strategy

From

Claudio Freire

Date:

14 January 2014, 14:35:12

On Tue, Jan 14, 2014 at 8:58 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
>
> In my past patch, it is significant bug which is mistaken caluculation of
> offset in posix_fadvise():-( However it works well without problem in
> pgbench.
> Because pgbench transactions are always random access...

Did you notice any difference?

AFAIK, when specifying read patterns (ie, RANDOM, SEQUENTIAL and stuff
like that), the offset doesn't matter. At least in linux.

Re: Optimize kernel readahead using buffer access strategy

From

Claudio Freire

Date:

14 January 2014, 14:35:19

On Tue, Jan 14, 2014 at 8:58 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
>
> In my past patch, it is significant bug which is mistaken caluculation of
> offset in posix_fadvise():-( However it works well without problem in
> pgbench.
> Because pgbench transactions are always random access...

Did you notice any difference?

AFAIK, when specifying read patterns (ie, RANDOM, SEQUENTIAL and stuff
like that), the offset doesn't matter. At least in linux.

Re: Optimize kernel readahead using buffer access strategy

From

Andres Freund

Date:

04 April 2014, 14:11:04

Hi,

On 2014-01-14 20:58:20 +0900, KONDO Mitsumasa wrote:
> I will test some join sqls performance and TPC-3 benchmark in this or next week.

This patch has been marked as "Waiting For Author" for nearly two months
now. Marked as "Returned with Feedback".

Greetings,

Andres Freund