Re: Optimize kernel readahead using buffer access strategy - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: Optimize kernel readahead using buffer access strategy
Date
Msg-id CAGTBQpZU6kBvq0gyeOjYOBO9aXnyEda_Q2qEOjMW4dTqyDZXNA@mail.gmail.com
Whole thread Raw
In response to Re: Optimize kernel readahead using buffer access strategy  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
Responses Re: Optimize kernel readahead using buffer access strategy  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
List pgsql-hackers
On Thu, Nov 14, 2013 at 11:13 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:
> Hi Claudio,
>
>
> (2013/11/14 22:53), Claudio Freire wrote:
>>
>> On Thu, Nov 14, 2013 at 9:09 AM, KONDO Mitsumasa
>> <kondo.mitsumasa@lab.ntt.co.jp> wrote:
>>>
>>> I create a patch that is improvement of disk-read and OS file caches. It
>>> can
>>> optimize kernel readahead parameter using buffer access strategy and
>>> posix_fadvice() in various disk-read situations.
>>>
>>> In general OS, readahead parameter was dynamically decided by disk-read
>>> situations. If long time disk-read was happened, readahead parameter
>>> becomes big.
>>> However it is based on experienced or heuristic algorithm, it causes
>>> waste
>>> disk-read and throws out useful OS file caches in some case. It is bad
>>> for
>>> disk-read performance a lot.
>>
>>
>> It would be relevant to know which kernel did you use for those tests.
>
> I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this
> test.

That's close to the kernel version I was using, so you should see the
same effect.

>> A while back, I tried to use posix_fadvise to prefetch index pages.
>
> I search your past work. Do you talk about this ML-thread? Or is there
> another latest discussion? I see your patch is interesting, but it wasn't
> submitted to CF and stopping discussions.
> http://www.postgresql.org/message-id/CAGTBQpZzf70n0PYJ=VQLd+jb3wJGo=2TXmY+SkJD6G_vjC5QNg@mail.gmail.com

Yes, I didn't, exactly because of that bad interaction with the
kernel. It needs either more smarts to only do fadvise on known-random
patterns (what you did mostly), or an accompanying kernel patch (which
I was working on, but ran out of test machines).

>> I ended up finding out that interleaving posix_fadvise with I/O like
>> that severly hinders (ie: completely disables) the kernel's read-ahead
>> algorithm.
>
> Your patch becomes maximum readahead, when a sql is selected index range
> scan. Is it right?

Ehm... sorta.

> I think that your patch assumes that pages are ordered by
> index-data.

No. It just knows which pages will be needed, and fadvises them. No
guessing involved, except the guess that the scan will not be aborted.
There's a heuristic to stop limited scans from attempting to fadvise,
and that's that prefetch strategy is applied only from the Nth+ page
walk.

It improves index-only scans the most, but I also attempted to handle
heap prefetches. That's where the kernel started conspiring against
me, because I used many naturally-clustered indexes, and THERE
performance was adversely affected because of that kernel bug.

>> You may want to try your patch with more
>> real workloads, and maybe you'll confirm what I found out last time I
>> messed with posix_fadvise. If my experience is still relevant, those
>> patterns will have suffered a severe performance penalty with this
>> patch, because it will disable kernel read-ahead on sequential index
>> access. It may still work for sequential heap scans, because the
>> access strategy will tell the kernel to do read-ahead, but many other
>> access methods will suffer.
>
> The decisive difference with your patch is that my patch uses buffer hint
> control architecture, so it can control readahaed smarter in some cases.

Indeed, but it's not enough. See my above comment about naturally
clustered indexes. The planner expects that, and plans accordingly. It
will notice correlation between a PK and physical location, and will
treat an index scan over PK to be almost sequential. With your patch,
that assumption will be broken I believe.

> However, my patch is on the way and needed to more improvement. I am going
> to add method of controlling readahead by GUC, for user can freely select
> readahed parameter in their transactions.

Rather, I'd try to avoid fadvising consecutive or almost-consecutive
blocks. Detecting that is hard at the block level, but maybe you can
tie that detection into the planner, and specify a sequential strategy
when the planner expects index-heap correlation?

>> Try OLAP-style queries.
>
> I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell
> me good OLAP benchmark tools?

I don't really know. Skimming the specs, I'm not sure if those queries
generate large index range queries. You could try, maybe with
autoexplain?



pgsql-hackers by date:

Previous
From: Daniel Farina
Date:
Subject: Re: pg_stat_statements: calls under-estimation propagation
Next
From: Haribabu kommi
Date:
Subject: Re: Heavily modified big table bloat even in auto vacuum is running