Thread: Introducing a new linux readahead framework

Introducing a new linux readahead framework

From
Wu Fengguang
Date:
Greetings,

I'd like to introduce a new readahead framework for the linux kernel:
http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1021.html

HOW IT WORKS

In adaptive readahead, the context based method may be of particular
interest to postgresql users. It works by peeking into the file cache
and check if there are any history pages present or accessed. In this
way it can detect almost all forms of sequential / semi-sequential read
patterns, e.g.
    - parallel / interleaved sequential scans on one file
    - sequential reads across file open/close
    - mixed sequential / random accesses
    - sparse / skimming sequential read

It also have methods to detect some less common cases:
    - reading backward
    - seeking all over reading N pages

WAYS TO BENEFIT FROM IT

As we know, postgresql relies on the kernel to do proper readahead.
The adaptive readahead might help performance in the following cases:
    - concurrent sequential scans
    - sequential scan on a fragmented table
      (some DBs suffer from this problem, not sure for pgsql)
    - index scan with clustered matches
    - index scan on majority rows (in case the planner goes wrong)

TUNABLE PARAMETERS

There are two parameters which are described in this email:
http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1024.html

Here are the more oriented guidelines for postgresql users:

- /proc/sys/vm/readahead_ratio
Since most DB servers are bounty of memory, the danger of readahead
thrashing is near to zero. In this case, you can set readahead_ratio to
100(or even 200:), which helps the readahead window to scale up rapidly.

- /proc/sys/vm/readahead_hit_rate
Sparse sequential reads are read patterns like {0, 2, 4, 5, 8, 11, ...}.
In this case we might prefer to do readahead to get good I/O performance
with the overhead of some useless pages. But if you prefer not to do so,
set readahead_hit_rate to 1 will disable this feature.

- /sys/block/sd<X>/queue/read_ahead_kb
Set it to a large value(e.g. 4096) as you used to do.
RAID users might want to use a bigger number.

TRYING IT OUT

The latest patch for stable kernels can be downloaded here:
http://www.vanheusden.com/ara/

Before compiling, make sure that the following options are enabled:
Processor type and features -> Adaptive file readahead
Processor type and features ->   Readahead debug and accounting

HELPING AND CONTRIBUTING

The patch is open to fine-tuning advices :)
Comments and benchmarking results are highly appreciated.

Thanks,
Wu

Re: Introducing a new linux readahead framework

From
"Jim C. Nasby"
Date:
On Fri, Apr 21, 2006 at 09:38:26AM +0800, Wu Fengguang wrote:
> Greetings,
>
> I'd like to introduce a new readahead framework for the linux kernel:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1021.html
>
> HOW IT WORKS
>
> In adaptive readahead, the context based method may be of particular
> interest to postgresql users. It works by peeking into the file cache
> and check if there are any history pages present or accessed. In this
> way it can detect almost all forms of sequential / semi-sequential read
> patterns, e.g.
>     - parallel / interleaved sequential scans on one file
>     - sequential reads across file open/close
>     - mixed sequential / random accesses
>     - sparse / skimming sequential read
>
> It also have methods to detect some less common cases:
>     - reading backward
>     - seeking all over reading N pages

Are there any ways to inform the kernel that you either are or aren't
doing a sequential read? It seems that in some cases it would be better
to bypass a bunch of tricky logic trying to determine that it's doing a
sequential read. A sequential scan in PostgreSQL would be such a case.

The opposite example would be an index scan of a highly uncorrelated
index, which would produce mostly random reads from the table. In that
case, reading ahead probably makes very little sense, though your logic
might have a better idea of the access pattern than PostgreSQL does.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Introducing a new linux readahead framework

From
Wu Fengguang
Date:
On Thu, Apr 20, 2006 at 11:31:47PM -0500, Jim C. Nasby wrote:
> > In adaptive readahead, the context based method may be of particular
> > interest to postgresql users. It works by peeking into the file cache
> > and check if there are any history pages present or accessed. In this
> > way it can detect almost all forms of sequential / semi-sequential read
> > patterns, e.g.
> >     - parallel / interleaved sequential scans on one file
> >     - sequential reads across file open/close
> >     - mixed sequential / random accesses
> >     - sparse / skimming sequential read
> >
> > It also have methods to detect some less common cases:
> >     - reading backward
> >     - seeking all over reading N pages
>
> Are there any ways to inform the kernel that you either are or aren't
> doing a sequential read? It seems that in some cases it would be better

This call will disable readahead totally for fd:
        posix_fadvise(fd, any, any, POSIX_FADV_RANDOM);

This one will reenable it:
        posix_fadvise(fd, any, any, POSIX_FADV_NORMAL);

This one will enable readahead _and_ set max readahead window to
2*max_readahead_kb:
        posix_fadvise(fd, any, any, POSIX_FADV_SEQUENTIAL);

> to bypass a bunch of tricky logic trying to determine that it's doing a
> sequential read. A sequential scan in PostgreSQL would be such a case.

You do not need to worry about the detecting `overhead' on sequential
scans :) The adaptive readahead framework has a fast code path(the
stateful method) to handle normal sequential reads, the detection of
which is really trivial.

> The opposite example would be an index scan of a highly uncorrelated
> index, which would produce mostly random reads from the table. In that
> case, reading ahead probably makes very little sense, though your logic
> might have a better idea of the access pattern than PostgreSQL does.

As for the index scans, the failsafe code path(i.e. the context based
one) will normally be used, and it does have a little overhead in
looking up the page cache(about 0.4% more CPU time). However, the
penalty of random disk access is so large that if ever it helps
reducing a small fraction of disk accesses, you wins.

Thanks,
Wu

Re: Introducing a new linux readahead framework

From
Markus Schaber
Date:
Hi, Wu,

Wu Fengguang wrote:

>>>In adaptive readahead, the context based method may be of particular
>>>interest to postgresql users. It works by peeking into the file cache
>>>and check if there are any history pages present or accessed. In this
>>>way it can detect almost all forms of sequential / semi-sequential read
>>>patterns, e.g.
>>>    - parallel / interleaved sequential scans on one file
>>>    - sequential reads across file open/close
>>>    - mixed sequential / random accesses
>>>    - sparse / skimming sequential read
>>>
>>>It also have methods to detect some less common cases:
>>>    - reading backward
>>>    - seeking all over reading N pages

Gread news, thanks!

> This call will disable readahead totally for fd:
>         posix_fadvise(fd, any, any, POSIX_FADV_RANDOM);
>
> This one will reenable it:
>         posix_fadvise(fd, any, any, POSIX_FADV_NORMAL);
>
> This one will enable readahead _and_ set max readahead window to
> 2*max_readahead_kb:
>         posix_fadvise(fd, any, any, POSIX_FADV_SEQUENTIAL);

I think that this is an easy, understandable and useful interpretation
of posix_fadvise() hints.


Are there any rough estimates when this will get into mainline kernel
(if you intend to submit)?

Thanks,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: Introducing a new linux readahead framework

From
Wu Fengguang
Date:
Hi Markus,

On Fri, Apr 21, 2006 at 09:53:34AM +0200, Markus Schaber wrote:
> Are there any rough estimates when this will get into mainline kernel
> (if you intend to submit)?

I'm not quite sure :)

The patch itself has been pretty stable.  To get it accepted, we must
back it by good benchmarking results for some important applications.
I have confirmed that file service via FTP/HTTP/NFS can more or less
benefit from it. However, database services have not been touched yet.
Oracle/DB2 seem to bypass the readahead code route, while postgresql
relies totally on kernel readahead logic. So if postgresql is proved
to work well with this patch, it will have good opportunity to go into
mainline :)

Thanks,
Wu

Re: Introducing a new linux readahead framework

From
"Jim C. Nasby"
Date:
On Fri, Apr 21, 2006 at 08:20:28PM +0800, Wu Fengguang wrote:
> Hi Markus,
>
> On Fri, Apr 21, 2006 at 09:53:34AM +0200, Markus Schaber wrote:
> > Are there any rough estimates when this will get into mainline kernel
> > (if you intend to submit)?
>
> I'm not quite sure :)
>
> The patch itself has been pretty stable.  To get it accepted, we must
> back it by good benchmarking results for some important applications.
> I have confirmed that file service via FTP/HTTP/NFS can more or less
> benefit from it. However, database services have not been touched yet.
> Oracle/DB2 seem to bypass the readahead code route, while postgresql
> relies totally on kernel readahead logic. So if postgresql is proved
> to work well with this patch, it will have good opportunity to go into
> mainline :)

IIRC Mark from OSDL said he'd try testing this when he gets a chance,
but you could also try running dbt2 and dbt3 against it.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: Introducing a new linux readahead framework

From
Wu Fengguang
Date:
On Fri, Apr 21, 2006 at 01:34:24PM -0500, Jim C. Nasby wrote:
> IIRC Mark from OSDL said he'd try testing this when he gets a chance,
> but you could also try running dbt2 and dbt3 against it.

Thanks for the info, I'll look into them.

Regards,
wu

Re: Introducing a new linux readahead framework

From
Michael Stone
Date:
From my initial testing this is very promising for a postgres server.
Benchmark-wise, a simple dd with an 8k blocksize gets ~200MB/s as
compared to ~140MB/s on the same hardware without the patch. Also, that
200MB/s seems to be unaffected by the dd blocksize, whereas without the
patch a 512k blocksize would get ~100MB/s. I'm now watching to see how
it does over a couple of days on real-world workloads.

Mike Stone

Re: Introducing a new linux readahead framework

From
Steve Poe
Date:
I found an average 14% improvement Using Pg 7.4.11 with odbc-bench as my
test bed with Wu's kernel patch. I have not tried version 8.x yet.

Thanks Wu.

Steve Poe

Using Postgresql 7.4.11, on an dual Opteron with 4GB

On Fri, 2006-04-21 at 09:38 +0800, Wu Fengguang wrote:
> Greetings,
>
> I'd like to introduce a new readahead framework for the linux kernel:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1021.html
>
> HOW IT WORKS
>
> In adaptive readahead, the context based method may be of particular
> interest to postgresql users. It works by peeking into the file cache
> and check if there are any history pages present or accessed. In this
> way it can detect almost all forms of sequential / semi-sequential read
> patterns, e.g.
>     - parallel / interleaved sequential scans on one file
>     - sequential reads across file open/close
>     - mixed sequential / random accesses
>     - sparse / skimming sequential read
>
> It also have methods to detect some less common cases:
>     - reading backward
>     - seeking all over reading N pages
>
> WAYS TO BENEFIT FROM IT
>
> As we know, postgresql relies on the kernel to do proper readahead.
> The adaptive readahead might help performance in the following cases:
>     - concurrent sequential scans
>     - sequential scan on a fragmented table
>       (some DBs suffer from this problem, not sure for pgsql)
>     - index scan with clustered matches
>     - index scan on majority rows (in case the planner goes wrong)
>
> TUNABLE PARAMETERS
>
> There are two parameters which are described in this email:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1024.html
>
> Here are the more oriented guidelines for postgresql users:
>
> - /proc/sys/vm/readahead_ratio
> Since most DB servers are bounty of memory, the danger of readahead
> thrashing is near to zero. In this case, you can set readahead_ratio to
> 100(or even 200:), which helps the readahead window to scale up rapidly.
>
> - /proc/sys/vm/readahead_hit_rate
> Sparse sequential reads are read patterns like {0, 2, 4, 5, 8, 11, ...}.
> In this case we might prefer to do readahead to get good I/O performance
> with the overhead of some useless pages. But if you prefer not to do so,
> set readahead_hit_rate to 1 will disable this feature.
>
> - /sys/block/sd<X>/queue/read_ahead_kb
> Set it to a large value(e.g. 4096) as you used to do.
> RAID users might want to use a bigger number.
>
> TRYING IT OUT
>
> The latest patch for stable kernels can be downloaded here:
> http://www.vanheusden.com/ara/
>
> Before compiling, make sure that the following options are enabled:
> Processor type and features -> Adaptive file readahead
> Processor type and features ->   Readahead debug and accounting
>
> HELPING AND CONTRIBUTING
>
> The patch is open to fine-tuning advices :)
> Comments and benchmarking results are highly appreciated.
>
> Thanks,
> Wu
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org


Re: Introducing a new linux readahead framework

From
"Jim C. Nasby"
Date:
(including bizgres-general)

Has anyone done any testing on bizgres? It's got some patches that
eliminate a lot of IO bottlenecks, so it might present even larger
gains.

On Wed, Apr 26, 2006 at 03:08:59PM -0500, Steve Poe wrote:
> I found an average 14% improvement Using Pg 7.4.11 with odbc-bench as my
> test bed with Wu's kernel patch. I have not tried version 8.x yet.
>
> Thanks Wu.
>
> Steve Poe
>
> Using Postgresql 7.4.11, on an dual Opteron with 4GB
>
> On Fri, 2006-04-21 at 09:38 +0800, Wu Fengguang wrote:
> > Greetings,
> >
> > I'd like to introduce a new readahead framework for the linux kernel:
> > http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1021.html
> >
> > HOW IT WORKS
> >
> > In adaptive readahead, the context based method may be of particular
> > interest to postgresql users. It works by peeking into the file cache
> > and check if there are any history pages present or accessed. In this
> > way it can detect almost all forms of sequential / semi-sequential read
> > patterns, e.g.
> >     - parallel / interleaved sequential scans on one file
> >     - sequential reads across file open/close
> >     - mixed sequential / random accesses
> >     - sparse / skimming sequential read
> >
> > It also have methods to detect some less common cases:
> >     - reading backward
> >     - seeking all over reading N pages
> >
> > WAYS TO BENEFIT FROM IT
> >
> > As we know, postgresql relies on the kernel to do proper readahead.
> > The adaptive readahead might help performance in the following cases:
> >     - concurrent sequential scans
> >     - sequential scan on a fragmented table
> >       (some DBs suffer from this problem, not sure for pgsql)
> >     - index scan with clustered matches
> >     - index scan on majority rows (in case the planner goes wrong)
> >
> > TUNABLE PARAMETERS
> >
> > There are two parameters which are described in this email:
> > http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1024.html
> >
> > Here are the more oriented guidelines for postgresql users:
> >
> > - /proc/sys/vm/readahead_ratio
> > Since most DB servers are bounty of memory, the danger of readahead
> > thrashing is near to zero. In this case, you can set readahead_ratio to
> > 100(or even 200:), which helps the readahead window to scale up rapidly.
> >
> > - /proc/sys/vm/readahead_hit_rate
> > Sparse sequential reads are read patterns like {0, 2, 4, 5, 8, 11, ...}.
> > In this case we might prefer to do readahead to get good I/O performance
> > with the overhead of some useless pages. But if you prefer not to do so,
> > set readahead_hit_rate to 1 will disable this feature.
> >
> > - /sys/block/sd<X>/queue/read_ahead_kb
> > Set it to a large value(e.g. 4096) as you used to do.
> > RAID users might want to use a bigger number.
> >
> > TRYING IT OUT
> >
> > The latest patch for stable kernels can be downloaded here:
> > http://www.vanheusden.com/ara/
> >
> > Before compiling, make sure that the following options are enabled:
> > Processor type and features -> Adaptive file readahead
> > Processor type and features ->   Readahead debug and accounting
> >
> > HELPING AND CONTRIBUTING
> >
> > The patch is open to fine-tuning advices :)
> > Comments and benchmarking results are highly appreciated.
> >
> > Thanks,
> > Wu
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> >                http://archives.postgresql.org
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: [Bizgres-general] Introducing a new linux

From
"Luke Lonergan"
Date:
Jim,

I’m thinking about it, we’re already using a fixed read-ahead of 16MB using blockdev on the stock Redhat 2.6.9 kernel, it would be nice to not have to set this so we may try it.

- Luke


On 4/26/06 3:28 PM, "Jim C. Nasby" <jnasby@pervasive.com> wrote:

(including bizgres-general)

Has anyone done any testing on bizgres? It's got some patches that
eliminate a lot of IO bottlenecks, so it might present even larger
gains.

On Wed, Apr 26, 2006 at 03:08:59PM -0500, Steve Poe wrote:
> I found an average 14% improvement Using Pg 7.4.11 with odbc-bench as my
> test bed with Wu's kernel patch. I have not tried version 8.x yet.
>
> Thanks Wu.
>
> Steve Poe
>
> Using Postgresql 7.4.11, on an dual Opteron with 4GB
>
> On Fri, 2006-04-21 at 09:38 +0800, Wu Fengguang wrote:
> > Greetings,
> >
> > I'd like to introduce a new readahead framework for the linux kernel:
> > http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1021.html
> >
> > HOW IT WORKS
> >
> > In adaptive readahead, the context based method may be of particular
> > interest to postgresql users. It works by peeking into the file cache
> > and check if there are any history pages present or accessed. In this
> > way it can detect almost all forms of sequential / semi-sequential read
> > patterns, e.g.
> >     - parallel / interleaved sequential scans on one file
> >     - sequential reads across file open/close
> >     - mixed sequential / random accesses
> >     - sparse / skimming sequential read
> >
> > It also have methods to detect some less common cases:
> >     - reading backward
> >     - seeking all over reading N pages
> >
> > WAYS TO BENEFIT FROM IT
> >
> > As we know, postgresql relies on the kernel to do proper readahead.
> > The adaptive readahead might help performance in the following cases:
> >     - concurrent sequential scans
> >     - sequential scan on a fragmented table
> >       (some DBs suffer from this problem, not sure for pgsql)
> >     - index scan with clustered matches
> >     - index scan on majority rows (in case the planner goes wrong)
> >
> > TUNABLE PARAMETERS
> >
> > There are two parameters which are described in this email:
> > http://www.ussg.iu.edu/hypermail/linux/kernel/0603.2/1024.html
> >
> > Here are the more oriented guidelines for postgresql users:
> >
> > - /proc/sys/vm/readahead_ratio
> > Since most DB servers are bounty of memory, the danger of readahead
> > thrashing is near to zero. In this case, you can set readahead_ratio to
> > 100(or even 200:), which helps the readahead window to scale up rapidly.
> >
> > - /proc/sys/vm/readahead_hit_rate
> > Sparse sequential reads are read patterns like {0, 2, 4, 5, 8, 11, ...}.
> > In this case we might prefer to do readahead to get good I/O performance
> > with the overhead of some useless pages. But if you prefer not to do so,
> > set readahead_hit_rate to 1 will disable this feature.
> >
> > - /sys/block/sd<X>/queue/read_ahead_kb
> > Set it to a large value(e.g. 4096) as you used to do.
> > RAID users might want to use a bigger number.
> >
> > TRYING IT OUT
> >
> > The latest patch for stable kernels can be downloaded here:
> > http://www.vanheusden.com/ara/
> >
> > Before compiling, make sure that the following options are enabled:
> > Processor type and features -> Adaptive file readahead
> > Processor type and features ->   Readahead debug and accounting
> >
> > HELPING AND CONTRIBUTING
> >
> > The patch is open to fine-tuning advices :)
> > Comments and benchmarking results are highly appreciated.
> >
> > Thanks,
> > Wu
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 4: Have you searched our list archives?
> >
> >                http://archives.postgresql.org
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461
_______________________________________________
Bizgres-general mailing list
Bizgres-general@pgfoundry.org
http://pgfoundry.org/mailman/listinfo/bizgres-general



Re: [Bizgres-general] Introducing a new linux

From
Michael Stone
Date:
On Wed, Apr 26, 2006 at 04:33:40PM -0700, Luke Lonergan wrote:
>I¹m thinking about it, we¹re already using a fixed read-ahead of 16MB using
>blockdev on the stock Redhat 2.6.9 kernel, it would be nice to not have to
>set this so we may try it.

FWIW, I never saw much performance difference from doing that. Wu's
patch, OTOH, gave a big boost.

Mike Stone

Re: Introducing a new linux readahead framework

From
Michael Stone
Date:
On Wed, Apr 26, 2006 at 10:43:48AM -0400, Michael Stone wrote:
>patch a 512k blocksize would get ~100MB/s. I'm now watching to see how
>it does over a couple of days on real-world workloads.

I've got one DB where the VACUUM ANALYZE generally takes 11M-12M ms;
with the patch the job took 1.7M ms. Another VACUUM that normally takes
between 300k-500k ms took 150k. Definately a promising addition.

Mike Stone