Re: Sequential Scan Read-Ahead - Mailing list pgsql-hackers

From Curt Sampson
Subject Re: Sequential Scan Read-Ahead
Date
Msg-id Pine.NEB.4.43.0204251118040.445-100000@angelic.cynic.net
Whole thread Raw
In response to Re: Sequential Scan Read-Ahead  (Bruce Momjian <pgman@candle.pha.pa.us>)
Responses Re: Sequential Scan Read-Ahead  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Sequential Scan Read-Ahead  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: Sequential Scan Read-Ahead  (Lincoln Yeoh <lyeoh@pop.jaring.my>)
List pgsql-hackers
On Wed, 24 Apr 2002, Bruce Momjian wrote:

> >     1. Not all systems do readahead.
>
> If they don't, that isn't our problem.  We expect it to be there, and if
> it isn't, the vendor/kernel is at fault.

It is your problem when another database kicks Postgres' ass
performance-wise.

And at that point, *you're* at fault. You're the one who's knowingly
decided to do things inefficiently.

Sorry if this sounds harsh, but this, "Oh, someone else is to blame"
attitude gets me steamed. It's one thing to say, "We don't support
this." That's fine; there are often good reasons for that. It's a
completely different thing to say, "It's an unrelated entity's fault we
don't support this."

At any rate, relying on the kernel to guess how to optimise for
the workload will never work as well as well as the software that
knows the workload doing the optimization.

The lack of support thing is no joke. Sure, lots of systems nowadays
support unified buffer cache and read-ahead. But how many, besides
Solaris, support free-behind, which is also very important to avoid
blowing out your buffer cache when doing sequential reads? And who
at all supports read-ahead for reverse scans? (Or does Postgres
not do those, anyway? I can see the support is there.)

And even when the facilities are there, you create problems by
using them.  Look at the OS buffer cache, for example. Not only do
we lose efficiency by using two layers of caching, but (as people
have pointed out recently on the lists), the optimizer can't even
know how much or what is being cached, and thus can't make decisions
based on that.

> Yes, seek() in file will turn off read-ahead.  Grabbing bigger chunks
> would help here, but if you have two people already reading from the
> same file, grabbing bigger chunks of the file may not be optimal.

Grabbing bigger chunks is always optimal, AFICT, if they're not
*too* big and you use the data. A single 64K read takes very little
longer than a single 8K read.

> >     3. Even when the read-ahead does occur, you're still doing more
> >     syscalls, and thus more expensive kernel/userland transitions, than
> >     you have to.
>
> I would guess the performance impact is minimal.

If it were minimal, people wouldn't work so hard to build multi-level
thread systems, where multiple userland threads are scheduled on
top of kernel threads.

However, it does depend on how much CPU your particular application
is using. You may have it to spare.

>     http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html

Well, this message has some points in it that I feel are just incorrect.
   1. It is *not* true that you have no idea where data is when   using a storage array or other similar system. While
you  certainly ought not worry about things such as head positions   and so on, it's been a given for a long, long time
thattwo   blocks that have close index numbers are going to be close   together in physical storage.
 
   2. Raw devices are quite standard across Unix systems (except   in the unfortunate case of Linux, which I think has
been  remedied, hasn't it?). They're very portable, and have just as   well--if not better--defined write semantics as
afilesystem.
 
   3. My observations of OS performance tuning over the past six   or eight years contradict the statement, "There's a
considerable  cost in complexity and code in using "raw" storage too, and   it's not a one off cost: as the
technologieschange, the "fast"   way to do things will change and the code will have to be   updated to match." While
optimizationshave been removed over   the years the basic optimizations (order reads by block number,   do larger reads
ratherthan smaller, cache the data) have   remained unchanged for a long, long time.
 
   4. "Better to leave this to the OS vendor where possible, and   take advantage of the tuning they do." Well, sorry
guys,but   have a look at the tuning they do. It hasn't changed in years,   except to remove now-unnecessary complexity
realatedto really,   really old and slow disk devices, and to add a few thing that   guess workload but still do a
worsejob than if the workload   generator just did its own optimisations in the first place.
 

>     http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html

Well, this one, with statements like "Postgres does have control
over its buffer cache," I don't know what to say. You can interpret
the statement however you like, but in the end Postgres very little
control at all over how data is moved between memory and disk.

BTW, please don't take me as saying that all control over physical
IO should be done by Postgres. I just think that Posgres could do
a better job of managing data transfer between disk and memory than
the OS can. The rest of the things (using raw paritions, read-ahead,
free-behind, etc.) just drop out of that one idea.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



pgsql-hackers by date:

Previous
From: Hiroshi Inoue
Date:
Subject: Re: Vote on SET in aborted transaction
Next
From: Tom Lane
Date:
Subject: Re: Sequential Scan Read-Ahead