Re: Sequential Scan Read-Ahead - Mailing list pgsql-hackers

From Bruce Momjian
Subject Re: Sequential Scan Read-Ahead
Date
Msg-id 200204250404.g3P44OI19061@candle.pha.pa.us
Whole thread Raw
In response to Re: Sequential Scan Read-Ahead  (Curt Sampson <cjs@cynic.net>)
Responses Re: Sequential Scan Read-Ahead  (Curt Sampson <cjs@cynic.net>)
List pgsql-hackers
Well, this is a very interesting email.  Let me comment on some points.


---------------------------------------------------------------------------

Curt Sampson wrote:
> On Wed, 24 Apr 2002, Bruce Momjian wrote:
> 
> > >     1. Not all systems do readahead.
> >
> > If they don't, that isn't our problem.  We expect it to be there, and if
> > it isn't, the vendor/kernel is at fault.
> 
> It is your problem when another database kicks Postgres' ass
> performance-wise.
> 
> And at that point, *you're* at fault. You're the one who's knowingly
> decided to do things inefficiently.

It is just hard to imagine an OS not doing read-ahead, at least in
simple cases.

> Sorry if this sounds harsh, but this, "Oh, someone else is to blame"
> attitude gets me steamed. It's one thing to say, "We don't support
> this." That's fine; there are often good reasons for that. It's a
> completely different thing to say, "It's an unrelated entity's fault we
> don't support this."

Well, we are guilty of trying to push as much as possible on to other
software.  We do this for portability reasons, and because we think our
time is best spent dealing with db issues, not issues then can be deal
with by other existing software, as long as the software is decent.

> At any rate, relying on the kernel to guess how to optimise for
> the workload will never work as well as well as the software that
> knows the workload doing the optimization.

Sure, that is certainly true.  However, it is hard to know what the
future will hold even if we had perfect knowledge of what was happening
in the kernel.  We don't know who else is going to start doing I/O once
our I/O starts.  We may have a better idea with kernel knowledge, but we
still don't know 100% what will be cached.

> The lack of support thing is no joke. Sure, lots of systems nowadays
> support unified buffer cache and read-ahead. But how many, besides
> Solaris, support free-behind, which is also very important to avoid

We have free-behind on our list.  I think LRU-K will do this quite well
and be a nice general solution for more than just sequential scans.

> blowing out your buffer cache when doing sequential reads? And who
> at all supports read-ahead for reverse scans? (Or does Postgres
> not do those, anyway? I can see the support is there.)
> 
> And even when the facilities are there, you create problems by
> using them.  Look at the OS buffer cache, for example. Not only do
> we lose efficiency by using two layers of caching, but (as people
> have pointed out recently on the lists), the optimizer can't even
> know how much or what is being cached, and thus can't make decisions
> based on that.

Again, are you going to know 100% anyway?

> 
> > Yes, seek() in file will turn off read-ahead.  Grabbing bigger chunks
> > would help here, but if you have two people already reading from the
> > same file, grabbing bigger chunks of the file may not be optimal.
> 
> Grabbing bigger chunks is always optimal, AFICT, if they're not
> *too* big and you use the data. A single 64K read takes very little
> longer than a single 8K read.

There may be validity in this.  It is easy to do (I think) and could be
a win.

> > >     3. Even when the read-ahead does occur, you're still doing more
> > >     syscalls, and thus more expensive kernel/userland transitions, than
> > >     you have to.
> >
> > I would guess the performance impact is minimal.
> 
> If it were minimal, people wouldn't work so hard to build multi-level
> thread systems, where multiple userland threads are scheduled on
> top of kernel threads.
> 
> However, it does depend on how much CPU your particular application
> is using. You may have it to spare.

I assume those apps are doing tons of kernel calls.  I don't think we
really do that many.

> >     http://candle.pha.pa.us/mhonarc/todo.detail/performance/msg00009.html
> 
> Well, this message has some points in it that I feel are just incorrect.
> 
>     1. It is *not* true that you have no idea where data is when
>     using a storage array or other similar system. While you
>     certainly ought not worry about things such as head positions
>     and so on, it's been a given for a long, long time that two
>     blocks that have close index numbers are going to be close
>     together in physical storage.

SCSI drivers, for example, are pretty smart.  Not sure we can take
advantage of that from user-land I/O.

>     2. Raw devices are quite standard across Unix systems (except
>     in the unfortunate case of Linux, which I think has been
>     remedied, hasn't it?). They're very portable, and have just as
>     well--if not better--defined write semantics as a filesystem.

Yes, but we are seeing some db's moving away from raw I/O.  Our
performance numbers beat most of the big db's already, so we must be
doing something right.  In fact, our big failing is more is missing
features and limitations of our db, rather than performance.

>     3. My observations of OS performance tuning over the past six
>     or eight years contradict the statement, "There's a considerable
>     cost in complexity and code in using "raw" storage too, and
>     it's not a one off cost: as the technologies change, the "fast"
>     way to do things will change and the code will have to be
>     updated to match." While optimizations have been removed over
>     the years the basic optimizations (order reads by block number,
>     do larger reads rather than smaller, cache the data) have
>     remained unchanged for a long, long time.

Yes, but do we spend our time doing that.  Is the payoff worth it, vs.
working on other features.  Sure it would be great to have all these
fancy things, but is this where our time should be spent, considering
other items on the TODO list?

>     4. "Better to leave this to the OS vendor where possible, and
>     take advantage of the tuning they do." Well, sorry guys, but
>     have a look at the tuning they do. It hasn't changed in years,
>     except to remove now-unnecessary complexity realated to really,
>     really old and slow disk devices, and to add a few thing that
>     guess workload but still do a worse job than if the workload
>     generator just did its own optimisations in the first place.
> 
> >     http://candle.pha.pa.us/mhonarc/todo.detail/optimizer/msg00011.html
> 
> Well, this one, with statements like "Postgres does have control
> over its buffer cache," I don't know what to say. You can interpret
> the statement however you like, but in the end Postgres very little
> control at all over how data is moved between memory and disk.
> 
> BTW, please don't take me as saying that all control over physical
> IO should be done by Postgres. I just think that Posgres could do
> a better job of managing data transfer between disk and memory than
> the OS can. The rest of the things (using raw paritions, read-ahead,
> free-behind, etc.) just drop out of that one idea.

Yes, clearly there is benefit in these, and some of them, like
free-behind, have already been tested, though not committed.

Jumping in and doing the I/O ourselves is a big undertaking, and looking
at our TODO list, I am not sure if it is worth it right now.

Of course, if we had 4 TODO items, I would be much more interested in at
least trying to see how much gain we could get.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


pgsql-hackers by date:

Previous
From: Tatsuo Ishii
Date:
Subject: Re: Sequential Scan Read-Ahead
Next
From: Curt Sampson
Date:
Subject: Re: Index Scans become Seq Scans after VACUUM ANALYSE