Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans - Mailing list pgsql-general

From Gerhard Wiesinger
Subject Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
Date
Msg-id alpine.LFD.2.00.0910091747480.13746@bbs.intern
Whole thread Raw
In response to Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans  (Greg Smith <gsmith@gregsmith.com>)
Responses Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
List pgsql-general
On Fri, 9 Oct 2009, Greg Smith wrote:

> On Sat, 3 Oct 2009, Gerhard Wiesinger wrote:
>
>> I wouldn't read 128k blocks all the time. I would do the following:
>> When e.g. B0, B127, B256 should be read I would read in 8k random block
>> I/O.
>>
>> When B1, B2, B3, B4, B5, B7, B8, B9, B10 are needed I would make 2 requests
>> with the largest possible blocksize:
>> 1.) B1-B5: 5*8k=40k
>> 2.) B7-B10: 4*8k=32k
>
> I see what you mean now.  This is impossible in the current buffer manager
> implementation because blocks are requested one at a time, and there are few
> situations where you can predict which are going to be wanted next. The hash
> index and sequential scan are two that were possible to predict in that way.
>
> The fadvise patches already committed didn't change the way blocks were read
> in, they just used knowledge about what was coming next to advise the OS.
> That's quite a bit different from actually asking for things in larger chunks
> and only using what you need.
>
> Implementing larger chunking reads or similar asynchronous batch I/O is a big
> project, because you'd have to rearchitect the whole way buffers are managed
> in the database to do it right.  Greg Stark's earliest proof of concept
> prototype for async I/O included a Solaris implementation that used the AIO
> library.  It wasn't feasible to actually use that underlying implemention in
> the database in the end though, because the model AIO uses expects you'll
> fire off a bunch of I/O and then retrieve blocks as they come in.  That's
> really not easy to align with the model for how blocks are read into
> shared_buffers right now.  He had some ideas for that and I've thought
> briefly about the problem, but it would be a major overhaul to some scary to
> touch database internals to pull off.
>
> Given that the OS and/or RAID implementations tend to do what we want in a
> lot of these cases, where smarter/chunkier read-ahead is what we you need,
> the payback on accelerating those cases hasn't been perceived as that great.
> There is a major win for the hash index reads, which Solaris systems can't
> take advantage of, so somebody who uses those heavily on that OS might be
> motivated enough produce improvements for that use case. Once the buffer
> cache at large understood how to handle batching async reads, Solaris AIO
> would be possible, fancier stuff with Linux AIO would be possible, and the
> type of chunking reads you're suggesting would be too.  But none of that is
> happening without some major rearchitecting first.
>
> Unfortunately there aren't that many people with the right knowledge and
> motivation to start tinkering around with the buffer cache internals to the
> extent that would be required to do better here, and pretty much of them I'm
> aware of are hacking on projects with a much clearer payback instead.
>

I've one idea, which is not ideal, but may work and shouldn't be much
effort to implement:
As in the example above we read B1-B5 and B7-B10 on a higher level outside
of normal buffer management with large request sizes (e.g. where hash
index scans and sequential scans are done). As the blocks are now in cache
normal buffer management is very fast:
1.) B1-B5: 5*8k=40k
2.) B7-B10: 4*8k=32k

So we are reading for 1.):
B1-B5 in one 40k block (typically from disk), afterwards we read B1, B2,
B3, B4, B5 in 8k chunks from cache again.

The disadvantage is of course that we have more read requests more maybe
performance is even better because the normal buffer requests are from
cache. A second disadvantage is the "bad" design.

But I think performance will be even better. And a configuration option to
enable this might also be interesting.

Maybe I will try that with pgiosim whether performance is better or not.

What do you think about it?
Is the idea clear?

Thnx.

Ciao
Gerhard

--
http://www.wiesinger.com/

pgsql-general by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: How to reduce WAL file Size
Next
From: Andrew Gould
Date:
Subject: Re: interface for "non-SQL people"