Thread: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

From
Gerhard Wiesinger
Date:
Hello,

As blocksizes, random I/O and linear I/O are critical I/O performance
parameters I had a look on PostgreSQL and a commercial software vendor.

Therefore I enhanced the system tap script:
http://www.wiesinger.com/opensource/systemtap/disktop_gw.stp

Output per 5 seconds on a sequence scan:
     UID      PID     PPID                       CMD   DEVICE    T        BYTES     REQUESTS    BYTES/REQ
      26     4263     4166                postmaster     dm-1    R    168542208        20574         8192
=> 32MB/s

So I saw, that even on sequential reads (and also on bitmap heap scan
acces) PostgreSQL uses only 8k blocks. I think that's a major I/O
bottleneck.

A commercial software database vendor solved the problem by reading
multiple continuous blocks by multiple 8k blocks up to a maximum
threshold. Output per 5 seconds on an equivalent "sequence scan":
     UID      PID     PPID                       CMD   DEVICE    T        BYTES     REQUESTS    BYTES/REQ
    1001     5381        1                   process     dm-1    R    277754638         2338       118800
=> 53 MB/s

A google research has shown that Gregory Stark already worked on that
issue (see references below) but as far as I saw only on bitmap heap
scans.

I think this is one of the most critical performance showstopper of
PostgreSQL on the I/O side.

What's the current status of the patch of Gregory Stark? Any timeframes
to integrate?
Does it also work for sequence scans? Any plans for a generic "multi block
read count" solution?

Any comments?

Thnx.

Ciao,
Gerhard

--
http://www.wiesinger.com/

http://wiki.postgresql.org/wiki/Todo#Concurrent_Use_of_Resources
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00027.php
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00395.php
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00088.php
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00092.php
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00098.php

http://archives.postgresql.org/pgsql-hackers/2006-10/msg00820.php

http://markmail.org/message/a5osy4qptxk2jgu3#query:+page:1+mid:hz7uzhwxtkbzncy2+state:results
http://markmail.org/message/a5osy4qptxk2jgu3#query:+page:1+mid:a5osy4qptxk2jgu3+state:results

On Sun, 27 Sep 2009, Gerhard Wiesinger wrote:

> I think this is one of the most critical performance showstopper of
> PostgreSQL on the I/O side.

I wish, this is an easy problem compared to the real important ones that
need to be resolved.  Situations where the OS is capable of faster
sequential I/O performance than PostgreSQL appears to deliver doing reads
are often caused by something other than what the person doing said
benchmarking believes they are.  For example, the last time I thought I
had a smoking gun situation just like the one you're describing, it turns
out the background operation I didn't know was going on that slowed things
down were hint bit updates:  http://wiki.postgresql.org/wiki/Hint_Bits

Background checkpoints can also cause this, typically if you set
checkpoint_segments really high and watch when they're happening you can
avoid that interfering with results too.

It's hard to isolate out the cause of issues like this.  Since most people
seem to get something close to real disk speed from sequential scans when
measured properly, I would suggest starting with the assumption there's
something wrong with your test case rather than PostgreSQL.  The best way
to do that is to construct a test case others can run that shows the same
problem on other systems using the database itself.  The easiest way to
build one of those is using generate_series to create some bogus test
data, SELECT everything in there with \timing on, and then use the size of
the relation on disk to estimate MB/s.

Regardless, it's easy enough to build PostgreSQL with larger block sizes
if you think that really matters for your situation.  You're never going
to see that in the mainstream version though, because there are plenty of
downsides to using larger blocks.  And since the database doesn't actually
know where on disk things are at, it's not really in a good position to
make decisions about I/O scheduling anyway.  More on that below.

> What's the current status of the patch of Gregory Stark? Any timeframes to
> integrate?

There needs to be a fairly major rearchitecting of how PostgreSQL handles
incoming disk I/O for that to go anywhere else, and I don't believe that's
expected to be ready in the near future.

> Does it also work for sequence scans? Any plans for a generic "multi block
> read count" solution?

There was a similar patch for sequential scans submitted by someone else
based on that work.  It was claimed to help performance on a Linux system
with a rather poor disk I/O setup.  No one else was able to replicate any
performance improvement using the patch though.  As far as I've been able
to tell, the read-ahead logic being done by the Linux kernel and in some
hardware is already doing this sort of optimization for you on that OS,
whether or not your app knows enough to recognize it's sequentially
scanning the disk it's working against.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

On Fri, 2 Oct 2009, Greg Smith wrote:

> On Sun, 27 Sep 2009, Gerhard Wiesinger wrote:
>
>> I think this is one of the most critical performance showstopper of
>> PostgreSQL on the I/O side.
>
> I wish, this is an easy problem compared to the real important ones that need
> to be resolved.  Situations where the OS is capable of faster sequential I/O
> performance than PostgreSQL appears to deliver doing reads are often caused
> by something other than what the person doing said benchmarking believes they
> are.  For example, the last time I thought I had a smoking gun situation just
> like the one you're describing, it turns out the background operation I
> didn't know was going on that slowed things down were hint bit updates:
> http://wiki.postgresql.org/wiki/Hint_Bits
>
> Background checkpoints can also cause this, typically if you set
> checkpoint_segments really high and watch when they're happening you can
> avoid that interfering with results too.
>
> It's hard to isolate out the cause of issues like this.  Since most people
> seem to get something close to real disk speed from sequential scans when
> measured properly, I would suggest starting with the assumption there's
> something wrong with your test case rather than PostgreSQL.  The best way to
> do that is to construct a test case others can run that shows the same
> problem on other systems using the database itself.  The easiest way to build
> one of those is using generate_series to create some bogus test data, SELECT
> everything in there with \timing on, and then use the size of the relation on
> disk to estimate MB/s.
>
> Regardless, it's easy enough to build PostgreSQL with larger block sizes if
> you think that really matters for your situation.  You're never going to see
> that in the mainstream version though, because there are plenty of downsides
> to using larger blocks.  And since the database doesn't actually know where
> on disk things are at, it's not really in a good position to make decisions
> about I/O scheduling anyway.  More on that below.
>
>> What's the current status of the patch of Gregory Stark? Any timeframes to
>> integrate?
>
> There needs to be a fairly major rearchitecting of how PostgreSQL handles
> incoming disk I/O for that to go anywhere else, and I don't believe that's
> expected to be ready in the near future.
>
>> Does it also work for sequence scans? Any plans for a generic "multi block
>> read count" solution?
>
> There was a similar patch for sequential scans submitted by someone else
> based on that work.  It was claimed to help performance on a Linux system
> with a rather poor disk I/O setup.  No one else was able to replicate any
> performance improvement using the patch though.  As far as I've been able to
> tell, the read-ahead logic being done by the Linux kernel and in some
> hardware is already doing this sort of optimization for you on that OS,
> whether or not your app knows enough to recognize it's sequentially scanning
> the disk it's working against.
>

I've enhanced the pgiosim project http://pgfoundry.org/projects/pgiosim/
with a patch for larger blocksizes independent from PostgreSQL:
http://www.wiesinger.com/opensource/pgiosim/pgiosim-0.2-blocksizes.diff

You'll find some detailed results below and can verify this on your
platforms with the patch above. Maybe someone can verify this on different
HW/SW plattforms. If you have any questions regarding the pgiosim and the
patch just feel free to ask.

Summary:
RANDOM I/O of blocksizes of e.g. 128k (e.g. BITMAP HEAP SCAN) has better
performance than reading the same blocks with 8k block sizes (factor 1.5).

Conclusio:
In the test scenario the proposed solution would have a performance gain
of a factor of 1.5 for typical BITMAP HEAP SCANS. For other scenarios no
performance gain with larger block sizes of continuous blocks could be
measured. Therefore I'm assuming that prefetching works well on Linux with
sequential I/O but not with random I/O.

I hope I can convince someone that such optimizations make sense as
commercial database venders have implemented such features for performance
reasons.

BTW: Prefetch is enabled on the raid and blockdevices.

Ciao,
Gerhard

--
http://www.wiesinger.com/

# RANDOM I/O 8k blocksize
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 test.txt
Arg: 1
Added test.txt
blocksize=8192, reading block as a whole
Elapsed: 135.92
Read 10000 blocks Wrote 0 blocks
73.57 op/sec, 588.60kB/sec

# RANDOM I/O 8k blocksize (for verification only), in fact same test as below
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -r test.txt
Arg: 1
Added test.txt
blocksize=8192, doing single read requests with chunk size of 8192 bytes
Elapsed: 136.30
Read 10000 blocks Wrote 0 blocks
73.37 op/sec, 586.94kB/sec

# RANDOM I/O 128k blocksize, read as one 128k block
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -o 131072 test.txt
Arg: 1
Added test.txt
blocksize=131072, reading block as a whole
Elapsed: 160.56
Read 10000 blocks Wrote 0 blocks
62.28 op/sec, 7972.15kB/sec

# RANDOM I/O 128k blocksize, read as multiple 8k blocks sequentially
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -o 131072 -r test.txt
Arg: 1
Added test.txt
blocksize=131072, doing single read requests with chunk size of 8192 bytes
Elapsed: 245.69
Read 10000 blocks Wrote 0 blocks
651.20 op/sec, 5209.81kB/sec

# SEQUENTIAL I/O 128k blocksize, read as multiple 128k blocks
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 100000 -o 131072
test.txt
Seq Scan
Arg: 1
Added test.txt
blocksize=131072, reading block as a whole
Elapsed: 112.78
Read 100000 blocks Wrote 0 blocks
886.70 op/sec, 113497.67kB/sec

# SEQUENTIAL I/O 128k blocksize, read as multiple 8k blocks sequentially
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 100000 -o 131072 -r
test.txt
Seq Scan
Arg: 1
Added test.txt
blocksize=131072, doing single read requests with chunk size of 8192 bytes
Elapsed: 112.45
Read 100000 blocks Wrote 0 blocks
14228.81 op/sec, 113830.49kB/sec

On Fri, 2 Oct 2009, Greg Smith wrote:

> On Sun, 27 Sep 2009, Gerhard Wiesinger wrote:
>
>> I think this is one of the most critical performance showstopper of
>> PostgreSQL on the I/O side.
>
> I wish, this is an easy problem compared to the real important ones that need
> to be resolved.  Situations where the OS is capable of faster sequential I/O
> performance than PostgreSQL appears to deliver doing reads are often caused
> by something other than what the person doing said benchmarking believes they
> are.  For example, the last time I thought I had a smoking gun situation just
> like the one you're describing, it turns out the background operation I
> didn't know was going on that slowed things down were hint bit updates:
> http://wiki.postgresql.org/wiki/Hint_Bits
>
> Background checkpoints can also cause this, typically if you set
> checkpoint_segments really high and watch when they're happening you can
> avoid that interfering with results too.
>
> It's hard to isolate out the cause of issues like this.  Since most people
> seem to get something close to real disk speed from sequential scans when
> measured properly, I would suggest starting with the assumption there's
> something wrong with your test case rather than PostgreSQL.  The best way to
> do that is to construct a test case others can run that shows the same
> problem on other systems using the database itself.  The easiest way to build
> one of those is using generate_series to create some bogus test data, SELECT
> everything in there with \timing on, and then use the size of the relation on
> disk to estimate MB/s.
>
> Regardless, it's easy enough to build PostgreSQL with larger block sizes if
> you think that really matters for your situation.  You're never going to see
> that in the mainstream version though, because there are plenty of downsides
> to using larger blocks.  And since the database doesn't actually know where
> on disk things are at, it's not really in a good position to make decisions
> about I/O scheduling anyway.  More on that below.
>
>> What's the current status of the patch of Gregory Stark? Any timeframes to
>> integrate?
>
> There needs to be a fairly major rearchitecting of how PostgreSQL handles
> incoming disk I/O for that to go anywhere else, and I don't believe that's
> expected to be ready in the near future.
>
>> Does it also work for sequence scans? Any plans for a generic "multi block
>> read count" solution?
>
> There was a similar patch for sequential scans submitted by someone else
> based on that work.  It was claimed to help performance on a Linux system
> with a rather poor disk I/O setup.  No one else was able to replicate any
> performance improvement using the patch though.  As far as I've been able to
> tell, the read-ahead logic being done by the Linux kernel and in some
> hardware is already doing this sort of optimization for you on that OS,
> whether or not your app knows enough to recognize it's sequentially scanning
> the disk it's working against.

I forgot to mention:
Larger blocksizes also reduce IOPS (I/Os per second) which might be a
critial threshold on storage systems (e.g. Fibre Channel systems). You
would get e.g. the throughput from the storage with large block sizes
(less IOPS) but with small block sizes the IOPS limit is reached and
throughput performance goes down.

Example:
With 100MB/s and 8k blocks you need 12500 IOPS which is a lot (e.g. at
least 90 disks with 140 IOPS)!
When blocks can be read with e.g. 128k block size 781 IOPS are sufficient
(6 disks are sufficient)!

So this makes a major difference.

Ciao,
Gerhard

--
http://www.wiesinger.com/

On Fri, 2 Oct 2009, Gerhard Wiesinger wrote:

> Larger blocksizes also reduce IOPS (I/Os per second) which might be a critial
> threshold on storage systems (e.g. Fibre Channel systems).

True to some extent, but don't forget that IOPS is always relative to a
block size in the first place.  If you're getting 200 IOPS with 8K blocks,
increasing your block size to 128K will not result in your getting 200
IOPS at that larger size; the IOPS number at the larger block size is
going to drop too.  And you'll pay the penalty for that IOPS number
dropping every time you're accessing something that would have only been
an 8K bit of I/O before.

The trade-off is very application dependent.  The position you're
advocating, preferring larger blocks, only makes sense if your workload
consists mainly of larger scans.  Someone who is pulling scattered records
from throughout a larger table will suffer with that same change, because
they'll be reading a minimum of 128K even if all they really needed with a
few bytes.  That penalty ripples all the way from the disk I/O upwards
through the buffer cache.

It's easy to generate a synthetic benchmark workload that models some
real-world applications and see performance plunge with a larger block
size.  There certainly are others where a larger block would work better.
Testing either way is complicated by the way RAID devices usually have
their own stripe sizes to consider on top of the database block size.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

On Fri, 2 Oct 2009, Greg Smith wrote:

> On Fri, 2 Oct 2009, Gerhard Wiesinger wrote:
>
>> Larger blocksizes also reduce IOPS (I/Os per second) which might be a
>> critial threshold on storage systems (e.g. Fibre Channel systems).
>
> True to some extent, but don't forget that IOPS is always relative to a block
> size in the first place.  If you're getting 200 IOPS with 8K blocks,
> increasing your block size to 128K will not result in your getting 200 IOPS
> at that larger size; the IOPS number at the larger block size is going to
> drop too.  And you'll pay the penalty for that IOPS number dropping every
> time you're accessing something that would have only been an 8K bit of I/O
> before.
>

Yes, there will be some (very small) drop in IOPS, when blocksize is
higher but today disks have a lot of throughput when IOPS*128k are
compared to e.g. 100MB/s. I've done some Excel calculations which support
this.

> The trade-off is very application dependent.  The position you're advocating,
> preferring larger blocks, only makes sense if your workload consists mainly
> of larger scans.  Someone who is pulling scattered records from throughout a
> larger table will suffer with that same change, because they'll be reading a
> minimum of 128K even if all they really needed with a few bytes.  That
> penalty ripples all the way from the disk I/O upwards through the buffer
> cache.
>

I wouldn't read 128k blocks all the time. I would do the following:
When e.g. B0, B127, B256 should be read I would read in 8k random block
I/O.

When B1, B2, B3, B4, B5, B7, B8, B9, B10 are needed I would make 2
requests with the largest possible blocksize:
1.) B1-B5: 5*8k=40k
2.) B7-B10: 4*8k=32k

In this case when B5 and B7 are only one block away we could also discuss
whether we should read B1-B10=10*8k=80k in one read request and don't use
B6.

That would reduce the IOPS of a factor of 4-5 in that scenario and
therefore throughput would go up.

> It's easy to generate a synthetic benchmark workload that models some
> real-world applications and see performance plunge with a larger block size.
> There certainly are others where a larger block would work better. Testing
> either way is complicated by the way RAID devices usually have their own
> stripe sizes to consider on top of the database block size.
>

Yes, there are block device read ahead buffers and also RAID stripe
caches. But both don't seem to work well with the tested HEAP BITMAP SCAN
scenario and also in practical PostgreSQL performance measurement
scenarios.

But the modelled pgiosim isn't a synthetic benchmark it is the same as a
real work HEAP BITMAP SCAN scenario in PostgreSQL where some blocks are
read directly consecutive at least logically in the filesystem (and with
some propability also physically on disk) but currently only with each 8k
block read even when 2 or more blocks could be read with one request.

BTW: I would also limit the blocksize to some upper limit on such requests
(e.g. 1MB).

Ciao,
Gerhard

On Sat, 3 Oct 2009, Gerhard Wiesinger wrote:

> I wouldn't read 128k blocks all the time. I would do the following:
> When e.g. B0, B127, B256 should be read I would read in 8k random block I/O.
>
> When B1, B2, B3, B4, B5, B7, B8, B9, B10 are needed I would make 2 requests
> with the largest possible blocksize:
> 1.) B1-B5: 5*8k=40k
> 2.) B7-B10: 4*8k=32k

I see what you mean now.  This is impossible in the current buffer manager
implementation because blocks are requested one at a time, and there are
few situations where you can predict which are going to be wanted next.
The hash index and sequential scan are two that were possible to predict
in that way.

The fadvise patches already committed didn't change the way blocks were
read in, they just used knowledge about what was coming next to advise the
OS.  That's quite a bit different from actually asking for things in
larger chunks and only using what you need.

Implementing larger chunking reads or similar asynchronous batch I/O is a
big project, because you'd have to rearchitect the whole way buffers are
managed in the database to do it right.  Greg Stark's earliest proof of
concept prototype for async I/O included a Solaris implementation that
used the AIO library.  It wasn't feasible to actually use that underlying
implemention in the database in the end though, because the model AIO uses
expects you'll fire off a bunch of I/O and then retrieve blocks as they
come in.  That's really not easy to align with the model for how blocks
are read into shared_buffers right now.  He had some ideas for that and
I've thought briefly about the problem, but it would be a major overhaul
to some scary to touch database internals to pull off.

Given that the OS and/or RAID implementations tend to do what we want in a
lot of these cases, where smarter/chunkier read-ahead is what we you need,
the payback on accelerating those cases hasn't been perceived as that
great.  There is a major win for the hash index reads, which Solaris
systems can't take advantage of, so somebody who uses those heavily on
that OS might be motivated enough produce improvements for that use case.
Once the buffer cache at large understood how to handle batching async
reads, Solaris AIO would be possible, fancier stuff with Linux AIO would
be possible, and the type of chunking reads you're suggesting would be
too.  But none of that is happening without some major rearchitecting
first.

Unfortunately there aren't that many people with the right knowledge and
motivation to start tinkering around with the buffer cache internals to
the extent that would be required to do better here, and pretty much of
them I'm aware of are hacking on projects with a much clearer payback
instead.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

On Fri, 9 Oct 2009, Greg Smith wrote:

> On Sat, 3 Oct 2009, Gerhard Wiesinger wrote:
>
>> I wouldn't read 128k blocks all the time. I would do the following:
>> When e.g. B0, B127, B256 should be read I would read in 8k random block
>> I/O.
>>
>> When B1, B2, B3, B4, B5, B7, B8, B9, B10 are needed I would make 2 requests
>> with the largest possible blocksize:
>> 1.) B1-B5: 5*8k=40k
>> 2.) B7-B10: 4*8k=32k
>
> I see what you mean now.  This is impossible in the current buffer manager
> implementation because blocks are requested one at a time, and there are few
> situations where you can predict which are going to be wanted next. The hash
> index and sequential scan are two that were possible to predict in that way.
>
> The fadvise patches already committed didn't change the way blocks were read
> in, they just used knowledge about what was coming next to advise the OS.
> That's quite a bit different from actually asking for things in larger chunks
> and only using what you need.
>
> Implementing larger chunking reads or similar asynchronous batch I/O is a big
> project, because you'd have to rearchitect the whole way buffers are managed
> in the database to do it right.  Greg Stark's earliest proof of concept
> prototype for async I/O included a Solaris implementation that used the AIO
> library.  It wasn't feasible to actually use that underlying implemention in
> the database in the end though, because the model AIO uses expects you'll
> fire off a bunch of I/O and then retrieve blocks as they come in.  That's
> really not easy to align with the model for how blocks are read into
> shared_buffers right now.  He had some ideas for that and I've thought
> briefly about the problem, but it would be a major overhaul to some scary to
> touch database internals to pull off.
>
> Given that the OS and/or RAID implementations tend to do what we want in a
> lot of these cases, where smarter/chunkier read-ahead is what we you need,
> the payback on accelerating those cases hasn't been perceived as that great.
> There is a major win for the hash index reads, which Solaris systems can't
> take advantage of, so somebody who uses those heavily on that OS might be
> motivated enough produce improvements for that use case. Once the buffer
> cache at large understood how to handle batching async reads, Solaris AIO
> would be possible, fancier stuff with Linux AIO would be possible, and the
> type of chunking reads you're suggesting would be too.  But none of that is
> happening without some major rearchitecting first.
>
> Unfortunately there aren't that many people with the right knowledge and
> motivation to start tinkering around with the buffer cache internals to the
> extent that would be required to do better here, and pretty much of them I'm
> aware of are hacking on projects with a much clearer payback instead.
>

I've one idea, which is not ideal, but may work and shouldn't be much
effort to implement:
As in the example above we read B1-B5 and B7-B10 on a higher level outside
of normal buffer management with large request sizes (e.g. where hash
index scans and sequential scans are done). As the blocks are now in cache
normal buffer management is very fast:
1.) B1-B5: 5*8k=40k
2.) B7-B10: 4*8k=32k

So we are reading for 1.):
B1-B5 in one 40k block (typically from disk), afterwards we read B1, B2,
B3, B4, B5 in 8k chunks from cache again.

The disadvantage is of course that we have more read requests more maybe
performance is even better because the normal buffer requests are from
cache. A second disadvantage is the "bad" design.

But I think performance will be even better. And a configuration option to
enable this might also be interesting.

Maybe I will try that with pgiosim whether performance is better or not.

What do you think about it?
Is the idea clear?

Thnx.

Ciao
Gerhard

--
http://www.wiesinger.com/

Gerhard Wiesinger <lists@wiesinger.com> writes:
> I've one idea, which is not ideal, but may work and shouldn't be much
> effort to implement:
> As in the example above we read B1-B5 and B7-B10 on a higher level outside
> of normal buffer management with large request sizes (e.g. where hash
> index scans and sequential scans are done). As the blocks are now in cache
> normal buffer management is very fast:
> 1.) B1-B5: 5*8k=40k
> 2.) B7-B10: 4*8k=32k

> So we are reading for 1.):
> B1-B5 in one 40k block (typically from disk), afterwards we read B1, B2,
> B3, B4, B5 in 8k chunks from cache again.

Is this really different from, or better than, telling the OS we'll need
those blocks soon via posix_fadvise?

            regards, tom lane