Thread: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

From
Gerhard Wiesinger
Date:
Hello,

As blocksizes, random I/O and linear I/O are critical I/O performance
parameters I had a look on PostgreSQL and a commercial software vendor.

Therefore I enhanced the system tap script:
http://www.wiesinger.com/opensource/systemtap/disktop_gw.stp

Output per 5 seconds on a sequence scan:
     UID      PID     PPID                       CMD   DEVICE    T        BYTES REQUESTS    BYTES/REQ
      26     4263     4166                postmaster     dm-1    R    168542208 20574         8192
=> 32MB/s

So I saw, that even on sequential reads (and also on bitmap heap scan acces)
PostgreSQL uses only 8k blocks. I think that's a major I/O bottleneck.

A commercial software database vendor solved the problem by reading multiple
continuous blocks by multiple 8k blocks up to a maximum threshold. Output per 5
seconds on an equivalent "sequence scan":
     UID      PID     PPID                       CMD   DEVICE    T        BYTES REQUESTS    BYTES/REQ
    1001     5381        1                   process     dm-1    R    277754638 2338       118800
=> 53 MB/s

A google research has shown that Gregory Stark already worked on that issue
(see references below) but as far as I saw only on bitmap heap scans.

I think this is one of the most critical performance showstopper of PostgreSQL
on the I/O side.

What's the current status of the patch of Gregory Stark? Any timeframes to
integrate?
Does it also work for sequence scans? Any plans for a generic "multi block read
count" solution?

Any comments?

Thnx.

Ciao,
Gerhard

--
http://www.wiesinger.com/

http://wiki.postgresql.org/wiki/Todo#Concurrent_Use_of_Resources
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00027.php
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00395.php
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00088.php
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00092.php
http://archives.postgresql.org/pgsql-hackers/2007-12/msg00098.php

http://archives.postgresql.org/pgsql-hackers/2006-10/msg00820.php

http://markmail.org/message/a5osy4qptxk2jgu3#query:+page:1+mid:hz7uzhwxtkbzncy2+state:results
http://markmail.org/message/a5osy4qptxk2jgu3#query:+page:1+mid:a5osy4qptxk2jgu3+state:results

On Sun, Sep 27, 2009 at 06:05:51PM +0200, Gerhard Wiesinger wrote:
> A google research has shown that Gregory Stark already worked on that issue
> (see references below) but as far as I saw only on bitmap heap scans.

Greg Stark's patches are about giving the IO subsystem enough
information about where the random accesses will be ending up next.
This is important, but almost completely independent from the case
where you know you're doing sequential IO, which is what you seem to be
talking about.

> I think this is one of the most critical performance showstopper of
> PostgreSQL on the I/O side.

PG's been able to handle data as fast as it can come back from the disk
in my tests.  When you start doing calculations then it will obviously
slow down, but what you were talking about wouldn't help here either.

Then again, I don't have a particularly amazing IO subsystem.  What
sort of performance do your disks give you and at what rate is PG doing
sequential scans for you?

--
  Sam  http://samason.me.uk/

Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans

From
Gerhard Wiesinger
Date:
On Sun, 27 Sep 2009, Sam Mason wrote:

> On Sun, Sep 27, 2009 at 06:05:51PM +0200, Gerhard Wiesinger wrote:
>> A google research has shown that Gregory Stark already worked on that issue
>> (see references below) but as far as I saw only on bitmap heap scans.
>
> Greg Stark's patches are about giving the IO subsystem enough
> information about where the random accesses will be ending up next.
> This is important, but almost completely independent from the case
> where you know you're doing sequential IO, which is what you seem to be
> talking about.
>

I'm talking about 2 cases
1.) Sequential scans
2.) Bitmap index scans
which both hopefully end physically in blocks which are after each other
and were larger block sizes can benefit.

>> I think this is one of the most critical performance showstopper of
>> PostgreSQL on the I/O side.
>
> PG's been able to handle data as fast as it can come back from the disk
> in my tests.  When you start doing calculations then it will obviously
> slow down, but what you were talking about wouldn't help here either.
>
> Then again, I don't have a particularly amazing IO subsystem.  What
> sort of performance do your disks give you and at what rate is PG doing
> sequential scans for you?
>

Hello Sam,

Detailed benchmarks are below, the original one from PostgreSQL have
already been posted. So i would expect at least about 60-80MB in reading
for PostgreSQL (when larger block sizes are read)in practical issues on
sequence scans but they are at about 30MB/s. See also pgiosim below.

Setup is:
Disk Setup: SW RAID 5 with 3x1TB SATA 7200 RPM disks
Linux Kernel: 2.6.30.5-43.fc11.x86_64
CPU: Quad Core: AMD Phenom(tm) II X4 940 Processor, 3GHz
RAM: 8GB

Thnx.

Ciao,
Gerhard

--
http://www.wiesinger.com/


http://pgfoundry.org/projects/pgiosim/

#################################################
# Performance benchmarks:
#################################################
dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync
1310720+0 records in
1310720+0 records out
10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s

dd if=test.txt of=/dev/null bs=8192
1310720+0 records in
1310720+0 records out
10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s

#################################################
# RANDOM
#################################################

# Random 8k block reads
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 test.txt
Arg: 1
Added test.txt
Elapsed: 148.22
Read 10000 blocks Wrote 0 blocks
67.47 op/sec, 539.75kB/sec

# Random 8k block reads & writes
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -w 100 test.txt
Write Mode: 100%
Arg: 1
Added test.txt
Elapsed: 201.44
Read 10000 blocks Wrote 10000 blocks
49.64 op/sec, 397.14kB/sec

# Random 8k block reads & writes, sync after each block
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -w 100 -y test.txt
Write Mode: 100%
fsync after each write
Arg: 1
Added test.txt
Elapsed: 282.30
Read 10000 blocks Wrote 10000 blocks
35.42 op/sec, 283.39kB/sec

#################################################
# SEQUENTIAL
#################################################

# Sequential 8k block reads
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 1000000 test.txt
Seq Scan
Arg: 1
Added test.txt
Elapsed: 71.88
Read 1000000 blocks Wrote 0 blocks
13911.40 op/sec, 111291.17kB/sec

# Sequential 8k block reads & writes
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 1000000 -w 100 test.txt
Seq Scan
Write Mode: 100%
Arg: 1
Added test.txt
Elapsed: 261.24
Read 1000000 blocks Wrote 1000000 blocks
3827.90 op/sec, 30623.18kB/sec

# Sequential 8k block reads & writes, sync after each block
echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 10000 -w 100 -y test.txt
Seq Scan
Write Mode: 100%
fsync after each write
Arg: 1
Added test.txt
Elapsed: 27.03
Read 10000 blocks Wrote 10000 blocks
369.96 op/sec, 2959.68kB/sec

#################################################


On Sun, Sep 27, 2009 at 09:04:31PM +0200, Gerhard Wiesinger wrote:
> I'm talking about 2 cases
> 1.) Sequential scans
> 2.) Bitmap index scans
> which both hopefully end physically in blocks which are after each other
> and were larger block sizes can benefit.

Unfortunately it's all a bit more complicated than you hope :(
Sequential scans *may* benefit from larger block sizes, but not much.
Your testing below doesn't seem to test this at all though.

Bitmap index scan will still be accessing blocks in a somewhat random
order (depending on how much correlation there is between the index and
physical rows, and what the selectivity is like).  The result of any
index scan (bitmap or otherwise) must come back in the correct order
(PG is designed around this) and the the best idea to speed this up has
been Greg's read ahead patch.  This pushes more information down into
the kernel so it can start reading the blocks back before PG actually
gets to them.  They are still going to be somewhat out of order so, in
the general case, you're going to be limited by the seek speed of your
disks.


> Detailed benchmarks are below, the original one from PostgreSQL have
> already been posted.

Which was saying what?  you were getting 32MB/s and 53MB/s from what?

As a quick test, maybe:

  create table benchmark ( i integer, j text, k text );
  begin; truncate benchmark; insert into benchmark select generate_series(1,1024*1024*10),
'0123456789abcdef','0123456789abcdef';commit; 

The first run of:

  select count(*) from benchmark;

Will cause the "hint" bits to get set and will cause a lot of writing to
happen.  Subsequent runs will be testing read performance.  My simple
SATA disk at home gets ~90MB/s when tested hdparm, which I'm taking as
the upper performance limit.  When I perform the above query, I see the
disk pulling data back at 89.60MB/s (stddev of 2.27) which is actually
above what I was expecting (there's a filesystem in the way). CPU usage
is around 10%.  Tested by turning on "\timing" mode in psql, dropping
caches and running:

  SELECT 715833344 / 7597.216 / 1024;

Where 715833344 is the size of the file backing the benchmark table
above and 7597.216 is the time taken in ms.

> http://pgfoundry.org/projects/pgiosim/

This seems to just be testing seek performance, not sequential
performance.

> dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync
> 10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s
>
> dd if=test.txt of=/dev/null bs=8192
> 10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s

These look slow.  RAID5 isn't going to be amazing, but it should be
better than this.  I'd spend some more time optimizing your system
config before worrying about PG.  If I can read at 90MB/s from a single
stock SATA drive you should be almost hitting 200MB/s with this, or
300MB/s in a RAID1 across three drives.

--
  Sam  http://samason.me.uk/

On Sun, Sep 27, 2009 at 10:01:27PM +0100, Sam Mason wrote:
> Tested by turning on "\timing" mode in psql, dropping
> caches and running:
>
>   SELECT 715833344 / 7597.216 / 1024;

Help, I can't do maths!  This is overestimating the performance and
should be:

  SELECT 715833344 / 7597.216 / 1024 / 1024 * 1000;

After a few more runs to increase confidence, the read performance is
87.17 and a stddev of 2.8.  Which seems more reasonable, it should *not*
be going above 90MB/s as often as it was.

--
  Sam  http://samason.me.uk/

>
>> dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync
>> 10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s
>>
>> dd if=test.txt of=/dev/null bs=8192
>> 10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s
>
> These look slow.  RAID5 isn't going to be amazing, but it should be
> better than this.  I'd spend some more time optimizing your system
> config before worrying about PG.  If I can read at 90MB/s from a single
> stock SATA drive you should be almost hitting 200MB/s with this, or
> 300MB/s in a RAID1 across three drives.

They are slow, they are not atypical for RAID5; especially the slow
writes with SW RAID-5 are typical.

I'd try a simple test on a 2 or 3 disk RAID-0 for testing purposes
only to see how much faster a RAID-10 array of n*2 disks could be.
The increase in random write performance for RAID-10 will be even more
noticeable.

On Sun, Sep 27, 2009 at 07:22:47PM -0600, Scott Marlowe wrote:
> >> dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync
> >> 10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s
> >>
> >> dd if=test.txt of=/dev/null bs=8192
> >> 10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s
> >
> > These look slow.

> They are slow, they are not atypical for RAID5; especially the slow
> writes with SW RAID-5 are typical.

Wow, no wonder it's shunned so much here!  I'd not realized before that
it incurred such a hit.

> I'd try a simple test on a 2 or 3 disk RAID-0 for testing purposes
> only to see how much faster a RAID-10 array of n*2 disks could be.
> The increase in random write performance for RAID-10 will be even more
> noticeable.

I was thinking that the higher the bandwidth the IO subsystem could push
the data though the more important a larger block size would be--less
to and fro between the kernel and userspace.  If the OP reported
considerably higher CPU usage than expected then he could try rebuilding
with larger block sizes to see if it helps.

I'm assuming that PG only issues block sized reads?  How does changing
block size affect index access performance; does it slow it down because
it has to pull the whole block in?

--
  Sam  http://samason.me.uk/

On Mon, Sep 28, 2009 at 5:53 AM, Sam Mason <sam@samason.me.uk> wrote:
> On Sun, Sep 27, 2009 at 07:22:47PM -0600, Scott Marlowe wrote:
>> >> dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync
>> >> 10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s
>> >>
>> >> dd if=test.txt of=/dev/null bs=8192
>> >> 10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s
>> >
>> > These look slow.
>
>> They are slow, they are not atypical for RAID5; especially the slow
>> writes with SW RAID-5 are typical.
>
> Wow, no wonder it's shunned so much here!  I'd not realized before that
> it incurred such a hit.
>
>> I'd try a simple test on a 2 or 3 disk RAID-0 for testing purposes
>> only to see how much faster a RAID-10 array of n*2 disks could be.
>> The increase in random write performance for RAID-10 will be even more
>> noticeable.
>
> I was thinking that the higher the bandwidth the IO subsystem could push
> the data though the more important a larger block size would be--less
> to and fro between the kernel and userspace.  If the OP reported
> considerably higher CPU usage than expected then he could try rebuilding
> with larger block sizes to see if it helps.
>
> I'm assuming that PG only issues block sized reads?  How does changing
> block size affect index access performance; does it slow it down because
> it has to pull the whole block in?

My experience has been that the stripe size of the RAID array is what
matters. It's about a compromise between something that works well
with sequential scans and something that works well with random access
at the same time.  On a purely transactional db, having a stripe size
in the 8k to 64k range seems optimal, depending on the RAID setup /
hardware.  For something on a reporting database, getting the stripe
size in the 32k to 512k range is usually best.  Most of the time I hit
32 or 64k stripes and it's decent at both.

Random IO is the killer.  If you've got at least 10 to 20% random IO,
it's what to tune for usually.

I haven't played much with larger pg blocksize on pgsql much in recent
years.  I remember someone doing so and seeing better performance up
to 32k, but that's the largest block size pg supports I believe, and
since that code path isn't as well tested as 8k, I just stick to 8k.

On Sun, 27 Sep 2009, Sam Mason wrote:

> The first run of:
>  select count(*) from benchmark;
> Will cause the "hint" bits to get set and will cause a lot of writing to
> happen.  Subsequent runs will be testing read performance.

You just need to be careful of caching effects here.  Either stop the
database and clear the system caches before doing the second count(*), or
pick a table size that's much larger than total system RAM so it's can't
possibly cache everything.  Otherwise you'll just be reading back from
cached memory instead on the second read (and the first one, too, but
because of hint bits that result doesn't mean much anyway).

Sam's results represent the status quo as I'm used to seeing it:  you
should see about the "wire speed" of the disk when pulling in data this
way, but both hint bits and checkpoints can slow results if you're not
careful to account for them.  I keep meaning to add something just like
this as a second level example on top of dd/bonnie++ on my disk testing
page.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

On Sun, 2009-09-27 at 18:05 +0200, Gerhard Wiesinger wrote:

> So I saw, that even on sequential reads (and also on bitmap heap scan acces)
> PostgreSQL uses only 8k blocks. I think that's a major I/O bottleneck.
>
> A commercial software database vendor solved the problem by reading multiple
> continuous blocks by multiple 8k blocks up to a maximum threshold. Output per 5
> seconds on an equivalent "sequence scan":

Is systemtap counting actual I/Os or just requests to access 8192 blocks
once in OS cache? Postgres doesn't read more than one block at a time
into its buffer pool, so those numbers of requests look about right.

There is belief here that multi-block I/O was introduced prior to OS
doing this as a standard mechanism. Linux expands its read ahead window
in response to sequential scans and so this seems like something we
don't want to do in the database.

It's possible this is wrong. Is the table being scanned fairly sizable
and was it allocated contiguously? i.e. was it a large table loaded via
COPY?

I also wonder if more L2 cache effects exist.

--
 Simon Riggs           www.2ndQuadrant.com


On Fri, 2 Oct 2009, Simon Riggs wrote:

>
> On Sun, 2009-09-27 at 18:05 +0200, Gerhard Wiesinger wrote:
>
>> So I saw, that even on sequential reads (and also on bitmap heap scan acces)
>> PostgreSQL uses only 8k blocks. I think that's a major I/O bottleneck.
>>
>> A commercial software database vendor solved the problem by reading multiple
>> continuous blocks by multiple 8k blocks up to a maximum threshold. Output per 5
>> seconds on an equivalent "sequence scan":
>
> Is systemtap counting actual I/Os or just requests to access 8192 blocks
> once in OS cache? Postgres doesn't read more than one block at a time
> into its buffer pool, so those numbers of requests look about right.
>

As far as I know these are VFS reads. So some reads might be from cache
but since I did all requests should be from disk:
echo 3 > /proc/sys/vm/drop_caches;service postgresql restart
do benchmark

Same for all benchmarks because I don't want to measure cache performance
of OS or of the DB to benchmark.

Therefore all requests (except reaing twice or more but that shouldn't be
the case and would also be fine as cache hit) are from disk and not from
the cache.

> There is belief here that multi-block I/O was introduced prior to OS
> doing this as a standard mechanism. Linux expands its read ahead window
> in response to sequential scans and so this seems like something we
> don't want to do in the database.

I played even with large values on block device readaheads of /dev/md*,
/dev/sd* and /dev/dm-* as well as stripe_cache_size of /dev/md* but
without any performance improvements in the benmark scenarios.

=> All readaheads/caches don't seem to work in at least in the HEAP
BITMAP SCAN scenarios on nearly latest Linux kernels.

But I think such block issues (reading in largest blocks as possible) have
to be optimized on application level (in our case DB level) because
1.) We can't assume that OS and even storage works well in such scenarios
2.) We can't assume that OS/storage is intelligent enough to reduce number
of IOPS when 2 random blocks are at random 2 sequential blocks and that
therefore the number of IOPS is reduced.
3.) I think such a logic should be very easy to integrate and even has
been done with some patches.

>
> It's possible this is wrong. Is the table being scanned fairly sizable
> and was it allocated contiguously? i.e. was it a large table loaded via
> COPY?
>
> I also wonder if more L2 cache effects exist.
>

What do you mean with "table being scanned fairly sizable"? I don't get
it.

Table was filled with a lot of inserts but at one time point.

Ciao,
Gerhard

On Sun, Sep 27, 2009 at 11:18 AM, Sam Mason <sam@samason.me.uk> wrote:
> On Sun, Sep 27, 2009 at 06:05:51PM +0200, Gerhard Wiesinger wrote:
>> A google research has shown that Gregory Stark already worked on that issue
>> (see references below) but as far as I saw only on bitmap heap scans.
>
> Greg Stark's patches are about giving the IO subsystem enough
> information about where the random accesses will be ending up next.
> This is important, but almost completely independent from the case
> where you know you're doing sequential IO, which is what you seem to be
> talking about.

FWIW I did work to write code to use FADV_SEQUENTIAL and FADV_RANDOM
but couldn't demonstrate any performance improvement. Basically
Postgres was already capable of saturating any raid controller I could
test doing a normal sequential scan with 8k block sizes and no special
read-ahead advice.


--
greg