Thread: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Gerhard Wiesinger
Date:
Hello, As blocksizes, random I/O and linear I/O are critical I/O performance parameters I had a look on PostgreSQL and a commercial software vendor. Therefore I enhanced the system tap script: http://www.wiesinger.com/opensource/systemtap/disktop_gw.stp Output per 5 seconds on a sequence scan: UID PID PPID CMD DEVICE T BYTES REQUESTS BYTES/REQ 26 4263 4166 postmaster dm-1 R 168542208 20574 8192 => 32MB/s So I saw, that even on sequential reads (and also on bitmap heap scan acces) PostgreSQL uses only 8k blocks. I think that's a major I/O bottleneck. A commercial software database vendor solved the problem by reading multiple continuous blocks by multiple 8k blocks up to a maximum threshold. Output per 5 seconds on an equivalent "sequence scan": UID PID PPID CMD DEVICE T BYTES REQUESTS BYTES/REQ 1001 5381 1 process dm-1 R 277754638 2338 118800 => 53 MB/s A google research has shown that Gregory Stark already worked on that issue (see references below) but as far as I saw only on bitmap heap scans. I think this is one of the most critical performance showstopper of PostgreSQL on the I/O side. What's the current status of the patch of Gregory Stark? Any timeframes to integrate? Does it also work for sequence scans? Any plans for a generic "multi block read count" solution? Any comments? Thnx. Ciao, Gerhard -- http://www.wiesinger.com/ http://wiki.postgresql.org/wiki/Todo#Concurrent_Use_of_Resources http://archives.postgresql.org/pgsql-hackers/2007-12/msg00027.php http://archives.postgresql.org/pgsql-hackers/2007-12/msg00395.php http://archives.postgresql.org/pgsql-hackers/2007-12/msg00088.php http://archives.postgresql.org/pgsql-hackers/2007-12/msg00092.php http://archives.postgresql.org/pgsql-hackers/2007-12/msg00098.php http://archives.postgresql.org/pgsql-hackers/2006-10/msg00820.php http://markmail.org/message/a5osy4qptxk2jgu3#query:+page:1+mid:hz7uzhwxtkbzncy2+state:results http://markmail.org/message/a5osy4qptxk2jgu3#query:+page:1+mid:a5osy4qptxk2jgu3+state:results
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Sam Mason
Date:
On Sun, Sep 27, 2009 at 06:05:51PM +0200, Gerhard Wiesinger wrote: > A google research has shown that Gregory Stark already worked on that issue > (see references below) but as far as I saw only on bitmap heap scans. Greg Stark's patches are about giving the IO subsystem enough information about where the random accesses will be ending up next. This is important, but almost completely independent from the case where you know you're doing sequential IO, which is what you seem to be talking about. > I think this is one of the most critical performance showstopper of > PostgreSQL on the I/O side. PG's been able to handle data as fast as it can come back from the disk in my tests. When you start doing calculations then it will obviously slow down, but what you were talking about wouldn't help here either. Then again, I don't have a particularly amazing IO subsystem. What sort of performance do your disks give you and at what rate is PG doing sequential scans for you? -- Sam http://samason.me.uk/
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Gerhard Wiesinger
Date:
On Sun, 27 Sep 2009, Sam Mason wrote: > On Sun, Sep 27, 2009 at 06:05:51PM +0200, Gerhard Wiesinger wrote: >> A google research has shown that Gregory Stark already worked on that issue >> (see references below) but as far as I saw only on bitmap heap scans. > > Greg Stark's patches are about giving the IO subsystem enough > information about where the random accesses will be ending up next. > This is important, but almost completely independent from the case > where you know you're doing sequential IO, which is what you seem to be > talking about. > I'm talking about 2 cases 1.) Sequential scans 2.) Bitmap index scans which both hopefully end physically in blocks which are after each other and were larger block sizes can benefit. >> I think this is one of the most critical performance showstopper of >> PostgreSQL on the I/O side. > > PG's been able to handle data as fast as it can come back from the disk > in my tests. When you start doing calculations then it will obviously > slow down, but what you were talking about wouldn't help here either. > > Then again, I don't have a particularly amazing IO subsystem. What > sort of performance do your disks give you and at what rate is PG doing > sequential scans for you? > Hello Sam, Detailed benchmarks are below, the original one from PostgreSQL have already been posted. So i would expect at least about 60-80MB in reading for PostgreSQL (when larger block sizes are read)in practical issues on sequence scans but they are at about 30MB/s. See also pgiosim below. Setup is: Disk Setup: SW RAID 5 with 3x1TB SATA 7200 RPM disks Linux Kernel: 2.6.30.5-43.fc11.x86_64 CPU: Quad Core: AMD Phenom(tm) II X4 940 Processor, 3GHz RAM: 8GB Thnx. Ciao, Gerhard -- http://www.wiesinger.com/ http://pgfoundry.org/projects/pgiosim/ ################################################# # Performance benchmarks: ################################################# dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync 1310720+0 records in 1310720+0 records out 10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s dd if=test.txt of=/dev/null bs=8192 1310720+0 records in 1310720+0 records out 10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s ################################################# # RANDOM ################################################# # Random 8k block reads echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 test.txt Arg: 1 Added test.txt Elapsed: 148.22 Read 10000 blocks Wrote 0 blocks 67.47 op/sec, 539.75kB/sec # Random 8k block reads & writes echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -w 100 test.txt Write Mode: 100% Arg: 1 Added test.txt Elapsed: 201.44 Read 10000 blocks Wrote 10000 blocks 49.64 op/sec, 397.14kB/sec # Random 8k block reads & writes, sync after each block echo 3 > /proc/sys/vm/drop_caches;./pgiosim -b 10000 -w 100 -y test.txt Write Mode: 100% fsync after each write Arg: 1 Added test.txt Elapsed: 282.30 Read 10000 blocks Wrote 10000 blocks 35.42 op/sec, 283.39kB/sec ################################################# # SEQUENTIAL ################################################# # Sequential 8k block reads echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 1000000 test.txt Seq Scan Arg: 1 Added test.txt Elapsed: 71.88 Read 1000000 blocks Wrote 0 blocks 13911.40 op/sec, 111291.17kB/sec # Sequential 8k block reads & writes echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 1000000 -w 100 test.txt Seq Scan Write Mode: 100% Arg: 1 Added test.txt Elapsed: 261.24 Read 1000000 blocks Wrote 1000000 blocks 3827.90 op/sec, 30623.18kB/sec # Sequential 8k block reads & writes, sync after each block echo 3 > /proc/sys/vm/drop_caches;./pgiosim -s -b 10000 -w 100 -y test.txt Seq Scan Write Mode: 100% fsync after each write Arg: 1 Added test.txt Elapsed: 27.03 Read 10000 blocks Wrote 10000 blocks 369.96 op/sec, 2959.68kB/sec #################################################
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Sam Mason
Date:
On Sun, Sep 27, 2009 at 09:04:31PM +0200, Gerhard Wiesinger wrote: > I'm talking about 2 cases > 1.) Sequential scans > 2.) Bitmap index scans > which both hopefully end physically in blocks which are after each other > and were larger block sizes can benefit. Unfortunately it's all a bit more complicated than you hope :( Sequential scans *may* benefit from larger block sizes, but not much. Your testing below doesn't seem to test this at all though. Bitmap index scan will still be accessing blocks in a somewhat random order (depending on how much correlation there is between the index and physical rows, and what the selectivity is like). The result of any index scan (bitmap or otherwise) must come back in the correct order (PG is designed around this) and the the best idea to speed this up has been Greg's read ahead patch. This pushes more information down into the kernel so it can start reading the blocks back before PG actually gets to them. They are still going to be somewhat out of order so, in the general case, you're going to be limited by the seek speed of your disks. > Detailed benchmarks are below, the original one from PostgreSQL have > already been posted. Which was saying what? you were getting 32MB/s and 53MB/s from what? As a quick test, maybe: create table benchmark ( i integer, j text, k text ); begin; truncate benchmark; insert into benchmark select generate_series(1,1024*1024*10), '0123456789abcdef','0123456789abcdef';commit; The first run of: select count(*) from benchmark; Will cause the "hint" bits to get set and will cause a lot of writing to happen. Subsequent runs will be testing read performance. My simple SATA disk at home gets ~90MB/s when tested hdparm, which I'm taking as the upper performance limit. When I perform the above query, I see the disk pulling data back at 89.60MB/s (stddev of 2.27) which is actually above what I was expecting (there's a filesystem in the way). CPU usage is around 10%. Tested by turning on "\timing" mode in psql, dropping caches and running: SELECT 715833344 / 7597.216 / 1024; Where 715833344 is the size of the file backing the benchmark table above and 7597.216 is the time taken in ms. > http://pgfoundry.org/projects/pgiosim/ This seems to just be testing seek performance, not sequential performance. > dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync > 10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s > > dd if=test.txt of=/dev/null bs=8192 > 10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s These look slow. RAID5 isn't going to be amazing, but it should be better than this. I'd spend some more time optimizing your system config before worrying about PG. If I can read at 90MB/s from a single stock SATA drive you should be almost hitting 200MB/s with this, or 300MB/s in a RAID1 across three drives. -- Sam http://samason.me.uk/
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Sam Mason
Date:
On Sun, Sep 27, 2009 at 10:01:27PM +0100, Sam Mason wrote: > Tested by turning on "\timing" mode in psql, dropping > caches and running: > > SELECT 715833344 / 7597.216 / 1024; Help, I can't do maths! This is overestimating the performance and should be: SELECT 715833344 / 7597.216 / 1024 / 1024 * 1000; After a few more runs to increase confidence, the read performance is 87.17 and a stddev of 2.8. Which seems more reasonable, it should *not* be going above 90MB/s as often as it was. -- Sam http://samason.me.uk/
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Scott Marlowe
Date:
> >> dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync >> 10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s >> >> dd if=test.txt of=/dev/null bs=8192 >> 10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s > > These look slow. RAID5 isn't going to be amazing, but it should be > better than this. I'd spend some more time optimizing your system > config before worrying about PG. If I can read at 90MB/s from a single > stock SATA drive you should be almost hitting 200MB/s with this, or > 300MB/s in a RAID1 across three drives. They are slow, they are not atypical for RAID5; especially the slow writes with SW RAID-5 are typical. I'd try a simple test on a 2 or 3 disk RAID-0 for testing purposes only to see how much faster a RAID-10 array of n*2 disks could be. The increase in random write performance for RAID-10 will be even more noticeable.
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Sam Mason
Date:
On Sun, Sep 27, 2009 at 07:22:47PM -0600, Scott Marlowe wrote: > >> dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync > >> 10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s > >> > >> dd if=test.txt of=/dev/null bs=8192 > >> 10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s > > > > These look slow. > They are slow, they are not atypical for RAID5; especially the slow > writes with SW RAID-5 are typical. Wow, no wonder it's shunned so much here! I'd not realized before that it incurred such a hit. > I'd try a simple test on a 2 or 3 disk RAID-0 for testing purposes > only to see how much faster a RAID-10 array of n*2 disks could be. > The increase in random write performance for RAID-10 will be even more > noticeable. I was thinking that the higher the bandwidth the IO subsystem could push the data though the more important a larger block size would be--less to and fro between the kernel and userspace. If the OP reported considerably higher CPU usage than expected then he could try rebuilding with larger block sizes to see if it helps. I'm assuming that PG only issues block sized reads? How does changing block size affect index access performance; does it slow it down because it has to pull the whole block in? -- Sam http://samason.me.uk/
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Scott Marlowe
Date:
On Mon, Sep 28, 2009 at 5:53 AM, Sam Mason <sam@samason.me.uk> wrote: > On Sun, Sep 27, 2009 at 07:22:47PM -0600, Scott Marlowe wrote: >> >> dd if=/dev/zero of=test.txt bs=8192 count=1310720 conv=fdatasync >> >> 10737418240 bytes (11 GB) copied, 169.482 s, 63.4 MB/s >> >> >> >> dd if=test.txt of=/dev/null bs=8192 >> >> 10737418240 bytes (11 GB) copied, 86.4457 s, 124 MB/s >> > >> > These look slow. > >> They are slow, they are not atypical for RAID5; especially the slow >> writes with SW RAID-5 are typical. > > Wow, no wonder it's shunned so much here! I'd not realized before that > it incurred such a hit. > >> I'd try a simple test on a 2 or 3 disk RAID-0 for testing purposes >> only to see how much faster a RAID-10 array of n*2 disks could be. >> The increase in random write performance for RAID-10 will be even more >> noticeable. > > I was thinking that the higher the bandwidth the IO subsystem could push > the data though the more important a larger block size would be--less > to and fro between the kernel and userspace. If the OP reported > considerably higher CPU usage than expected then he could try rebuilding > with larger block sizes to see if it helps. > > I'm assuming that PG only issues block sized reads? How does changing > block size affect index access performance; does it slow it down because > it has to pull the whole block in? My experience has been that the stripe size of the RAID array is what matters. It's about a compromise between something that works well with sequential scans and something that works well with random access at the same time. On a purely transactional db, having a stripe size in the 8k to 64k range seems optimal, depending on the RAID setup / hardware. For something on a reporting database, getting the stripe size in the 32k to 512k range is usually best. Most of the time I hit 32 or 64k stripes and it's decent at both. Random IO is the killer. If you've got at least 10 to 20% random IO, it's what to tune for usually. I haven't played much with larger pg blocksize on pgsql much in recent years. I remember someone doing so and seeing better performance up to 32k, but that's the largest block size pg supports I believe, and since that code path isn't as well tested as 8k, I just stick to 8k.
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Greg Smith
Date:
On Sun, 27 Sep 2009, Sam Mason wrote: > The first run of: > select count(*) from benchmark; > Will cause the "hint" bits to get set and will cause a lot of writing to > happen. Subsequent runs will be testing read performance. You just need to be careful of caching effects here. Either stop the database and clear the system caches before doing the second count(*), or pick a table size that's much larger than total system RAM so it's can't possibly cache everything. Otherwise you'll just be reading back from cached memory instead on the second read (and the first one, too, but because of hint bits that result doesn't mean much anyway). Sam's results represent the status quo as I'm used to seeing it: you should see about the "wire speed" of the disk when pulling in data this way, but both hint bits and checkpoints can slow results if you're not careful to account for them. I keep meaning to add something just like this as a second level example on top of dd/bonnie++ on my disk testing page. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Simon Riggs
Date:
On Sun, 2009-09-27 at 18:05 +0200, Gerhard Wiesinger wrote: > So I saw, that even on sequential reads (and also on bitmap heap scan acces) > PostgreSQL uses only 8k blocks. I think that's a major I/O bottleneck. > > A commercial software database vendor solved the problem by reading multiple > continuous blocks by multiple 8k blocks up to a maximum threshold. Output per 5 > seconds on an equivalent "sequence scan": Is systemtap counting actual I/Os or just requests to access 8192 blocks once in OS cache? Postgres doesn't read more than one block at a time into its buffer pool, so those numbers of requests look about right. There is belief here that multi-block I/O was introduced prior to OS doing this as a standard mechanism. Linux expands its read ahead window in response to sequential scans and so this seems like something we don't want to do in the database. It's possible this is wrong. Is the table being scanned fairly sizable and was it allocated contiguously? i.e. was it a large table loaded via COPY? I also wonder if more L2 cache effects exist. -- Simon Riggs www.2ndQuadrant.com
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Gerhard Wiesinger
Date:
On Fri, 2 Oct 2009, Simon Riggs wrote: > > On Sun, 2009-09-27 at 18:05 +0200, Gerhard Wiesinger wrote: > >> So I saw, that even on sequential reads (and also on bitmap heap scan acces) >> PostgreSQL uses only 8k blocks. I think that's a major I/O bottleneck. >> >> A commercial software database vendor solved the problem by reading multiple >> continuous blocks by multiple 8k blocks up to a maximum threshold. Output per 5 >> seconds on an equivalent "sequence scan": > > Is systemtap counting actual I/Os or just requests to access 8192 blocks > once in OS cache? Postgres doesn't read more than one block at a time > into its buffer pool, so those numbers of requests look about right. > As far as I know these are VFS reads. So some reads might be from cache but since I did all requests should be from disk: echo 3 > /proc/sys/vm/drop_caches;service postgresql restart do benchmark Same for all benchmarks because I don't want to measure cache performance of OS or of the DB to benchmark. Therefore all requests (except reaing twice or more but that shouldn't be the case and would also be fine as cache hit) are from disk and not from the cache. > There is belief here that multi-block I/O was introduced prior to OS > doing this as a standard mechanism. Linux expands its read ahead window > in response to sequential scans and so this seems like something we > don't want to do in the database. I played even with large values on block device readaheads of /dev/md*, /dev/sd* and /dev/dm-* as well as stripe_cache_size of /dev/md* but without any performance improvements in the benmark scenarios. => All readaheads/caches don't seem to work in at least in the HEAP BITMAP SCAN scenarios on nearly latest Linux kernels. But I think such block issues (reading in largest blocks as possible) have to be optimized on application level (in our case DB level) because 1.) We can't assume that OS and even storage works well in such scenarios 2.) We can't assume that OS/storage is intelligent enough to reduce number of IOPS when 2 random blocks are at random 2 sequential blocks and that therefore the number of IOPS is reduced. 3.) I think such a logic should be very easy to integrate and even has been done with some patches. > > It's possible this is wrong. Is the table being scanned fairly sizable > and was it allocated contiguously? i.e. was it a large table loaded via > COPY? > > I also wonder if more L2 cache effects exist. > What do you mean with "table being scanned fairly sizable"? I don't get it. Table was filled with a lot of inserts but at one time point. Ciao, Gerhard
Re: PostgreSQL reads each 8k block - no larger blocks are used - even on sequential scans
From
Greg Stark
Date:
On Sun, Sep 27, 2009 at 11:18 AM, Sam Mason <sam@samason.me.uk> wrote: > On Sun, Sep 27, 2009 at 06:05:51PM +0200, Gerhard Wiesinger wrote: >> A google research has shown that Gregory Stark already worked on that issue >> (see references below) but as far as I saw only on bitmap heap scans. > > Greg Stark's patches are about giving the IO subsystem enough > information about where the random accesses will be ending up next. > This is important, but almost completely independent from the case > where you know you're doing sequential IO, which is what you seem to be > talking about. FWIW I did work to write code to use FADV_SEQUENTIAL and FADV_RANDOM but couldn't demonstrate any performance improvement. Basically Postgres was already capable of saturating any raid controller I could test doing a normal sequential scan with 8k block sizes and no special read-ahead advice. -- greg