Patch : seq scan readahead (WIP) - Mailing list pgsql-hackers
From | Pierre Frédéric Caillaud |
---|---|
Subject | Patch : seq scan readahead (WIP) |
Date | |
Msg-id | op.uyb6s4gacke6l8@soyouz Whole thread Raw |
Responses |
Re: Patch : seq scan readahead (WIP)
|
List | pgsql-hackers |
This is a spinoff of the current work on compression... I've discovered that linux doesn't apply readahead to sparse files. So I added a little readahead in seq scans. Then I realized this might also be beneficial for the standard Postgres. On my RAID1 it shows some pretty drastic effects. The PC : - RAID1 of 2xSATA disks, reads at about 60 MB/s - RAID5 of 3xSATA disks, reads at about 210 MB/s Both RAIDs are Linux Software RAID. Test data : A 9.3GB table with not too small rows, so count(*) doesn't use lots of CPU. The problem : - On the RAID5 there is no problem, count(*) maxes out the disk. - On the RAID1, count(*) also maxes out the disk, but there are 2 disks. One works, one sits idle. It does nothing. Linux Software RAID cannot use 2 disks on sequential reads, at least on my kernel version. What do your boxes do in such a situation ? For standard postgres, iostat says : Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 3,00 0,00 40,00 0 40 sdb 727,00 116600,00 40,00 116600 40 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 124,00 23408,00 0,00 23408 0 sdb 628,00 101640,00 0,00 101640 0 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 744,00 124536,00 0,00 124536 0 sdb 0,00 0,00 0,00 0 0 Basically it is reading the disks in turn, but not at the same time. The solution : Somehow coerce Linux Software RAID to stripe reads across the 2 mirrors to get more throughput. After a bit of fiddling, this seems to do it : - for each page read in a seq scan Strategy 0 : do nothing (this is the current strategy) Strategy 1 : issue a Prefetch call 4096 pages ahead (32MB) of current position Strategy 2 : if (the current page & 4096) == 1, issue a Prefetch call 4096 pages ahead (32MB) of current position Strategy 3 : issue a prefetch at 32MB * ((the current page & 4096) ? 1 : 2) ahead of current position Results to seq scan 9.3GB of data on the RAID5 : Strategy 0 :46.4 s It maxes out the disk anyway, so I didn't try the others. However RAID1 is better for not so read-only databases... Results to seq scan 9.3GB of data on the RAID1 : Strategy 0 :162.8 s Strategy 1 :152.9 s Strategy 2 :105.2 s Strategy 3 :152.3 s Strategy 2 cuts the seq scan duration by 35%, ie. disk bandwidth gets a +54% boost. For strategy 2, iostat says : Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 625,00 105288,00 0,00 105288 0 sdb 820,00 105968,00 0,00 105968 0 Both RAID1 volumes are exploited at the same time. I guess it would need some experimenting with the values, and a per-tablespace setting, but since lots of people use Linux Software RAID1 on servers, this might be interesting... You guys want to try it ? Patch attached.
Attachment
pgsql-hackers by date: