Patch : seq scan readahead (WIP) - Mailing list pgsql-hackers

From Pierre Frédéric Caillaud
Subject Patch : seq scan readahead (WIP)
Date
Msg-id op.uyb6s4gacke6l8@soyouz
Whole thread Raw
Responses Re: Patch : seq scan readahead (WIP)  (Albert Cervera i Areny <albert@nan-tic.com>)
List pgsql-hackers
This is a spinoff of the current work on compression...
I've discovered that linux doesn't apply readahead to sparse files.
So I added a little readahead in seq scans.

Then I realized this might also be beneficial for the standard Postgres.
On my RAID1 it shows some pretty drastic effects.

The PC :

- RAID1 of 2xSATA disks, reads at about 60 MB/s
- RAID5 of 3xSATA disks, reads at about 210 MB/s

Both RAIDs are Linux Software RAID.

Test data :

A 9.3GB table with not too small rows, so count(*) doesn't use lots of CPU.

The problem :

- On the RAID5 there is no problem, count(*) maxes out the disk.
- On the RAID1, count(*) also maxes out the disk, but there are 2 disks.
One works, one sits idle. It does nothing.
Linux Software RAID cannot use 2 disks on sequential reads, at least on my
kernel version. What do your boxes do in such a situation ?

For standard postgres, iostat says :

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               3,00         0,00        40,00          0         40
sdb             727,00    116600,00        40,00     116600         40

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             124,00     23408,00         0,00      23408          0
sdb             628,00    101640,00         0,00     101640          0

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             744,00    124536,00         0,00     124536          0
sdb               0,00         0,00         0,00          0          0

Basically it is reading the disks in turn, but not at the same time.

The solution :

Somehow coerce Linux Software RAID to stripe reads across the 2 mirrors to
get more throughput.

After a bit of fiddling, this seems to do it :

- for each page read in a seq scan

Strategy 0 : do nothing (this is the current strategy)
Strategy 1 : issue a Prefetch call 4096 pages ahead (32MB) of current
position
Strategy 2 : if (the current page & 4096) == 1, issue a Prefetch call 4096
pages ahead (32MB) of current position
Strategy 3 : issue a prefetch at 32MB * ((the current page & 4096) ? 1 :
2) ahead of current position

Results to seq scan 9.3GB of data on the RAID5 :

Strategy 0 :46.4 s
It maxes out the disk anyway, so I didn't try the others.
However RAID1 is better for not so read-only databases...

Results to seq scan 9.3GB of data on the RAID1 :

Strategy 0 :162.8 s
Strategy 1 :152.9 s
Strategy 2 :105.2 s
Strategy 3 :152.3 s

Strategy 2 cuts the seq scan duration by 35%, ie. disk bandwidth gets a
+54% boost.

For strategy 2, iostat says :

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             625,00    105288,00         0,00     105288          0
sdb             820,00    105968,00         0,00     105968          0

Both RAID1 volumes are exploited at the same time.

I guess it would need some experimenting with the values, and a
per-tablespace setting, but since lots of people use Linux Software RAID1
on servers, this might be interesting...

You guys want to try it ?

Patch attached.











Attachment

pgsql-hackers by date:

Previous
From: Michael Meskes
Date:
Subject: Re: Split-up ECPG patches
Next
From: Albert Cervera i Areny
Date:
Subject: Re: Patch : seq scan readahead (WIP)