Initial prefetch performance testing - Mailing list pgsql-hackers

From Greg Smith
Subject Initial prefetch performance testing
Date
Msg-id Pine.GSO.4.64.0809220317320.20434@westnet.com
Whole thread Raw
Responses Re: Initial prefetch performance testing  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Initial prefetch performance testing  (Gregory Stark <stark@enterprisedb.com>)
List pgsql-hackers
The complicated patch I've been working with for a while now is labeled 
"sequential scan posix fadvise" in the CommitFest queue.  There are a lot 
of parts to that, going back to last December, and I've added the many 
most relevant links to the September CommitFest page.

The first message there on this topic is 
http://archives.postgresql.org/message-id/87ve7egxow.fsf@oxford.xeocode.com 
which is a program from Greg Stark that measures how much prefetching 
advisory information improves the overall transfer speed on a synthetic 
random read benchmark.  The idea is that you advise the OS about up to n 
requests at a time, where n goes from 1 (no prefetch at all) to 8192.  As 
n goes up, the total net bandwidth usually goes up as well.  You can 
basically divide the bandwidth at any prefetch level by the baseline (1=no 
prefetch) to get a speedup multiplier.  The program allows you to submit 
both unsorted and sorted requests, and the speedup is pretty large and 
similarly distributed (but of different magnitude) in both cases.

While not a useful PostgreSQL patch on its own, this program lets one 
figure out if the basic idea here, advise about blocks ahead of time to 
speed up the whole thing, works on a particular system without having to 
cope with a larger test.  What I have to report here are some results from 
many systems running both Linux and Solaris with various numbers of disk 
spindles.  The Linux systems use the posix fadvise call, while the Solaris 
ones use its aio library.

Using the maximum prefetch working set tested, 8192, here's the speedup 
multiplier on this benchmark for both sorted and unsorted requests using a 
8GB file:

OS        Spindles    Unsorted X    Sorted X
1:Linux        1        2.3        2.1
2:Linux        1        1.5        1.0
3:Solaris    1        2.6        3.0
4:Linux        3        6.3        2.8
5:Linux (Stark)    3        5.3        3.6
6:Linux        10        5.4        4.9
7:Solaris*    48        16.9        9.2

Systems (1)-(3) are standard single-disk workstations with various speed 
and size disks.  (4) is a 3-disk software RAID0 (on an Areca card in JBOD 
mode).  (5) is the system Greg Stark originally reported his results on, 
which is also a 3-disk array of some sort.  (6) uses a Sun 2640 disk array 
with a 10 disk RAID0+1 setup, while (7) is a Sun Fire X4500 with 48 disks 
in a giant RAID-Z array.

The Linux systems drop the OS cache after each run, they're all running 
kernel 2.6.18 or higher with that feature.  Solaris system (3) is using 
the UFS filesystem with the default tuning, which doesn't cache enough 
information for that to be necessary[1]--the results look very similar to 
the Linux case even without explicitly dropping the cache.

* For (7) the results there showed obvious caching (>150MB/s), as I 
expected from Solaris's ZFS which does cache aggressively by default.  In 
order to get useful results with the server's 16GB of RAM, I increased the 
test file to 64GB, at which point the results looked reasonable.

Comparing with a prefetch working set of 256, which I eyeballed on the 
results spreadsheet I made as the best return on prefetch effort before 
improvements leveled off, the speedups looked like this:

OS        Spindles    Unsorted X    Sorted X
1:Linux        1        2.3        2.0
2:Linux        1        1.5        0.9
3:Solaris    1        2.5        3.3
4:Linux        3        5.8        2.6
5:Linux (Stark)    3        5.6        3.7
6:Linux        10        5.7        5.1
7:Solaris    48        10.0        7.8

Observations:

-For the most part, using the fadvise/aio technique was a significant win 
even on single disk systems.  The worst result, on system (2) with sorted 
blocks, was basically break even within the measurement tolerance here: 
94% of the no prefetch rate is the worst result I saw, but all these 
bounced around about +/- 5% so I wouldn't read too much into that.  In 
every other case, there was at least a 50% speed increase even with a 
single disk.

-As Greg Stark suggested, the larger the spindle count the larger the 
speedup, and the larger the prefetch size that might make sense.  His 
suggestion to model the user GUC as "effective_spindle_count" looks like a 
good one.  The sequential scan fadvise implementation patch submitted uses 
the earlier preread_pages name for that parameter, which I agree seems 
less friendly.

-The Solaris aio implementation seems to perform a bit better relative to 
no prefetch than the Linux fadvise one.  I'm left wondering a bit about 
whether that's just a Solaris vs. Linux thing, in particular whether 
that's just some lucky caching on Solaris where the cache isn't completely 
cleared, or whether Linux's aio library might work better than its fadvise 
call does.

The attached archive file includes a couple of useful bits for anyone who 
wants to try this test on their hardware.  I think I filed away all the 
rough edges here and it should be real easy for someone else to run this 
test now.  It includes:

-prefetch.c is a slightly modified version of the original test program. 
I fixed a couple of minor bugs in the parameter input/output code that 
only showed up under some platform combinations, the actual prefetch 
implementation is untouched.

-prefetchtest is a shell script that compiles the program and runs it 
against a full range of prefetch sizes.  Just run it and tell it where you 
want the test data file to go (with an optional size that defaults to 
8GB), and it produces an output file named prefetch-results.csv with all 
the results in it.

-I included all of the raw data for the various systems I tested so other 
testers have baselines to compare against.  An OpenOffice spreadsheet 
comparing all the results and that computes the ratios shown above is also 
included.

Conclusion:  on all the systems I tested on, this approach gave excellent 
results, which makes me feel confident that I should see a corresponding 
speedup on database-level tests that use this same basic technique.  I'm 
not sure whether it might make sense to bundle this test program up 
somehow so others can use it for similar compatibility tests (I'm thinking 
of something similar to contrib/test_fsync), will revisit that after the 
rest of the review.

Next step:  I've got two data sets (one generated, one real-world sample) 
that should demonstrate a useful heap scan prefetch speedup, and one test 
program I think will demonstrate whether the sequential scan prefetch code 
works right.  Now that I've vetted all the hardware/OS combinations I hope 
I can squeeze that in this week, I don't need to test all of them now that 
I know which are the interesting systems.

As far as other platforms go, I should get a Mac OS system in the near 
future to test on as well (once I have the database tests working, not 
worth scheduling yet), but as it will only have a single disk that will 
basically just be a compatibility test rather than a serious performance 
one.  Would be nice to get a report from someone running FreeBSD to see 
what's needed to make the test script run on that OS.

[1] http://blogs.sun.com/jkshah/entry/postgresql_east_2008_talk_best : 
Page 8 of the presentation covers just how limited the default UFS cache 
tuning is.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: parallel pg_restore
Next
From: "Hans-Jürgen Schönig"
Date:
Subject: Re: Toasted table not deleted when no out of line columns left