Thread: Initial prefetch performance testing

Initial prefetch performance testing

From

Greg Smith

Date:

22 September 2008, 05:57:50

The complicated patch I've been working with for a while now is labeled 
"sequential scan posix fadvise" in the CommitFest queue.  There are a lot 
of parts to that, going back to last December, and I've added the many 
most relevant links to the September CommitFest page.

The first message there on this topic is 
http://archives.postgresql.org/message-id/87ve7egxow.fsf@oxford.xeocode.com 
which is a program from Greg Stark that measures how much prefetching 
advisory information improves the overall transfer speed on a synthetic 
random read benchmark.  The idea is that you advise the OS about up to n 
requests at a time, where n goes from 1 (no prefetch at all) to 8192.  As 
n goes up, the total net bandwidth usually goes up as well.  You can 
basically divide the bandwidth at any prefetch level by the baseline (1=no 
prefetch) to get a speedup multiplier.  The program allows you to submit 
both unsorted and sorted requests, and the speedup is pretty large and 
similarly distributed (but of different magnitude) in both cases.

While not a useful PostgreSQL patch on its own, this program lets one 
figure out if the basic idea here, advise about blocks ahead of time to 
speed up the whole thing, works on a particular system without having to 
cope with a larger test.  What I have to report here are some results from 
many systems running both Linux and Solaris with various numbers of disk 
spindles.  The Linux systems use the posix fadvise call, while the Solaris 
ones use its aio library.

Using the maximum prefetch working set tested, 8192, here's the speedup 
multiplier on this benchmark for both sorted and unsorted requests using a 
8GB file:

OS        Spindles    Unsorted X    Sorted X
1:Linux        1        2.3        2.1
2:Linux        1        1.5        1.0
3:Solaris    1        2.6        3.0
4:Linux        3        6.3        2.8
5:Linux (Stark)    3        5.3        3.6
6:Linux        10        5.4        4.9
7:Solaris*    48        16.9        9.2

Systems (1)-(3) are standard single-disk workstations with various speed 
and size disks.  (4) is a 3-disk software RAID0 (on an Areca card in JBOD 
mode).  (5) is the system Greg Stark originally reported his results on, 
which is also a 3-disk array of some sort.  (6) uses a Sun 2640 disk array 
with a 10 disk RAID0+1 setup, while (7) is a Sun Fire X4500 with 48 disks 
in a giant RAID-Z array.

The Linux systems drop the OS cache after each run, they're all running 
kernel 2.6.18 or higher with that feature.  Solaris system (3) is using 
the UFS filesystem with the default tuning, which doesn't cache enough 
information for that to be necessary[1]--the results look very similar to 
the Linux case even without explicitly dropping the cache.

* For (7) the results there showed obvious caching (>150MB/s), as I 
expected from Solaris's ZFS which does cache aggressively by default.  In 
order to get useful results with the server's 16GB of RAM, I increased the 
test file to 64GB, at which point the results looked reasonable.

Comparing with a prefetch working set of 256, which I eyeballed on the 
results spreadsheet I made as the best return on prefetch effort before 
improvements leveled off, the speedups looked like this:

OS        Spindles    Unsorted X    Sorted X
1:Linux        1        2.3        2.0
2:Linux        1        1.5        0.9
3:Solaris    1        2.5        3.3
4:Linux        3        5.8        2.6
5:Linux (Stark)    3        5.6        3.7
6:Linux        10        5.7        5.1
7:Solaris    48        10.0        7.8

Observations:

-For the most part, using the fadvise/aio technique was a significant win 
even on single disk systems.  The worst result, on system (2) with sorted 
blocks, was basically break even within the measurement tolerance here: 
94% of the no prefetch rate is the worst result I saw, but all these 
bounced around about +/- 5% so I wouldn't read too much into that.  In 
every other case, there was at least a 50% speed increase even with a 
single disk.

-As Greg Stark suggested, the larger the spindle count the larger the 
speedup, and the larger the prefetch size that might make sense.  His 
suggestion to model the user GUC as "effective_spindle_count" looks like a 
good one.  The sequential scan fadvise implementation patch submitted uses 
the earlier preread_pages name for that parameter, which I agree seems 
less friendly.

-The Solaris aio implementation seems to perform a bit better relative to 
no prefetch than the Linux fadvise one.  I'm left wondering a bit about 
whether that's just a Solaris vs. Linux thing, in particular whether 
that's just some lucky caching on Solaris where the cache isn't completely 
cleared, or whether Linux's aio library might work better than its fadvise 
call does.

The attached archive file includes a couple of useful bits for anyone who 
wants to try this test on their hardware.  I think I filed away all the 
rough edges here and it should be real easy for someone else to run this 
test now.  It includes:

-prefetch.c is a slightly modified version of the original test program. 
I fixed a couple of minor bugs in the parameter input/output code that 
only showed up under some platform combinations, the actual prefetch 
implementation is untouched.

-prefetchtest is a shell script that compiles the program and runs it 
against a full range of prefetch sizes.  Just run it and tell it where you 
want the test data file to go (with an optional size that defaults to 
8GB), and it produces an output file named prefetch-results.csv with all 
the results in it.

-I included all of the raw data for the various systems I tested so other 
testers have baselines to compare against.  An OpenOffice spreadsheet 
comparing all the results and that computes the ratios shown above is also 
included.

Conclusion:  on all the systems I tested on, this approach gave excellent 
results, which makes me feel confident that I should see a corresponding 
speedup on database-level tests that use this same basic technique.  I'm 
not sure whether it might make sense to bundle this test program up 
somehow so others can use it for similar compatibility tests (I'm thinking 
of something similar to contrib/test_fsync), will revisit that after the 
rest of the review.

Next step:  I've got two data sets (one generated, one real-world sample) 
that should demonstrate a useful heap scan prefetch speedup, and one test 
program I think will demonstrate whether the sequential scan prefetch code 
works right.  Now that I've vetted all the hardware/OS combinations I hope 
I can squeeze that in this week, I don't need to test all of them now that 
I know which are the interesting systems.

As far as other platforms go, I should get a Mac OS system in the near 
future to test on as well (once I have the database tests working, not 
worth scheduling yet), but as it will only have a single disk that will 
basically just be a compatibility test rather than a serious performance 
one.  Would be nice to get a report from someone running FreeBSD to see 
what's needed to make the test script run on that OS.

[1] http://blogs.sun.com/jkshah/entry/postgresql_east_2008_talk_best : 
Page 8 of the presentation covers just how limited the default UFS cache 
tuning is.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Initial prefetch performance testing

From

Simon Riggs

Date:

22 September 2008, 07:01:40

On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote:

> -As Greg Stark suggested, the larger the spindle count the larger the 
> speedup, and the larger the prefetch size that might make sense.  His 
> suggestion to model the user GUC as "effective_spindle_count" looks like a 
> good one.  The sequential scan fadvise implementation patch submitted uses 
> the earlier preread_pages name for that parameter, which I agree seems 
> less friendly.

Good news about the testing.

I'd prefer to set this as a tablespace level storage parameter. Since
that is where it would need to live when we have multiple tablespaces.
Specifically as a storage parameter, so we have same syntax for
table-level and tablespace-level storage parameters. That would also
allow us to have tablespace-level defaults for table-level settings.

prefetch_... is a much better name since its an existing industry term.
I'm not in favour of introducing the concept of spindles, since I can
almost hear the questions about ramdisks and memory-based storage. Plus
I don't ever want to discover that the best setting for
effective_spindles is 7 (or 5) when I have 6 disks because of some
technology shift or postgres behaviour change in the future.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Initial prefetch performance testing

From

"Hans-Jürgen Schönig"

Date:

22 September 2008, 08:34:27

On Sep 22, 2008, at 12:02 PM, Simon Riggs wrote:

>
> On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote:
>
>> -As Greg Stark suggested, the larger the spindle count the larger the
>> speedup, and the larger the prefetch size that might make sense.  His
>> suggestion to model the user GUC as "effective_spindle_count" looks
>> like a
>> good one.  The sequential scan fadvise implementation patch
>> submitted uses
>> the earlier preread_pages name for that parameter, which I agree
>> seems
>> less friendly.
>
> Good news about the testing.


absolutely; we made tests and got similar figures.
also, I/O is much more stable and steady with the patch.


>
> I'd prefer to set this as a tablespace level storage parameter. Since
> that is where it would need to live when we have multiple tablespaces.
> Specifically as a storage parameter, so we have same syntax for
> table-level and tablespace-level storage parameters. That would also
> allow us to have tablespace-level defaults for table-level settings.
>


+1


> prefetch_... is a much better name since its an existing industry
> term.
> I'm not in favour of introducing the concept of spindles, since I can
> almost hear the questions about ramdisks and memory-based storage.
> Plus
> I don't ever want to discover that the best setting for
> effective_spindles is 7 (or 5) when I have 6 disks because of some
> technology shift or postgres behaviour change in the future.



i would definitely avoid to use of "spindles".
i totally agree with simon here. once mature SSD storage or some in-
memory stuff will be available for the masses, this is not suitable
anymore.
the best thing would be to simply use the parameter as it was in the
original patch.
maybe we should simply make the parameter adjustable per table and per
index. this would automatically cover 95% of all cases such as
clustered tables and so on.
many thanks and best regards,
    hans

--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: www.postgresql-support.de

Re: Initial prefetch performance testing

From

Gregory Stark

Date:

22 September 2008, 12:51:33

Simon Riggs <simon@2ndQuadrant.com> writes:

> On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote:
>
>> -As Greg Stark suggested, the larger the spindle count the larger the 
>> speedup, and the larger the prefetch size that might make sense.  His 
>> suggestion to model the user GUC as "effective_spindle_count" looks like a 
>> good one.  The sequential scan fadvise implementation patch submitted uses 
>> the earlier preread_pages name for that parameter, which I agree seems 
>> less friendly.
>
> Good news about the testing.
>
> I'd prefer to set this as a tablespace level storage parameter. 

Sounds, like a good idea, except... what's a tablespace level storage parameter?

> prefetch_... is a much better name since its an existing industry term.
> I'm not in favour of introducing the concept of spindles, since I can
> almost hear the questions about ramdisks and memory-based storage. Plus
> I don't ever want to discover that the best setting for
> effective_spindles is 7 (or 5) when I have 6 disks because of some
> technology shift or postgres behaviour change in the future.

In principle I quite strongly disagree with this.

Someone might very well want to set spindle_count to 6 when he actually has 7
but at least he can have an intuitive feel for what he's doing -- he's setting
it to slightly less than Postgres thinks is optimal.

Number of blocks to prefetch is an internal implementation detail that the DBA
has absolutely no way to know what the correct value is. That's how we get the
cargo cult configuration tweaks we've seen in the past where people follow
recommendations with no idea what the consequences are or whether they apply.

In an ideal world we would have a half-dozen parameters to tell Postgres how
much memory is available, how many disks available, etc and Postgres would
know how best to use the resources. I think if we expose internal knobs like
you propose then we end up with hundreds of parameters and to adjust them
you'll have to be an expert in Postgres internals.

That said, there is a place for these internal knobs when we don't really know
how to best make use of resources. At this point we only have results from a
few systems and the results don't seem to jibe with the theory.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!

Re: Initial prefetch performance testing

From

Simon Riggs

Date:

22 September 2008, 13:28:50

On Mon, 2008-09-22 at 16:46 +0100, Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
> 
> > On Mon, 2008-09-22 at 04:57 -0400, Greg Smith wrote:
> >
> >> -As Greg Stark suggested, the larger the spindle count the larger the 
> >> speedup, and the larger the prefetch size that might make sense.  His 
> >> suggestion to model the user GUC as "effective_spindle_count" looks like a 
> >> good one.  The sequential scan fadvise implementation patch submitted uses 
> >> the earlier preread_pages name for that parameter, which I agree seems 
> >> less friendly.
> >
> > Good news about the testing.
> >
> > I'd prefer to set this as a tablespace level storage parameter. 
> 
> Sounds, like a good idea, except... what's a tablespace level storage parameter?

A storage parameter, just at tablespace level.

WITH (storage_parameter = value)

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Initial prefetch performance testing

From

Greg Smith

Date:

22 September 2008, 14:06:27

On Mon, 22 Sep 2008, Simon Riggs wrote:

> I'd prefer to set this as a tablespace level storage parameter.

That seems reasonable, but I'm not working at that level yet.  There's 
still a larger open questions about how the buffer manager interaction 
will work here, and I'd like to have a better view of that first before 
getting into the exact syntax used to set the parameter.  For now, a GUC 
works well enough, but you're right that something finer-grained may make 
sense before this actually hits the codebase.

> prefetch_... is a much better name since its an existing industry term.
> I'm not in favour of introducing the concept of spindles, since I can
> almost hear the questions about ramdisks and memory-based storage.

It's possible to make a case for exposing the internal number that's 
getting varied here, naming the parameter something like prefetch_depth, 
and letting people set that to whatever they want.  Based on the current 
data I might suggest a default of 256, using 0 to turn the feature off 
altogether, and a maximum of at least 8192 and possibly more.

In practice I expect there to only be a couple of popular values and the 
idea of fine-tuning is a bit questionable.  I think that's what Greg Stark 
was driving at with how the value was re-spun.  Instead of using 
effective_spindle_count, you could just as easily make a case for an enum 
like [off,low,medium,high] mapping to [0,16,256,8192].  From what I've 
seen so far, that would reduce tweaking time in the field considerably 
while not really changing the range of available behavior very much.

I will be running a set of tests on a fast SSD device before I'm done, 
that's another one that I'll try once I've got the database-level tests 
ready to run, too.  What I expect is that it will favor 0, presumably you 
might as well just read the blocks rather than advise about them when the 
seek overhead is close to zero.  Should be able to do a RAM disk run as 
well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Initial prefetch performance testing

From

Simon Riggs

Date:

22 September 2008, 15:13:35

On Mon, 2008-09-22 at 13:06 -0400, Greg Smith wrote:

> > prefetch_... is a much better name since its an existing industry term.
> > I'm not in favour of introducing the concept of spindles, since I can
> > almost hear the questions about ramdisks and memory-based storage.
> 
> It's possible to make a case for exposing the internal number that's 
> getting varied here, naming the parameter something like prefetch_depth, 
> and letting people set that to whatever they want.  Based on the current 
> data I might suggest a default of 256, using 0 to turn the feature off 
> altogether, and a maximum of at least 8192 and possibly more.
> 
> In practice I expect there to only be a couple of popular values and the 
> idea of fine-tuning is a bit questionable.  I think that's what Greg Stark 
> was driving at with how the value was re-spun.  Instead of using 
> effective_spindle_count, you could just as easily make a case for an enum 
> like [off,low,medium,high] mapping to [0,16,256,8192].  From what I've 
> seen so far, that would reduce tweaking time in the field considerably 
> while not really changing the range of available behavior very much.

Tuning Postgres I/O already involves quite a few parameters called
buffersize, segment width, stripe size, etc.. I've never heard anything
from a disk manufacturer say this is wrong and we should just have
"spindle equivalents". I don't think we should dress this up too much,
that's all. We aren't going to make anybody's life any easier. But we
will probably generate lots of annoying phone calls to disk
manufacturers asking "so how many spindles is your subsystem worth in
Postgres terms?" to which they will shrug and say "no idea".

Is the behaviour of this sufficiently linear to be able to say that 3
spindles = 3 effective_spindles and 6=6 etc.? I would guess it won't be
and you're left with a name more misleading than useful.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Initial prefetch performance testing

From

Ron Mayer

Date:

22 September 2008, 15:16:04

Gregory Stark wrote:
> Simon Riggs <simon@2ndQuadrant.com> writes:
>> I'm not in favour of introducing the concept of spindles....
> 
> In principle I quite strongly disagree with this....
> Number of blocks to prefetch is an internal implementation detail that the DBA
> has absolutely no way to know what the correct value is. 

Even more often on systems I see these days, "spindles"
is an implementation detail that the DBA has no way to know
what the correct value is.

For example, on our sites hosted with Amazon's compute cloud (a great
place to host web sites), I know nothing about spindles, but know
about Amazon Elastic Block Store[2]'s and Instance Store's[1].   I
have some specs and are able to run benchmarks on them; but couldn't
guess how many spindles my X% of the N-disk device that corresponds
to.  For another example, some of our salesguys with SSD drives
have 0 spindles on their demo machines.

I'd rather a parameter that expressed things more in terms of
measurable quantities -- perhaps seeks/second?  perhaps
random-access/sequential-access times?

[1] http://www.amazon.com/gp/browse.html?node=201590011
[2] http://www.amazon.com/b/ref=sc_fe_c_0_201590011_1?ie=UTF8&node=689343011&no=201590011&me=A36L942TSJ2AJA

Re: Initial prefetch performance testing

From

Greg Smith

Date:

22 September 2008, 16:05:33

On Mon, 22 Sep 2008, Gregory Stark wrote:

> Hm, I'm disappointed with the 48-drive array here. I wonder why it maxed out
> at only 10x the bandwidth of one drive. I would expect more like 24x or more.

The ZFS RAID-Z implementation doesn't really scale that linearly.  It's 
rather hard to get the full bandwidth out of a X4500 with any single 
process, and I haven't done any filesystem tuning to improve 
things--everything is at the defaults.

> I'm quite surprised Solaris doesn't support posix_fadvise -- perhaps 
> it's in some other version of Solaris?

Both the systems I used were standard Solaris 10 boxes and I'm not aware 
of any changes in this area in the later OpenSolaris releases (which is 
where I'd expect something like this to change first).  The test program I 
tried failed to find #ifdef POSIX_FADV_WILLNEED, and the message I saw 
from you at 
http://archives.postgresql.org/message-id/877imua265.fsf@oxford.xeocode.com 
suggested you didn't find any fadvise either so I didn't look much 
further.

The above is a cue for someone from Sun to chime in on this subject.

> I have an updated patch I'll be sending along shortly. You might want to test
> with that?

Obviously I've got everything setup to test right now, am currently 
analyzing your earlier patch and the sequential scan fork that derived 
from it.  If you've got a later version of the bitmap heap scan one as 
well, I'll replace the one I had been planning to test (your 
bitmap-preread-v9) with that one when it's available.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Initial prefetch performance testing

From

Gregory Stark

Date:

22 September 2008, 17:04:34

Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:

> For example, on our sites hosted with Amazon's compute cloud (a great
> place to host web sites), I know nothing about spindles, but know
> about Amazon Elastic Block Store[2]'s and Instance Store's[1].   I
> have some specs and are able to run benchmarks on them; but couldn't
> guess how many spindles my X% of the N-disk device that corresponds
> to.  

Well I don't see how you're going to guess how much prefetching is optimal for
those environments either... 

> For another example, some of our salesguys with SSD drives
> have 0 spindles on their demo machines.

Sounds to me like you're finding it pretty intuitive. Actually you would want
"1" because it can handle one request at a time. Actually if you have a
multipath array I imagine you would want to think of each interface as a
spindle because that's the bottleneck and you'll want to keep all the
interfaces busy.

> I'd rather a parameter that expressed things more in terms of
> measurable quantities -- perhaps seeks/second?  perhaps
> random-access/sequential-access times?

Well that's precisely what I'm saying. Simon et al want a parameter to control
how much prefetching to do. That's *not* a measurable quantity. I'm suggesting
effective_spindle_count which *is* a measurable quantity even if it might be a
bit harder to measure in some environments than others.

The two other quantities you describe are both currently represented by our
random_page_cost (or random_page_cost/sequential_page_cost). What we're
dealing with now is an entirely orthogonal property of your system: how many
concurrent requests can the system handle.

If you have ten spindles then you really want to send enough requests to
ensure there are ten concurrent requests being processed on ten different
drives (assuming you want each scan to make maximum use of the resources which
is primarily true in DSS but might not be true in OLTP). That's a lot more
than ten requests though because if you sent ten requests many of them would
end up on the same devices.

In theory my logic led me to think for ten drives it would be about 30.
Experiments seem to show it's more like 300-400. That discrepancy might be a
reason to put this debate aside for now anywaysand expose the internal
implementation until we understand better what's going on there.

Ironically I'm pretty happy to lose this argument because EDB is interested in
rolling this into its dynamic tuning module. If there's a consensus -- by my
count three people have spoken up already which is more than usual -- then
I'll gladly concede. Anyone object to going back to preread_pages? Or should
it be prefetch_pages? prefetch_blocks? Something else?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

Re: Initial prefetch performance testing

From

Ron Mayer

Date:

22 September 2008, 17:49:44

Gregory Stark wrote:
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
>> I'd rather a parameter that expressed things more in terms of
>> measurable quantities [...]
> 
> ...What we're
> dealing with now is an entirely orthogonal property of your system: how many
> concurrent requests can the system handle.

Really?  I'd have thought you'd want to give the OS enough guesses
about the future that it's elevator algorithms for the drive heads
don't keep seeking back-and-forth but rather do as much per sweep
across a device that they can.

> Ironically I'm pretty happy to lose this argument because EDB is interested in
> rolling this into its dynamic tuning module. If there's a consensus -- by my
> count three people have spoken up already which is more than usual -- then
> I'll gladly concede. Anyone object to going back to preread_pages? Or should
> it be prefetch_pages? prefetch_blocks? Something else?

Well - as you pointed out, I'm not on their side of the debate either.
I'm not sure what a relevant measurable parameter would be so I'm not
being too helpful in the conversation either.

Re: Initial prefetch performance testing

From

Gregory Stark

Date:

23 September 2008, 08:55:40

[resending due to the attachment being too large for the -hackers list --
weren't we going to raise it when we killed -patches?]

Greg Smith <gsmith@gregsmith.com> writes:

> Using the maximum prefetch working set tested, 8192, here's the speedup
> multiplier on this benchmark for both sorted and unsorted requests using a 8GB
> file:
>
> OS        Spindles    Unsorted X    Sorted X
> 1:Linux        1        2.3        2.1
> 2:Linux        1        1.5        1.0
> 3:Solaris    1        2.6        3.0
> 4:Linux        3        6.3        2.8
> 5:Linux (Stark)    3        5.3        3.6
> 6:Linux        10        5.4        4.9
> 7:Solaris*    48        16.9        9.2

Incidentally I've been looking primarily at the sorted numbers because they
parallel bitmap heap scans. (Note that the heap scan is only about half the
i/o of a bitmap index scan + heap scan so even if it's infinitely faster it'll
only halve the time spent in the two nodes.)

Hm, I'm disappointed with the 48-drive array here. I wonder why it maxed out
at only 10x the bandwidth of one drive. I would expect more like 24x or more.
I wonder if Solaris's aio has an internal limit on how many pending i/o
requests it can handle. Perhaps it's a tunable?

Unfortunately I don't see a convenient low-invasive way to integrate aio into
Postgres. posix_fadvise we can just issue the advice and then forget about it.
But aio we would pretty much have to pick a target buffer, pin it, issue the
aio and then remember the pin later when we need to read the buffer. That
would require restructuring the code significantly. I'm quite surprised
Solaris doesn't support posix_fadvise -- perhaps it's in some other version of
Solaris?

Here's a graph of results from this program for various sized arrays on a
single machine:

http://wiki.postgresql.org/images/a/a3/Results.svg

Each colour corresponds to an array of a different number of spindles ranging
from 1 to 15 drives. The X axis is how much prefetching was done and the Y
axis is the bandwidth obtained.

There is a distinct maximum and then dropoff and it would be great to get some
data points for larger arrays to understand where that maximum goes as the
array gets larger.

> Conclusion:  on all the systems I tested on, this approach gave excellent
> results, which makes me feel confident that I should see a corresponding
> speedup on database-level tests that use this same basic technique.  I'm not
> sure whether it might make sense to bundle this test program up somehow so
> others can use it for similar compatibility tests (I'm thinking of something
> similar to contrib/test_fsync), will revisit that after the rest of the review.
>
> Next step:  I've got two data sets (one generated, one real-world sample) that
> should demonstrate a useful heap scan prefetch speedup, and one test program I
> think will demonstrate whether the sequential scan prefetch code works right.
> Now that I've vetted all the hardware/OS combinations I hope I can squeeze that
> in this week, I don't need to test all of them now that I know which are the
> interesting systems.

I have an updated patch I'll be sending along shortly. You might want to test
with that?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Re: Initial prefetch performance testing

From

Zdenek Kotala

Date:

23 September 2008, 09:26:34

Greg Smith napsal(a):
> On Mon, 22 Sep 2008, Gregory Stark wrote:

> 
>> I'm quite surprised Solaris doesn't support posix_fadvise -- perhaps 
>> it's in some other version of Solaris?

Solaris has only fake variant of posix_fadvise. See 
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/port/gen/posix_fadvise.c

UFS has own optimization. For example if it detects sequential scan then  file cache is limited on 80kB. Or it also has
prefetching.

ZFS has intelligent read patter recognitions algorithms and other 
improvements.
    Zdenek

Re: Initial prefetch performance testing

From

Gregory Stark

Date:

23 September 2008, 10:33:42

Greg Smith <gsmith@gregsmith.com> writes:

> On Mon, 22 Sep 2008, Gregory Stark wrote:
>
>> Hm, I'm disappointed with the 48-drive array here. I wonder why it maxed out
>> at only 10x the bandwidth of one drive. I would expect more like 24x or more.
>
> The ZFS RAID-Z implementation doesn't really scale that linearly.  It's rather
> hard to get the full bandwidth out of a X4500 with any single process, and I
> haven't done any filesystem tuning to improve things--everything is at the
> defaults.

Well random access i/o will fall pretty far short of the full bandwidth.
Actually this is a major issue, our sequential_page_cost vs random_page_cost
dichotomy doesn't really work when we're prefetching pages.

In my experiments an array capable of supplying about 1.4GB/s in sequential
i/o could only muster about 40MB/s of random i/o with prefetching and only
about 5MB/s without.

For this machine we would have quite a dilemma setting random_page_cost -- do
we set it to 280 or 35?

Perhaps access paths which expect to be able to prefetch most of their
accesses should use random_page_cost / effective_spindle_count for their i/o
costs?

But then if people don't set random_page_cost high enough they could easily
find themselves with random fetches being costed as less expensive than
sequential fetches. And I have a feeling it'll be a hard sell to get people to
set random_page_cost in the double digits let alone triple digits.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about
EnterpriseDB'sPostgreSQL training!

Re: Initial prefetch performance testing

From

Tom Lane

Date:

23 September 2008, 10:50:37

Gregory Stark <stark@enterprisedb.com> writes:
> Perhaps access paths which expect to be able to prefetch most of their
> accesses should use random_page_cost / effective_spindle_count for their i/o
> costs?

> But then if people don't set random_page_cost high enough they could easily
> find themselves with random fetches being costed as less expensive than
> sequential fetches. And I have a feeling it'll be a hard sell to get people to
> set random_page_cost in the double digits let alone triple digits.

Well, we could use something likeMax(random_page_cost / effective_spindle_count, seq_page_cost)
to ensure the result remains somewhat sane.
        regards, tom lane

Re: Initial prefetch performance testing

From

Gregory Stark

Date:

23 September 2008, 11:38:55

Greg Smith <gsmith@gregsmith.com> writes:

>> I have an updated patch I'll be sending along shortly. You might want to test
>> with that?
>
> Obviously I've got everything setup to test right now, am currently analyzing
> your earlier patch and the sequential scan fork that derived from it.  If
> you've got a later version of the bitmap heap scan one as well, I'll replace
> the one I had been planning to test (your bitmap-preread-v9) with that one when
> it's available.

Well here you go. It includes:

. Bitmap heap scans (as before)
. Index scans
. Setting POSIX_FADV_SEQUENTIAL for bulk sequential scans
. Improved (ie, debugged) autoconf tests for posix_fadvise
  (and posix_fallocate though I don't have any code using it yet)

The bitmap heap scans are still prefetching the number of buffers I estimate
based on effective_spindle_count according to my magic formula. We've been
discussing throwing that out, I just haven't modified this to do that yet.

Index scans prefetch all pages for matching index tuples on the leaf page when
we do page-at-a-time scans. I haven't bothered doing the gradual ramp-up or
keeping a ring of the optimal size prefetched.

The sequential scan stuff is based on Zoltan's posts but done in a different
way. It passes an i/o access strategy to smgr and fd.c which keeps track of
what the previous strategy was and calls posix_fadvise if it's changed. This
will correctly handle queries which reference the same table twice even if one
reference is a sequential scan and the other is an index lookup.

I have *not* been able to observe any significant effect from
POSIX_FADV_SEQUENTIAL but I'm not sure what circumstances it was a problem. It
sounds like it's a peculiar situation which is not easy to reliably reproduce.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

Attachment

bitmap-preread-v18.diff.gz

Re: Initial prefetch performance testing

From

Greg Smith

Date:

23 September 2008, 12:25:56

On Tue, 23 Sep 2008, Gregory Stark wrote:

> I have *not* been able to observe any significant effect from
> POSIX_FADV_SEQUENTIAL but I'm not sure what circumstances it was a problem. It
> sounds like it's a peculiar situation which is not easy to reliably reproduce.

Zoltan, Hans-Juergen:  would it be possible for you to try the latest 
bitmap-preread-v18.diff.gz patch Greg Stark just sent over to the list? 
It's at 
http://archives.postgresql.org/message-id/87ljxjudu7.fsf@oxford.xeocode.com 
as well.  That's a refinement of the original strategy you used, and I'd 
be curious to hear whether it still works usefully on the troublesome 
workload you submitted your original patch against.  Since none of the 
rest of us have been successful so far replicating the large speed-up on 
multiple concurrent sequential scans you reported, I think you're the best 
candidate to see if there was any regression because of how the patch was 
refactored.

I'm excited to see index scans in the new patch as well, since I've got 
1TB of test data that gets navigated that way I can test with.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Initial prefetch performance testing

From

Bruce Momjian

Date:

24 September 2008, 11:29:52

Ron Mayer wrote:
> Even more often on systems I see these days, "spindles"
> is an implementation detail that the DBA has no way to know
> what the correct value is.
> 
> For example, on our sites hosted with Amazon's compute cloud (a great
> place to host web sites), I know nothing about spindles, but know
> about Amazon Elastic Block Store[2]'s and Instance Store's[1].   I
> have some specs and are able to run benchmarks on them; but couldn't
> guess how many spindles my X% of the N-disk device that corresponds
> to.  For another example, some of our salesguys with SSD drives
> have 0 spindles on their demo machines.
> 
> I'd rather a parameter that expressed things more in terms of
> measurable quantities -- perhaps seeks/second?  perhaps
> random-access/sequential-access times?

I assume SAN users might not know the number of spindles either.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Initial prefetch performance testing

From

Bruce Momjian

Date:

24 September 2008, 11:34:05

Gregory Stark wrote:
> 
> Ron Mayer <rm_pg@cheapcomplexdevices.com> writes:
> 
> > For example, on our sites hosted with Amazon's compute cloud (a great
> > place to host web sites), I know nothing about spindles, but know
> > about Amazon Elastic Block Store[2]'s and Instance Store's[1].   I
> > have some specs and are able to run benchmarks on them; but couldn't
> > guess how many spindles my X% of the N-disk device that corresponds
> > to.  
> 
> Well I don't see how you're going to guess how much prefetching is optimal for
> those environments either... 
> 
> > For another example, some of our salesguys with SSD drives
> > have 0 spindles on their demo machines.
> 
> Sounds to me like you're finding it pretty intuitive. Actually you would want
> "1" because it can handle one request at a time. Actually if you have a
> multipath array I imagine you would want to think of each interface as a
> spindle because that's the bottleneck and you'll want to keep all the
> interfaces busy.

I assume everyone would want at least one because you would want the
page to be prefetched while you are processing the existing page.  I
think a larger problem is that it is likely the page prefetch will take
longer than processing the existing page, which might mean that you need
to prefetch a few pages ahead to allow read reordering for better
performance.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Initial prefetch performance testing

From

Heikki Linnakangas

Date:

24 September 2008, 11:42:48

Bruce Momjian wrote:
> Ron Mayer wrote:
>> Even more often on systems I see these days, "spindles"
>> is an implementation detail that the DBA has no way to know
>> what the correct value is.
>>
>> For example, on our sites hosted with Amazon's compute cloud (a great
>> place to host web sites), I know nothing about spindles, but know
>> about Amazon Elastic Block Store[2]'s and Instance Store's[1].   I
>> have some specs and are able to run benchmarks on them; but couldn't
>> guess how many spindles my X% of the N-disk device that corresponds
>> to.  For another example, some of our salesguys with SSD drives
>> have 0 spindles on their demo machines.
>>
>> I'd rather a parameter that expressed things more in terms of
>> measurable quantities -- perhaps seeks/second?  perhaps
>> random-access/sequential-access times?
> 
> I assume SAN users might not know the number of spindles either.

Yeah. Nevertheless I like the way effective_spindle_count works, as 
opposed to an unintuitive "number of blocks to prefetch" (assuming the 
formula we use to turn the former into latter works). Perhaps we should 
keep the meaning the same, but call it "effective_io_concurrency"? 
Something that conveys the idea of "how many simultaneous I/O requests 
the I/O subsystem can handle", without referring to any specific 
technology. That concept applies to SANs and RAM drives as well.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Initial prefetch performance testing

From

Simon Riggs

Date:

24 September 2008, 12:16:50

On Wed, 2008-09-24 at 17:42 +0300, Heikki Linnakangas wrote:

> Yeah. Nevertheless I like the way effective_spindle_count works, as 
> opposed to an unintuitive "number of blocks to prefetch" (assuming the 
> formula we use to turn the former into latter works). Perhaps we should 
> keep the meaning the same, but call it "effective_io_concurrency"? 
> Something that conveys the idea of "how many simultaneous I/O requests 
> the I/O subsystem can handle", without referring to any specific 
> technology. That concept applies to SANs and RAM drives as well.

You've spoiled all the fun now with a good suggestion.

I was looking forward to the Jules Verne-like nostalgia of the other
suggestion over the years to come.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Initial prefetch performance testing

From

Gregory Stark

Date:

01 October 2008, 08:02:23

Simon Riggs <simon@2ndQuadrant.com> writes:

> On Mon, 2008-09-22 at 16:46 +0100, Gregory Stark wrote:
>
>> Simon Riggs <simon@2ndQuadrant.com> writes:
>> 
>> > I'd prefer to set this as a tablespace level storage parameter. 
>> 
>> Sounds, like a good idea, except... what's a tablespace level storage parameter?
>
> A storage parameter, just at tablespace level.
>
> WITH (storage_parameter = value)

I still think this is a good idea but I still think there are unanswered
questions about it. Surely whatever we do with this parameter also holds for
random_page_cost and sequential_page_cost as well? 

Should they remain GUC parameters at all? If so, how would setting any of them
locally interact with the tablespace parameter for tables used in the query? 

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!