Re: RAID arrays and performance - Mailing list pgsql-performance

From Mark Mielke
Subject Re: RAID arrays and performance
Date
Msg-id 4755F6E7.2090701@mark.mielke.cc
Whole thread Raw
In response to Re: RAID arrays and performance  (James Mansion <james@mansionfamily.plus.com>)
Responses Re: RAID arrays and performance
List pgsql-performance
James Mansion wrote:
> Mark Mielke wrote:
>> PostgreSQL or the kernel should already have the hottest pages in
>> memory, so the value of doing async I/O is very likely the cooler
>> pages that are unique to the query. We don't know what the cooler
>> pages are until we follow three tree down.
>>
> I'm assuming that at the time we start to search the index, we have
> some idea of value or values that
> we are looking for.  Or, as you say, we are applying a function to
> 'all of it'.
>
> Think of a 'between' query.  The subset of the index that can be a
> match can be bounded by the leaf
> pages that contain the end point(s).  Similarly if we have a merge
> with a sorted intermediate set from
> a prior step then we also have bounds on the values.

How do you find the bounding points for these pages? Does this not
require descending through the tree in a more traditional way?

> I'm not convinced that your assertion that the index leaf pages must
> necessarily be processed in-order
> is true - it depends what sort of operation is under way.  I am
> assuming that we try hard to keep
> interior index nodes and data in meory and that having identified the
> subset of these that we want, we
> can immediately infer the set of leaves that are potentially of interest.

It is because you are missing my point. In order to find a reduced set
of pages to load, one must descend through the B-Tree. Identifying the
subset requires sequential loading of pages. There is some theoretical
potential for async I/O, but I doubt your own assertion that this is
significant in any significant way. I ask you again - how do you know
which lower level pages to load before you load the higher level pages?
The only time the B-Tree can be processed out of order in this regard is
if you are doing a bitmap index scan or some similar operation that will
scan the entire tree, and you do not care what order the pages arrive
in. If you are looking for a specific key, one parent page leads to one
child page, and the operation is sequential.

>> The difference between preload and handling async I/O in terms of
>> performance is debatable. Greg reports that async I/O on Linux sucks,
>> but posix_fadvise*() has substantial benefits. posix_fadvise*() is
>> preload not async I/O (he also reported that async I/O on Solaris
>> seems to work well). Being able to do work as the first page is
>> available is a micro-optimization as far as I am concerned at this
>> point (one that may not yet work on Linux), as the real benefit comes
>> from utilizing all 12 disks in Matthew's case, not from guaranteeing
>> that data is processed as soon as possible.
> I see it as part of the same problem.  You can partition the data
> across all the disks and run queries in parallel
> against the partitions, or you can lay out the data in the RAID array
> in which case the optimiser has very little idea
> how the data will map to physical data layout - so its best bet is to
> let the systems that DO know decide the
> access strategy.  And those systems can only do that if you give them
> a lot of requests that CAN be reordered,
> so they can choose a good ordering.

You can see it however you like - what remains, is that the 12X speed up
is going to come from using 12 disks, and loading the 12 disks in
parallel. More theoretical improvements with regard to the ability for a
particular hard disk to schedule reads and return results out of order,
have not, in my reading, been shown to reliably improve performance to a
significant degree. Unfortunately, Linux doesn't support my SATA NCQ
yet, so I haven't been able to experiment myself. Gregory provided
numbers showing a 2X - 3X performance when using three disks. This has
the potential for significant improvement with real hardware, modest
cost, and perhaps trivial changes to PostgreSQL. What you describe is a
re-design of the I/O strategy that will be designed for asynchronous
I/O, with some sort of state machine that will be able to deal
efficiently with either index pages or table pages out of order. Do you
have numbers to show that such a significant change would result in a
reasonable return on the investment?

>> Micro-optimization.
> Well, you like to assert this - but why?
I'll quote from:

    http://en.wikipedia.org/wiki/Native_Command_Queuing

Most reading I have done shows NCQ to have limited gains, with most
benchmarks being suspect. Also, latency is only for the first page. One
presumes that asynch I/O will be mostly valuable in the case where many
pages can be scheduled for reading at the same time. In the case that
many pages are scheduled for reading, the first page will be eventually
served, and the overall bandwidth is still the same. In your theoretical
model, you are presuming the CPU is a bottleneck, either for processing,
or scheduling the next batch of reads. I think you are hand waving, and
given that Linux doesn't yet support asynch I/O well, Gregory's model
will serve more PostgreSQL users than yours with less complexity.

> Clearly a hint to preload is better than nothing.  But it seems to me
> that the worst case is that we wait for
> the slowest page to load and then start processing hoping that the
> rest of the data stays in the buffer cache
> and is 'instant', while AIO and evaluate-when-ready means that process
> is still bound by the slowest
> data to arrive, but at that point there is little processing still to
> do, and the already-processed buffers can be
> reused earlier.  In the case where there is significant presure on the
> buffer cache, this can be significant.

It seems to me that you have yet to prove that there will be substantial
gains in overall performance for preload. Leaping on to presuming that
PostgreSQL should switch to a fully asynch I/O model seems a greater
stretch. By the sounds of it, Gregory could have the first implemented
very soon. When will you have yours? :-)

>> In your hand waving, you have assumed that PostgreSQL B-Tree index
>> might need to be replaced? :-)
> Sure, if the intent is to make the system thread-hot or AIO-hot, then
> the change is potentially very
> invasive.  The strategy to evaluate queries based on parallel
> execution and async IO is not necessarily
> very like a strategy where you delegate to the OS buffer cache.
>
> I'm not too bothered for the urpose of this discussion whether the way
> that postgres currently
> navigates indexes is amenable to this.  This is bikeshed land, right?

I am only interested by juicy projects that have a hope of success. This
subject does interest me - I am hoping my devil's advocate participation
encourages people to seek a practical implementation that will benefit me.

> I think it is foolish to disregard strategies that will allow
> overlapping IO and processing - and we want to
> keep disks reading and writing rather than seeking.  To me that
> suggests AIO and disk-native queuing
> are quite a big deal.  And parallel evaluation will be too as the
> number of cores goes up and there is
> an expectation that this should reduce latency of individual query,
> not just allow throughput with lots
> of concurrent demand.

I am more amenable to multi-threaded index processing for the same query
than async I/O to take advantage of NCQ. Guess we each come from a
different background. :-)

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

pgsql-performance by date:

Previous
From: Gregory Stark
Date:
Subject: Re: Bad query plans for queries on partitioned table
Next
From: Julian Mehnle
Date:
Subject: Re: Bad query plans for queries on partitioned table