Re: RAID arrays and performance - Mailing list pgsql-performance

From James Mansion
Subject Re: RAID arrays and performance
Date
Msg-id 4755E9A7.7060906@mansionfamily.plus.com
Whole thread Raw
In response to Re: RAID arrays and performance  (Mark Mielke <mark@mark.mielke.cc>)
Responses Re: RAID arrays and performance  (Mark Mielke <mark@mark.mielke.cc>)
List pgsql-performance
Mark Mielke wrote:
> PostgreSQL or the kernel should already have the hottest pages in
> memory, so the value of doing async I/O is very likely the cooler
> pages that are unique to the query. We don't know what the cooler
> pages are until we follow three tree down.
>
I'm assuming that at the time we start to search the index, we have some
idea of value or values that
we are looking for.  Or, as you say, we are applying a function to 'all
of it'.

Think of a 'between' query.  The subset of the index that can be a match
can be bounded by the leaf
pages that contain the end point(s).  Similarly if we have a merge with
a sorted intermediate set from
a prior step then we also have bounds on the values.

I'm not convinced that your assertion that the index leaf pages must
necessarily be processed in-order
is true - it depends what sort of operation is under way.  I am assuming
that we try hard to keep
interior index nodes and data in meory and that having identified the
subset of these that we want, we
can immediately infer the set of leaves that are potentially of interest.
> The difference between preload and handling async I/O in terms of
> performance is debatable. Greg reports that async I/O on Linux sucks,
> but posix_fadvise*() has substantial benefits. posix_fadvise*() is
> preload not async I/O (he also reported that async I/O on Solaris
> seems to work well). Being able to do work as the first page is
> available is a micro-optimization as far as I am concerned at this
> point (one that may not yet work on Linux), as the real benefit comes
> from utilizing all 12 disks in Matthew's case, not from guaranteeing
> that data is processed as soon as possible.
>
I see it as part of the same problem.  You can partition the data across
all the disks and run queries in parallel
against the partitions, or you can lay out the data in the RAID array in
which case the optimiser has very little idea
how the data will map to physical data layout - so its best bet is to
let the systems that DO know decide the
access strategy.  And those systems can only do that if you give them a
lot of requests that CAN be reordered,
so they can choose a good ordering.

> Micro-optimization.
>
Well, you like to assert this - but why?  If a concern is the latency
(and my experience suggests that latency is the
biggest issue in practice, not throughput per se) then overlapping
processing while waiting for 'distant' data is
important - and we don't have any information about the physical layout
of the data that allows us to assert that
forward access pre-read of data from a file is the right strategy for
accessing it as fast as possible - we have to
allow the OS (and to an increasing extent the disks) to manage the
elevator IO to best effect.  Its clear that the
speed of streaming read and write of modern disks is really high
compared to that of random access, so anything
we can do to help the disks run in that mode is pretty worthwhile even
if the physical streaming doesn't match
any obvious logical ordering of the OS files or logical data pages
within them.  If you have a function to apply to
a set of data elements and the elements are independant, then requiring
that the function is applied in an order
rather than conceptually in parallel is going to put a lot of constraint
on how the hardware can optimise it.

Clearly a hint to preload is better than nothing.  But it seems to me
that the worst case is that we wait for
the slowest page to load and then start processing hoping that the rest
of the data stays in the buffer cache
and is 'instant', while AIO and evaluate-when-ready means that process
is still bound by the slowest
data to arrive, but at that point there is little processing still to
do, and the already-processed buffers can be
reused earlier.  In the case where there is significant presure on the
buffer cache, this can be significant.

Of course, a couple of decades bullying Sybase systems on Sun Enterprise
boxes may have left me
somewhat jaundiced - but Sybase can at least parallelise things.
Sometimes.  When it does, its quite
a big win.

> In your hand waving, you have assumed that PostgreSQL B-Tree index
> might need to be replaced? :-)
>
Sure, if the intent is to make the system thread-hot or AIO-hot, then
the change is potentially very
invasive.  The strategy to evaluate queries based on parallel execution
and async IO is not necessarily
very like a strategy where you delegate to the OS buffer cache.

I'm not too bothered for the urpose of this discussion whether the way
that postgres currently
navigates indexes is amenable to this.  This is bikeshed land, right?

I think it is foolish to disregard strategies that will allow
overlapping IO and processing - and we want to
keep disks reading and writing rather than seeking.  To me that suggests
AIO and disk-native queuing
are quite a big deal.  And parallel evaluation will be too as the number
of cores goes up and there is
an expectation that this should reduce latency of individual query, not
just allow throughput with lots
of concurrent demand.


pgsql-performance by date:

Previous
From: Mark Mielke
Date:
Subject: Re: RAID arrays and performance
Next
From: Gregory Stark
Date:
Subject: Re: Bad query plans for queries on partitioned table