Re: Configuration Recommendations - Mailing list pgsql-performance

From Robert Klemme
Subject Re: Configuration Recommendations
Date
Msg-id CAM9pMnOQWT4OqTMyRCJuXXV5OYvcDMK=kKucpKe8QUcQep_ioA@mail.gmail.com
Whole thread Raw
In response to Re: Configuration Recommendations  ("Greg Sabino Mullane" <greg@turnstep.com>)
List pgsql-performance
On Tue, May 15, 2012 at 7:53 PM, Greg Sabino Mullane <greg@turnstep.com> wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: RIPEMD160
>
>
>>>> Is it established practice in the Postgres world to separate indexes
>>>> from tables? I would assume that the reasoning of Richard Foote -
>>>> albeit for Oracle databases - is also true for Postgres:
>>
>>> Yes, it's an established practice. I'd call it something just short of
>>> a best practice though, as it really depends on your situation.
>>
>> What are the benefits?
>
> Disk seeks, basically. Yes, there are a lot of complications regarding
> all the various hardware/OS/PG level cachings, but at the end of the
> day, it's less work to have each drive concentrate on a single area

Hmm...  I see your point.  OTOH, the whole purpose of using NAS or SAN
with cache, logical volumes and multiple spindles per volume is to
reduce the impact of slow disk operations like seeks.  If in such a
situation your database operations are impacted by those seek
operations then the setup does not seem optimal anyway.  Bottom line
is: with a setup properly tailored to the workload there should be no
seeks "visible" to the database.

> (especially as we always require a heap scan at the moment).

Are you referring to the scan along tuple versions?
http://facility9.com/2011/03/postgresql-row-storage-fundamentals/

>>> I also find his examples a bit contrived, and the whole "multi-user"
>>> argument irrelevant for common cases.
>>
>> Why is that?
>
> Because most Postgres servers are dedicated to serving the same data
> or sets of data, and the number of "other users" calling ad-hoc queries
> against lots of different tables (per his example) is small.

I don't see how it should be relevant for this discussion whether
selects are "ad hoc" or other.  The mere fact that concurrent accesses
to the same set of tables and indexes albeit to different data (keys)
is sufficient to have a potential for seeks - even if disks for index
and table are separated.  And this will typically happen in a
multiuser application - even if all users use the same set of queries.

> So this sentence just doesn't ring true to me:
>
>    " ... by the time weâve read the index leaf block, processed and
>    read all the associated table blocks referenced by the index leaf
>    block, the chances of there being no subsequent physical activity
>    in the index tablespace due to another user session is virtually
>    nil. We would still need to re-scan the disk to physically access
>    the next index leaf block (or table block) anyways."
>
> That's certainly not true for Postgres servers, and I doubt if it
> is quite that bad on Oracle either.

I don't think this has much to do with the brand.  Richard just
describes logical consequences of concurrent access (see my attempt at
explanation above).  Fact remains that concurrent accesses rarely
target for the same data and because of that you would see quite
erratic access patterns to blocks.  How they translate to actual disk
accesses depends on various caching mechanisms in place and the
physical distribution of data across disks (RAID).  But I think we
cannot ignore the fact that the data requested by concurrent queries
most likely resides on different blocks.

>>> I lean towards using separate tablespaces in Postgres, as the
>>> performance outweighs the additional>> complexity.
>
>> What about his argument with regards to access patterns (i.e.
>> interleaving index and table access during an index scan)?  Also,
>> Shaun's advice to have more spindles available sounds convincing to
>> me, too.
>
> I don't buy his arguments. To do so, you'd have to buy a key point:
>
>    "when most physical I/Os in both index and table segments are
>     effectively random, single block reads"
>
> They are not; hence, the rest of his argument falls apart. Certainly,
> if things were as truly random and uncached as he states, there would
> be no benefit to separation.

Your argument with seeks also only works in absence of caching (see
above).  I think Richard was mainly pointing out that /in absence of
caching/ different blocks need to be accessed here.

> As far as spindles, yes: like RAM, it's seldom the case to have
> too litte :) But as with all things, one should get some benchmarks
> on your specific workload before making hardware changes. (Well, RAM
> may be an exception to that, up to a point).

Can you share some measurement data which backs the thesis that the
distribution of index and table to different disks is advantageous?
That would be interesting to see.  Then one could also balance
performance benefits against other effects (manageability etc.) and
see on which side the advantage comes out.

Even though I'm not convinced: Thank you for the interesting discussion!

Cheers

robert


--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

pgsql-performance by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: [pgsql-performance] Daily digest v1.3606 (10 messages)
Next
From: Merlin Moncure
Date:
Subject: Re: SSD selection