Home > mailing lists

Re: WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: WIP: [[Parallel] Shared] Hash
Date	March 27, 2017 07:50:22
Msg-id	CAEepm=0hUD+JfGLeFrdLU+80zQEqHA1SC7bu84CMbLERVLTCag@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] WIP: [[Parallel] Shared] Hash (Andres Freund <andres@anarazel.de>)
Responses	Re: WIP: [[Parallel] Shared] Hash (Peter Geoghegan <pg@bowt.ie>)
List	pgsql-hackers

Tree view

On Mon, Mar 27, 2017 at 12:12 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Sun, Mar 26, 2017 at 3:41 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>>> 1.  Segments are what buffile.c already calls the individual
>>> capped-at-1GB files that it manages.  They are an implementation
>>> detail that is not part of buffile.c's user interface.  There seems to
>>> be no reason to change that.
>>
>> After reading your next email I realised this is not quite true:
>> BufFileTell and BufFileSeek expose the existence of segments.
>
> Yeah, that's something that tuplestore.c itself relies on.
>
> I always thought that the main reason practical why we have BufFile
> multiplex 1GB segments concerns use of temp_tablespaces, rather than
> considerations that matter only when using obsolete file systems:
>
> /*
>  * We break BufFiles into gigabyte-sized segments, regardless of RELSEG_SIZE.
>  * The reason is that we'd like large temporary BufFiles to be spread across
>  * multiple tablespaces when available.
>  */
>
> Now, I tend to think that most installations that care about
> performance would be better off using RAID to stripe their one temp
> tablespace file system. But, I suppose this still makes sense when you
> have a number of file systems that happen to be available, and disk
> capacity is the main concern. PHJ uses one temp tablespace per worker,
> which I further suppose might not be as effective in balancing disk
> space usage.

I was thinking about IO bandwidth balance rather than size.  If you
rotate through tablespaces segment-by-segment, won't you be exposed to
phasing effects that could leave disk arrays idle for periods of time?Whereas if you assign them to participants, you
canonly get idle

arrays if you have fewer participants than tablespaces.

This seems like a fairly complex subtopic and I don't have a strong
view on it.  Clearly you could rotate through tablespaces on the basis
of participant, partition, segment, some combination, or something
else.  Doing it by participant seemed to me to be the least prone to
IO imbalance cause by phasing effects (= segment based) or data
distribution (= partition based), of the options I considered when I
wrote it that way.

Like you, I also tend to suspect that people would be more likely to
use RAID type technologies to stripe things like this for both
bandwidth and space reasons these days.  Tablespaces seem to make more
sense as a way of separating different classes of storage
(fast/expensive, slow/cheap etc), not as an IO or space striping
technique.  I may be way off base there though...

-- 
Thomas Munro
http://www.enterprisedb.com

pgsql-hackers by date:

From: Craig Ringer
Date: 27 March 2017, 07:31:14
Subject: Re: logical decoding of two-phase transactions

From: Craig Ringer
Date: 27 March 2017, 08:01:42
Subject: Re: Logical decoding on standby

Re: WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers

Previous

Next