Home > mailing lists

Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Parallel tuplesort (for parallel B-Tree index creation)
Date	September 6, 2016 10:35:00
Msg-id	0939f43f-226b-ee57-4a27-2d1b8e18185c@iki.fi Whole thread Raw
In response to	Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@heroku.com>)
Responses	Re: Parallel tuplesort (for parallel B-Tree index creation)
List	pgsql-hackers

Tree view

I'm reviewing patches 1-3 in this series, i.e. those patches that are 
not directly related to parallelism, but are independent improvements to 
merging.

Let's begin with patch 1:

On 08/02/2016 01:18 AM, Peter Geoghegan wrote:
> Cap the number of tapes used by external sorts
>
> Commit df700e6b set merge order based on available buffer space (the
> number of tapes was as high as possible while still allowing at least 32
> * BLCKSZ buffer space per tape), rejecting Knuth's theoretically
> justified "sweet spot" of 7 tapes (a merge order of 6 -- Knuth's P),
> improving performance when the sort thereby completed in one pass.
> However, it's still true that there are unlikely to be benefits from
> increasing the number of tapes past 7 once the amount of data to be
> sorted significantly exceeds available memory; that commit probably
> mostly just improved matters where it enabled all merging to be done in
> a final on-the-fly merge.
>
> One problem with the merge order logic established by that commit is
> that with large work_mem settings and data volumes, the tapes previously
> wasted as much as 8% of the available memory budget; tens of thousands
> of tapes could be logically allocated for a sort that will only benefit
> from a few dozen.

Yeah, wasting 8% of the memory budget on this seems like a bad idea. If 
I understand correctly, that makes the runs shorter than necessary, 
leading to more runs.

> A new quasi-arbitrary cap of 501 is applied on the number of tapes that
> tuplesort will ever use (i.e.  merge order is capped at 500 inclusive).
> This is a conservative estimate of the number of runs at which doing all
> merging on-the-fly no longer allows greater overlapping of I/O and
> computation.

Hmm. Surely there are cases, so that with > 501 tapes you could do it 
with one merge pass, but now you need two? And that would hurt 
performance, no?

Why do we reserve the buffer space for all the tapes right at the 
beginning? Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) 
callin inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every 
time we start a new run, until we reach maxTapes?

- Heikki

pgsql-hackers by date:

From: Ashutosh Bapat
Date: 06 September 2016, 10:13:46
Subject: Re: Declarative partitioning - another take

From: Ivan Kartyshov
Date: 06 September 2016, 10:39:47
Subject: Re: make async slave to wait for lsn to be replayed

Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

Previous

Next