Re: Tuplesort merge pre-reading - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Tuplesort merge pre-reading
Date
Msg-id 1541ef84-c2e4-26b6-03a3-19d22589caa2@iki.fi
Whole thread Raw
In response to Re: Tuplesort merge pre-reading  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Tuplesort merge pre-reading  (Peter Geoghegan <pg@heroku.com>)
Re: Tuplesort merge pre-reading  (Peter Geoghegan <pg@heroku.com>)
Re: Tuplesort merge pre-reading  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On 09/15/2016 10:12 PM, Peter Geoghegan wrote:
> On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Spotted another issue with this code just now. Shouldn't it be based
>>> on state->tapeRange? You don't want the destTape to get memory, since
>>> you don't use batch memory for tapes that are written to (and final
>>> on-the-fly merges don't use their destTape at all).
>
>>> Wait, you're using a local variable maxTapes here, which potentially
>>> differs from state->maxTapes:
>
>> I changed that so that it does actually change state->maxTapes. I considered
>> having a separate numTapes field, that can be smaller than maxTapes, but we
>> don't need the original maxTapes value after that point anymore, so it
>> would've been just pro forma to track them separately. I hope the comment
>> now explains that better.
>
> I still don't get why you're doing all of this within mergeruns() (the
> beginning of when we start merging -- we merge all quicksorted runs),
> rather than within beginmerge() (the beginning of one particular merge
> pass, of which there are potentially more than one). As runs are
> merged in a non-final merge pass, fewer tapes will remain active for
> the next merge pass. It doesn't do to do all that up-front when we
> have multiple merge passes, which will happen from time to time.

Now that the pre-reading is done in logtape.c, it doesn't stop at a run 
boundary. For example, when we read the last 1 MB of the first run on a 
tape, and we're using a 10 MB read buffer, we will merrily also read the 
first 9 MB from the next run. You cannot un-read that data, even if the 
tape is inactive in the next merge pass.

I don't think it makes much difference in practice, because most merge 
passes use all, or almost all, of the available tapes. BTW, I think the 
polyphase algorithm prefers to do all the merges that don't use all 
tapes upfront, so that the last final merge always uses all the tapes. 
I'm not 100% sure about that, but that's my understanding of the 
algorithm, and that's what I've seen in my testing.

> Correct me if I'm wrong, but I think that you're more skeptical of the
> need for polyphase merge than I am. I at least see no reason to not
> keep it around. I also think it has some value. It doesn't make this
> optimization any harder, really.

We certainly still need multi-pass merges.

BTW, does a 1-way merge make any sense? I was surprised to see this in 
the log, even without this patch:

LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 3-way merge step: CPU 0.62s/7.23u sec elapsed 8.44 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.44 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.45 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;

- Heikki



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Implement targetlist SRFs using ROWS FROM() (was Changed SRF in targetlist handling)
Next
From: Robert Haas
Date:
Subject: Re: Renaming some binaries