Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date
Msg-id 6f918bd0-2a8c-c8c0-664f-78b25135c033@iki.fi
Whole thread Raw
In response to Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 12/21/2016 12:53 AM, Robert Haas wrote:
>> That leaves one problem, though: reusing space in the final merge phase. If
>> the tapes being merged belong to different LogicalTapeSets, and create one
>> new tape to hold the result, the new tape cannot easily reuse the space of
>> the input tapes because they are on different tape sets.
>
> If the worker is always completely finished with the tape before the
> leader touches it, couldn't the leader's LogicalTapeSet just "adopt"
> the tape and overwrite it like any other?

Currently, the logical tape code assumes that all tapes in a single 
LogicalTapeSet are allocated from the same BufFile. The logical tape's 
on-disk format contains block numbers, to point to the next/prev block 
of the tape [1], and they're assumed to refer to the same file. That 
allows reusing space efficiently during the merge. After you have read 
the first block from tapes A, B and C, you can immediately reuse those 
three blocks for output tape D.

Now, if you read multiple tapes, from different LogicalTapeSet, hence 
backed by different BufFiles, you cannot reuse the space from those 
different tapes for a single output tape, because the on-disk format 
doesn't allow referring to blocks in other files. You could reuse the 
space of *one* of the input tapes, by placing the output tape in the 
same LogicalTapeSet, but not all of them.

We could enhance that, by using "filename + block number" instead of 
just block number, in the pointers in the logical tapes. Then you could 
spread one logical tape across multiple files. Probably not worth it in 
practice, though.


[1] As the code stands, there are no next/prev pointers, but a tree of 
"indirect" blocks. But I'm planning to change that to simpler next/prev 
pointers, in 
https://www.postgresql.org/message-id/flat/55b3b7ae-8dec-b188-b8eb-e07604052351%40iki.fi

- Heikki




pgsql-hackers by date:

Previous
From: Joel Jacobson
Date:
Subject: Re: [HACKERS] SET NOT NULL [NOT VALID / CONCURRENTLY]?
Next
From: Craig Ringer
Date:
Subject: Re: [HACKERS] Replication slot xmin is not reset if HS feedback isturned off while standby is shut down