Thread: Logical tape pause/resume

Logical tape pause/resume

From

Heikki Linnakangas

Date:

04 October 2016, 13:47:40

One of the patches in Peter Geoghegan's Parallel tuplesort patch set [1]
is to put a cap on the number of tapes to use for the sort. The reason
is that each tape consumes 3 * BLCKSZ worth of memory, for the buffers.
We decide the max number of tapes upfront, so if we decide that e.g. the
max number of tapes is 1000, we reserve 1000 * 3 * BLCKSZ bytes of
memory for the buffers, and that memory cannot then be used for the
quicksorts to build the initial runs, even if it turns out later that we
needed only a few tapes.

That amounts to about 8% of the available memory. That's quite wasteful.
Peter's approach of putting an arbitrary cap on the max number of tapes
works, but I find it a bit hackish. And you still waste that 8% with
smaller work_mem settings.

When we finish writing an initial run to a tape, we keep the tape
buffers around. That's the reason we need to reserve that memory. But
there's no fundamental reason we have to do that. As I suggested in [2],
we could flush the buffers to disk, and only keep in memory the location
of the last block we wrote. If we need to write another run to the tape,
we can reload the last incomplete block from the disk, and continue writing.

Reloading the last block, requires an extra I/O. That's OK. It's quite
rare to have a multi-pass sort these days, so it's a good bet that you
don't need to do it. And if you have a very small work_mem, so that you
need to do a multi-pass sort, having some more memory available for
building the initial runs is probably worth it, and there's also a good
chance that the block is still in the OS cache.

So, here are two patches to that end:

1. The first patch changes the way we store the logical tapes on disk.
Instead of using indirect blocks, HFS style, the blocks form a linked
list. Each block has a trailer, with the block numbers of the previous
and next block of the tape. That eliminates the indirect blocks, which
simplifies the code quite a bit, and makes it simpler to implement the
pause/resume functionality in the second patch. It also immediately
reduces the memory needed for the buffers, from 3 to 1 block per tape,
as we don't need to hold the indirect blocks in memory.

2. The second patch adds a new function LogicalTapePause(). A paused
tape can be be written to, but it doesn't have an in-memory buffer. The
last, incomplete block is written to disk, but if you call
LogicalTapeWrite(), a new buffer is allocated and the last block is read
back into memory. If you don't write to the tape anymore, and call
LogicalTapeRewind() after LogicalTapePause(), nothing needs to be done,
as the last block is already on disk.

In addition to saving a little bit of memory, I'd like to do this
refactoring because it simplifies the code. It's code that has stayed
largely unchanged for the past 15 years, so I'm not too eager to touch
it, but looking at the changes coming with Peter's parallel tuplesort
patch set, I think this makes sense. The parallel tuplesort patch set
adds code for "serializing" and "deserializing" a tape, which means
writing the top indirect block to disk, so that it can be read back in
in the leader process. That's awfully similar to the "pause/resume"
functionality here, so by adding it now, we won't have to do it in the
parallel tuplesort patch. (Admittedly, it's not a whole lot of code, but
still.)

[1] CAM3SWZQKM=Pzc=CAHzRixKjp2eO5Q0Jg1SoFQqeXFQ647JiwqQ@mail.gmail.com
[2] e8f44b63-4745-b855-7772-e8201906a4a1@iki.fi

- Heikki

Attachment

Re: Logical tape pause/resume

From

Simon Riggs

Date:

04 October 2016, 14:09:11

On 4 October 2016 at 11:47, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

> When we finish writing an initial run to a tape, we keep the tape buffers
> around. That's the reason we need to reserve that memory. But there's no
> fundamental reason we have to do that. As I suggested in [2], we could flush
> the buffers to disk, and only keep in memory the location of the last block
> we wrote. If we need to write another run to the tape, we can reload the
> last incomplete block from the disk, and continue writing.
>
> Reloading the last block, requires an extra I/O. That's OK. It's quite rare
> to have a multi-pass sort these days, so it's a good bet that you don't need
> to do it. And if you have a very small work_mem, so that you need to do a
> multi-pass sort, having some more memory available for building the initial
> runs is probably worth it, and there's also a good chance that the block is
> still in the OS cache.

Sounds like a good idea.

Why not just make each new run start at a block boundary?
That way we waste on average BLCKSZ/2 disk space per run, which is
negligible but we avoid any need to have code to read back in the last
block.

That also makes it easier to put each run in its own file, which will
surely help with parallelism and compression since we can spread
multiple run files across multiple temp tablespaces.

Can we also please read in the value from the last tuple in a run, so
we have min and max values in memory for each run? That can be used
during the merge step to avoid merging runs unless the value ranges
overlap.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical tape pause/resume

From

Heikki Linnakangas

Date:

04 October 2016, 14:47:35

On 10/04/2016 02:09 PM, Simon Riggs wrote:
> On 4 October 2016 at 11:47, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
>> When we finish writing an initial run to a tape, we keep the tape buffers
>> around. That's the reason we need to reserve that memory. But there's no
>> fundamental reason we have to do that. As I suggested in [2], we could flush
>> the buffers to disk, and only keep in memory the location of the last block
>> we wrote. If we need to write another run to the tape, we can reload the
>> last incomplete block from the disk, and continue writing.
>>
>> Reloading the last block, requires an extra I/O. That's OK. It's quite rare
>> to have a multi-pass sort these days, so it's a good bet that you don't need
>> to do it. And if you have a very small work_mem, so that you need to do a
>> multi-pass sort, having some more memory available for building the initial
>> runs is probably worth it, and there's also a good chance that the block is
>> still in the OS cache.
>
> Sounds like a good idea.
>
> Why not just make each new run start at a block boundary?
> That way we waste on average BLCKSZ/2 disk space per run, which is
> negligible but we avoid any need to have code to read back in the last
> block.

Hmm. You'd still have to read back the last block, so that you can 
update its next-pointer.

What you could do, though, is to always store only one run on each tape. 
If we can make the in-memory representation of a tape small enough, that 
might be acceptable. You only need 4 bytes to store the starting block 
number. Or we could dump the starting block numbers of the tapes, to yet 
another tape. The number of tapes you merge in one pass would still be 
limited by the amount of memory you have, but the number of tapes would 
be unlimited.

I don't want to do that right now, it gets more invasive. Something to 
keep in mind for the future, though. This also related to your other 
suggestion below:

> That also makes it easier to put each run in its own file, which will
> surely help with parallelism and compression since we can spread
> multiple run files across multiple temp tablespaces.
>
> Can we also please read in the value from the last tuple in a run, so
> we have min and max values in memory for each run? That can be used
> during the merge step to avoid merging runs unless the value ranges
> overlap.

This gets more advanced... Yeah, stuff like that would make sense. You 
could also split the tapes into smaller parts, and store the min/max for 
each part, to make the merging even more efficient and easier to 
parallelize. Or to take that to the extreme, you could make each tape a 
little B-tree.

But that's for another day and another patch :-). What I have now is a 
drop-in replacement for the current logical tape implementation. These 
more advanced schemes would build on top of that.

- Heikki

Re: Logical tape pause/resume

From

Simon Riggs

Date:

04 October 2016, 17:58:44

On 4 October 2016 at 12:47, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

>> Why not just make each new run start at a block boundary?
>> That way we waste on average BLCKSZ/2 disk space per run, which is
>> negligible but we avoid any need to have code to read back in the last
>> block.
>
>
> Hmm. You'd still have to read back the last block, so that you can update
> its next-pointer.

If each run is in its own file, then you can skip that bit.

And we do want the sort to disk to use multiple files so we can
parallelize I/O as well as CPU.

So since we know we'll want multiple files, we should be thinking
about how to split things up between files.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Logical tape pause/resume

From

Heikki Linnakangas

Date:

04 October 2016, 19:00:14

On 10/04/2016 05:58 PM, Simon Riggs wrote:
> On 4 October 2016 at 12:47, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
>>> Why not just make each new run start at a block boundary?
>>> That way we waste on average BLCKSZ/2 disk space per run, which is
>>> negligible but we avoid any need to have code to read back in the last
>>> block.
>>
>> Hmm. You'd still have to read back the last block, so that you can update
>> its next-pointer.
>
> If each run is in its own file, then you can skip that bit.

Then you need to remember the names of the files, in memory. That's 
closer to my idea of having only one run on each tape, but you're taking 
further by also saying that each tape is a separate file.

That might be a good idea for building the initial runs. Each file would 
then be written and read sequentially, so we could perhaps get rid of 
the whole pre-reading code, and rely completely on OS read-ahead.

However, can't really do that for multi-pass merges, because you want to 
reuse the space of the input tapes for the output tape, as you go.

> And we do want the sort to disk to use multiple files so we can
> parallelize I/O as well as CPU.

Huh? Why can't you parallelize I/O on a single file?

- Heikki

Re: Logical tape pause/resume

From

Claudio Freire

Date:

04 October 2016, 19:06:53

On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> 1. The first patch changes the way we store the logical tapes on disk.
> Instead of using indirect blocks, HFS style, the blocks form a linked list.
> Each block has a trailer, with the block numbers of the previous and next
> block of the tape. That eliminates the indirect blocks, which simplifies the
> code quite a bit, and makes it simpler to implement the pause/resume
> functionality in the second patch. It also immediately reduces the memory
> needed for the buffers, from 3 to 1 block per tape, as we don't need to hold
> the indirect blocks in memory.

Doesn't that make prefetching more than a buffer ahead harder?

Re: Logical tape pause/resume

From

Heikki Linnakangas

Date:

04 October 2016, 19:12:59

On 10/04/2016 07:06 PM, Claudio Freire wrote:
> On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> 1. The first patch changes the way we store the logical tapes on disk.
>> Instead of using indirect blocks, HFS style, the blocks form a linked list.
>> Each block has a trailer, with the block numbers of the previous and next
>> block of the tape. That eliminates the indirect blocks, which simplifies the
>> code quite a bit, and makes it simpler to implement the pause/resume
>> functionality in the second patch. It also immediately reduces the memory
>> needed for the buffers, from 3 to 1 block per tape, as we don't need to hold
>> the indirect blocks in memory.
>
> Doesn't that make prefetching more than a buffer ahead harder?

Yes, indeed. That's a potential downside, if we wanted to do that. We 
rely wholly on OS readahead for that currently, so it doesn't matter. I 
don't think we want to start issuing explicit pre-fetching calls here 
any time soon. But if we did, we could presume that the pages in 
sequential order for prefetching purposes, and that would work pretty 
well in practice (that's what the OS readahead is doing for us now). Or 
we could try to enforce that by allocating blocks in larger chunks.

In short, I'm OK with that.

- Heikki

Re: Logical tape pause/resume

From

Peter Geoghegan

Date:

09 October 2016, 03:27:21

Apologies for the delayed response to this.

On Tue, Oct 4, 2016 at 3:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> One of the patches in Peter Geoghegan's Parallel tuplesort patch set [1] is
> to put a cap on the number of tapes to use for the sort.

> That amounts to about 8% of the available memory. That's quite wasteful.
> Peter's approach of putting an arbitrary cap on the max number of tapes
> works, but I find it a bit hackish. And you still waste that 8% with smaller
> work_mem settings.

I don't think it's hackish -- there was never a theoretical
justification for using as many tapes as possible from Knuth, or
anyone else. I think that Simon's 2006 work on allowing for a number
of tapes much greater than Knuth's "sweet spot" of 7 was useful only
because it sometimes enabled final on-the-fly merging where we are not
merging a huge number of runs (i.e. when merging only somewhat more
than 7 runs). Your recent work on tape preloading has probably greatly
diminished the value of not doing all merging on-the-fly in a single,
final merge, though. Knuth's sweet spot of 7 had little or nothing to
do with the economics of buying many tape drives.

You shouldn't really waste 8% of the budget with low work_mem settings
with my cap patch applied, because the new cap never limits the number
of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you
have about 1GB of work_mem. So, if there is still any waste at the low
end, that can only be solved by tweaking the main calculation within
tuplesort_merge_order(). (Also, to be clear to others following along:
that memory is never actually allocated, so it's only "wasted" from
the perspective of one sort operation alone).

The cost of multiple passes is paid in sequential I/O of tape temp
files, which is now clearly a relatively low cost. OTOH, the cost of a
larger merge heap is paid in more CPU cache misses, which is a cost we
can feel quite severely. While it's really hard to come up with a
generic break-even point, I do think that there is value in capping
the number of tapes somewhere in the hundreds. It's not like a cap on
the number of tapes along the lines I've proposed was not thought
about from day one, by both Tom and Simon. Noah's relatively recent
MaxAllocSize work has given the issue new importance, though (the same
might have been said during the 9.6 replacement selection vs.
quicksort discussions, actually).

> When we finish writing an initial run to a tape, we keep the tape buffers
> around. That's the reason we need to reserve that memory. But there's no
> fundamental reason we have to do that. As I suggested in [2], we could flush
> the buffers to disk, and only keep in memory the location of the last block
> we wrote. If we need to write another run to the tape, we can reload the
> last incomplete block from the disk, and continue writing.

Okay. But, you haven't actually addressed the problem with non-trivial
amounts of memory being logically allocated ahead of time for no good
reason -- you've only address the constant factor (the overhead
per-tape). Can't we also fix the general problem, by applying a cap?
Better to have a cap that is approximately right (in the hundreds or
so) than one that is exactly wrong (infinity -- no cap).

> Reloading the last block, requires an extra I/O. That's OK. It's quite rare
> to have a multi-pass sort these days, so it's a good bet that you don't need
> to do it. And if you have a very small work_mem, so that you need to do a
> multi-pass sort, having some more memory available for building the initial
> runs is probably worth it, and there's also a good chance that the block is
> still in the OS cache.

That analysis does seem sound to me.

> In addition to saving a little bit of memory, I'd like to do this
> refactoring because it simplifies the code. It's code that has stayed
> largely unchanged for the past 15 years, so I'm not too eager to touch it,
> but looking at the changes coming with Peter's parallel tuplesort patch set,
> I think this makes sense.

I can definitely see value in refactoring, to make that code less
complicated -- I just don't think it's justified by the amount of
memory that is wasted on tapes.

That said, I'm not especially worried about the complexity within the
logtape.c changes of the parallel CREATE INDEX patch alone. I'm much
more interested in having a logtape.c that could be more easily made
to support binary searching, etc, to find *partition* boundaries,
which my CREATE INDEX patch doesn't use or care about. This is needed
for tuplesort.c partition-based sorting. When parallel sort isn't just
used by CREATE INDEX, partitioning becomes important. And, with
partitioning, dynamic sampling is essential (I think that this will
end up living in logtape.c).

To recap on what I went into in the paritioning-to-parallel-tuplesort
thread [1], I think that partitioning will come in a future release,
and will be of more value to parallel queries, where much more can be
pushed down within the executor; we really want to remove
co-dependencies across workers early. I'm much less excited about how
much that will benefit parallel CREATE INDEX, though, where
partition-to-sort would need to deal with the complexity of *actually
writing* the finished index in parallel (at least, to see any
benefit), fixing up internal pages, etc. And, any benefit would be
limited, because writing itself is I/O bound -- there is never too
much to push down, unlike with real parallel query. (I feel fairly
confident about this because my CREATE INDEX patch seems to make
Postgres parallel CREATE INDEX have comparable scalability to the
implementations in other major systems.)

[1] https://www.postgresql.org/message-id/CAM3SWZR%2BATYAzyMT%2Bhm-Bo%3D1L1smtJbNDtibwBTKtYqS0dYZVg%40mail.gmail.com
-- 
Peter Geoghegan

Re: Logical tape pause/resume

From

Heikki Linnakangas

Date:

10 October 2016, 09:52:50

On 10/09/2016 03:27 AM, Peter Geoghegan wrote:
> You shouldn't really waste 8% of the budget with low work_mem settings
> with my cap patch applied, because the new cap never limits the number
> of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you
> have about 1GB of work_mem. So, if there is still any waste at the low
> end, that can only be solved by tweaking the main calculation within
> tuplesort_merge_order(). (Also, to be clear to others following along:
> that memory is never actually allocated, so it's only "wasted" from
> the perspective of one sort operation alone).

Regardless of the number of tapes, the memory used for the tape buffers, 
while building the initial runs, is wasted. It's not entirely wasted 
when you have multiple merge passes, because without the buffer you need 
to read the last partial page back into memory, when you start to output 
the next run on it. But even in that case, I believe that memory would 
be better used for the quicksort, to create larger runs.

> The cost of multiple passes is paid in sequential I/O of tape temp
> files, which is now clearly a relatively low cost. OTOH, the cost of a
> larger merge heap is paid in more CPU cache misses, which is a cost we
> can feel quite severely. While it's really hard to come up with a
> generic break-even point, I do think that there is value in capping
> the number of tapes somewhere in the hundreds. It's not like a cap on
> the number of tapes along the lines I've proposed was not thought
> about from day one, by both Tom and Simon. Noah's relatively recent
> MaxAllocSize work has given the issue new importance, though (the same
> might have been said during the 9.6 replacement selection vs.
> quicksort discussions, actually).

Yeah, there might be value in having a cap, because operating the merge 
heap becomes more expensive when it becomes larger. Mainly because of 
cache misses. We should perhaps have a limit on the size of the merge 
heap, and use the same limit for the replacement selection. Although, 
the heap behaves slightly differently during replacement selection: 
During replacement selection, new values added to the heap tend to go to 
the bottom of the heap, but during merge, new values tend to go close to 
the top. The latter pattern incurs fewer cache misses.

But that's orthogonal to the wasting-memory aspect of this. Even if a we 
have a cap of 100, it's good to not spend memory for the tape buffers of 
those 100 tapes, when building the initial runs.

>> When we finish writing an initial run to a tape, we keep the tape buffers
>> around. That's the reason we need to reserve that memory. But there's no
>> fundamental reason we have to do that. As I suggested in [2], we could flush
>> the buffers to disk, and only keep in memory the location of the last block
>> we wrote. If we need to write another run to the tape, we can reload the
>> last incomplete block from the disk, and continue writing.
>
> Okay. But, you haven't actually addressed the problem with non-trivial
> amounts of memory being logically allocated ahead of time for no good
> reason -- you've only address the constant factor (the overhead
> per-tape).

I thought I did. Can you elaborate?

Are you referring to the fact that the array of LogicalTapes is still 
allocated ahead of time, with size maxTapes? True, it'd be nice to 
allocate the LogicalTape structs only when needed. But that's peanuts, 
compared to the tape buffers.

>> In addition to saving a little bit of memory, I'd like to do this
>> refactoring because it simplifies the code. It's code that has stayed
>> largely unchanged for the past 15 years, so I'm not too eager to touch it,
>> but looking at the changes coming with Peter's parallel tuplesort patch set,
>> I think this makes sense.
>
> I can definitely see value in refactoring, to make that code less
> complicated -- I just don't think it's justified by the amount of
> memory that is wasted on tapes.

Ok, good. I think we're in agreement on doing this, then, even if we 
don't agree on the justification :-).

> That said, I'm not especially worried about the complexity within the
> logtape.c changes of the parallel CREATE INDEX patch alone. I'm much
> more interested in having a logtape.c that could be more easily made
> to support binary searching, etc, to find *partition* boundaries,
> which my CREATE INDEX patch doesn't use or care about. This is needed
> for tuplesort.c partition-based sorting. When parallel sort isn't just
> used by CREATE INDEX, partitioning becomes important. And, with
> partitioning, dynamic sampling is essential (I think that this will
> end up living in logtape.c).

Yeah. I'm not sure how partitioning and all that would be done here. But 
I'm prettty sure this simpler logtape.c code is easier to work with, for 
partitioning too.

We might want to have a each partition on a separate tape, for example, 
and therefore have a lot more tapes than currently. Not wasting memory 
on the tape buffers becomes important in that case.

- Heikki

Re: Logical tape pause/resume

From

Heikki Linnakangas

Date:

10 October 2016, 11:17:45

Here are freshly rebased versions of these patches. No big changes, but
edited some comments slightly.

- Heikki

Attachment

Re: Logical tape pause/resume

From

Peter Geoghegan

Date:

10 October 2016, 22:31:45

On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Regardless of the number of tapes, the memory used for the tape buffers,
> while building the initial runs, is wasted. It's not entirely wasted when
> you have multiple merge passes, because without the buffer you need to read
> the last partial page back into memory, when you start to output the next
> run on it. But even in that case, I believe that memory would be better used
> for the quicksort, to create larger runs.

Okay. Somehow, I was confused on that point.

> Yeah, there might be value in having a cap, because operating the merge heap
> becomes more expensive when it becomes larger. Mainly because of cache
> misses. We should perhaps have a limit on the size of the merge heap, and
> use the same limit for the replacement selection. Although, the heap behaves
> slightly differently during replacement selection: During replacement
> selection, new values added to the heap tend to go to the bottom of the
> heap, but during merge, new values tend to go close to the top. The latter
> pattern incurs fewer cache misses.

I don't think it makes sense to bring replacement selection into it.
I'd vote for completely killing replacement selection at this point.
The merge heap work committed a few weeks back really weakened the
case for it, a case that the 9.6 work left hanging by a thread to
begin with. In general, having a replacement selection heap that is
larger can only be helpful when that means that we have enough
"juggling capacity" to reorder things into one (or relatively few)
long runs. For RS, the optimal size of the heap is good enough to do
any juggling that is possible to make runs as long as possible, but no
larger. The default replacement_sort_tuples setting (150,000) seems
very high to me, considered from that point of view, actually.

Multiple passes just don't seem to be that bad on modern hardware, in
the cases where they still happen. You don't see a big drop when
tuning down work_mem to just below the threshold at which the sort can
complete in one pass. And, multi-pass sorts still won't happen very
often in practice. Now, this work of yours risks slowing down multiple
pass sorts when that does happen, but I think it could still be worth
it.

> But that's orthogonal to the wasting-memory aspect of this. Even if a we
> have a cap of 100, it's good to not spend memory for the tape buffers of
> those 100 tapes, when building the initial runs.

You're right about that.

>> Okay. But, you haven't actually addressed the problem with non-trivial
>> amounts of memory being logically allocated ahead of time for no good
>> reason -- you've only address the constant factor (the overhead
>> per-tape).
>
>
> I thought I did. Can you elaborate?
>
> Are you referring to the fact that the array of LogicalTapes is still
> allocated ahead of time, with size maxTapes? True, it'd be nice to allocate
> the LogicalTape structs only when needed. But that's peanuts, compared to
> the tape buffers.

Is that's all that is left to remove, in terms of wasteful USEMEM()
overhead? Then, yeah, I guess I am just talking about "peanuts", plus
the more significant issue of merge heap CPU cache efficiency.

>> I can definitely see value in refactoring, to make that code less
>> complicated -- I just don't think it's justified by the amount of
>> memory that is wasted on tapes.
>
>
> Ok, good. I think we're in agreement on doing this, then, even if we don't
> agree on the justification :-).

Agreed. :-)

> Yeah. I'm not sure how partitioning and all that would be done here. But I'm
> prettty sure this simpler logtape.c code is easier to work with, for
> partitioning too.

That's my intuition, too. Obviously my thoughts on partitioning are
still very hand-wavy, but I'm pretty sure that partitioning is the
future for the parallelization of sorting in the executor. Pushing
*everything* down is more important than actually having the sort be
as fast as possible there.

-- 
Peter Geoghegan

Re: Logical tape pause/resume

From

Heikki Linnakangas

Date:

12 October 2016, 17:30:52

On 10/10/2016 10:31 PM, Peter Geoghegan wrote:
> On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Ok, good. I think we're in agreement on doing this, then, even if we don't
>> agree on the justification :-).
>
> Agreed. :-)

Attached are new patch versions. Rebased them over current head, fixed a
number of bugs in the seek and backspace code, and did a bunch of
stylistic cleanups and comment fixes.

I changed the API of LogicalTapeBackspace() slightly. If you try to back
up beyond the beginning of tape, it used to return FALSE and do nothing.
Now it backspaces as much as it can, to the very beginning of the tape,
and returns the amount backspaced. That was more convenient than the old
behaviour with this new implementation.

- Heikki

Attachment

Re: [HACKERS] Logical tape pause/resume

From

Heikki Linnakangas

Date:

21 December 2016, 18:25:24

On 10/12/2016 05:30 PM, Heikki Linnakangas wrote:
> On 10/10/2016 10:31 PM, Peter Geoghegan wrote:
>> On Sun, Oct 9, 2016 at 11:52 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Ok, good. I think we're in agreement on doing this, then, even if we don't
>>> agree on the justification :-).
>>
>> Agreed. :-)
>
> Attached are new patch versions. Rebased them over current head, fixed a
> number of bugs in the seek and backspace code, and did a bunch of
> stylistic cleanups and comment fixes.

A rebased set of patches attached.

In the meanwhile, Robert committed the cap on the number of tapes. Since 
that's in, I'm not sure if the pause/resume part of this is worthwhile. 
Maybe it is.

For now, barring objections, I'm going to commit the first patch. It 
seems like a worthwhile simplification in any case, especially for 
Peter's Parallel tuplesort patch set.

- Heikki

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] Logical tape pause/resume

From

Robert Haas

Date:

21 December 2016, 19:22:51

On Wed, Dec 21, 2016 at 7:25 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> For now, barring objections, I'm going to commit the first patch. It seems
> like a worthwhile simplification in any case, especially for Peter's
> Parallel tuplesort patch set.

Well, it removes more code than it adds.  That's definitely something.
And saving memory per-empty-tape is good, too.  A few random comments:

"seeked" is kind of a lame variable name.  How about "seekpos" or
"newpos" or something like that?
    /*     * Even in minimum memory, use at least a MINORDER merge.  On the other     * hand, even when we have lots of
memory,do not use more than a MAXORDER
 
-     * merge.  Tapes are pretty cheap, but they're not entirely free.  Each
-     * additional tape reduces the amount of memory available to build runs,
-     * which in turn can cause the same sort to need more runs, which makes
-     * merging slower even if it can still be done in a single pass.  Also,
-     * high order merges are quite slow due to CPU cache effects; it can be
-     * faster to pay the I/O cost of a polyphase merge than to perform a single
-     * merge pass across many hundreds of tapes.
+     * merge.  Tapes are pretty cheap, but they're not entirely free.
High order
+     * merges are quite slow due to CPU cache effects; it can be faster to pay
+     * the I/O cost of a polyphase merge than to perform a single merge pass
+     * across many hundreds of tapes.     */

I think you could leave this comment as-is.  You haven't zeroed out
the overhead of a tape, and I like those additional bits I crammed in
there.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Logical tape pause/resume

From

Heikki Linnakangas

Date:

21 December 2016, 19:25:49

On 12/21/2016 03:22 PM, Robert Haas wrote:
> On Wed, Dec 21, 2016 at 7:25 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> For now, barring objections, I'm going to commit the first patch. It seems
>> like a worthwhile simplification in any case, especially for Peter's
>> Parallel tuplesort patch set.
>
> Well, it removes more code than it adds.  That's definitely something.
> And saving memory per-empty-tape is good, too.  A few random comments:
>
> "seeked" is kind of a lame variable name.  How about "seekpos" or
> "newpos" or something like that?

Ok.

>      /*
>       * Even in minimum memory, use at least a MINORDER merge.  On the other
>       * hand, even when we have lots of memory, do not use more than a MAXORDER
> -     * merge.  Tapes are pretty cheap, but they're not entirely free.  Each
> -     * additional tape reduces the amount of memory available to build runs,
> -     * which in turn can cause the same sort to need more runs, which makes
> -     * merging slower even if it can still be done in a single pass.  Also,
> -     * high order merges are quite slow due to CPU cache effects; it can be
> -     * faster to pay the I/O cost of a polyphase merge than to perform a single
> -     * merge pass across many hundreds of tapes.
> +     * merge.  Tapes are pretty cheap, but they're not entirely free.
> High order
> +     * merges are quite slow due to CPU cache effects; it can be faster to pay
> +     * the I/O cost of a polyphase merge than to perform a single merge pass
> +     * across many hundreds of tapes.
>       */
>
> I think you could leave this comment as-is.  You haven't zeroed out
> the overhead of a tape, and I like those additional bits I crammed in
> there.  :-)

Yes, quite right. That was a mishap in rebasing, that change to remove 
the comment really belongs to the pause/resume patch rather than the 1st 
one.

Thanks for the review!

- Heikki

Re: [HACKERS] Logical tape pause/resume

From

Peter Geoghegan

Date:

29 December 2016, 02:57:45

On Wed, Dec 21, 2016 at 4:25 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> In the meanwhile, Robert committed the cap on the number of tapes. Since
> that's in, I'm not sure if the pause/resume part of this is worthwhile.
> Maybe it is.

I rebased my parallel tuplesort patch on top of what you committed a
few days back (your 0001-* patch). It definitely made my changes to
logtape.c significantly simpler, which was a big improvement.

I would be inclined to not go forward with 0002-* though, because I
think it's cleaner for the parallel tuplesort patch to have workers
rely on the tape freezing code to flush the last block out to make
state available in temp files for the leader to process/merge. The
memory savings that remain on the table are probably not measurable if
we were to fix them, given the work we've already done, palloc()
fragmentation, and so on.

-- 
Peter Geoghegan