Re: Logical tape pause/resume - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Logical tape pause/resume
Date
Msg-id b549252d-be9c-eccd-1a56-f44ce2f45a10@iki.fi
Whole thread Raw
In response to Re: Logical tape pause/resume  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Logical tape pause/resume  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On 10/09/2016 03:27 AM, Peter Geoghegan wrote:
> You shouldn't really waste 8% of the budget with low work_mem settings
> with my cap patch applied, because the new cap never limits the number
> of tapes. IIRC, the cap of 500 tapes doesn't start to matter until you
> have about 1GB of work_mem. So, if there is still any waste at the low
> end, that can only be solved by tweaking the main calculation within
> tuplesort_merge_order(). (Also, to be clear to others following along:
> that memory is never actually allocated, so it's only "wasted" from
> the perspective of one sort operation alone).

Regardless of the number of tapes, the memory used for the tape buffers, 
while building the initial runs, is wasted. It's not entirely wasted 
when you have multiple merge passes, because without the buffer you need 
to read the last partial page back into memory, when you start to output 
the next run on it. But even in that case, I believe that memory would 
be better used for the quicksort, to create larger runs.

> The cost of multiple passes is paid in sequential I/O of tape temp
> files, which is now clearly a relatively low cost. OTOH, the cost of a
> larger merge heap is paid in more CPU cache misses, which is a cost we
> can feel quite severely. While it's really hard to come up with a
> generic break-even point, I do think that there is value in capping
> the number of tapes somewhere in the hundreds. It's not like a cap on
> the number of tapes along the lines I've proposed was not thought
> about from day one, by both Tom and Simon. Noah's relatively recent
> MaxAllocSize work has given the issue new importance, though (the same
> might have been said during the 9.6 replacement selection vs.
> quicksort discussions, actually).

Yeah, there might be value in having a cap, because operating the merge 
heap becomes more expensive when it becomes larger. Mainly because of 
cache misses. We should perhaps have a limit on the size of the merge 
heap, and use the same limit for the replacement selection. Although, 
the heap behaves slightly differently during replacement selection: 
During replacement selection, new values added to the heap tend to go to 
the bottom of the heap, but during merge, new values tend to go close to 
the top. The latter pattern incurs fewer cache misses.

But that's orthogonal to the wasting-memory aspect of this. Even if a we 
have a cap of 100, it's good to not spend memory for the tape buffers of 
those 100 tapes, when building the initial runs.

>> When we finish writing an initial run to a tape, we keep the tape buffers
>> around. That's the reason we need to reserve that memory. But there's no
>> fundamental reason we have to do that. As I suggested in [2], we could flush
>> the buffers to disk, and only keep in memory the location of the last block
>> we wrote. If we need to write another run to the tape, we can reload the
>> last incomplete block from the disk, and continue writing.
>
> Okay. But, you haven't actually addressed the problem with non-trivial
> amounts of memory being logically allocated ahead of time for no good
> reason -- you've only address the constant factor (the overhead
> per-tape).

I thought I did. Can you elaborate?

Are you referring to the fact that the array of LogicalTapes is still 
allocated ahead of time, with size maxTapes? True, it'd be nice to 
allocate the LogicalTape structs only when needed. But that's peanuts, 
compared to the tape buffers.

>> In addition to saving a little bit of memory, I'd like to do this
>> refactoring because it simplifies the code. It's code that has stayed
>> largely unchanged for the past 15 years, so I'm not too eager to touch it,
>> but looking at the changes coming with Peter's parallel tuplesort patch set,
>> I think this makes sense.
>
> I can definitely see value in refactoring, to make that code less
> complicated -- I just don't think it's justified by the amount of
> memory that is wasted on tapes.

Ok, good. I think we're in agreement on doing this, then, even if we 
don't agree on the justification :-).

> That said, I'm not especially worried about the complexity within the
> logtape.c changes of the parallel CREATE INDEX patch alone. I'm much
> more interested in having a logtape.c that could be more easily made
> to support binary searching, etc, to find *partition* boundaries,
> which my CREATE INDEX patch doesn't use or care about. This is needed
> for tuplesort.c partition-based sorting. When parallel sort isn't just
> used by CREATE INDEX, partitioning becomes important. And, with
> partitioning, dynamic sampling is essential (I think that this will
> end up living in logtape.c).

Yeah. I'm not sure how partitioning and all that would be done here. But 
I'm prettty sure this simpler logtape.c code is easier to work with, for 
partitioning too.

We might want to have a each partition on a separate tape, for example, 
and therefore have a lot more tapes than currently. Not wasting memory 
on the tape buffers becomes important in that case.

- Heikki




pgsql-hackers by date:

Previous
From: Julien Rouhaud
Date:
Subject: Re: pg_stat_statements and non default search_path
Next
From: Gilles Darold
Date:
Subject: Re: proposal: psql \setfileref