Re: Tuplesort merge pre-reading - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Tuplesort merge pre-reading |
Date | |
Msg-id | CAM3SWZQrU12mog=TG9jvNOpeUL5Krk9N1Y8GmsLGTGRBJui-aQ@mail.gmail.com Whole thread Raw |
In response to | Re: Tuplesort merge pre-reading (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Tuplesort merge pre-reading
Re: Tuplesort merge pre-reading Re: Tuplesort merge pre-reading Re: Tuplesort merge pre-reading Re: Tuplesort merge pre-reading |
List | pgsql-hackers |
On Sun, Sep 11, 2016 at 8:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > Here's a new version of these patches, rebased over current master. I > squashed the two patches into one, there's not much point to keep them > separate. I think I have my head fully around this now. For some reason, I initially thought that this patch was a great deal more radical than it actually is. (Like Greg, I somehow initially thought that you were rejecting the idea of batch memory in general, and somehow (over) leveraging the filesystem cache. I think I misunderstood your remarks when we talked on IM about it early on.) I don't know what the difference is between accessing 10 pages randomly, and accessing a random set of 10 single pages sequentially, in close succession. As Tom would say, that's above my pay grade. I suppose it comes down to how close "close" actually is (but in any case, it's all very fudged). I mention this because I think that cost_sort() should be updated to consider sequential I/O the norm, alongside this patch of yours (your patch strengthens the argument [1] for that general idea). The reason that this new approach produces more sequential I/O, apart from the simple fact that you effectively have much more memory available and so fewer rounds of preloading, is that the indirection block stuff can make I/O less sequential in order to support eager reclamation of space. For example, maybe there is interleaving of blocks as logtape.c manages to reclaim blocks in the event of multiple merge steps. I care about that second factor a lot more now than I would have a year ago, when a final on-the-fly merge generally avoids multiple passes (and associated logtape.c block fragmentation), because parallel CREATE INDEX is usually affected by that (workers will often want to do their own merge ahead of the leader's final merge), and because I want to cap the number of tapes used, which will make multiple passes a bit more common in practice. I was always suspicious of the fact that memtuples is so large during merging, and had a vague plan to fix that (although I was the one responsible for growing memtuples even more for the merge in 9.6, that was just because under the status quo of having many memtuples during the merge, the size of memtuples should at least be in balance with remaining memory for caller tuples -- it wasn't an endorsement of the status quo). However, it never occurred to me to do that by pushing batch memory into the head of logtape.c, which now seems like an excellent idea. To summarize my understanding of this patch: If it wasn't for my work on parallel CREATE INDEX, I would consider this patch to give us only a moderate improvement to user-visible performance, since it doesn't really help memory rich cases very much (cases that are not likely to have much random I/O anyway). In that universe, I'd be more appreciative of the patch as a simplifying exercise, since while refactoring. It's nice to get a boost for more memory constrained cases, it's not a huge win. However, that's not the universe we live in -- I *am* working on parallel CREATE INDEX, of course. And so, I really am very excited about this patch, because it really helps with the particular challenges I have there, even though it's fair to assume that we have a reasonable amount of memory available when parallelism is used. If workers are going to do their own merges, as they often will, then multiple merge pass cases become far more important, and any additional I/O is a real concern, *especially* additional random I/O (say from logtape.c fragmentation). The patch directly addresses that, which is great. Your patch, alongside my just-committed patch concerning how we maintain the heap invariant, together attack the merge bottleneck from two different directions: they address I/O costs, and CPU costs, respectively. Other things I noticed: * You should probably point out that typically, access to batch memory will still be sequential, despite your block-based scheme. The preloading will now more or less make that the normal case. Any fragmentation will now be essentially in memory, not on disk, which is a big win. * I think that logtape.c header comments are needed for this. Maybe that's where you should point out that memory access is largely sequential. But it's surely true that logtape.c needs to take responsibility for being the place where the vast majority of memory is allocated during merging. * i think you should move "bool *mergeactive; /* active input run source? */" within Tuplesortstate to be next to the other batch memory stuff. No point in having separate merge and batch "sections" there anymore. * You didn't carry over my 0002-* batch memory patch modifications to comments, even though you should have in a few cases. There remains some references in comments to batch memory, as a thing exclusively usable by final on-the-fly merges. That's not true anymore -- it's usable by final merges, too. For example, readtup_alloc() still references the final on-the-fly merge. * You also fail to take credit in the commit message for making batch memory usable when returning caller tuples to callers that happen to request "randomAccess" (So, I guess the aforementioned comments above routines like readtup_alloc() shouldn't even refer to merging, unless it's to say that non-final merges are not supported due to their unusual requirements). My patch didn't go that far (I only addressed the final merge itself, not the subsequent access to tuples when reading from that materialized final output tape by TSS_SORTEDONTAPE case). But, that's actually really useful for randomAccess callers, above and beyond what I proposed (which in any case was mostly written with parallel workers in mind, which never do TSS_SORTEDONTAPE processing). * Furthermore, readtup_alloc() will not just be called in WRITETUP() routines -- please update comments. * There is a very subtle issue here: > + /* > + * We no longer need a large memtuples array, only one slot per tape. Shrink > + * it, to make the memory available for other use. We only need one slot per > + * tape. > + */ > + pfree(state->memtuples); > + FREEMEM(state, state->memtupsize * sizeof(SortTuple)); > + state->memtupsize = state->maxTapes; > + state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple)); > + USEMEM(state, state->memtupsize * sizeof(SortTuple)); The FREEMEM() call needs to count the chunk overhead in both cases. In short, I think you need to copy the GetMemoryChunkSpace() stuff that you see within grow_memtuples(). * Whitespace issue here: > @@ -2334,7 +2349,8 @@ inittapes(Tuplesortstate *state) > #endif > > /* > - * Decrease availMem to reflect the space needed for tape buffers; but > + * Decrease availMem to reflect the space needed for tape buffers, when > + * writing the initial runs; but > * don't decrease it to the point that we have no room for tuples. (That > * case is only likely to occur if sorting pass-by-value Datums; in all > * other scenarios the memtuples[] array is unlikely to occupy more than > @@ -2359,14 +2375,6 @@ inittapes(Tuplesortstate *state) > state->tapeset = LogicalTapeSetCreate(maxTapes); * I think that you need to comment on why state->tuplecontext is not used for batch memory now. It is still useful, for multiple merge passes, but the situation has notably changed for it. * Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?: > @@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite) > Assert(lt->frozen); > datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect); > } > + > + /* Allocate a read buffer */ > + if (lt->buffer) > + pfree(lt->buffer); > + lt->buffer = palloc(lt->read_buffer_size); > + lt->buffer_size = lt->read_buffer_size; * Typo: > + > + /* > + * from this point on, we no longer use the usemem()/lackmem() mechanism to > + * track memory usage of indivitual tuples. > + */ > + state->batchused = true; * Please make this use the ".., + 1023" natural rounding trick that is used in the similar traces that are removed: > +#ifdef TRACE_SORT > + if (trace_sort) > + elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape", > + (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024); > +#endif * It couldn't hurt to make this code paranoid about LACKMEM() becoming true, which will cause havoc (we saw this recently in 9.6; a patch of mine to fix that just went in): > + /* > + * Use all the spare memory we have available for read buffers. Divide it > + * memory evenly among all the tapes. > + */ > + avail_blocks = state->availMem / BLCKSZ; > + per_tape = avail_blocks / maxTapes; > + cutoff = avail_blocks % maxTapes; > + if (per_tape == 0) > + { > + per_tape = 1; > + cutoff = 0; > + } > + for (tapenum = 0; tapenum < maxTapes; tapenum++) > + { > + LogicalTapeAssignReadBufferSize(state->tapeset, tapenum, > + (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ); > + } In other words, we really don't want availMem to become < 0, since it's int64, but a derived value is passed to LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now, if LACKMEM() did happen it would be a bug anyway, but I recommend defensive code also be added. Per grow_memtuples(), "We need to be sure that we do not cause LACKMEM to become true, else the space management algorithm will go nuts". Let's be sure that we get that right, since, as we saw recently, especially since grow_memtuples() will not actually have the chance to save us here (if there is a bug along these lines, let's at least make the right "can't happen error" complaint to user when it pops up). * It looks like your patch makes us less eager about freeing per-tape batch memory, now held as preload buffer space within logtape.c. ISTM that there should be some way to have the "tape exhausted" code path within tuplesort_gettuple_common() (as well as the similar spot within mergeonerun()) instruct logtape.c that we're done with that tape. In other words, when mergeprereadone() (now renamed to mergereadnext()) detects the tape is exhausted, it should have logtape.c free its huge tape buffer immediately. Think about cases where input is presorted, and earlier tapes can be freed quite early on. It's worth keeping that around, (you removed the old way that this happened, through mergebatchfreetape()). That's all I have right now. I like the direction this is going in, but I think this needs more polishing. [1] https://www.postgresql.org/message-id/CAM3SWZQLP6e=1si1NcQjYft7R-VYpprrf_i59tZOZX5m7VFK-w@mail.gmail.com -- Peter Geoghegan
pgsql-hackers by date: