Re: Tuplesort merge pre-reading - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Tuplesort merge pre-reading
Date
Msg-id b5d82d34-80ee-5c3d-97a0-67ada4f3cf3e@iki.fi
Whole thread Raw
In response to Re: Tuplesort merge pre-reading  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Tuplesort merge pre-reading  (Heikki Linnakangas <hlinnaka@iki.fi>)
List pgsql-hackers
On 09/29/2016 01:52 PM, Peter Geoghegan wrote:
> * Variables like maxTapes have a meaning that is directly traceable
> back to Knuth's description of polyphase merge. I don't think that you
> should do anything to them, on general principle.

Ok. I still think that changing maxTapes would make sense, when there
are fewer runs than tapes, but that is actually orthogonal to the rest
of the patch, so let's discuss that separately. I've changed the patch
to not do that.

> * Everything or almost everything that you've added to mergeruns()
> should probably be in its own dedicated function. This function can
> have a comment where you acknowledge that it's not perfectly okay that
> you claim memory per-tape, but it's simpler and faster overall.

Ok.

> * I think you should be looking at the number of active tapes, and not
> state->Level or state->currentRun in this new function. Actually,
> maybe this wouldn't be the exact definition of an active tape that we
> establish at the beginning of beginmerge() (this considers tapes with
> dummy runs to be inactive for that merge), but it would look similar.
> The general concern I have here is that you shouldn't rely on
> state->Level or state->currentRun from a distance for the purposes of
> determining which tapes need some batch memory. This is also where you
> might say something like: "we don't bother with shifting memory around
> tapes for each merge step, even though that would be fairer. That's
> why we don't use the beginmerge() definition of activeTapes --
> instead, we use our own broader definition of the number of active
> tapes that doesn't exclude dummy-run-tapes with real runs, making the
> allocation reasonably sensible for every merge pass".

I'm not sure I understood what your concern was, but please have a look
at this new version, if the comments in initTapeBuffers() explain that
well enough.

> * The "batchUsed" terminology really isn't working here, AFAICT. For
> one thing, you have two separate areas where caller tuples might
> reside: The small per-tape buffers (sized MERGETUPLEBUFFER_SIZE per
> tape), and the logtape.c controlled preread buffers (sized up to
> MaxAllocSize per tape). Which of these two things is batch memory? I
> think it might just be the first one, but KiBs of memory do not
> suggest "batch" to me. Isn't that really more like what you might call
> double buffer memory, used not to save overhead from palloc (having
> many palloc headers in memory), but rather to recycle memory
> efficiently? So, these two things should have two new names of their
> own, I think, and neither should be called "batch memory" IMV. I see
> several assertions remain here and there that were written with my
> original definition of batch memory in mind -- assertions like:

Ok. I replaced "batch" terminology with "slab allocator" and "slab
slots", I hope this is less confusing. This isn't exactly like e.g. the
slab allocator in the Linux kernel, as it has a fixed number of slots,
so perhaps an "object pool" might be more accurate. But I like "slab"
because it's not used elsewhere in the system. I also didn't use the
term "cache" for the "slots", as might be typical for slab allocators,
because "cache" is such an overloaded term.

>         case TSS_SORTEDONTAPE:
>             Assert(forward || state->randomAccess);
>             Assert(!state->batchUsed);
>
> (Isn't state->batchUsed always true now?)

Good catch. It wasn't, because mergeruns() set batchUsed only after
checking for the TSS_SORTEDONTAPE case, even though it set up the batch
memory arena before it. So if replacement selection was able to produce
a single run batchUsed was false. Fixed, the slab allocator (as it's now
called) is now always used in TSS_SORTEDONTAPE case.

> And:
>
>         case TSS_FINALMERGE:
>             Assert(forward);
>             Assert(state->batchUsed || !state->tuples);
>
> (Isn't state->tuples only really of interest to datum-tuple-case
> routines, now that you've made everything use large logtape.c preread
> buffers?)

Yeah, fixed.

> * Is is really necessary for !state->tuples cases to do that
> MERGETUPLEBUFFER_SIZE-based allocation? In other words, what need is
> there for pass-by-value datum cases to have this scratch/double buffer
> memory at all, since the value is returned to caller by-value, not
> by-reference? This is related to the problem of it not being entirely
> clear what batch memory now is, I think.

True, fixed. I still set slabAllocatorUsed (was batchUsed), but it's
initialized as a dummy 0-slot arena when !state->tuples.

Here's a new patch version, addressing the points you made. Please have
a look!

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Bug in to_timestamp().
Next
From: Robert Haas
Date:
Subject: Re: "Re: Question about grant create on database and pg_dump/pg_dumpall