tuplesort memory usage: grow_memtuples - Mailing list pgsql-hackers

From Jeff Janes
Subject tuplesort memory usage: grow_memtuples
Date
Msg-id CAMkU=1zE-d2HAU=MqmeVjBFv577FzWRzHmm6e3CQ9Os7-jRv7A@mail.gmail.com
Whole thread Raw
Responses Re: tuplesort memory usage: grow_memtuples
Re: tuplesort memory usage: grow_memtuples
List pgsql-hackers
When sorting small tuples, the memtuples array can use a substantial
fraction of the total per-tuple memory used. (In the case of pass by
value, it is all of it)

The way it grows leads to sub-optimal memory usage.

In one case, if it can grow by 1.999 x, but not by 2 x, we just give
up and use half the memory we could have.  This is what actually
happens in pass-by-value sorting when using a power of two for
*work_mem.

While I understand we don't want to thrash memory, it seems like we
could at least grow by less than 2x just one last time.

Also, if there is just barely room to double memtuples (and it is
pass-by-reference sorting) then the number of slots is doubled but
leaves no room for the tuples themselves to be stored, so most of
those slots can't be used.

A solution to both would be to assess when we are about to do one of
the two above things, and instead use the historical tuple size to
compute how much of a growth we can do so that the memtuples slots and
the cumulative palloc heap of tuples will exhaust at the same time.
If the availMem is less than half of allowedMem, then the next
doubling of memtuples would either fail, or would succeed but leave
behind too little allowedMem to hold the full complement of new
tuples.  So either way, extrapolate from current usage.   The patch
assumes that nothing other than memtuples and individual tuple storage
have been deducted from availMem.  Currently that is true, but may not
always be so (we recently discussed pre-deducting the tape buffer
overhead, so it doesn't cause a memory overrun).

As the XXXX indicates, I am not currently able to prove to myself that
there are no conditions under which this change could cause
LACKMEM(state) to become true.

This patch gives about a 2-fold window over which a sort that would
otherwise go to tape sort will instead be done in memory, which is 2-3
times faster.  That might not be very impressive, because after all if
you wanted to use more memory you could have just increased work_mem.
But it still seems worthwhile to use the memory we are told to use.
It may be hard to fine-tune the *work_mem settings, but not using the
amount we are given surely doesn't make it any easier.

But other than the tape vs in memory improvement, this also gives
other improvements.  For example, this gives me about a 35% speed up
when using default work_mem to do a "select count(distinct k) from t"
where k is random integer and t has 512 million rows.  Granted, it is
silly to intentionally do such a large sort with such small memory.
But it still seems worth improving, and the wins should be sitting
around at other sizes as well, though probably not as big.

In my test case, the improvement mostly comes from turning random IO
reads into sequential ones.

The prevelance of random IO is due to a cascade of issues:

1) As discussed above, we are using only half the memory we could be.

2) Due to MINORDER we are spreading that reduced memory over twice as
many tapes as MERGE_BUFFER_SIZE would suggest.  Changing this would
obviously have a trade-off

3) MERGE_BUFFER_SIZE itself describes the in-RAM footprint, not the
on-tape footprint, because we unpack tuples as we read them from tape.
 Perhaps mergeprereadone could stash the preread data unpacked and
unpack it only as needed.  But that is a much more invasive change.

4) Pre-reading all tapes whenever just one becomes empty causes blocks
to be freed, and thus re-used, in smaller contiguous chunks than they
otherwise would. (Easy to change, but there is trade-off to changing
it)

5) Even without 4, logtape.c still makes a jumble of the free list on
multi-level merges.  I'm not entirely sure why yet--I think maybe it
is the way indirect blocks are handled.

Add it all up, and instead of pre-reading 32 consecutive 8K blocks, it
pre-reads only about 1 or 2 consecutive ones on the final merge.  Now
some of those could be salvaged by the kernel keeping track of
multiple interleaved read ahead opportunities, but in my hands vmstat
shows a lot of IO wait and shows reads that seem to be closer to
random IO than large read-ahead.  If it used truly efficient read
ahead, CPU would probably be limiting.

I'll add this to the open commit fest.  When doing performance
testing, I find it far easier to alter a guc.c parameter than to keep
multiple compiled binaries around.  Would people like a testing patch
that does the same thing as this, or does the original behavior, under
control of a parameter setting?

Cheers,

Jeff

Attachment

pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: COPY with hints, rebirth
Next
From: "Kevin Grittner"
Date:
Subject: Re: COPY with hints, rebirth