Re: Tuplesort merge pre-reading - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Tuplesort merge pre-reading
Date
Msg-id 95f5a75d-b6ad-f175-57f2-6e32f745e908@iki.fi
Whole thread Raw
In response to Re: Tuplesort merge pre-reading  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Tuplesort merge pre-reading
List pgsql-hackers
On 09/06/2016 10:26 PM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Offhand, I would think that taken together this is very important. I'd
>> certainly want to see cases in the hundreds of megabytes or gigabytes
>> of work_mem alongside your 4MB case, even just to be able to talk
>> informally about this. As you know, the default work_mem value is very
>> conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

I wrote a little testing toolkit, see third patch. I'm not proposing to
commit that, but that's what I used for testing. It creates four tables,
about 1GB in size each (it also creates smaller and larger tables, but I
used the "medium" sized ones for these tests). Two of the tables contain
integers, and two contain text strings. Two of the tables are completely
ordered, two are in random order. To measure, it runs ORDER BY queries
on the tables, with different work_mem settings.

Attached are the full results. In summary, these patches improve
performance in some of the tests, and are a wash on others. The patches
help in particular in the randomly ordered cases, with up to 512 MB of
work_mem.

For example, with work_mem=256MB, which is enough to get a single merge
pass:

with patches:

ordered_ints: 7078 ms, 6910 ms, 6849 ms
random_ints: 15639 ms, 15575 ms, 15625 ms
ordered_text: 11121 ms, 12318 ms, 10824 ms
random_text: 53462 ms, 53420 ms, 52949 ms

unpatched master:

ordered_ints: 6961 ms, 7040 ms, 7044 ms
random_ints: 19205 ms, 18797 ms, 18955 ms
ordered_text: 11045 ms, 11377 ms, 11203 ms
random_text: 57117 ms, 54489 ms, 54806 ms

(The same queries were run three times in a row, that's what the three
numbers on each row mean. I.e. the differences between the numbers on
same row are noise)

> It looks like your benchmark relies on multiple passes, which can be
> misleading. I bet it suffers some amount of problems from palloc()
> fragmentation. When very memory constrained, that can get really bad.
>
> Non-final merge passes (merges with more than one run -- real or dummy
> -- on any given tape) can have uneven numbers of runs on each tape.
> So, tuplesort.c needs to be prepared to doll out memory among tapes
> *unevenly* there (same applies to memtuples "slots"). This is why
> batch memory support is so hard for those cases (the fact that they're
> so rare anyway also puts me off it). As you know, I wrote a patch that
> adds batch memory support to cases that require randomAccess (final
> output on a materialized tape), for their final merge. These final
> merges happen to not be a final on-the-fly merge only due to this
> randomAccess requirement from caller. It's possible to support these
> cases in the future, with that patch, only because I am safe to assume
> that each run/tape is the same size there (well, the assumption is
> exactly as safe as it was for the 9.6 final on-the-fly merge, at
> least).
>
> My point about non-final merges is that you have to be very careful
> that you're comparing apples to apples, memory accounting wise, when
> looking into something like this. I'm not saying that you didn't, but
> it's worth considering.

I'm not 100% sure I'm accounting for all the memory correctly. But I
didn't touch the way the initial quicksort works, nor the way the runs
are built. And the merge passes don't actually need or benefit from a
lot of memory, so I doubt it's very sensitive to that.

In this patch, the memory available for the read buffers is just divided
evenly across maxTapes. The buffers for the tapes that are not currently
active are wasted. It could be made smarter, by freeing all the
currently-unused buffers for tapes that are not active at the moment.
Might do that later, but this is what I'm going to benchmark for now. I
don't think adding buffers is helpful beyond a certain point, so this is
probably good enough in practice. Although it would be nice to free the
memory we don't need earlier, in case there are other processes that
could make use of it.

> FWIW, I did try an increase in the buffer size in LogicalTape at one
> time several months back, and so no benefit there (at least, with no
> other changes).

Yeah, unless you get rid of the pre-reading in tuplesort.c, you're just
double-buffering.

- Heikki


Attachment

pgsql-hackers by date:

Previous
From: Vladimir Sitnikov
Date:
Subject: Re: Index Onlys Scan for expressions
Next
From: Jeff Janes
Date:
Subject: Re: Write Ahead Logging for Hash Indexes