Thread: Tuplesort merge pre-reading

Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
While reviewing Peter's latest round of sorting patches, and trying to
understand the new "batch allocation" mechanism, I started to wonder how
useful the pre-reading in the merge stage is in the first place.

I'm talking about the code that reads a bunch of from each tape, loading
them into the memtuples array. That code was added by Tom Lane, back in
1999:

commit cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sat Oct 30 17:27:15 1999 +0000

     Further performance improvements in sorting: reduce number of
comparisons
     during initial run formation by keeping both current run and next-run
     tuples in the same heap (yup, Knuth is smarter than I am).  And, during
     merge passes, make use of available sort memory to load multiple tuples
     from any one input 'tape' at a time, thereby improving locality of
     access to the temp file.

So apparently there was a benefit back then, but is it still worthwhile?
The LogicalTape buffers one block at a time, anyway, how much gain are
we getting from parsing the tuples into SortTuple format in batches?

I wrote a quick patch to test that, attached. It seems to improve
performance, at least in this small test case:

create table lotsofints(i integer);
insert into lotsofints select random() * 1000000000.0 from
generate_series(1, 10000000);
vacuum freeze;

select count(*) FROM (select * from lotsofints order by i) t;

On my laptop, with default work_mem=4MB, that select takes 7.8 s on
unpatched master, and 6.2 s with the attached patch.

So, at least in some cases, the pre-loading hurts. I think we should get
rid of it. This patch probably needs some polishing: I replaced the
batch allocations with a simpler scheme with a buffer to hold just a
single tuple for each tape, and that might need some more work to allow
downsizing those buffers if you have a few very large tuples in an
otherwise narrow table. And perhaps we should free and reallocate a
smaller memtuples array for the merging, now that we're not making use
of the whole of it. And perhaps we should teach LogicalTape to use
larger buffers, if we can't rely on the OS to do the buffering for us.
But overall, this seems to make the code both simpler and faster.

Am I missing something?

- Heikki


Attachment

Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 5:20 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I wrote a quick patch to test that, attached. It seems to improve
> performance, at least in this small test case:
>
> create table lotsofints(i integer);
> insert into lotsofints select random() * 1000000000.0 from
> generate_series(1, 10000000);
> vacuum freeze;
>
> select count(*) FROM (select * from lotsofints order by i) t;
>
> On my laptop, with default work_mem=4MB, that select takes 7.8 s on
> unpatched master, and 6.2 s with the attached patch.

The benefits have a lot to do with OS read-ahead, and efficient use of
memory bandwidth during the merge, where we want to access the caller
tuples sequentially per tape (i.e. that's what the batch memory stuff
added -- it also made much better use of available memory). Note that
I've been benchmarking the parallel CREATE INDEX patch on a server
with many HDDs, since sequential performance is mostly all that
matters. I think that in 1999, the preloading had a lot more to do
with logtape.c's ability to aggressively recycle blocks during merges,
such that the total storage overhead does not exceed the original size
of the caller tuples (plus what it calls "trivial bookkeeping
overhead" IIRC). That's less important these days, but still matters
some (it's more of an issue when you can't complete the sort in one
pass, which is rare these days).

Offhand, I would think that taken together this is very important. I'd
certainly want to see cases in the hundreds of megabytes or gigabytes
of work_mem alongside your 4MB case, even just to be able to talk
informally about this. As you know, the default work_mem value is very
conservative.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Offhand, I would think that taken together this is very important. I'd
> certainly want to see cases in the hundreds of megabytes or gigabytes
> of work_mem alongside your 4MB case, even just to be able to talk
> informally about this. As you know, the default work_mem value is very
> conservative.

It looks like your benchmark relies on multiple passes, which can be
misleading. I bet it suffers some amount of problems from palloc()
fragmentation. When very memory constrained, that can get really bad.

Non-final merge passes (merges with more than one run -- real or dummy
-- on any given tape) can have uneven numbers of runs on each tape.
So, tuplesort.c needs to be prepared to doll out memory among tapes
*unevenly* there (same applies to memtuples "slots"). This is why
batch memory support is so hard for those cases (the fact that they're
so rare anyway also puts me off it). As you know, I wrote a patch that
adds batch memory support to cases that require randomAccess (final
output on a materialized tape), for their final merge. These final
merges happen to not be a final on-the-fly merge only due to this
randomAccess requirement from caller. It's possible to support these
cases in the future, with that patch, only because I am safe to assume
that each run/tape is the same size there (well, the assumption is
exactly as safe as it was for the 9.6 final on-the-fly merge, at
least).

My point about non-final merges is that you have to be very careful
that you're comparing apples to apples, memory accounting wise, when
looking into something like this. I'm not saying that you didn't, but
it's worth considering.

FWIW, I did try an increase in the buffer size in LogicalTape at one
time several months back, and so no benefit there (at least, with no
other changes).

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/06/2016 10:26 PM, Peter Geoghegan wrote:
> On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> Offhand, I would think that taken together this is very important. I'd
>> certainly want to see cases in the hundreds of megabytes or gigabytes
>> of work_mem alongside your 4MB case, even just to be able to talk
>> informally about this. As you know, the default work_mem value is very
>> conservative.

I spent some more time polishing this up, and also added some code to
logtape.c, to use larger read buffers, to compensate for the fact that
we don't do pre-reading from tuplesort.c anymore. That should trigger
the OS read-ahead, and make the I/O more sequential, like was the
purpose of the old pre-reading code. But simpler. I haven't tested that
part much yet, but I plan to run some tests on larger data sets that
don't fit in RAM, to make the I/O effects visible.

I wrote a little testing toolkit, see third patch. I'm not proposing to
commit that, but that's what I used for testing. It creates four tables,
about 1GB in size each (it also creates smaller and larger tables, but I
used the "medium" sized ones for these tests). Two of the tables contain
integers, and two contain text strings. Two of the tables are completely
ordered, two are in random order. To measure, it runs ORDER BY queries
on the tables, with different work_mem settings.

Attached are the full results. In summary, these patches improve
performance in some of the tests, and are a wash on others. The patches
help in particular in the randomly ordered cases, with up to 512 MB of
work_mem.

For example, with work_mem=256MB, which is enough to get a single merge
pass:

with patches:

ordered_ints: 7078 ms, 6910 ms, 6849 ms
random_ints: 15639 ms, 15575 ms, 15625 ms
ordered_text: 11121 ms, 12318 ms, 10824 ms
random_text: 53462 ms, 53420 ms, 52949 ms

unpatched master:

ordered_ints: 6961 ms, 7040 ms, 7044 ms
random_ints: 19205 ms, 18797 ms, 18955 ms
ordered_text: 11045 ms, 11377 ms, 11203 ms
random_text: 57117 ms, 54489 ms, 54806 ms

(The same queries were run three times in a row, that's what the three
numbers on each row mean. I.e. the differences between the numbers on
same row are noise)

> It looks like your benchmark relies on multiple passes, which can be
> misleading. I bet it suffers some amount of problems from palloc()
> fragmentation. When very memory constrained, that can get really bad.
>
> Non-final merge passes (merges with more than one run -- real or dummy
> -- on any given tape) can have uneven numbers of runs on each tape.
> So, tuplesort.c needs to be prepared to doll out memory among tapes
> *unevenly* there (same applies to memtuples "slots"). This is why
> batch memory support is so hard for those cases (the fact that they're
> so rare anyway also puts me off it). As you know, I wrote a patch that
> adds batch memory support to cases that require randomAccess (final
> output on a materialized tape), for their final merge. These final
> merges happen to not be a final on-the-fly merge only due to this
> randomAccess requirement from caller. It's possible to support these
> cases in the future, with that patch, only because I am safe to assume
> that each run/tape is the same size there (well, the assumption is
> exactly as safe as it was for the 9.6 final on-the-fly merge, at
> least).
>
> My point about non-final merges is that you have to be very careful
> that you're comparing apples to apples, memory accounting wise, when
> looking into something like this. I'm not saying that you didn't, but
> it's worth considering.

I'm not 100% sure I'm accounting for all the memory correctly. But I
didn't touch the way the initial quicksort works, nor the way the runs
are built. And the merge passes don't actually need or benefit from a
lot of memory, so I doubt it's very sensitive to that.

In this patch, the memory available for the read buffers is just divided
evenly across maxTapes. The buffers for the tapes that are not currently
active are wasted. It could be made smarter, by freeing all the
currently-unused buffers for tapes that are not active at the moment.
Might do that later, but this is what I'm going to benchmark for now. I
don't think adding buffers is helpful beyond a certain point, so this is
probably good enough in practice. Although it would be nice to free the
memory we don't need earlier, in case there are other processes that
could make use of it.

> FWIW, I did try an increase in the buffer size in LogicalTape at one
> time several months back, and so no benefit there (at least, with no
> other changes).

Yeah, unless you get rid of the pre-reading in tuplesort.c, you're just
double-buffering.

- Heikki


Attachment

Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:
> On 09/06/2016 10:26 PM, Peter Geoghegan wrote:
>> On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> Offhand, I would think that taken together this is very important. I'd
>>> certainly want to see cases in the hundreds of megabytes or gigabytes
>>> of work_mem alongside your 4MB case, even just to be able to talk
>>> informally about this. As you know, the default work_mem value is very
>>> conservative.
>
> I spent some more time polishing this up, and also added some code to
> logtape.c, to use larger read buffers, to compensate for the fact that
> we don't do pre-reading from tuplesort.c anymore. That should trigger
> the OS read-ahead, and make the I/O more sequential, like was the
> purpose of the old pre-reading code. But simpler. I haven't tested that
> part much yet, but I plan to run some tests on larger data sets that
> don't fit in RAM, to make the I/O effects visible.

Ok, I ran a few tests with 20 GB tables. I thought this would show any
differences in I/O behaviour, but in fact it was still completely CPU
bound, like the tests on smaller tables I posted yesterday. I guess I
need to point temp_tablespaces to a USB drive or something. But here we go.

It looks like there was a regression when sorting random text, with 256
MB work_mem. I suspect that was a fluke - I only ran these tests once
because they took so long. But I don't know for sure.

Claudio, if you could also repeat the tests you ran on Peter's patch set
on the other thread, with these patches, that'd be nice. These patches
are effectively a replacement for
0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
would be much appreciated too, of course.

Attached are new versions. Compared to last set, they contain a few
comment fixes, and a change to the 2nd patch to not allocate tape
buffers for tapes that were completely unused.

- Heikki


Attachment

Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
(Resending with compressed files, to get below the message size limit of
the list)

On 09/09/2016 02:13 PM, Heikki Linnakangas wrote:
> On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:
>> On 09/06/2016 10:26 PM, Peter Geoghegan wrote:
>>> On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>>> Offhand, I would think that taken together this is very important. I'd
>>>> certainly want to see cases in the hundreds of megabytes or gigabytes
>>>> of work_mem alongside your 4MB case, even just to be able to talk
>>>> informally about this. As you know, the default work_mem value is very
>>>> conservative.
>>
>> I spent some more time polishing this up, and also added some code to
>> logtape.c, to use larger read buffers, to compensate for the fact that
>> we don't do pre-reading from tuplesort.c anymore. That should trigger
>> the OS read-ahead, and make the I/O more sequential, like was the
>> purpose of the old pre-reading code. But simpler. I haven't tested that
>> part much yet, but I plan to run some tests on larger data sets that
>> don't fit in RAM, to make the I/O effects visible.
>
> Ok, I ran a few tests with 20 GB tables. I thought this would show any
> differences in I/O behaviour, but in fact it was still completely CPU
> bound, like the tests on smaller tables I posted yesterday. I guess I
> need to point temp_tablespaces to a USB drive or something. But here we go.

I took a different tact at demonstrating the I/O pattern effects. I
added some instrumentation code to logtape.c, that prints a line to a
file whenever it reads a block, with the block number. I ran the same
query with master and with these patches, and plotted the access pattern
with gnuplot.

I'm happy with what it looks like. We are in fact getting a more
sequential access pattern with these patches, because we're not
expanding the pre-read tuples into SortTuples. Keeping densely-packed
blocks in memory, instead of SortTuples, allows caching more data overall.

Attached is the patch I used to generate these traces, the gnuplot
script, and traces from I got from sorting a 1 GB table of random
integers, with work_mem=16MB.

Note that in the big picture, what appear to be individual dots, are
actually clusters of a bunch of dots. So the access pattern is a lot
more sequential than it looks like at first glance, with or without
these patches. The zoomed-in pictures show that. If you want to inspect
these in more detail, I recommend running gnuplot in interactive mode,
so that you can zoom in and out easily.

I'm happy with the amount of testing I've done now, and the results.
Does anyone want to throw out any more test cases where there might be a
regression? If not, let's get these reviewed and committed.

- Heikki

Attachment

Re: Tuplesort merge pre-reading

From
Greg Stark
Date:
On Fri, Sep 9, 2016 at 1:01 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm happy with what it looks like. We are in fact getting a more sequential
> access pattern with these patches, because we're not expanding the pre-read
> tuples into SortTuples. Keeping densely-packed blocks in memory, instead of
> SortTuples, allows caching more data overall.


Wow, this is really cool. We should do something like this for query
execution too.

I still didn't follow exactly why removing the prefetching allows more
sequential i/o. I thought the whole point of prefetching was to reduce
the random i/o from switching tapes.

-- 
greg



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/09/2016 03:25 PM, Greg Stark wrote:
> On Fri, Sep 9, 2016 at 1:01 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I'm happy with what it looks like. We are in fact getting a more sequential
>> access pattern with these patches, because we're not expanding the pre-read
>> tuples into SortTuples. Keeping densely-packed blocks in memory, instead of
>> SortTuples, allows caching more data overall.
>
>
> Wow, this is really cool. We should do something like this for query
> execution too.
>
> I still didn't follow exactly why removing the prefetching allows more
> sequential i/o. I thought the whole point of prefetching was to reduce
> the random i/o from switching tapes.

The first patch removed prefetching, but the second patch re-introduced 
it, in a different form. The prefetching is now done in logtape.c, by 
reading multiple pages at a time. The on-tape representation of tuples 
is more compact than having them in memory as SortTuples, so you can fit 
more data in memory overall, which makes the access pattern more sequential.

There's one difference between these approaches that I didn't point out 
earlier: We used to prefetch tuples from each *run*, and stopped 
pre-reading when we reached the end of the run. Now that we're doing the 
prefetching as raw tape blocks, we don't stop at run boundaries. I don't 
think that makes any big difference one way or another, but I thought 
I'd mention it.

- Heikki




Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/09/2016 02:13 PM, Heikki Linnakangas wrote:
> On 09/08/2016 09:59 PM, Heikki Linnakangas wrote:
>> On 09/06/2016 10:26 PM, Peter Geoghegan wrote:
>>> On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>>> Offhand, I would think that taken together this is very important. I'd
>>>> certainly want to see cases in the hundreds of megabytes or gigabytes
>>>> of work_mem alongside your 4MB case, even just to be able to talk
>>>> informally about this. As you know, the default work_mem value is very
>>>> conservative.
>>
>> I spent some more time polishing this up, and also added some code to
>> logtape.c, to use larger read buffers, to compensate for the fact that
>> we don't do pre-reading from tuplesort.c anymore. That should trigger
>> the OS read-ahead, and make the I/O more sequential, like was the
>> purpose of the old pre-reading code. But simpler. I haven't tested that
>> part much yet, but I plan to run some tests on larger data sets that
>> don't fit in RAM, to make the I/O effects visible.
>
> Ok, I ran a few tests with 20 GB tables. I thought this would show any
> differences in I/O behaviour, but in fact it was still completely CPU
> bound, like the tests on smaller tables I posted yesterday. I guess I
> need to point temp_tablespaces to a USB drive or something. But here we go.

I took a different tact at demonstrating the I/O pattern effects. I
added some instrumentation code to logtape.c, that prints a line to a
file whenever it reads a block, with the block number. I ran the same
query with master and with these patches, and plotted the access pattern
with gnuplot.

I'm happy with what it looks like. We are in fact getting a more
sequential access pattern with these patches, because we're not
expanding the pre-read tuples into SortTuples. Keeping densely-packed
blocks in memory, instead of SortTuples, allows caching more data overall.

Attached is the patch I used to generate these traces, the gnuplot
script, and traces from I got from sorting a 1 GB table of random
integers, with work_mem=16MB.

Note that in the big picture, what appear to be individual dots, are
actually clusters of a bunch of dots. So the access pattern is a lot
more sequential than it looks like at first glance, with or without
these patches. The zoomed-in pictures show that. If you want to inspect
these in more detail, I recommend running gnuplot in interactive mode,
so that you can zoom in and out easily.

I'm happy with the amount of testing I've done now, and the results.
Does anyone want to throw out any more test cases where there might be a
regression? If not, let's get these reviewed and committed.

- Heikki


Attachment

Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Fri, Sep 9, 2016 at 4:55 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I'm happy with the amount of testing I've done now, and the results. Does
> anyone want to throw out any more test cases where there might be a
> regression? If not, let's get these reviewed and committed.

I'll try to look at this properly tomorrow. Currently still working
away at creating a new revision of my sorting patchset. Obviously this
is interesting, but it raises certain questions for the parallel
CREATE INDEX patch in particular that I'd like to get straight, aside
from everything else.

I've been using an AWS d2.4xlarge instance for testing all my recent
sort patches, with 16 vCPUs, 122 GiB RAM, 12 x 2 TB disks. It worked
well to emphasize I/O throughput and parallelism over latency. I'd
like to investigate how this pre-reading stuff does there. I recall
that for one very large case, it took a full minute to do just the
first round of preloading during the leader's final merge (this was
with something like 50GB of maintenance_work_mem). So, it will be
interesting.

BTW, noticed a typo here:

> + * track memory usage of indivitual tuples.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Fri, Sep 9, 2016 at 5:25 AM, Greg Stark <stark@mit.edu> wrote:
> Wow, this is really cool. We should do something like this for query
> execution too.

We should certainly do this for tuplestore.c, too. I've been meaning
to adopt it to use batch memory. I did look at it briefly, and recall
that it was surprisingly awkward because a surprisingly large number
of callers want to have memory that they can manage independently of
the lifetime of their tuplestore.


-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Claudio Freire
Date:
On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Claudio, if you could also repeat the tests you ran on Peter's patch set on
> the other thread, with these patches, that'd be nice. These patches are
> effectively a replacement for
> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
> would be much appreciated too, of course.
>
> Attached are new versions. Compared to last set, they contain a few comment
> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
> that were completely unused.


Will do so



Re: Tuplesort merge pre-reading

From
Claudio Freire
Date:
On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>> the other thread, with these patches, that'd be nice. These patches are
>> effectively a replacement for
>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>> would be much appreciated too, of course.
>>
>> Attached are new versions. Compared to last set, they contain a few comment
>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>> that were completely unused.
>
>
> Will do so

It seems both 1 and 1+2 break make check.

Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).

Peter's patch needs some rebasing on top of those but nothing major.



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/10/2016 04:21 AM, Claudio Freire wrote:
> On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>
>>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>>> the other thread, with these patches, that'd be nice. These patches are
>>> effectively a replacement for
>>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>>> would be much appreciated too, of course.
>>>
>>> Attached are new versions. Compared to last set, they contain a few comment
>>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>>> that were completely unused.
>>
>> Will do so

Thanks!

> It seems both 1 and 1+2 break make check.

Oh. Works for me. What's the failure you're getting?

> Did I misunderstand something? I'm applying the first patch of Peter's
> series (cap number of tapes), then your first one (remove prefetch)
> and second one (use larger read buffers).

Yes. I have been testing without Peter's first patch, with just the two 
patches I posted. But it should work together (and does, I just tested) 
with that one as well.

- Heikki




Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Sat, Sep 10, 2016 at 12:04 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Did I misunderstand something? I'm applying the first patch of Peter's
>> series (cap number of tapes), then your first one (remove prefetch)
>> and second one (use larger read buffers).
>
>
> Yes. I have been testing without Peter's first patch, with just the two
> patches I posted. But it should work together (and does, I just tested) with
> that one as well.

You're going to need to rebase, since the root displace patch is based
on top of my patch 0002-*, not Heikki's alternative. But, that should
be very straightforward.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
Here's a new version of these patches, rebased over current master. I
squashed the two patches into one, there's not much point to keep them
separate.

- Heikki


Attachment

Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Sun, Sep 11, 2016 at 8:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Here's a new version of these patches, rebased over current master. I
> squashed the two patches into one, there's not much point to keep them
> separate.

I think I have my head fully around this now. For some reason, I
initially thought that this patch was a great deal more radical than
it actually is. (Like Greg, I somehow initially thought that you were
rejecting the idea of batch memory in general, and somehow (over)
leveraging the filesystem cache. I think I misunderstood your remarks
when we talked on IM about it early on.)

I don't know what the difference is between accessing 10 pages
randomly, and accessing a random set of 10 single pages sequentially,
in close succession. As Tom would say, that's above my pay grade. I
suppose it comes down to how close "close" actually is (but in any
case, it's all very fudged).

I mention this because I think that cost_sort() should be updated to
consider sequential I/O the norm, alongside this patch of yours (your
patch strengthens the argument [1] for that general idea). The reason
that this new approach produces more sequential I/O, apart from the
simple fact that you effectively have much more memory available and
so fewer rounds of preloading, is that the indirection block stuff can
make I/O less sequential in order to support eager reclamation of
space. For example, maybe there is interleaving of blocks as logtape.c
manages to reclaim blocks in the event of multiple merge steps. I care
about that second factor a lot more now than I would have a year ago,
when a final on-the-fly merge generally avoids multiple passes (and
associated logtape.c block fragmentation), because parallel CREATE
INDEX is usually affected by that (workers will often want to do their
own merge ahead of the leader's final merge), and because I want to
cap the number of tapes used, which will make multiple passes a bit
more common in practice.

I was always suspicious of the fact that memtuples is so large during
merging, and had a vague plan to fix that (although I was the one
responsible for growing memtuples even more for the merge in 9.6, that
was just because under the status quo of having many memtuples during
the merge, the size of memtuples should at least be in balance with
remaining memory for caller tuples -- it wasn't an endorsement of the
status quo). However, it never occurred to me to do that by pushing
batch memory into the head of logtape.c, which now seems like an
excellent idea.

To summarize my understanding of this patch: If it wasn't for my work
on parallel CREATE INDEX, I would consider this patch to give us only
a moderate improvement to user-visible performance, since it doesn't
really help memory rich cases very much (cases that are not likely to
have much random I/O anyway). In that universe, I'd be more
appreciative of the patch as a simplifying exercise, since while
refactoring. It's nice to get a boost for more memory constrained
cases, it's not a huge win. However, that's not the universe we live
in -- I *am* working on parallel CREATE INDEX, of course. And so, I
really am very excited about this patch, because it really helps with
the particular challenges I have there, even though it's fair to
assume that we have a reasonable amount of memory available when
parallelism is used. If workers are going to do their own merges, as
they often will, then multiple merge pass cases become far more
important, and any additional I/O is a real concern, *especially*
additional random I/O (say from logtape.c fragmentation). The patch
directly addresses that, which is great. Your patch, alongside my
just-committed patch concerning how we maintain the heap invariant,
together attack the merge bottleneck from two different directions:
they address I/O costs, and CPU costs, respectively.

Other things I noticed:

* You should probably point out that typically, access to batch memory
will still be sequential, despite your block-based scheme. The
preloading will now more or less make that the normal case. Any
fragmentation will now be essentially in memory, not on disk, which is
a big win.

* I think that logtape.c header comments are needed for this. Maybe
that's where you should point out that memory access is largely
sequential. But it's surely true that logtape.c needs to take
responsibility for being the place where the vast majority of memory
is allocated during merging.

* i think you should move "bool   *mergeactive; /* active input run
source? */" within Tuplesortstate to be next to the other batch memory
stuff. No point in having separate merge and batch "sections" there
anymore.

* You didn't carry over my 0002-* batch memory patch modifications to
comments, even though you should have in a few cases. There remains
some references in comments to batch memory, as a thing exclusively
usable by final on-the-fly merges. That's not true anymore -- it's
usable by final merges, too. For example, readtup_alloc() still
references the final on-the-fly merge.

* You also fail to take credit in the commit message for making batch
memory usable when returning caller tuples to callers that happen to
request "randomAccess" (So, I guess the aforementioned comments above
routines like readtup_alloc() shouldn't even refer to merging, unless
it's to say that non-final merges are not supported due to their
unusual requirements). My patch didn't go that far (I only addressed
the final merge itself, not the subsequent access to tuples when
reading from that materialized final output tape by TSS_SORTEDONTAPE
case). But, that's actually really useful for randomAccess callers,
above and beyond what I proposed (which in any case was mostly written
with parallel workers in mind, which never do TSS_SORTEDONTAPE
processing).

* Furthermore, readtup_alloc() will not just be called in WRITETUP()
routines -- please update comments.

* There is a very subtle issue here:

> +   /*
> +    * We no longer need a large memtuples array, only one slot per tape. Shrink
> +    * it, to make the memory available for other use. We only need one slot per
> +    * tape.
> +    */
> +   pfree(state->memtuples);
> +   FREEMEM(state, state->memtupsize * sizeof(SortTuple));
> +   state->memtupsize = state->maxTapes;
> +   state->memtuples = (SortTuple *) palloc(state->maxTapes * sizeof(SortTuple));
> +   USEMEM(state, state->memtupsize * sizeof(SortTuple));

The FREEMEM() call needs to count the chunk overhead in both cases. In
short, I think you need to copy the GetMemoryChunkSpace() stuff that
you see within grow_memtuples().

* Whitespace issue here:

> @@ -2334,7 +2349,8 @@ inittapes(Tuplesortstate *state)
>  #endif
>
>     /*
> -    * Decrease availMem to reflect the space needed for tape buffers; but
> +    * Decrease availMem to reflect the space needed for tape buffers, when
> +    * writing the initial runs; but
>      * don't decrease it to the point that we have no room for tuples. (That
>      * case is only likely to occur if sorting pass-by-value Datums; in all
>      * other scenarios the memtuples[] array is unlikely to occupy more than
> @@ -2359,14 +2375,6 @@ inittapes(Tuplesortstate *state)
>     state->tapeset = LogicalTapeSetCreate(maxTapes);

* I think that you need to comment on why state->tuplecontext is not
used for batch memory now. It is still useful, for multiple merge
passes, but the situation has notably changed for it.

* Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:

> @@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
>             Assert(lt->frozen);
>             datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
>         }
> +
> +       /* Allocate a read buffer */
> +       if (lt->buffer)
> +           pfree(lt->buffer);
> +       lt->buffer = palloc(lt->read_buffer_size);
> +       lt->buffer_size = lt->read_buffer_size;

* Typo:

> +
> +   /*
> +    * from this point on, we no longer use the usemem()/lackmem() mechanism to
> +    * track memory usage of indivitual tuples.
> +    */
> +   state->batchused = true;

* Please make this use the ".., + 1023" natural rounding trick that is
used in the similar traces that are removed:

> +#ifdef TRACE_SORT
> +   if (trace_sort)
> +       elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape",
> +            (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024);
> +#endif

* It couldn't hurt to make this code paranoid about LACKMEM() becoming
true, which will cause havoc (we saw this recently in 9.6; a patch of
mine to fix that just went in):

> +   /*
> +    * Use all the spare memory we have available for read buffers. Divide it
> +    * memory evenly among all the tapes.
> +    */
> +   avail_blocks = state->availMem / BLCKSZ;
> +   per_tape = avail_blocks / maxTapes;
> +   cutoff = avail_blocks % maxTapes;
> +   if (per_tape == 0)
> +   {
> +       per_tape = 1;
> +       cutoff = 0;
> +   }
> +   for (tapenum = 0; tapenum < maxTapes; tapenum++)
> +   {
> +       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
> +                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
> +   }

In other words, we really don't want availMem to become < 0, since
it's int64, but a derived value is passed to
LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now,
if LACKMEM() did happen it would be a bug anyway, but I recommend
defensive code also be added. Per grow_memtuples(), "We need to be
sure that we do not cause LACKMEM to become true, else the space
management algorithm will go nuts". Let's be sure that we get that
right, since, as we saw recently, especially since grow_memtuples()
will not actually have the chance to save us here (if there is a bug
along these lines, let's at least make the right "can't happen error"
complaint to user when it pops up).

* It looks like your patch makes us less eager about freeing per-tape
batch memory, now held as preload buffer space within logtape.c.

ISTM that there should be some way to have the "tape exhausted" code
path within tuplesort_gettuple_common() (as well as the similar spot
within mergeonerun()) instruct logtape.c that we're done with that
tape. In other words, when mergeprereadone() (now renamed to
mergereadnext()) detects the tape is exhausted, it should have
logtape.c free its huge tape buffer immediately. Think about cases
where input is presorted, and earlier tapes can be freed quite early
on. It's worth keeping that around, (you removed the old way that this
happened, through mergebatchfreetape()).

That's all I have right now. I like the direction this is going in,
but I think this needs more polishing.

[1] https://www.postgresql.org/message-id/CAM3SWZQLP6e=1si1NcQjYft7R-VYpprrf_i59tZOZX5m7VFK-w@mail.gmail.com
-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> * Please make this use the ".., + 1023" natural rounding trick that is
> used in the similar traces that are removed:
>
>> +#ifdef TRACE_SORT
>> +   if (trace_sort)
>> +       elog(LOG, "using %d kB of memory for read buffers in %d tapes, %d kB per tape",
>> +            (int) (state->availMem / 1024), maxTapes, (int) (per_tape * BLCKSZ) / 1024);
>> +#endif

Also, please remove the int cast, and use INT64_FORMAT. Again, this
should match existing trace_sort traces concerning batch memory.


-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
> * Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:
>
>> @@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
>>             Assert(lt->frozen);
>>             datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
>>         }
>> +
>> +       /* Allocate a read buffer */
>> +       if (lt->buffer)
>> +           pfree(lt->buffer);
>> +       lt->buffer = palloc(lt->read_buffer_size);
>> +       lt->buffer_size = lt->read_buffer_size;

Of course, when you do that you're going to have to make the new
"buffer_size" and "read_buffer_size" fields of type "Size" (or,
possibly, "int64", to match tuplesort.c's own buffer sizing variables
ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos"
and "nbytes" fields next to "buffer_size" within the struct
LogicalTape, I think. ISTM that logtape.c blocknums can remain of type
"long", though, since that reflects an existing hardly-relevant
limitation that you're not making any worse.

More generally, I think you also need to explain in comments why there
is a "read_buffer_size" field in addition to the "buffer_size" field.
-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> +   for (tapenum = 0; tapenum < maxTapes; tapenum++)
>> +   {
>> +       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
>> +                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
>> +   }

Spotted another issue with this code just now. Shouldn't it be based
on state->tapeRange? You don't want the destTape to get memory, since
you don't use batch memory for tapes that are written to (and final
on-the-fly merges don't use their destTape at all).

(Looks again...)

Wait, you're using a local variable maxTapes here, which potentially
differs from state->maxTapes:

> +   /*
> +    * If we had fewer runs than tapes, refund buffers for tapes that were never
> +    * allocated.
> +    */
> +   maxTapes = state->maxTapes;
> +   if (state->currentRun < maxTapes)
> +   {
> +       FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
> +       maxTapes = state->currentRun;
> +   }

I find this confusing, and think it's probably buggy. I don't think
you should have a local variable called maxTapes that you modify at
all, since state->maxTapes is supposed to not change once established.
You can't use state->currentRun like that, either, I think, because
it's the high watermark number of runs (quicksorted runs), not runs
after any particular merge phase, where we end up with fewer runs as
they're merged (we must also consider dummy runs to get this) -- we
want something like activeTapes. cf. the code you removed for the
beginmerge() finalMergeBatch case. Of course, activeTapes will vary if
there are multiple merge passes, which suggests all this code really
has no business being in mergeruns() (it should instead be in
beginmerge(), or code that beginmerge() reliably calls).

Immediately afterwards, you do this:

> +   /* Initialize the merge tuple buffer arena.  */
> +   state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
> +   state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
> +   state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
> +   USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);

The fact that you size the buffer based on "maxTapes + 1" also
suggests a problem. I think that the code looks like this because it
must instruct logtape.c that the destTape tape requires some buffer
(iff there is to be a non-final merge). Is that right? I hope that you
don't give the typically unused destTape tape a full share of batch
memory all the time (the same applies to any other
inactive-at-final-merge tapes).

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Gavin Flower
Date:
On 12/09/16 10:13, Peter Geoghegan wrote:
> On Sun, Sep 11, 2016 at 8:47 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
[...]
> I don't know what the difference is between accessing 10 pages
> randomly, and accessing a random set of 10 single pages sequentially,
> in close succession. As Tom would say, that's above my pay grade. I
> suppose it comes down to how close "close" actually is (but in any
> case, it's all very fudged).

If you select ten pages at random and sort them, then consecutive reads 
of the sorted list are more likely to access pages in the same block or 
closely adjacent (is my understanding).

eg

blocks:  XXXX  XXXX  XXXX   XXXX   XXXX
pages:   0 1   2 3   4 5    6 7    8 9

if the ten 'random pages' were selected in the random order:
6 1 7 8 4 2 9 3 0
Consecutive reads would always read new blocks, but the sorted list 
would have blocks read sequentially.

In practice, it would be rarely this simple.  However, if any of the 
random pages where in the same block, then that block would only need to 
be fetched once - similarly if 2 of the random pages where in 
consecutive blocks, then the two blocks would be logically adjacent 
(which means they are likely to be physically close together, but not 
guaranteed!).

[...]


Cheers,
Gavin



Re: Tuplesort merge pre-reading

From
Gavin Flower
Date:
On 12/09/16 12:16, Gavin Flower wrote:
[...]
>  two blocks would be logically adjacent (which means they are likely 
> to be physically close together, but not guaranteed!).
>
[...]

Actual disk layouts are quite complicated, the above is an over 
simplification, but the message is still valid.

There are various tricks of disc layout ( & low level handling) that can 
be used to minimise the time taken to read 2 blocks that are logically 
adjacent.  I had to know this stuff for discs that MainFrame computers 
used in the 1980's - modern disk systems are way more complicated, but 
the conclusions are still valid.

I am extremely glad that I no longer have to concern myself with 
understanding the precise low stuff on discs these days - there is no 
longer a one-to-one correspondence of what the O/S thinks is a disk 
block, with how the data is physically recorded on the disc.


Cheers,
Gavin





Re: Tuplesort merge pre-reading

From
Claudio Freire
Date:
On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Here's a new version of these patches, rebased over current master. I
> squashed the two patches into one, there's not much point to keep them
> separate.


I don't know what was up with the other ones, but this one works fine.

Benchmarking now.



Re: Tuplesort merge pre-reading

From
Claudio Freire
Date:
On Mon, Sep 12, 2016 at 12:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Here's a new version of these patches, rebased over current master. I
>> squashed the two patches into one, there's not much point to keep them
>> separate.
>
>
> I don't know what was up with the other ones, but this one works fine.
>
> Benchmarking now.


I spoke too soon, git AM had failed and I didn't notice.

regression.diffs attached

Built with

./configure --enable-debug --enable-cassert && make clean && make -j7
&& make check

Attachment

Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Mon, Sep 12, 2016 at 8:47 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I spoke too soon, git AM had failed and I didn't notice.

I wrote the regression test that causes Postgres to crash with the
patch applied. It tests, among other things, that CLUSTER tuples are
fixed-up by a routine like the current MOVETUP(), which is removed in
Heikki's patch. (There was a 9.6 bug where CLUSTER was broken due to
that.)

It shouldn't be too difficult for Heikki to fix this.
-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/12/2016 06:47 PM, Claudio Freire wrote:
> On Mon, Sep 12, 2016 at 12:02 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Here's a new version of these patches, rebased over current master. I
>>> squashed the two patches into one, there's not much point to keep them
>>> separate.
>>
>> I don't know what was up with the other ones, but this one works fine.
>>
>> Benchmarking now.
>
> I spoke too soon, git AM had failed and I didn't notice.
>
> regression.diffs attached
>
> Built with
>
> ./configure --enable-debug --enable-cassert && make clean && make -j7
> && make check

Ah, of course! I had been building without assertions, as I was doing
performance testing. With --enable-cassert, it failed for me as well
(and there was even a compiler warning pointing out one of the issues).
Sorry about that.

Here's a fixed version. I'll go through Peter's comments and address
those, but I don't think there was anything there that should affect
performance much, so I think you can proceed with your benchmarking with
this version. (You'll also need to turn off assertions for that!)

- Heikki


Attachment

Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Mon, Sep 12, 2016 at 12:07 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Here's a fixed version. I'll go through Peter's comments and address those,
> but I don't think there was anything there that should affect performance
> much, so I think you can proceed with your benchmarking with this version.
> (You'll also need to turn off assertions for that!)

I agree that it's unlikely that addressing any of my feedback will
result in any major change to performance.


-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
Addressed all your comments one way or another, new patch attached.
Comments on some specific points below:

On 09/12/2016 01:13 AM, Peter Geoghegan wrote:
> Other things I noticed:
>
> * You should probably point out that typically, access to batch memory
> will still be sequential, despite your block-based scheme. The
> preloading will now more or less make that the normal case. Any
> fragmentation will now be essentially in memory, not on disk, which is
> a big win.

That's not true, the "buffers" in batch memory are not accessed
sequentially. When we pull the next tuple from a tape, we store it in
the next free buffer. Usually, that buffer was used to hold the previous
tuple that was returned from gettuple(), and was just added to the free
list.

It's still quite cache-friendly, though, because we only need a small
number of slots (one for each tape).

> * i think you should move "bool   *mergeactive; /* active input run
> source? */" within Tuplesortstate to be next to the other batch memory
> stuff. No point in having separate merge and batch "sections" there
> anymore.

Hmm. I think I prefer to keep the memory management stuff in a separate
section. While it's true that it's currently only used during merging,
it's not hard to imagine using it when building the initial runs, for
example. Except for replacement selection, the pattern for building the
runs is: add a bunch of tuples, sort, flush them out. It would be
straightforward to use one large chunk of memory to hold all the tuples.
I'm not going to do that now, but I think keeping the memory management
stuff separate from merge-related state makes sense.

> * I think that you need to comment on why state->tuplecontext is not
> used for batch memory now. It is still useful, for multiple merge
> passes, but the situation has notably changed for it.

Hmm. We don't actually use it after the initial runs at all anymore. I
added a call to destroy it in mergeruns().

Now that we use the batch memory buffers for allocations < 1 kB (I
pulled that number out of a hat, BTW), and we only need one allocation
per tape (plus one), there's not much risk of fragmentation.

> On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> * Doesn't this code need to call MemoryContextAllocHuge() rather than palloc()?:
>>
>>> @@ -709,18 +765,19 @@ LogicalTapeRewind(LogicalTapeSet *lts, int tapenum, bool forWrite)
>>>             Assert(lt->frozen);
>>>             datablocknum = ltsRewindFrozenIndirectBlock(lts, lt->indirect);
>>>         }
>>> +
>>> +       /* Allocate a read buffer */
>>> +       if (lt->buffer)
>>> +           pfree(lt->buffer);
>>> +       lt->buffer = palloc(lt->read_buffer_size);
>>> +       lt->buffer_size = lt->read_buffer_size;
>
> Of course, when you do that you're going to have to make the new
> "buffer_size" and "read_buffer_size" fields of type "Size" (or,
> possibly, "int64", to match tuplesort.c's own buffer sizing variables
> ever since Noah added MaxAllocSizeHuge). Ditto for the existing "pos"
> and "nbytes" fields next to "buffer_size" within the struct
> LogicalTape, I think. ISTM that logtape.c blocknums can remain of type
> "long", though, since that reflects an existing hardly-relevant
> limitation that you're not making any worse.

True. I fixed that by putting a MaxAllocSize cap on the buffer size
instead. I doubt that doing > 1 GB of read-ahead of a single tape will
do any good.

I wonder if we should actually have a smaller cap there. Even 1 GB seems
excessive. Might be better to start the merging sooner, rather than wait
for the read of 1 GB to complete. The point of the OS readahead is that
the OS will do that for us, in the background. And other processes might
have better use for the memory anyway.

> * It couldn't hurt to make this code paranoid about LACKMEM() becoming
> true, which will cause havoc (we saw this recently in 9.6; a patch of
> mine to fix that just went in):
>
>> +   /*
>> +    * Use all the spare memory we have available for read buffers. Divide it
>> +    * memory evenly among all the tapes.
>> +    */
>> +   avail_blocks = state->availMem / BLCKSZ;
>> +   per_tape = avail_blocks / maxTapes;
>> +   cutoff = avail_blocks % maxTapes;
>> +   if (per_tape == 0)
>> +   {
>> +       per_tape = 1;
>> +       cutoff = 0;
>> +   }
>> +   for (tapenum = 0; tapenum < maxTapes; tapenum++)
>> +   {
>> +       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
>> +                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
>> +   }
>
> In other words, we really don't want availMem to become < 0, since
> it's int64, but a derived value is passed to
> LogicalTapeAssignReadBufferSize() as an argument of type "Size". Now,
> if LACKMEM() did happen it would be a bug anyway, but I recommend
> defensive code also be added. Per grow_memtuples(), "We need to be
> sure that we do not cause LACKMEM to become true, else the space
> management algorithm will go nuts". Let's be sure that we get that
> right, since, as we saw recently, especially since grow_memtuples()
> will not actually have the chance to save us here (if there is a bug
> along these lines, let's at least make the right "can't happen error"
> complaint to user when it pops up).

Hmm. We don't really need the availMem accounting at all, after we have
started merging. There is nothing we can do to free memory if we run
out, and we use fairly little memory anyway. But yes, better safe than
sorry. I tried to clarify the comments on that.

> * It looks like your patch makes us less eager about freeing per-tape
> batch memory, now held as preload buffer space within logtape.c.
>
> ISTM that there should be some way to have the "tape exhausted" code
> path within tuplesort_gettuple_common() (as well as the similar spot
> within mergeonerun()) instruct logtape.c that we're done with that
> tape. In other words, when mergeprereadone() (now renamed to
> mergereadnext()) detects the tape is exhausted, it should have
> logtape.c free its huge tape buffer immediately. Think about cases
> where input is presorted, and earlier tapes can be freed quite early
> on. It's worth keeping that around, (you removed the old way that this
> happened, through mergebatchfreetape()).

OK. I solved that by calling LogicalTapeRewind(), when we're done
reading a tape. Rewinding a tape has the side-effect of freeing the
buffer. I was going to put that into mergereadnext(), but it turns out
that it's tricky to figure out if there are any more runs on the same
tape, because we have the "actual" tape number there, but the tp_runs is
indexed by "logical" tape number. So I put the rewind calls at the end
of mergeruns(), and in TSS_FINALMERGE processing, instead. It means that
we don't free the buffers quite as early as we could, but I think this
is good enough.

> On Sun, Sep 11, 2016 at 3:13 PM, Peter Geoghegan <pg@heroku.com> wrote:
>>> +   for (tapenum = 0; tapenum < maxTapes; tapenum++)
>>> +   {
>>> +       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
>>> +                                       (per_tape + (tapenum < cutoff ? 1 : 0)) * BLCKSZ);
>>> +   }
>
> Spotted another issue with this code just now. Shouldn't it be based
> on state->tapeRange? You don't want the destTape to get memory, since
> you don't use batch memory for tapes that are written to (and final
> on-the-fly merges don't use their destTape at all).

logtape.c will only actually allocate the memory when reading.

> (Looks again...)
>
> Wait, you're using a local variable maxTapes here, which potentially
> differs from state->maxTapes:
>
>> +   /*
>> +    * If we had fewer runs than tapes, refund buffers for tapes that were never
>> +    * allocated.
>> +    */
>> +   maxTapes = state->maxTapes;
>> +   if (state->currentRun < maxTapes)
>> +   {
>> +       FREEMEM(state, (maxTapes - state->currentRun) * TAPE_BUFFER_OVERHEAD);
>> +       maxTapes = state->currentRun;
>> +   }
>
> I find this confusing, and think it's probably buggy. I don't think
> you should have a local variable called maxTapes that you modify at
> all, since state->maxTapes is supposed to not change once established.

I changed that so that it does actually change state->maxTapes. I
considered having a separate numTapes field, that can be smaller than
maxTapes, but we don't need the original maxTapes value after that point
anymore, so it would've been just pro forma to track them separately. I
hope the comment now explains that better.

> You can't use state->currentRun like that, either, I think, because
> it's the high watermark number of runs (quicksorted runs), not runs
> after any particular merge phase, where we end up with fewer runs as
> they're merged (we must also consider dummy runs to get this) -- we
> want something like activeTapes. cf. the code you removed for the
> beginmerge() finalMergeBatch case. Of course, activeTapes will vary if
> there are multiple merge passes, which suggests all this code really
> has no business being in mergeruns() (it should instead be in
> beginmerge(), or code that beginmerge() reliably calls).

Hmm, yes, using currentRun here is wrong. It needs to be "currentRun +
1", because we need one more tape than there are runs, to hold the output.

Note that I'm not re-allocating the read buffers depending on which
tapes are used in the current merge pass. I'm just dividing up the
memory among all tapes. Now that the pre-reading is done in logtape.c,
when we reach the end of a run on a tape, we will already have data from
the next run on the same tape in the read buffer.

> Immediately afterwards, you do this:
>
>> +   /* Initialize the merge tuple buffer arena.  */
>> +   state->batchMemoryBegin = palloc((maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
>> +   state->batchMemoryEnd = state->batchMemoryBegin + (maxTapes + 1) * MERGETUPLEBUFFER_SIZE;
>> +   state->freeBufferHead = (MergeTupleBuffer *) state->batchMemoryBegin;
>> +   USEMEM(state, (maxTapes + 1) * MERGETUPLEBUFFER_SIZE);
>
> The fact that you size the buffer based on "maxTapes + 1" also
> suggests a problem. I think that the code looks like this because it
> must instruct logtape.c that the destTape tape requires some buffer
> (iff there is to be a non-final merge). Is that right? I hope that you
> don't give the typically unused destTape tape a full share of batch
> memory all the time (the same applies to any other
> inactive-at-final-merge tapes).

Ah, no, the "+ 1" comes from the need to hold the tuple that we last
returned to the caller in tuplesort_gettuple, until the next call. See
lastReturnedTuple. I tried to clarify the comments on that.

Thanks for the thorough review! Let me know how this looks now.

- Heikki

Attachment

Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Addressed all your comments one way or another, new patch attached. Comments
> on some specific points below:

Cool. My response here is written under time pressure, which is not
ideal. I think it's still useful, though.

>> * You should probably point out that typically, access to batch memory
>> will still be sequential, despite your block-based scheme.

> That's not true, the "buffers" in batch memory are not accessed
> sequentially. When we pull the next tuple from a tape, we store it in the
> next free buffer. Usually, that buffer was used to hold the previous tuple
> that was returned from gettuple(), and was just added to the free list.
>
> It's still quite cache-friendly, though, because we only need a small number
> of slots (one for each tape).

That's kind of what I meant, I think -- it's more or less sequential.
Especially in the common case where there is only one merge pass.

> True. I fixed that by putting a MaxAllocSize cap on the buffer size instead.
> I doubt that doing > 1 GB of read-ahead of a single tape will do any good.

You may well be right about that, but ideally that could be verified.
I think that the tuplesort is entitled to have whatever memory the
user makes available, unless that's almost certainly useless. It
doesn't seem like our job to judge that it's always wrong to use extra
memory with only a small expected benefit. If it's actually a
microscopic expected benefit, or just as often negative to the sort
operation's performance, then I'd say it's okay to cap it at
MaxAllocSize. But it's not obvious to me that this is true; not yet,
anyway.

> Hmm. We don't really need the availMem accounting at all, after we have
> started merging. There is nothing we can do to free memory if we run out,
> and we use fairly little memory anyway. But yes, better safe than sorry. I
> tried to clarify the comments on that.

It is true that we don't really care about the accounting at all. But,
the same applies to the existing grow_memtuples() case at the
beginning of merging. The point is, we *do* care about availMem, this
one last time. We must at least produce a sane (i.e. >= 0) number in
any calculation. (I think you understand this already -- just saying.)

> OK. I solved that by calling LogicalTapeRewind(), when we're done reading a
> tape. Rewinding a tape has the side-effect of freeing the buffer. I was
> going to put that into mergereadnext(), but it turns out that it's tricky to
> figure out if there are any more runs on the same tape, because we have the
> "actual" tape number there, but the tp_runs is indexed by "logical" tape
> number. So I put the rewind calls at the end of mergeruns(), and in
> TSS_FINALMERGE processing, instead. It means that we don't free the buffers
> quite as early as we could, but I think this is good enough.

That seems adequate.

>> Spotted another issue with this code just now. Shouldn't it be based
>> on state->tapeRange? You don't want the destTape to get memory, since
>> you don't use batch memory for tapes that are written to (and final
>> on-the-fly merges don't use their destTape at all).

>> Wait, you're using a local variable maxTapes here, which potentially
>> differs from state->maxTapes:

> I changed that so that it does actually change state->maxTapes. I considered
> having a separate numTapes field, that can be smaller than maxTapes, but we
> don't need the original maxTapes value after that point anymore, so it
> would've been just pro forma to track them separately. I hope the comment
> now explains that better.

I still don't get why you're doing all of this within mergeruns() (the
beginning of when we start merging -- we merge all quicksorted runs),
rather than within beginmerge() (the beginning of one particular merge
pass, of which there are potentially more than one). As runs are
merged in a non-final merge pass, fewer tapes will remain active for
the next merge pass. It doesn't do to do all that up-front when we
have multiple merge passes, which will happen from time to time.

Correct me if I'm wrong, but I think that you're more skeptical of the
need for polyphase merge than I am. I at least see no reason to not
keep it around. I also think it has some value. It doesn't make this
optimization any harder, really.

> Hmm, yes, using currentRun here is wrong. It needs to be "currentRun + 1",
> because we need one more tape than there are runs, to hold the output.

As I said, I think it should be the number of active tapes, as you see
today within beginmerge() + mergebatch(). Why not do it that way? If
you don't get the distinction, see my remarks below on final merges
always using batch memory, even when there are to be multiple merge
passes (no reason to add that restriction here). More generally, I
really don't want to mess with the long standing definition of
maxTapes and things like that, because I see no advantage.

> Ah, no, the "+ 1" comes from the need to hold the tuple that we last
> returned to the caller in tuplesort_gettuple, until the next call. See
> lastReturnedTuple. I tried to clarify the comments on that.

I see. I don't think that you need to do any of that. Just have the
sizing be based on an even share among activeTapes on final merges
(not necessarily final on-the-fly merges, per my 0002-* patch --
you've done that here, it looks like). However, It looks like you're
doing the wrong thing by only having the check at the top of
beginmerge() -- "if (state->Level == 1)". You're not going to use
batch memory for the final merge just because there happened to be
multiple merges, that way. Sure, you shouldn't be using batch memory
for non-final merges (due to their need for an uneven amount of memory
per active tape), which you're not doing, but the mere fact that you
had a non-final merge should not affect the final merge at all. Which
is the kind of thing I'm concerned about when I talk about
beginmerge() being the right place for all that new code (not
mergeruns()). Even if you can make it work, it fits a lot better to
put it in beginmerge() -- that's the existing flow of things, which
should be preserved. I don't think you need to examine "state->Level"
or "state->currentRun" at all.

Did you do it this way to make the "replacement selection best case"
stuff within mergeruns() catch on, so it does the right thing during
its TSS_SORTEDONTAPE processing?

Thanks
-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/15/2016 10:12 PM, Peter Geoghegan wrote:
> On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Spotted another issue with this code just now. Shouldn't it be based
>>> on state->tapeRange? You don't want the destTape to get memory, since
>>> you don't use batch memory for tapes that are written to (and final
>>> on-the-fly merges don't use their destTape at all).
>
>>> Wait, you're using a local variable maxTapes here, which potentially
>>> differs from state->maxTapes:
>
>> I changed that so that it does actually change state->maxTapes. I considered
>> having a separate numTapes field, that can be smaller than maxTapes, but we
>> don't need the original maxTapes value after that point anymore, so it
>> would've been just pro forma to track them separately. I hope the comment
>> now explains that better.
>
> I still don't get why you're doing all of this within mergeruns() (the
> beginning of when we start merging -- we merge all quicksorted runs),
> rather than within beginmerge() (the beginning of one particular merge
> pass, of which there are potentially more than one). As runs are
> merged in a non-final merge pass, fewer tapes will remain active for
> the next merge pass. It doesn't do to do all that up-front when we
> have multiple merge passes, which will happen from time to time.

Now that the pre-reading is done in logtape.c, it doesn't stop at a run 
boundary. For example, when we read the last 1 MB of the first run on a 
tape, and we're using a 10 MB read buffer, we will merrily also read the 
first 9 MB from the next run. You cannot un-read that data, even if the 
tape is inactive in the next merge pass.

I don't think it makes much difference in practice, because most merge 
passes use all, or almost all, of the available tapes. BTW, I think the 
polyphase algorithm prefers to do all the merges that don't use all 
tapes upfront, so that the last final merge always uses all the tapes. 
I'm not 100% sure about that, but that's my understanding of the 
algorithm, and that's what I've seen in my testing.

> Correct me if I'm wrong, but I think that you're more skeptical of the
> need for polyphase merge than I am. I at least see no reason to not
> keep it around. I also think it has some value. It doesn't make this
> optimization any harder, really.

We certainly still need multi-pass merges.

BTW, does a 1-way merge make any sense? I was surprised to see this in 
the log, even without this patch:

LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 3-way merge step: CPU 0.62s/7.23u sec elapsed 8.44 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.44 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;
LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.45 sec
STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER 
BY i) t;

- Heikki



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Thu, Sep 15, 2016 at 1:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I still don't get why you're doing all of this within mergeruns() (the
>> beginning of when we start merging -- we merge all quicksorted runs),
>> rather than within beginmerge() (the beginning of one particular merge
>> pass, of which there are potentially more than one). As runs are
>> merged in a non-final merge pass, fewer tapes will remain active for
>> the next merge pass. It doesn't do to do all that up-front when we
>> have multiple merge passes, which will happen from time to time.
>
>
> Now that the pre-reading is done in logtape.c, it doesn't stop at a run
> boundary. For example, when we read the last 1 MB of the first run on a
> tape, and we're using a 10 MB read buffer, we will merrily also read the
> first 9 MB from the next run. You cannot un-read that data, even if the tape
> is inactive in the next merge pass.

I'm not sure that I like that approach. At the very least, it seems to
not be a good fit with the existing structure of things. I need to
think about it some more, and study how that plays out in practice.

> BTW, does a 1-way merge make any sense?

Not really, no, but it's something that I've seen plenty of times.

This is seen when runs are distributed such that mergeonerun() only
finds one real run on all active tapes, with all other active tapes
having only dummy runs. In general, dummy runs are "logically merged"
(decremented) in preference to any real runs on the same tape (that's
the reason why they exist), so you end up with what we call a "1-way
merge" when you see one real one on one active tape only. You never
see something like "0-way merge" within trace_sort output, though,
because that case is optimized to be almost a no-op.

It's almost a no-op because when it happens then mergeruns() knows to
itself directly decrement the number of dummy runs once for each
active tape, making that "logical merge" completed with only that
simple change in metadata (that is, the "merge" completes by just
decrementing dummy run counts -- no actual call to mergeonerun()).

We could optimize away "1-way merge" cases, perhaps, so that tuples
don't have to be spilt out one at a time (there could perhaps instead
be just some localized change to metadata, a bit like the all-dummy
case). That doesn't seem worth bothering with, especially with this
new approach of yours. I prefer to avoid special cases like that.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Thu, Sep 15, 2016 at 1:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> BTW, does a 1-way merge make any sense? I was surprised to see this in the
> log, even without this patch:
>
> LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 3-way merge step: CPU 0.62s/7.23u sec elapsed 8.44 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.44 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;
> LOG:  finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.45 sec
> STATEMENT:  SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY
> i) t;

Another thing that I think it worth pointing out here is that the
number of merge passes shown is excessive, practically speaking. I
suggested that we have something like checkpoint_warning for this last
year, which Greg Stark eventually got behind, but Robert Haas didn't
seem to like. Maybe this idea should be revisited. What do you think?

There is no neat explanation for why it's considered excessive to
checkpoint every 10 seconds, but not every 2 minutes. But, we warn
about the former case by default, and not the latter. It's hard to
know exactly where to draw the line, but that isn't a great reason to
not do it (maybe one extra merge pass is a good threshold -- that's
what I suggested once). I think that other database systems similarly
surface multiple merge passes. It's just inefficient to ever do
multiple merge passes, even if you're very frugal with memory.
Certainly, it's almost impossible to defend doing 3+ passes these
days.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Addressed all your comments one way or another, new patch attached.

So, it's clear that this isn't ready today. As I mentioned, I'm going
away for a week now. I ask that you hold off on committing this until
I return, and have a better opportunity to review the performance
characteristics of this latest revision, for one thing.

Thanks
-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/17/2016 07:27 PM, Peter Geoghegan wrote:
> On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Addressed all your comments one way or another, new patch attached.
>
> So, it's clear that this isn't ready today. As I mentioned, I'm going
> away for a week now. I ask that you hold off on committing this until
> I return, and have a better opportunity to review the performance
> characteristics of this latest revision, for one thing.

Ok. I'll probably read through it myself once more some time next week, 
and also have a first look at your actual parallel sorting patch. Have a 
good trip!

- Heikki




Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Sat, Sep 17, 2016 at 9:41 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Ok. I'll probably read through it myself once more some time next week, and
> also have a first look at your actual parallel sorting patch. Have a good
> trip!

Thanks! It will be good to get away for a while.

I'd be delighted to recruit you to work on the parallel CREATE INDEX
patch. I've already explained how I think this preread patch of yours
works well with parallel tuplesort (as proposed) in particular. To
reiterate: while what you've come up with here is technically an
independent improvement to merging, it's much more valuable in the
overall context of parallel sort, where disproportionate wall clock
time is spent merging, and where multiple passes are the norm (one
pass in each worker, plus one big final pass in the leader process
alone -- logtape.c fragmentation becomes a real issue). The situation
is actually similar to the original batch memory preloading patch that
went into 9.6 (which your new patch supersedes), and the subsequently
committed quicksort for external sort patch (which my new patch
extends to work in parallel).

Because I think of your preload patch as a part of the overall
parallel CREATE INDEX effort, if that was the limit of your
involvement then I'd still think it fair to credit you as my
co-author. I hope it isn't the limit of your involvement, though,
because it seems likely that the final result will be better still if
you get involved with the big patch that formally introduces parallel
CREATE INDEX.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Claudio Freire
Date:
On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>> the other thread, with these patches, that'd be nice. These patches are
>> effectively a replacement for
>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>> would be much appreciated too, of course.
>>
>> Attached are new versions. Compared to last set, they contain a few comment
>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>> that were completely unused.
>
>
> Will do so

Well, here they are, the results.

ODS format only (unless you've got issues opening the ODS).

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.

Attachment

Re: Tuplesort merge pre-reading

From
Claudio Freire
Date:
On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>>
>>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>>> the other thread, with these patches, that'd be nice. These patches are
>>> effectively a replacement for
>>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>>> would be much appreciated too, of course.
>>>
>>> Attached are new versions. Compared to last set, they contain a few comment
>>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>>> that were completely unused.
>>
>>
>> Will do so
>
> Well, here they are, the results.
>
> ODS format only (unless you've got issues opening the ODS).
>
> The results seem all over the map. Some regressions seem significant
> (both in the amount of performance lost and their significance, since
> all 4 runs show a similar regression). The worst being "CREATE INDEX
> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
> work_mem, which should be an in-memory sort, which makes it odd.
>
> I will re-run it overnight just in case to confirm the outcome.

A new run for "patched" gives better results, it seems it was some
kind of glitch in the run (maybe some cron decided to do something
while running those queries).

Attached

In essence, it doesn't look like it's harmfully affecting CPU
efficiency. Results seem neutral on the CPU front.

Attachment

Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/22/2016 03:40 AM, Claudio Freire wrote:
> On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> The results seem all over the map. Some regressions seem significant
>> (both in the amount of performance lost and their significance, since
>> all 4 runs show a similar regression). The worst being "CREATE INDEX
>> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
>> work_mem, which should be an in-memory sort, which makes it odd.
>>
>> I will re-run it overnight just in case to confirm the outcome.
>
> A new run for "patched" gives better results, it seems it was some
> kind of glitch in the run (maybe some cron decided to do something
> while running those queries).
>
> Attached
>
> In essence, it doesn't look like it's harmfully affecting CPU
> efficiency. Results seem neutral on the CPU front.

Looking at the spreadsheet, there is a 40% slowdown in the "slow" 
"CREATE INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" 
test with 4GB of work_mem. I can't reproduce that on my laptop, though. 
Got any clue what's going on there?

- Heikki




Re: Tuplesort merge pre-reading

From
Claudio Freire
Date:
On Thu, Sep 22, 2016 at 4:17 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> On 09/22/2016 03:40 AM, Claudio Freire wrote:
>>
>> On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire <klaussfreire@gmail.com>
>> wrote:
>>>
>>> The results seem all over the map. Some regressions seem significant
>>> (both in the amount of performance lost and their significance, since
>>> all 4 runs show a similar regression). The worst being "CREATE INDEX
>>> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
>>> work_mem, which should be an in-memory sort, which makes it odd.
>>>
>>> I will re-run it overnight just in case to confirm the outcome.
>>
>>
>> A new run for "patched" gives better results, it seems it was some
>> kind of glitch in the run (maybe some cron decided to do something
>> while running those queries).
>>
>> Attached
>>
>> In essence, it doesn't look like it's harmfully affecting CPU
>> efficiency. Results seem neutral on the CPU front.
>
>
> Looking at the spreadsheet, there is a 40% slowdown in the "slow" "CREATE
> INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" test with 4GB
> of work_mem. I can't reproduce that on my laptop, though. Got any clue
> what's going on there?

It's not present in other runs, so I think it's a fluke the
spreadsheet isn't filtering out. Especially considering that one
should be a fully in-memory fast sort and thus unaffected by the
current patch (z and z2 being integers, IIRC, most comparisons should
be about comparing the first columns and thus rarely involve the big
strings).

I'll try to confirm that's the case though.



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Thu, Sep 15, 2016 at 9:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I still don't get why you're doing all of this within mergeruns() (the
>> beginning of when we start merging -- we merge all quicksorted runs),
>> rather than within beginmerge() (the beginning of one particular merge
>> pass, of which there are potentially more than one). As runs are
>> merged in a non-final merge pass, fewer tapes will remain active for
>> the next merge pass. It doesn't do to do all that up-front when we
>> have multiple merge passes, which will happen from time to time.
>
>
> Now that the pre-reading is done in logtape.c, it doesn't stop at a run
> boundary. For example, when we read the last 1 MB of the first run on a
> tape, and we're using a 10 MB read buffer, we will merrily also read the
> first 9 MB from the next run. You cannot un-read that data, even if the tape
> is inactive in the next merge pass.

I've had a chance to think about this some more. I did a flying review
of the same revision that I talk about here, but realized some more
things. First, I will do a recap.

> I don't think it makes much difference in practice, because most merge
> passes use all, or almost all, of the available tapes. BTW, I think the
> polyphase algorithm prefers to do all the merges that don't use all tapes
> upfront, so that the last final merge always uses all the tapes. I'm not
> 100% sure about that, but that's my understanding of the algorithm, and
> that's what I've seen in my testing.

Not sure that I understand. I agree that each merge pass tends to use
roughly the same number of tapes, but the distribution of real runs on
tapes is quite unbalanced in earlier merge passes (due to dummy runs).
It looks like you're always using batch memory, even for non-final
merges. Won't that fail to be in balance much of the time because of
the lopsided distribution of runs? Tapes have an uneven amount of real
data in earlier merge passes.

FWIW, polyphase merge actually doesn't distribute runs based on the
Fibonacci sequence (at least in Postgres). It uses a generalization of
the Fibonacci numbers [1] -- *which* generalization varies according
to merge order/maxTapes. IIRC, with Knuth's P == 6 (i.e. merge order
== 6), it's the "hexanacci" sequence.

The following code is from your latest version, posted on 2016-09-14,
within the high-level mergeruns() function (the function that takes
quicksorted runs, and produces final output following 1 or more merge
passes):

> +   usedBlocks = 0;
> +   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
> +   {
> +       int64       numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
> +
> +       if (numBlocks > MaxAllocSize / BLCKSZ)
> +           numBlocks = MaxAllocSize / BLCKSZ;
> +       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
> +                                       numBlocks * BLCKSZ);
> +       usedBlocks += numBlocks;
> +   }
> +   USEMEM(state, usedBlocks * BLCKSZ);

I'm basically repeating myself here, but: I think it's incorrect that
LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
generally, it is questionable that it is called in such a high level
routine, rather than the start of a specific merge pass -- I said so a
couple of times already).

It's weird that you change the meaning of maxTapes by reassigning in
advance of iterating through maxTapes like this. I should point out
again that the master branch mergebatch() function does roughly the
same thing as you're doing here as follows:

static void
mergebatch(Tuplesortstate *state, int64 spacePerTape)
{   int         srcTape;
   Assert(state->activeTapes > 0);   Assert(state->tuples);
   /*    * For the purposes of tuplesort's memory accounting, the batch allocation    * is special, and regular memory
accountingthrough USEMEM() calls is    * abandoned (see mergeprereadone()).    */   for (srcTape = 0; srcTape <
state->maxTapes;srcTape++)   {       char       *mergetuples;
 
       if (!state->mergeactive[srcTape])           continue;
       /* Allocate buffer for each active tape */       mergetuples = MemoryContextAllocHuge(state->tuplecontext,
                                    spacePerTape);
 
       /* Initialize state for tape */       state->mergetuples[srcTape] = mergetuples;
state->mergecurrent[srcTape]= mergetuples;       state->mergetail[srcTape] = mergetuples;
state->mergeoverflow[srcTape]= NULL;   }
 
   state->batchUsed = true;   state->spacePerTape = spacePerTape;
}

Notably, this function:

* Works without altering the meaning of maxTapes. state->maxTapes is
Knuth's T, which is established very early and doesn't change with
polyphase merge (same applies to state->tapeRange). What does change
across merge passes is the number of *active* tapes. I don't think
it's necessary to change that in any way. I find it very confusing.
(Also, that you're using state->currentRun here at all seems bad, for
more or less the same reason -- that's the number of quicksorted
runs.)

* Does allocation for the final merge (that's the only point that it's
called), and so is not based on the number of active tapes that happen
to be in play when merging begins at a high level (i.e., when
mergeruns() is first called). Many tapes will be totally inactive by
the final merge, so this seems completely necessary for multiple merge
pass cases.

End of recap. Here is some new information:

I was previously confused on this last point, because I thought that
logtape.c might be able to do something smart to recycle memory that
is bound per-tape by calls to LogicalTapeAssignReadBufferSize() in
your patch. But it doesn't: all the recycling stuff only happens for
the much smaller buffers that are juggled and reused to pass tuples
back to caller within tuplesort_gettuple_common(), etc -- not the
logtape.c managed buffers. So, AFAICT there is no justification that I
can see for not adopting these notable properties of mergebatch() for
some analogous point in this patch. Actually, you should probably not
get rid of mergebatch(), but instead call
LogicalTapeAssignReadBufferSize() there. This change would mean that
the big per-tape memory allocations would happen in the same place as
before -- you'd just be asking logtape.c to do it for you, instead of
allocating directly.

Perhaps the practical consequences of not doing something closer to
mergebatch() are debatable, but I suspect that there is little point
in actually debating it. I think you might as well do it that way. No?

[1] http://mathworld.wolfram.com/Fibonaccin-StepNumber.html
-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/28/2016 06:05 PM, Peter Geoghegan wrote:
> On Thu, Sep 15, 2016 at 9:51 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I don't think it makes much difference in practice, because most merge
>> passes use all, or almost all, of the available tapes. BTW, I think the
>> polyphase algorithm prefers to do all the merges that don't use all tapes
>> upfront, so that the last final merge always uses all the tapes. I'm not
>> 100% sure about that, but that's my understanding of the algorithm, and
>> that's what I've seen in my testing.
>
> Not sure that I understand. I agree that each merge pass tends to use
> roughly the same number of tapes, but the distribution of real runs on
> tapes is quite unbalanced in earlier merge passes (due to dummy runs).
> It looks like you're always using batch memory, even for non-final
> merges. Won't that fail to be in balance much of the time because of
> the lopsided distribution of runs? Tapes have an uneven amount of real
> data in earlier merge passes.

How does the distribution of the runs on the tapes matter?

>> +   usedBlocks = 0;
>> +   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
>> +   {
>> +       int64       numBlocks = blocksPerTape + (tapenum < remainder ? 1 : 0);
>> +
>> +       if (numBlocks > MaxAllocSize / BLCKSZ)
>> +           numBlocks = MaxAllocSize / BLCKSZ;
>> +       LogicalTapeAssignReadBufferSize(state->tapeset, tapenum,
>> +                                       numBlocks * BLCKSZ);
>> +       usedBlocks += numBlocks;
>> +   }
>> +   USEMEM(state, usedBlocks * BLCKSZ);
>
> I'm basically repeating myself here, but: I think it's incorrect that
> LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
> generally, it is questionable that it is called in such a high level
> routine, rather than the start of a specific merge pass -- I said so a
> couple of times already).

You can't release the tape buffer at the end of a pass, because the 
buffer of a tape will already be filled with data from the next run on 
the same tape.

- Heikki




Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Wed, Sep 28, 2016 at 5:04 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Not sure that I understand. I agree that each merge pass tends to use
>> roughly the same number of tapes, but the distribution of real runs on
>> tapes is quite unbalanced in earlier merge passes (due to dummy runs).
>> It looks like you're always using batch memory, even for non-final
>> merges. Won't that fail to be in balance much of the time because of
>> the lopsided distribution of runs? Tapes have an uneven amount of real
>> data in earlier merge passes.
>
>
> How does the distribution of the runs on the tapes matter?

The exact details are not really relevant to this discussion (I think
it's confusing that we simply say "Target Fibonacci run counts",
FWIW), but the simple fact that it can be quite uneven is.

This is why I never pursued batch memory for non-final merges. Isn't
that what you're doing here? You're pretty much always setting
"state->batchUsed = true".

>> I'm basically repeating myself here, but: I think it's incorrect that
>> LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
>> generally, it is questionable that it is called in such a high level
>> routine, rather than the start of a specific merge pass -- I said so a
>> couple of times already).
>
>
> You can't release the tape buffer at the end of a pass, because the buffer
> of a tape will already be filled with data from the next run on the same
> tape.

Okay, but can't you just not use batch memory for non-final merges,
per my initial approach? That seems far cleaner.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Wed, Sep 28, 2016 at 5:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
> This is why I never pursued batch memory for non-final merges. Isn't
> that what you're doing here? You're pretty much always setting
> "state->batchUsed = true".

Wait. I guess you feel you have to, since it wouldn't be okay to use
almost no memory per tape on non-final merges.

You're able to throw out so much code here in large part because you
give almost all memory over to logtape.c (e.g., you don't manage each
tape's share of "slots" anymore -- better to give everything to
logtape.c). So, with your patch, you would actually only have one
caller tuple in memory at once per tape for a merge that doesn't use
batch memory (if you made it so that a merge *could* avoid the use of
batch memory, as I suggest).

In summary, under your scheme, the "batchUsed" variable contains a
tautological value, since you cannot sensibly not use batch memory.
(This is even true with !state->tuples callers).

Do I have that right? If so, this seems rather awkward. Hmm.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/28/2016 07:11 PM, Peter Geoghegan wrote:
> On Wed, Sep 28, 2016 at 5:04 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> Not sure that I understand. I agree that each merge pass tends to use
>>> roughly the same number of tapes, but the distribution of real runs on
>>> tapes is quite unbalanced in earlier merge passes (due to dummy runs).
>>> It looks like you're always using batch memory, even for non-final
>>> merges. Won't that fail to be in balance much of the time because of
>>> the lopsided distribution of runs? Tapes have an uneven amount of real
>>> data in earlier merge passes.
>>
>>
>> How does the distribution of the runs on the tapes matter?
>
> The exact details are not really relevant to this discussion (I think
> it's confusing that we simply say "Target Fibonacci run counts",
> FWIW), but the simple fact that it can be quite uneven is.

Well, I claim that the fact that the distribution of runs is uneven, 
does not matter. Can you explain why you think it does?

> This is why I never pursued batch memory for non-final merges. Isn't
> that what you're doing here? You're pretty much always setting
> "state->batchUsed = true".

Yep. As the patch stands, we wouldn't really need batchUsed, as we know 
that it's always true when merging, and false otherwise. But I kept it, 
as it seems like that might not always be true - we might use batch 
memory when building the initial runs, for example - and because it 
seems nice to have an explicit flag for it, for readability and 
debugging purposes.

>>> I'm basically repeating myself here, but: I think it's incorrect that
>>> LogicalTapeAssignReadBufferSize() is called so indiscriminately (more
>>> generally, it is questionable that it is called in such a high level
>>> routine, rather than the start of a specific merge pass -- I said so a
>>> couple of times already).
>>
>>
>> You can't release the tape buffer at the end of a pass, because the buffer
>> of a tape will already be filled with data from the next run on the same
>> tape.
>
> Okay, but can't you just not use batch memory for non-final merges,
> per my initial approach? That seems far cleaner.

Why? I don't see why the final merge should behave differently from the 
non-final ones.

- Heikki




Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/28/2016 07:20 PM, Peter Geoghegan wrote:
> On Wed, Sep 28, 2016 at 5:11 PM, Peter Geoghegan <pg@heroku.com> wrote:
>> This is why I never pursued batch memory for non-final merges. Isn't
>> that what you're doing here? You're pretty much always setting
>> "state->batchUsed = true".
>
> Wait. I guess you feel you have to, since it wouldn't be okay to use
> almost no memory per tape on non-final merges.
>
> You're able to throw out so much code here in large part because you
> give almost all memory over to logtape.c (e.g., you don't manage each
> tape's share of "slots" anymore -- better to give everything to
> logtape.c). So, with your patch, you would actually only have one
> caller tuple in memory at once per tape for a merge that doesn't use
> batch memory (if you made it so that a merge *could* avoid the use of
> batch memory, as I suggest).

Correct.

> In summary, under your scheme, the "batchUsed" variable contains a
> tautological value, since you cannot sensibly not use batch memory.
> (This is even true with !state->tuples callers).

I suppose.

> Do I have that right? If so, this seems rather awkward. Hmm.

How is it awkward?

- Heikki




Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Thu, Sep 29, 2016 at 10:49 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Do I have that right? If so, this seems rather awkward. Hmm.
>
>
> How is it awkward?

Maybe that was the wrong choice of words. What I mean is that it seems
somewhat unprincipled to give over an equal share of memory to each
active-at-least-once tape, regardless of how much that matters in
practice. One tape could have several runs in memory at once, while
another only has a fraction of a single much larger run. Maybe this is
just the first time that I find myself on the *other* side of a
discussion about an algorithm that seems brute-force compared to what
it might replace, but is actually better overall.   :-)

Now, that could be something that I just need to get over. In any
case, I still think:

* Variables like maxTapes have a meaning that is directly traceable
back to Knuth's description of polyphase merge. I don't think that you
should do anything to them, on general principle.

* Everything or almost everything that you've added to mergeruns()
should probably be in its own dedicated function. This function can
have a comment where you acknowledge that it's not perfectly okay that
you claim memory per-tape, but it's simpler and faster overall.

* I think you should be looking at the number of active tapes, and not
state->Level or state->currentRun in this new function. Actually,
maybe this wouldn't be the exact definition of an active tape that we
establish at the beginning of beginmerge() (this considers tapes with
dummy runs to be inactive for that merge), but it would look similar.
The general concern I have here is that you shouldn't rely on
state->Level or state->currentRun from a distance for the purposes of
determining which tapes need some batch memory. This is also where you
might say something like: "we don't bother with shifting memory around
tapes for each merge step, even though that would be fairer. That's
why we don't use the beginmerge() definition of activeTapes --
instead, we use our own broader definition of the number of active
tapes that doesn't exclude dummy-run-tapes with real runs, making the
allocation reasonably sensible for every merge pass".

* The "batchUsed" terminology really isn't working here, AFAICT. For
one thing, you have two separate areas where caller tuples might
reside: The small per-tape buffers (sized MERGETUPLEBUFFER_SIZE per
tape), and the logtape.c controlled preread buffers (sized up to
MaxAllocSize per tape). Which of these two things is batch memory? I
think it might just be the first one, but KiBs of memory do not
suggest "batch" to me. Isn't that really more like what you might call
double buffer memory, used not to save overhead from palloc (having
many palloc headers in memory), but rather to recycle memory
efficiently? So, these two things should have two new names of their
own, I think, and neither should be called "batch memory" IMV. I see
several assertions remain here and there that were written with my
original definition of batch memory in mind -- assertions like:
       case TSS_SORTEDONTAPE:           Assert(forward || state->randomAccess);           Assert(!state->batchUsed);

(Isn't state->batchUsed always true now?)

And:
       case TSS_FINALMERGE:           Assert(forward);           Assert(state->batchUsed || !state->tuples);

(Isn't state->tuples only really of interest to datum-tuple-case
routines, now that you've made everything use large logtape.c preread
buffers?)

* Is is really necessary for !state->tuples cases to do that
MERGETUPLEBUFFER_SIZE-based allocation? In other words, what need is
there for pass-by-value datum cases to have this scratch/double buffer
memory at all, since the value is returned to caller by-value, not
by-reference? This is related to the problem of it not being entirely
clear what batch memory now is, I think.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Robert Haas
Date:
On Thu, Sep 29, 2016 at 6:52 AM, Peter Geoghegan <pg@heroku.com> wrote:
>> How is it awkward?
>
> Maybe that was the wrong choice of words. What I mean is that it seems
> somewhat unprincipled to give over an equal share of memory to each
> active-at-least-once tape, ...

I don't get it.  If the memory is being used for prereading, then the
point is just to avoid doing many small I/Os instead of one big I/O,
and that goal will be accomplished whether the memory is equally
distributed or not; indeed, it's likely to be accomplished BETTER if
the memory is equally distributed than if it isn't.

I can imagine that there might be a situation in which it makes sense
to give a bigger tape more resources than a smaller one; for example,
if one were going to divide N tapes across K worker processess and
make individual workers or groups of workers responsible for sorting
particular tapes, one would of course want to divide up the data
across the available processes relatively evenly, rather than having
some workers responsible for only a small amount of data and others
for a very large amount of data.  But such considerations are
irrelevant here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/29/2016 01:52 PM, Peter Geoghegan wrote:
> * Variables like maxTapes have a meaning that is directly traceable
> back to Knuth's description of polyphase merge. I don't think that you
> should do anything to them, on general principle.

Ok. I still think that changing maxTapes would make sense, when there
are fewer runs than tapes, but that is actually orthogonal to the rest
of the patch, so let's discuss that separately. I've changed the patch
to not do that.

> * Everything or almost everything that you've added to mergeruns()
> should probably be in its own dedicated function. This function can
> have a comment where you acknowledge that it's not perfectly okay that
> you claim memory per-tape, but it's simpler and faster overall.

Ok.

> * I think you should be looking at the number of active tapes, and not
> state->Level or state->currentRun in this new function. Actually,
> maybe this wouldn't be the exact definition of an active tape that we
> establish at the beginning of beginmerge() (this considers tapes with
> dummy runs to be inactive for that merge), but it would look similar.
> The general concern I have here is that you shouldn't rely on
> state->Level or state->currentRun from a distance for the purposes of
> determining which tapes need some batch memory. This is also where you
> might say something like: "we don't bother with shifting memory around
> tapes for each merge step, even though that would be fairer. That's
> why we don't use the beginmerge() definition of activeTapes --
> instead, we use our own broader definition of the number of active
> tapes that doesn't exclude dummy-run-tapes with real runs, making the
> allocation reasonably sensible for every merge pass".

I'm not sure I understood what your concern was, but please have a look
at this new version, if the comments in initTapeBuffers() explain that
well enough.

> * The "batchUsed" terminology really isn't working here, AFAICT. For
> one thing, you have two separate areas where caller tuples might
> reside: The small per-tape buffers (sized MERGETUPLEBUFFER_SIZE per
> tape), and the logtape.c controlled preread buffers (sized up to
> MaxAllocSize per tape). Which of these two things is batch memory? I
> think it might just be the first one, but KiBs of memory do not
> suggest "batch" to me. Isn't that really more like what you might call
> double buffer memory, used not to save overhead from palloc (having
> many palloc headers in memory), but rather to recycle memory
> efficiently? So, these two things should have two new names of their
> own, I think, and neither should be called "batch memory" IMV. I see
> several assertions remain here and there that were written with my
> original definition of batch memory in mind -- assertions like:

Ok. I replaced "batch" terminology with "slab allocator" and "slab
slots", I hope this is less confusing. This isn't exactly like e.g. the
slab allocator in the Linux kernel, as it has a fixed number of slots,
so perhaps an "object pool" might be more accurate. But I like "slab"
because it's not used elsewhere in the system. I also didn't use the
term "cache" for the "slots", as might be typical for slab allocators,
because "cache" is such an overloaded term.

>         case TSS_SORTEDONTAPE:
>             Assert(forward || state->randomAccess);
>             Assert(!state->batchUsed);
>
> (Isn't state->batchUsed always true now?)

Good catch. It wasn't, because mergeruns() set batchUsed only after
checking for the TSS_SORTEDONTAPE case, even though it set up the batch
memory arena before it. So if replacement selection was able to produce
a single run batchUsed was false. Fixed, the slab allocator (as it's now
called) is now always used in TSS_SORTEDONTAPE case.

> And:
>
>         case TSS_FINALMERGE:
>             Assert(forward);
>             Assert(state->batchUsed || !state->tuples);
>
> (Isn't state->tuples only really of interest to datum-tuple-case
> routines, now that you've made everything use large logtape.c preread
> buffers?)

Yeah, fixed.

> * Is is really necessary for !state->tuples cases to do that
> MERGETUPLEBUFFER_SIZE-based allocation? In other words, what need is
> there for pass-by-value datum cases to have this scratch/double buffer
> memory at all, since the value is returned to caller by-value, not
> by-reference? This is related to the problem of it not being entirely
> clear what batch memory now is, I think.

True, fixed. I still set slabAllocatorUsed (was batchUsed), but it's
initialized as a dummy 0-slot arena when !state->tuples.

Here's a new patch version, addressing the points you made. Please have
a look!

- Heikki


Attachment

Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/29/2016 05:41 PM, Heikki Linnakangas wrote:
> Here's a new patch version, addressing the points you made. Please have
> a look!

Bah, I fumbled the initSlabAllocator() function, attached is a fixed
version.

- Heikki


Attachment

Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Thu, Sep 29, 2016 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Maybe that was the wrong choice of words. What I mean is that it seems
>> somewhat unprincipled to give over an equal share of memory to each
>> active-at-least-once tape, ...
>
> I don't get it.  If the memory is being used for prereading, then the
> point is just to avoid doing many small I/Os instead of one big I/O,
> and that goal will be accomplished whether the memory is equally
> distributed or not; indeed, it's likely to be accomplished BETTER if
> the memory is equally distributed than if it isn't.

I think it could hurt performance if preloading loads runs on a tape
that won't be needed until some subsequent merge pass, in preference
to using that memory proportionately, giving more to larger input runs
for *each* merge pass (giving memory proportionate to the size of each
run to be merged from each tape).  For tapes with a dummy run, the
appropriate amount of memory for there next merge pass is zero.

I'm not arguing that it would be worth it to do that, but I do think
that that's the sensible way of framing the idea of using a uniform
amount of memory to every maybe-active tape up front. I'm not too
concerned about this because I'm never too concerned about multiple
merge pass cases, which are relatively rare and relatively
unimportant. Let's just get our story straight.

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Robert Haas
Date:
On Thu, Sep 29, 2016 at 11:38 AM, Peter Geoghegan <pg@heroku.com> wrote:
> On Thu, Sep 29, 2016 at 2:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Maybe that was the wrong choice of words. What I mean is that it seems
>>> somewhat unprincipled to give over an equal share of memory to each
>>> active-at-least-once tape, ...
>>
>> I don't get it.  If the memory is being used for prereading, then the
>> point is just to avoid doing many small I/Os instead of one big I/O,
>> and that goal will be accomplished whether the memory is equally
>> distributed or not; indeed, it's likely to be accomplished BETTER if
>> the memory is equally distributed than if it isn't.
>
> I think it could hurt performance if preloading loads runs on a tape
> that won't be needed until some subsequent merge pass, in preference
> to using that memory proportionately, giving more to larger input runs
> for *each* merge pass (giving memory proportionate to the size of each
> run to be merged from each tape).  For tapes with a dummy run, the
> appropriate amount of memory for there next merge pass is zero.

OK, true.  But I still suspect that unless the amount of data you need
to read from a tape is zero or very small, the size of the buffer
doesn't matter.  For example, if you have a 1GB tape and a 10GB tape,
I doubt there's any benefit in making the buffer for the 10GB tape 10x
larger.  They can probably be the same.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Thu, Sep 29, 2016 at 4:10 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Bah, I fumbled the initSlabAllocator() function, attached is a fixed
> version.

This looks much better. It's definitely getting close. Thanks for
being considerate of my more marginal concerns. More feedback:

* Should say "fixed number of...":

> +    * we start merging.  Merging only needs to keep a small, fixed number tuples

* Minor concern about these new macros:

> +#define IS_SLAB_SLOT(state, tuple) \
> +   ((char *) tuple >= state->slabMemoryBegin && \
> +    (char *) tuple < state->slabMemoryEnd)
> +
> +/*
> + * Return the given tuple to the slab memory free list, or free it
> + * if it was palloc'd.
> + */
> +#define RELEASE_SLAB_SLOT(state, tuple) \
> +   do { \
> +       SlabSlot *buf = (SlabSlot *) tuple; \
> +       \
> +       if (IS_SLAB_SLOT(state, tuple)) \
> +       { \
> +           buf->nextfree = state->slabFreeHead; \
> +           state->slabFreeHead = buf; \
> +       } else \
> +           pfree(tuple); \
> +   } while(0)

I suggest duplicating the paranoia seen elsewhere around what "state"
macro argument could expand to. You know, by surrounding "state" with
parenthesis each time it is used. This is what we see with existing,
similar macros.

* Should cast to int64 here (for the benefit of win64):

> +       elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes",
> +            (long) (availBlocks * BLCKSZ) / 1024, numInputTapes);

* FWIW, I still don't love this bit:

> +    * numTapes and numInputTapes reflect the actual number of tapes we will
> +    * use.  Note that the output tape's tape number is maxTapes - 1, so the
> +    * tape numbers of the used tapes are not consecutive, so you cannot
> +    * just loop from 0 to numTapes to visit all used tapes!
> +    */
> +   if (state->Level == 1)
> +   {
> +       numInputTapes = state->currentRun;
> +       numTapes = numInputTapes + 1;
> +       FREEMEM(state, (state->maxTapes - numTapes) * TAPE_BUFFER_OVERHEAD);
> +   }

But I can see how the verbosity of almost-duplicating the activeTapes
stuff seems unappealing. That said, I think that you should point out
in comments that you're calculating the number of
maybe-active-in-some-merge tapes. They're maybe-active in that they
have some number of real tapes. Not going to insist on that, but
something to think about.

* Shouldn't this use state->tapeRange?:

> +   else
> +   {
> +       numInputTapes = state->maxTapes - 1;
> +       numTapes = state->maxTapes;
> +   }

* Doesn't it also set numTapes without it being used? Maybe that
variable can be declared within "if (state->Level == 1)" block.

* Minor issues with initSlabAllocator():

You call the new function initSlabAllocator() as follows:

> +   if (state->tuples)
> +       initSlabAllocator(state, numInputTapes + 1);
> +   else
> +       initSlabAllocator(state, 0);

Isn't the number of slots (the second argument to initSlabAllocator())
actually just numInputTapes when we're "state->tuples"? And so,
shouldn't the "+ 1" bit happen within initSlabAllocator() itself? It
can just inspect "state->tuples" itself. In short, maybe push a bit
more into initSlabAllocator(). Making the arguments match those passed
to initTapeBuffers() a bit later would be nice, perhaps.

* This could be simpler, I think:

> +   /* Release the read buffers on all the other tapes, by rewinding them. */
> +   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
> +   {
> +       if (tapenum == state->result_tape)
> +           continue;
> +       LogicalTapeRewind(state->tapeset, tapenum, true);
> +   }

Can't you just use state->tapeRange, and remove the "continue"? I
recommend referring to "now-exhausted input tapes" here, too.

* I'm not completely prepared to give up on using
MemoryContextAllocHuge() within logtape.c just yet. Maybe you're right
that it couldn't possibly matter that we impose a MaxAllocSize cap
within logtape.c (per tape), but I have slight reservations that I
need to address. Maybe a better way of putting it would be that I have
some reservations about possible regressions at the very high end,
with very large workMem. Any thoughts on that?

-- 
Peter Geoghegan



Re: Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 09/30/2016 04:08 PM, Peter Geoghegan wrote:
> On Thu, Sep 29, 2016 at 4:10 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Bah, I fumbled the initSlabAllocator() function, attached is a fixed
>> version.
>
> This looks much better. It's definitely getting close. Thanks for
> being considerate of my more marginal concerns. More feedback:

Fixed most of the things you pointed out, thanks.

> * Minor issues with initSlabAllocator():
>
> You call the new function initSlabAllocator() as follows:
>
>> +   if (state->tuples)
>> +       initSlabAllocator(state, numInputTapes + 1);
>> +   else
>> +       initSlabAllocator(state, 0);
>
> Isn't the number of slots (the second argument to initSlabAllocator())
> actually just numInputTapes when we're "state->tuples"? And so,
> shouldn't the "+ 1" bit happen within initSlabAllocator() itself? It
> can just inspect "state->tuples" itself. In short, maybe push a bit
> more into initSlabAllocator(). Making the arguments match those passed
> to initTapeBuffers() a bit later would be nice, perhaps.

The comment above that explains the "+ 1". init_slab_allocator allocates 
the number of slots that was requested, and the caller is responsible 
for deciding how many slots are needed. Yeah, we could remove the 
argument and move the logic altogether into init_slab_allocator(), but I 
think it's clearer this way. Matter of taste, I guess.

> * This could be simpler, I think:
>
>> +   /* Release the read buffers on all the other tapes, by rewinding them. */
>> +   for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
>> +   {
>> +       if (tapenum == state->result_tape)
>> +           continue;
>> +       LogicalTapeRewind(state->tapeset, tapenum, true);
>> +   }
>
> Can't you just use state->tapeRange, and remove the "continue"? I
> recommend referring to "now-exhausted input tapes" here, too.

Don't think so. result_tape == tapeRange only when the merge was done in 
a single pass (or you're otherwise lucky).

> * I'm not completely prepared to give up on using
> MemoryContextAllocHuge() within logtape.c just yet. Maybe you're right
> that it couldn't possibly matter that we impose a MaxAllocSize cap
> within logtape.c (per tape), but I have slight reservations that I
> need to address. Maybe a better way of putting it would be that I have
> some reservations about possible regressions at the very high end,
> with very large workMem. Any thoughts on that?

Meh, I can't imagine that using more than 1 GB for a read-ahead buffer 
could make any difference in practice. If you have a very large 
work_mem, you'll surely get away with a single merge pass, and 
fragmentation won't become an issue. And 1GB should be more than enough 
to trigger OS read-ahead.

Committed with some final kibitzing. Thanks for the review!

PS. This patch didn't fix bug #14344, the premature reuse of memory with 
tuplesort_gettupleslot. We'll still need to come up with 1. a 
backportable fix for that, and 2. perhaps a different fix for master.

- Heikki



Re: Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Mon, Oct 3, 2016 at 3:39 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Can't you just use state->tapeRange, and remove the "continue"? I
>> recommend referring to "now-exhausted input tapes" here, too.
>
>
> Don't think so. result_tape == tapeRange only when the merge was done in a
> single pass (or you're otherwise lucky).

Ah, yes. Logical tape assignment/physical tape number confusion on my part here.

>> * I'm not completely prepared to give up on using
>> MemoryContextAllocHuge() within logtape.c just yet. Maybe you're right
>> that it couldn't possibly matter that we impose a MaxAllocSize cap
>> within logtape.c (per tape), but I have slight reservations that I
>> need to address. Maybe a better way of putting it would be that I have
>> some reservations about possible regressions at the very high end,
>> with very large workMem. Any thoughts on that?
>
>
> Meh, I can't imagine that using more than 1 GB for a read-ahead buffer could
> make any difference in practice. If you have a very large work_mem, you'll
> surely get away with a single merge pass, and fragmentation won't become an
> issue. And 1GB should be more than enough to trigger OS read-ahead.

I had a non-specific concern, not an intuition of suspicion about
this. I think that I'll figure it out when I rebase the parallel
CREATE INDEX patch on top of this and test that.

> Committed with some final kibitzing. Thanks for the review!

Thanks for working on this!

> PS. This patch didn't fix bug #14344, the premature reuse of memory with
> tuplesort_gettupleslot. We'll still need to come up with 1. a backportable
> fix for that, and 2. perhaps a different fix for master.

Agreed. It seemed like you favor not changing memory ownership
semantics for 9.6. I'm not sure that that's the easiest approach for
9.6, but let's discuss that over on the dedicated thread soon.

-- 
Peter Geoghegan



Re: [HACKERS] Tuplesort merge pre-reading

From
Tom Lane
Date:
Heikki Linnakangas <hlinnaka@iki.fi> writes:
> I'm talking about the code that reads a bunch of from each tape, loading
> them into the memtuples array. That code was added by Tom Lane, back in
> 1999:

> commit cf627ab41ab9f6038a29ddd04dd0ff0ccdca714e
> Author: Tom Lane <tgl@sss.pgh.pa.us>
> Date:   Sat Oct 30 17:27:15 1999 +0000

>      Further performance improvements in sorting: reduce number of comparisons
>      during initial run formation by keeping both current run and next-run
>      tuples in the same heap (yup, Knuth is smarter than I am).  And, during
>      merge passes, make use of available sort memory to load multiple tuples
>      from any one input 'tape' at a time, thereby improving locality of
>      access to the temp file.

> So apparently there was a benefit back then, but is it still worthwhile?

I'm fairly sure that the point was exactly what it said, ie improve
locality of access within the temp file by sequentially reading as many
tuples in a row as we could, rather than grabbing one here and one there.

It may be that the work you and Peter G. have been doing have rendered
that question moot.  But I'm a bit worried that the reason you're not
seeing any effect is that you're only testing situations with zero seek
penalty (ie your laptop's disk is an SSD).  Back then I would certainly
have been testing with temp files on spinning rust, and I fear that this
may still be an issue in that sort of environment.

The relevant mailing list thread seems to be "sort on huge table" in
pgsql-hackers in October/November 1999.  The archives don't seem to have
threaded that too successfully, but here's a message specifically
describing the commit you mention:

https://www.postgresql.org/message-id/2726.941493808%40sss.pgh.pa.us

and you can find the rest by looking through the archive summary pages
for that interval.

The larger picture to be drawn from that thread is that we were seeing
very different performance characteristics on different platforms.
The specific issue that Tatsuo-san reported seemed like it might be
down to weird read-ahead behavior in a 90s-vintage Linux kernel ...
but the point that this stuff can be environment-dependent is still
something to take to heart.
        regards, tom lane



Re: [HACKERS] Tuplesort merge pre-reading

From
Tom Lane
Date:
I wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> I'm talking about the code that reads a bunch of from each tape, loading 
>> them into the memtuples array. That code was added by Tom Lane, back in 
>> 1999:
>> So apparently there was a benefit back then, but is it still worthwhile? 

> I'm fairly sure that the point was exactly what it said, ie improve
> locality of access within the temp file by sequentially reading as many
> tuples in a row as we could, rather than grabbing one here and one there.

[ blink... ]  Somehow, my mail reader popped up a message from 2016
as current, and I spent some time researching and answering it without
noticing the message date.

Never mind, nothing to see here ...
        regards, tom lane



Re: [HACKERS] Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Thu, Apr 13, 2017 at 9:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm fairly sure that the point was exactly what it said, ie improve
> locality of access within the temp file by sequentially reading as many
> tuples in a row as we could, rather than grabbing one here and one there.
>
> It may be that the work you and Peter G. have been doing have rendered
> that question moot.  But I'm a bit worried that the reason you're not
> seeing any effect is that you're only testing situations with zero seek
> penalty (ie your laptop's disk is an SSD).  Back then I would certainly
> have been testing with temp files on spinning rust, and I fear that this
> may still be an issue in that sort of environment.

I actually think Heikki's work here would particularly help on
spinning rust, especially when less memory is available. He
specifically justified it on the basis of it resulting in a more
sequential read pattern, particularly when multiple passes are
required.

> The larger picture to be drawn from that thread is that we were seeing
> very different performance characteristics on different platforms.
> The specific issue that Tatsuo-san reported seemed like it might be
> down to weird read-ahead behavior in a 90s-vintage Linux kernel ...
> but the point that this stuff can be environment-dependent is still
> something to take to heart.

BTW, I'm skeptical of the idea of Heikki's around killing polyphase
merge itself at this point. I think that keeping most tapes active per
pass is useful now that our memory accounting involves handing over an
even share to each maybe-active tape for every merge pass, something
established by Heikki's work on external sorting.

Interestingly enough, I think that Knuth was pretty much spot on with
his "sweet spot" of 7 tapes, even if you have modern hardware. Commit
df700e6 (where the sweet spot of merge order 7 was no longer always
used) was effective because it masked certain overheads that we
experience when doing multiple passes, overheads that Heikki and I
mostly removed. This was confirmed by Robert's testing of my merge
order cap work for commit fc19c18, where he found that using 7 tapes
was only slightly worse than using many hundreds of tapes. If we could
somehow be completely effective in making access to logical tapes
perfectly sequential, then 7 tapes would probably be noticeably
*faster*, due to CPU caching effects.

Knuth was completely correct to say that it basically made no
difference once more than 7 tapes are used to merge, because he didn't
have logtape.c fragmentation to worry about.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/



Re: [HACKERS] Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Thu, Apr 13, 2017 at 10:19 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> I actually think Heikki's work here would particularly help on
> spinning rust, especially when less memory is available. He
> specifically justified it on the basis of it resulting in a more
> sequential read pattern, particularly when multiple passes are
> required.

BTW, what you might have missed is that Heikki did end up using a
significant amount of memory in the committed version. It just ended
up being managed by logtape.c, which now does the prereading instead
of tuplesort.c, but at a lower level. There is only one tuple in the
merge heap, but there is still up to 1GB of memory per tape,
containing raw preread tuples mixed with integers that demarcate tape
contents.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/



Re: [HACKERS] Tuplesort merge pre-reading

From
Robert Haas
Date:
On Fri, Apr 14, 2017 at 1:19 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> Interestingly enough, I think that Knuth was pretty much spot on with
> his "sweet spot" of 7 tapes, even if you have modern hardware. Commit
> df700e6 (where the sweet spot of merge order 7 was no longer always
> used) was effective because it masked certain overheads that we
> experience when doing multiple passes, overheads that Heikki and I
> mostly removed. This was confirmed by Robert's testing of my merge
> order cap work for commit fc19c18, where he found that using 7 tapes
> was only slightly worse than using many hundreds of tapes. If we could
> somehow be completely effective in making access to logical tapes
> perfectly sequential, then 7 tapes would probably be noticeably
> *faster*, due to CPU caching effects.

I don't think there's any one fixed answer, because increasing the
number of tapes reduces I/O by adding CPU cost, and visca versa.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Tuplesort merge pre-reading

From
Peter Geoghegan
Date:
On Fri, Apr 14, 2017 at 5:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I don't think there's any one fixed answer, because increasing the
> number of tapes reduces I/O by adding CPU cost, and visca versa.

Sort of, but if you have to merge hundreds of runs (a situation that
should be quite rare), then you should be concerned about being CPU
bound first, as Knuth was. Besides, on modern hardware, read-ahead can
be more effective if you have more merge passes, to a point, which
might also make it worth it -- using hundreds of tapes results in
plenty of *random* I/O. Plus, most of the time you only do a second
pass over a subset of initial quicksorted runs -- not all of them.

Probably the main complicating factor that Knuth doesn't care about is
time to return the first tuple -- startup cost. That was a big
advantage for commit df700e6 that I should have mentioned.

I'm not seriously suggesting that we should prefer multiple passes in
the vast majority of real world cases, nor am I suggesting that we
should go out of our way to help cases that need to do that. I just
find all this interesting.

-- 
Peter Geoghegan

VMware vCenter Server
https://www.vmware.com/



Re: [HACKERS] Tuplesort merge pre-reading

From
Heikki Linnakangas
Date:
On 04/14/2017 08:19 AM, Peter Geoghegan wrote:
> BTW, I'm skeptical of the idea of Heikki's around killing polyphase
> merge itself at this point. I think that keeping most tapes active per
> pass is useful now that our memory accounting involves handing over an
> even share to each maybe-active tape for every merge pass, something
> established by Heikki's work on external sorting.

The pre-read buffers are only needed for input tapes; the total number 
of tapes doesn't matter.

For comparison, imagine that you want to perform a merge, such that you 
always merge 7 runs into one. With polyphase merge, you would need 8 
tapes, so that you always read from 7 of them, and write onto one. With 
balanced merge, you would need 14 tapes: you always read from 7 tapes, 
and you would need up to 7 output tapes, of which one would be active at 
any given time.

Those extra idle output tapes are practically free in our 
implementation. The "pre-read buffers" are only needed for input tapes, 
the number of output tapes doesn't matter. Likewise, maintaining the 
heap is cheaper if you only merge a small number of tapes at a time, but 
that's also dependent on the number of *input* tapes, not the total 
number of tapes.

- Heikki