Re: Tuplesort merge pre-reading - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Tuplesort merge pre-reading |
Date | |
Msg-id | 95f5a75d-b6ad-f175-57f2-6e32f745e908@iki.fi Whole thread Raw |
In response to | Re: Tuplesort merge pre-reading (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Tuplesort merge pre-reading
|
List | pgsql-hackers |
On 09/06/2016 10:26 PM, Peter Geoghegan wrote: > On Tue, Sep 6, 2016 at 12:08 PM, Peter Geoghegan <pg@heroku.com> wrote: >> Offhand, I would think that taken together this is very important. I'd >> certainly want to see cases in the hundreds of megabytes or gigabytes >> of work_mem alongside your 4MB case, even just to be able to talk >> informally about this. As you know, the default work_mem value is very >> conservative. I spent some more time polishing this up, and also added some code to logtape.c, to use larger read buffers, to compensate for the fact that we don't do pre-reading from tuplesort.c anymore. That should trigger the OS read-ahead, and make the I/O more sequential, like was the purpose of the old pre-reading code. But simpler. I haven't tested that part much yet, but I plan to run some tests on larger data sets that don't fit in RAM, to make the I/O effects visible. I wrote a little testing toolkit, see third patch. I'm not proposing to commit that, but that's what I used for testing. It creates four tables, about 1GB in size each (it also creates smaller and larger tables, but I used the "medium" sized ones for these tests). Two of the tables contain integers, and two contain text strings. Two of the tables are completely ordered, two are in random order. To measure, it runs ORDER BY queries on the tables, with different work_mem settings. Attached are the full results. In summary, these patches improve performance in some of the tests, and are a wash on others. The patches help in particular in the randomly ordered cases, with up to 512 MB of work_mem. For example, with work_mem=256MB, which is enough to get a single merge pass: with patches: ordered_ints: 7078 ms, 6910 ms, 6849 ms random_ints: 15639 ms, 15575 ms, 15625 ms ordered_text: 11121 ms, 12318 ms, 10824 ms random_text: 53462 ms, 53420 ms, 52949 ms unpatched master: ordered_ints: 6961 ms, 7040 ms, 7044 ms random_ints: 19205 ms, 18797 ms, 18955 ms ordered_text: 11045 ms, 11377 ms, 11203 ms random_text: 57117 ms, 54489 ms, 54806 ms (The same queries were run three times in a row, that's what the three numbers on each row mean. I.e. the differences between the numbers on same row are noise) > It looks like your benchmark relies on multiple passes, which can be > misleading. I bet it suffers some amount of problems from palloc() > fragmentation. When very memory constrained, that can get really bad. > > Non-final merge passes (merges with more than one run -- real or dummy > -- on any given tape) can have uneven numbers of runs on each tape. > So, tuplesort.c needs to be prepared to doll out memory among tapes > *unevenly* there (same applies to memtuples "slots"). This is why > batch memory support is so hard for those cases (the fact that they're > so rare anyway also puts me off it). As you know, I wrote a patch that > adds batch memory support to cases that require randomAccess (final > output on a materialized tape), for their final merge. These final > merges happen to not be a final on-the-fly merge only due to this > randomAccess requirement from caller. It's possible to support these > cases in the future, with that patch, only because I am safe to assume > that each run/tape is the same size there (well, the assumption is > exactly as safe as it was for the 9.6 final on-the-fly merge, at > least). > > My point about non-final merges is that you have to be very careful > that you're comparing apples to apples, memory accounting wise, when > looking into something like this. I'm not saying that you didn't, but > it's worth considering. I'm not 100% sure I'm accounting for all the memory correctly. But I didn't touch the way the initial quicksort works, nor the way the runs are built. And the merge passes don't actually need or benefit from a lot of memory, so I doubt it's very sensitive to that. In this patch, the memory available for the read buffers is just divided evenly across maxTapes. The buffers for the tapes that are not currently active are wasted. It could be made smarter, by freeing all the currently-unused buffers for tapes that are not active at the moment. Might do that later, but this is what I'm going to benchmark for now. I don't think adding buffers is helpful beyond a certain point, so this is probably good enough in practice. Although it would be nice to free the memory we don't need earlier, in case there are other processes that could make use of it. > FWIW, I did try an increase in the buffer size in LogicalTape at one > time several months back, and so no benefit there (at least, with no > other changes). Yeah, unless you get rid of the pre-reading in tuplesort.c, you're just double-buffering. - Heikki
Attachment
pgsql-hackers by date: