Re: Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAM3SWZS1F4vab6+QeEHLF3A1YHYEChoNULUVCokLixLi5QkzDA@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel tuplesort (for parallel B-Tree index creation) (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Parallel tuplesort (for parallel B-Tree index creation)
Re: Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Tue, Sep 6, 2016 at 12:34 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I'm reviewing patches 1-3 in this series, i.e. those patches that are not > directly related to parallelism, but are independent improvements to > merging. That's fantastic! Thanks! I'm really glad you're picking those ones up. I feel that I'm far too dependent on Robert's review for this stuff. That shouldn't be taken as a statement against Robert -- it's intended as quite the opposite -- but it's just personally difficult to rely on exactly one other person for something that I've put so much work into. Robert has been involved with 100% of all sorting patches I've written, generally with far less input from anyone else, and at this point, that's really rather a lot of complex patches. > Let's begin with patch 1: > > On 08/02/2016 01:18 AM, Peter Geoghegan wrote: >> >> Cap the number of tapes used by external sorts > Yeah, wasting 8% of the memory budget on this seems like a bad idea. If I > understand correctly, that makes the runs shorter than necessary, leading to > more runs. Right. Quite simply, whatever you could have used the workMem for prior to the merge step, now you can't. It's not so bad during the merge step of a final on-the-fly merge (or, with the 0002-* patch, any final merge), since you can get a "refund" of unused (though logically allocated by USEMEM()) tapes to grow memtuples with (other overhead forms the majority of the refund, though). That still isn't much consolation to the user, because run generation is typically much more expensive (we really just refund unused tapes because it's easy). >> A new quasi-arbitrary cap of 501 is applied on the number of tapes that >> tuplesort will ever use (i.e. merge order is capped at 500 inclusive). >> This is a conservative estimate of the number of runs at which doing all >> merging on-the-fly no longer allows greater overlapping of I/O and >> computation. > > > Hmm. Surely there are cases, so that with > 501 tapes you could do it with > one merge pass, but now you need two? And that would hurt performance, no? In theory, yes, that could be true, and not just for my proposed new cap of 500 for merge order (501 tapes), but for any such cap. I noticed that the Greenplum tuplesort.c uses a max of 250, so I guess I just thought that to double that. Way back in 2006, Tom and Simon talked about a cap too on several occasions, but I think that that was in the thousands then. Hundreds of runs are typically quite rare. It isn't that painful to do a second pass, because the merge process may be more CPU cache efficient as a result, which tends to be the dominant cost these days (over and above the extra I/O that an extra pass requires). This seems like a very familiar situation to me: I pick a quasi-arbitrary limit or cap for something, and it's not clear that it's optimal. Everyone more or less recognizes the need for such a cap, but is uncomfortable about the exact figure chosen, not because it's objectively bad, but because it's clearly something pulled from the air, to some degree. It may not make you feel much better about it, but I should point out that I've read a paper that claims "Modern servers of the day have hundreds of GB operating memory and tens of TB storage capacity. Hence, if the sorted data fit the persistent storage, the first phase will generate hundreds of runs at most." [1]. Feel free to make a counter-proposal for a cap. I'm not attached to 500. I'm mostly worried about blatant waste with very large workMem sizings. Tens of thousands of tapes is just crazy. The amount of data that you need to have as input is very large when workMem is big enough for this new cap to be enforced. > Why do we reserve the buffer space for all the tapes right at the beginning? > Instead of the single USEMEM(maxTapes * TAPE_BUFFER_OVERHEAD) callin > inittapes(), couldn't we call USEMEM(TAPE_BUFFER_OVERHEAD) every time we > start a new run, until we reach maxTapes? No, because then you have no way to clamp back memory, which is now almost all used (we hold off from making LACKMEM() continually true, if at all possible, which is almost always the case). You can't really continually shrink memtuples to make space for new tapes, which is what it would take. [1] http://ceur-ws.org/Vol-1343/paper8.pdf -- Peter Geoghegan
pgsql-hackers by date: