Re: [PATCH] Incremental sort (was: PoC: Partial sort) - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: [PATCH] Incremental sort (was: PoC: Partial sort) |
Date | |
Msg-id | 20190614173725.vu6xrkx4iu4ghjdz@development Whole thread Raw |
In response to | Re: [PATCH] Incremental sort (was: PoC: Partial sort) (James Coleman <jtc331@gmail.com>) |
List | pgsql-hackers |
On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote: >On Wed, Jun 5, 2019 at 12:14 PM Rafia Sabih <rafia.pghackers@gmail.com> wrote: >> > > 2) Provide some fallback at execution time. For example, we might watch >> > > the size of the group, and if we run into an unexpectedly large one we >> > > might abandon the incremental sort and switch to a "full sort" mode. >> > >> > Are there good examples of our doing this in other types of nodes >> > (whether the fallback is an entirely different algorithm/node type)? I >> > like this idea in theory, but I also think it's likely it would add a >> > very significant amount of complexity. The other problem is knowing >> > where to draw the line: you end up creating these kinds of cliffs >> > where pulling one more tuple through the incremental sort would give >> > you your batch and result in not having to pull many more tuples in a >> > regular sort node, but the fallback logic kicks in anyway. >> > >> What about having some simple mechanism for this, like if we encounter >> the group with more tuples than the one estimated then simply switch >> to normal sort for the remaining tuples, as the estimates does not >> hold true anyway. Atleast this will not give issues of having >> regressions of incremental sort being too bad than the normal sort. >> I mean having something like this, populate the tuplesortstate and >> keep on counting the number of tuples in a group, if they are within >> the budget call tuplesort_performsort, otherwise put all the tuples in >> the tuplesort and then call tuplesort_performsort. We may have an >> additional field in IncrementalSortState to save the estimated size of >> each group. I am assuming that we use MCV lists to approximate better >> the group sizes as suggested above by Tomas. > >I think the first thing to do is get some concrete numbers on performance if we: > >1. Only sort one group at a time. >2. Update the costing to prefer traditional sort unless we have very >high confidence we'll win with incremental sort. > >It'd be nice not to have to add additional complexity if at all possible. > +1 to that >> > Unrelated to all of the above: if I read the patch properly it >> > intentionally excludes backwards scanning. I don't see any particular >> > reason why that ought to be the case, and it seems like an odd >> > limitation for the feature should it be merged. Should that be a >> > blocker to merging? >> >> Regarding this, I came across this, >> /* >> * Incremental sort can't be used with either EXEC_FLAG_REWIND, >> * EXEC_FLAG_BACKWARD or EXEC_FLAG_MARK, because we hold only current >> * bucket in tuplesortstate. >> */ >> I think that is quite understandable. How are you planning to support >> backwards scan for this? In other words, when will incremental sort be >> useful for backward scan. > >For some reason I was thinking we'd need it to support backwards scans >to be able to handle DESC sort on the index, but I've tested and >confirmed that already works. I suppose that's because the index scan >provides that ordering and the sort node doesn't need to reverse the >direction of what's provided to it. That's not particularly obvious to >someone newer to the codebase; I'm not sure if that's documented >anywhere. > Yeah, backward scans are not about ASC/DESC, it's about being able to walk back through the result. And we can't do that with incremental sorts without materialization. >> On a different note, I can't stop imagining this operator on the lines >> similar to parallel-append, wherein multiple workers can sort the >> different groups independently at the same time. > >That is an interesting idea. I suppose it'd be particularly valuable >if somehow there a node that was generating each batch in parallel >already, though I'm not sure at first though what kind of query or >node would result in that. I also wonder if (assuming that weren't the >case) it would be much of an improvement since a single thread would >have to generate each batch anyway; I'm not sure if the overhead of >farming each batch out to a worker would actually be useful or if the >real blocker is the base scan. > >At the very least it's an interesting idea. > I kinda doubt that'd be very valuable. Or more precisely, we kinda already have that capability because we can do things like this: -> Gather Merge -> Sort -> ... scan ... so I imagine we'd just do an Incremental Sort here. Granted, it's not distributed by prefix groups (I assume that's what you mean by batches here), but I don't think that's a big problem. >--- > >I've been writing down notes here, and I realized that my test case >from far upthread is actually a useful setup to see how much overhead >is involved in sorting each batch individually, since it sets up data >with each batch only containing 1 tuple. That particular case is one >we could easily optimize anyway in the code and skip sorting >altogether -- might be a useful enhancement. > >I hope to do some more testing and then report back with the results. > >James Coleman OK. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: