Thread: Progress on fast path sorting, btree index creation time
I'll try and keep this terse. I've promised to justify the number of specialisations that are in my fast-path sorting patch, and I may yet conclude that a reduction is appropriate. Not today though - there are quite a few ideas in the patch (even more now), and not all have been exhaustively evaluated. Maybe that isn't what some had hoped for right now, but deferring publicly and definitively answering those questions to focus on tweaking the patch has brought additional benefits. I felt that I went too long without checking in with the community though. The purpose of this e-mail is to: 1. Report back a further improvement in the performance benefits seen when sorting with multiple sortkeys. The performance benefits seen for that case were previously relatively modest; they're better now. 2. Report improvements from applying these techniques to btree index tuplesorting (short version: they're also quite good). The data used is exactly the same as it was in my previous benchmark; orderlines is 1538MB, and has lots of duplicates. The environment is also identical. 3. Resolve two anticipated controversies that are, respectively, somewhat orthogonal and completely orthogonal to the binary bloat controversy. The first (controversy A) is that I have added a new piece of infrastructure, pg_always_inline, which, as the name suggests, is a portable way of insisting that a function should be invariably inlined. Portable libraries like libc have been doing this for a long time now, and I actually use the libc macro (that expands to __attribute__((always_inline)) ) where possible. The second possible point of contention (controversy B) is that I have jettisoned various protections against bad qsort implementations that I believe are a legacy of when we used the system qsort pre-2006, that can no longer be justified. For example, quick sort performing badly in the face of lots of duplicates is a well understood problem today (partitioning should stop on equal keys), and ISTM that protecting against that outside the qsort implementation (but only for index sorts) is wrong-headed. The first possibly controversial adjustment (controversy A) is clearly also the reason for (1) - that has been well isolated. I haven't got around to quantifying the performance improvement seen due to the second possibly controversial adjustment (controversy B), but I believe that it can be justified as a case of effectively removing redundant code anyway. After all, we now assert against comparing a tuple to itself anyway, and the "cheap insurance" that existed in comparetup_index_btree was never present in any form in comparetup_index_heap, and we heard no complaints, AFAIK. Attached are: * A detailing of libc's use of __always_inline / __attribute__((always_inline)). Note that it often appears in code that is built for all platforms. * The WIP patch itself, rebased to integrate with the new SortSupport infrastructure. I've gone a bit crazy with btree specialisations, but I suspect that's where it matters least and makes most sense. * A new Python script for bench marking index creation, that is similar to the other one I previously posted for bench marking sorting heap tuples. * A spreadsheet that shows the results of re-running my earlier heap tuple sorting benchmark with this new patch. The improvement in the query that orders by 2 columns is all that is pertinent there, when considering the value of (1) and the sense in standing still for controversy A. * A spreadsheet that shows the difference in index creation times, generated with the help of the new python script. Thoughts? I had another idea when writing this patch that I haven't developed at all but I'll share anyway. That is, it might be a good idea to use a balance quicksort: http://xlinux.nist.gov/dads//HTML/balancedqsrt.html It might be possible to get a reasonable approximation of the actual median value of a given column with existing statistics, which could be hinted to qsort_arg. This would do a better job of guessing an appropriate initial pivot value for qsort than the med3 sampling technique (what we do now, advocated by Sedgewick: use the median of the first, middle and last elements of the current partition), though we'd still use that med3 technique to select all other pivots, and perhaps signal to qsort_arg "you're on your own, fall back of med3 for the initial pivot" with a null ptr, according to some heuristic. There's obviously a not inconsiderable impedance mismatch to resolve if we're to do that though, so that Tuplesortstate has a pointer to the median SortTuple. Can I get a straw poll on how much of a problem worst-case performance of qsort is believed to be? In a perfect world, if it were somehow possible to know the perfect pivots ahead of time from a histogram or something, we'd have a quick sort variant with worst-case performance of O(n log(n)). That, or the limited approximation that I've sketched would perhaps be worthwhile, even if it was subject to a number of caveats. Wikipedia claims that the worst case for quick sort is O(n log(n)) with the refinements recommended by Sedgewick's 1978 paper, but this seems like nonsense to me - the med3 technique is a good heuristic in practice, but it's perfectly possible in theory for it's sampling to always get things completely wrong (this is not an unfamiliar problem). How often it messes up in the real world and how much it matters is something that I wouldn't like to speculate on, though it is the main factor that will determine if the idea is worth pursuing. I can tell you that I have heard one person observe that it had an unpredictable runtime. However, they might not have noticed if this patch was applied, because in general the worse quicksort does the better this patch does. Also, the fact that we use a median of "medians" when n > 40 (Dr. Sedgewick again, if I'm not mistaken) makes me a little skeptical of that claim. Come to think of it, it might not be a bad idea to add a bunch of comments to qsort_arg while I have this swapped into my head, as it currently has no comments at all. While I'm thinking out loud, here's another idea: Have a GUC which, when enabled (perhaps potentially at various different granularities), makes Timsort the in-memory sorting algorithm, as it now is by default for Python, Java SE7 (for arrays) and the Android platform; for certain applications, this could be a real win, and I don't think that it would have to be invasive: it could be dropped in. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > * A spreadsheet that shows the results of re-running my earlier heap > tuple sorting benchmark with this new patch. The improvement in the > query that orders by 2 columns is all that is pertinent there, when > considering the value of (1) and the sense in standing still for > controversy A. > > * A spreadsheet that shows the difference in index creation times, > generated with the help of the new python script. very nice. let me save everyone the effort of opening his spreadsheets (which by the way both show 'HEAD/unoptimized' -- probably not what you meant): he's showing a consistent ~50% reduction in running time of sort driven queries -- that's money. merlin
On 30 December 2011 19:46, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> * A spreadsheet that shows the results of re-running my earlier heap >> tuple sorting benchmark with this new patch. The improvement in the >> query that orders by 2 columns is all that is pertinent there, when >> considering the value of (1) and the sense in standing still for >> controversy A. >> >> * A spreadsheet that shows the difference in index creation times, >> generated with the help of the new python script. > > very nice. let me save everyone the effort of opening his > spreadsheets (which by the way both show 'HEAD/unoptimized' -- > probably not what you meant): he's showing a consistent ~50% reduction > in running time of sort driven queries -- that's money. Sorry, I think you may have misinterpreted the results, which is my fault - I introduced a formatting error. In the case of the "btree" spreadsheet, the first query on each sheet should be "create index test on orderlines (prod_id);", and not "select * from orderlines order by prod_id". The idea is to compare the results from each set of binaries across pages of the spreadsheet (note that there are two tabs). You should not compare anything between the two spreadsheets. Revised btree results attached. The heap results that I posted do not have any formatting errors, so they have not been revised. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Fri, Dec 30, 2011 at 2:30 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 30 December 2011 19:46, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Thu, Dec 29, 2011 at 8:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >>> * A spreadsheet that shows the results of re-running my earlier heap >>> tuple sorting benchmark with this new patch. The improvement in the >>> query that orders by 2 columns is all that is pertinent there, when >>> considering the value of (1) and the sense in standing still for >>> controversy A. >>> >>> * A spreadsheet that shows the difference in index creation times, >>> generated with the help of the new python script. >> >> very nice. let me save everyone the effort of opening his >> spreadsheets (which by the way both show 'HEAD/unoptimized' -- >> probably not what you meant): he's showing a consistent ~50% reduction >> in running time of sort driven queries -- that's money. > > Sorry, I think you may have misinterpreted the results, which is my > fault - I introduced a formatting error. In the case of the "btree" > spreadsheet, the first query on each sheet should be "create index > test on orderlines (prod_id);", and not "select * from orderlines > order by prod_id". The idea is to compare the results from each set of > binaries across pages of the spreadsheet (note that there are two > tabs). You should not compare anything between the two spreadsheets. > Revised btree results attached. The heap results that I posted do not > have any formatting errors, so they have not been revised. right-- my bad. still, that's 31-37% -- still pretty nice. merlin
On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > 3. Resolve two anticipated controversies that are, respectively, > somewhat orthogonal and completely orthogonal to the binary bloat > controversy. The first (controversy A) is that I have added a new > piece of infrastructure, pg_always_inline, which, as the name > suggests, is a portable way of insisting that a function should be > invariably inlined. Portable libraries like libc have been doing this > for a long time now, and I actually use the libc macro (that expands > to __attribute__((always_inline)) ) where possible. I don't have a problem with the idea of a pg_always_inline, but I'm wondering what sort of fallback mechanism you propose. It seems to me that if the performance improvement is conditioned on inlining be done and we're not confident that we can force the compiler to inline, we ought to fall back to not making the specialization available, rather than to making something available that may not work well. Of course if it's only a question of a smaller performance gain, rather than an actual loss, then that's not as much of an issue... > The second > possible point of contention (controversy B) is that I have jettisoned > various protections against bad qsort implementations that I believe > are a legacy of when we used the system qsort pre-2006, that can no > longer be justified. For example, quick sort performing badly in the > face of lots of duplicates is a well understood problem today > (partitioning should stop on equal keys), and ISTM that protecting > against that outside the qsort implementation (but only for index > sorts) is wrong-headed. Assuming you mean that qsort_arg() will protect against this stuff so that callers don't need to do it separately, +1. > I had another idea when writing this patch that I haven't developed at > all but I'll share anyway. That is, it might be a good idea to use a > balance quicksort: > > http://xlinux.nist.gov/dads//HTML/balancedqsrt.html I can't find any description of how this actually works... obviously, we'd like to find a close-to-median element, but how do we actually do that cheaply? > It might be possible to get a reasonable approximation of the actual > median value of a given column with existing statistics, which could > be hinted to qsort_arg. I doubt that this would be worthwhile -- the pivot that we pick at the toplevel doesn't really matter much; even if it's bad, we can recover just fine if the pivot we pick one level down is good, or the level after that. We only really have a problem if we keep systematically picking bad pivots every time. And if we do have that problem, improving the toplevel choice of pivot is only going to help modestly, because we'll still be O(n^2) on each partition. > Can I get a straw poll on how much of a problem worst-case performance > of qsort is believed to be? I'd be reluctant to remove any of the protections we currently have, because I bet they were all put in to fix problems that people hit in the field. But I don't know that we need more. Of course, consolidating them into qsort() itself rather than its callers probably makes sense, as noted above. > In a perfect world, if it were somehow possible to know the perfect > pivots ahead of time from a histogram or something, we'd have a quick > sort variant with worst-case performance of O(n log(n)). That, or the > limited approximation that I've sketched would perhaps be worthwhile, > even if it was subject to a number of caveats. Wikipedia claims that > the worst case for quick sort is O(n log(n)) with the refinements > recommended by Sedgewick's 1978 paper, but this seems like nonsense to > me - the med3 technique is a good heuristic in practice, but it's > perfectly possible in theory for it's sampling to always get things > completely wrong (this is not an unfamiliar problem). I agree. After reading http://nanxkurniawan.wordpress.com/2010/05/31/quicksort/ I am inclined to think that things are still O(n lg n) even if we always partition so that any fixed percentage of the data -- is on one side of the partition element and the remainder is on the other side. So for example if all of our partition elements are greater than 1% of the elements in their partition and less than the other 99%, we'll still be O(n lg n), though with a considerably higher constant factor, I think. However, if our partition elements are always a fixed NUMBER of elements from the end - whether it's 1 or 100 - then we'll be O(n^2), though again the constant factor will depend on how many elements you knock off each time. In practice I'm not sure whether an algorithmic analysis matters much, because in real life the constant factors are probably pretty important, hence the median-of-medians approach with n > 40. > While I'm thinking out loud, here's another idea: Have a GUC which, > when enabled (perhaps potentially at various different granularities), > makes Timsort the in-memory sorting algorithm, as it now is by default > for Python, Java SE7 (for arrays) and the Android platform; for > certain applications, this could be a real win, and I don't think that > it would have to be invasive: it could be dropped in. I gather from a quick read that this is supposed to be especially good when the data is already somewhat sorted. Would there be any merit in trying to guess when that's likely to be the case (e.g. based on the correlation statistic)? That seems like a stretch, but OTOH a GUC doesn't feel great either: what is the poor user to do with a query that does 2 sorts, one of which is faster with QS and the other of which is faster with Timsort? Ugh. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Dec 29, 2011 at 9:03 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> The first (controversy A) is that I have added a new >> piece of infrastructure, pg_always_inline, which, as the name >> suggests, is a portable way of insisting that a function should be >> invariably inlined. > I don't have a problem with the idea of a pg_always_inline, but I'm > wondering what sort of fallback mechanism you propose. There is no compiler anywhere that implements "always inline", unless you are talking about a macro. "inline" is a hint and nothing more, and if you think you can force it you are mistaken. So this controversy is easily resolved: we do not need any such construct. The real question is whether we should accept a patch that is a performance loss when the compiler fails to inline some reasonably simple function. I think that would depend on the actual numbers involved, so we'd need to see data before making a decision. >> The second >> possible point of contention (controversy B) is that I have jettisoned >> various protections against bad qsort implementations that I believe >> are a legacy of when we used the system qsort pre-2006, that can no >> longer be justified. No objection to that one, I think, as long as you've verified that our implementation is in fact okay about these things. regards, tom lane
On 5 January 2012 22:27, Tom Lane <tgl@sss.pgh.pa.us> wrote: > There is no compiler anywhere that implements "always inline", unless > you are talking about a macro. "inline" is a hint and nothing more, > and if you think you can force it you are mistaken. So this controversy > is easily resolved: we do not need any such construct. I'm slightly puzzled by your remarks here. GCC documentation on this is sparse (although, as I've demonstrated, I can get better numbers using the always_inline attribute on GCC 4.3), but take a look at this: http://msdn.microsoft.com/en-us/library/z8y1yy88.aspx While it is not strictly true to say that pg_always_inline could inline every function under every set of circumstances, it's pretty close to the truth. I do accept that this facility could quite easily be abused if its use isn't carefully measured. I also accept that C99/GNU C inline functions are generally just requests to the compiler that may be ignored (and indeed the compiler may inline without being asked to). It's short sighted to see this as a case of inlining itself making a big difference, so much as it making a big difference as an enabling transformation. > The real question is whether we should accept a patch that is a > performance loss when the compiler fails to inline some reasonably > simple function. I think that would depend on the actual numbers > involved, so we'd need to see data before making a decision. Who said anything about a performance loss? Since the raw improvement to qsort_arg is so large, it's pretty difficult to imagine a confluence of circumstances in which this results in a net loss. See the figures at http://archives.postgresql.org/pgsql-hackers/2011-11/msg01316.php , for example. The pg_always_inline idea is relatively recent. It just serves to provide additional encouragement to the compiler to inline, insofar as that is possible on the platform in question. The compiler's cost/benefit analysis cannot possibly appreciate how hot a code path qsort_arg is, because it has a set of generic heuristics that are quite fallible, and very probably are on quite conservative. We can appreciate such things though. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Jan 5, 2012 at 5:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > There is no compiler anywhere that implements "always inline", unless > you are talking about a macro. "inline" is a hint and nothing more, > and if you think you can force it you are mistaken. So this controversy > is easily resolved: we do not need any such construct. That's basically in direct contradiction to the experimental evidence. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5 January 2012 20:23, Robert Haas <robertmhaas@gmail.com> wrote: > I don't have a problem with the idea of a pg_always_inline, but I'm > wondering what sort of fallback mechanism you propose. It seems to me > that if the performance improvement is conditioned on inlining be done > and we're not confident that we can force the compiler to inline, we > ought to fall back to not making the specialization available, rather > than to making something available that may not work well. Of course > if it's only a question of a smaller performance gain, rather than an > actual loss, then that's not as much of an issue... pg_always_inline is more a less a neat adjunct to what I've already done. You can take it or leave it, but it would be irrational to dismiss it out of hand, because it is demonstrably effective in this case. Using non-portable things like function attributes is fairly well precedented - we use __attribute__((format(*) )) frequently. We don't need a fallback mechanism other than "#else #define pg_always_inline inline #endif". I think these facilities are available to well over 99% of our users, who use GCC, GCC compatible compiler and MSVC builds. I have basically no sympathy for anyone who uses a compiler that can't inline functions. Do such people actually exist? >> ISTM that protecting >> against that outside the qsort implementation (but only for index >> sorts) is wrong-headed. > > Assuming you mean that qsort_arg() will protect against this stuff so > that callers don't need to do it separately, +1. That's exactly what I mean. Going quadratic in the face of lot of duplicates is something that, remarkably, was a problem well into the 1990s, apparently because some guy noticed it one day, though I think that lots of popular implementations happened to be unaffected anyway. As you know, all queries tested have lots and lots of duplicates (a ~1.5GB table that contains the same number of distinct elements as a ~10MB table once did), and removing the "duplicate protection" for btrees, on top of everything else, sees the qsort do an awful lot better than HEAD does with the psuedo-protection. As I've said, we already lack any such "protection" for heap tuples. We are unaffected now anyway, in particular, because we do this, as apparently recommended by both Sedgewick and Knuth: while (pb <= pc && (r = cmp(pb, a)) <= 0) { if (r == 0) { swap(pa, pb); pa +=es; } pb += es; } Note that the partition swap occurs *because* the pivot and element are equal. You might imagine that this is superfluous. It turns out that it protects against the duplicates, resulting in sub-partitions about the same size (though it occurs to me that it does so at the expense of precluding parallelism). In short, Sedgewick would approve of our qsort implementation. As for the "compare a tuple to itself" thing, that's just silly, any was probably only ever seen in some homespun implementation at some point a relatively long time ago, but I've asserted against it anyway. > I can't find any description of how this actually works... obviously, > we'd like to find a close-to-median element, but how do we actually do > that cheaply? Uh, it works whatever way you want it to work. The implementation has to figure out how to get the median ahead of time. > I doubt that this would be worthwhile -- the pivot that we pick at the > toplevel doesn't really matter much; even if it's bad, we can recover > just fine if the pivot we pick one level down is good, or the level > after that. We only really have a problem if we keep systematically > picking bad pivots every time. And if we do have that problem, > improving the toplevel choice of pivot is only going to help modestly, > because we'll still be O(n^2) on each partition. > >> Can I get a straw poll on how much of a problem worst-case performance >> of qsort is believed to be? > > I'd be reluctant to remove any of the protections we currently have, > because I bet they were all put in to fix problems that people hit in > the field. But I don't know that we need more. Of course, > consolidating them into qsort() itself rather than its callers > probably makes sense, as noted above. I was merely proposing something that would compliment the med3 method and our existing protections. >> In a perfect world, if it were somehow possible to know the perfect >> pivots ahead of time from a histogram or something, we'd have a quick >> sort variant with worst-case performance of O(n log(n)). That, or the >> limited approximation that I've sketched would perhaps be worthwhile, >> even if it was subject to a number of caveats. Wikipedia claims that >> the worst case for quick sort is O(n log(n)) with the refinements >> recommended by Sedgewick's 1978 paper, but this seems like nonsense to >> me - the med3 technique is a good heuristic in practice, but it's >> perfectly possible in theory for it's sampling to always get things >> completely wrong (this is not an unfamiliar problem). > > I agree. After reading > http://nanxkurniawan.wordpress.com/2010/05/31/quicksort/ I am inclined > to think that things are still O(n lg n) even if we always partition > so that any fixed percentage of the data -- is on one side of the > partition element and the remainder is on the other side. So for > example if all of our partition elements are greater than 1% of the > elements in their partition and less than the other 99%, we'll still > be O(n lg n), though with a considerably higher constant factor, I > think. However, if our partition elements are always a fixed NUMBER > of elements from the end - whether it's 1 or 100 - then we'll be > O(n^2), though again the constant factor will depend on how many > elements you knock off each time. In practice I'm not sure whether an > algorithmic analysis matters much, because in real life the constant > factors are probably pretty important, hence the median-of-medians > approach with n > 40. Yeah, it's true that you have to be exceptionally unlucky to actually see worst-case performance, but I imagined that it might be enough of a difference to be worth pursuing. Experimentally, I can certainly see the difference (I simulated it), but it is underwhelming next to everything else. Might be worth simplifying the swap logic, given as we only ever have to swap SortTuple structs in qsort_arg, and given that it would make the meta qsort_arg that much less ugly. >> While I'm thinking out loud, here's another idea: Have a GUC which, >> when enabled (perhaps potentially at various different granularities), >> makes Timsort the in-memory sorting algorithm, as it now is by default >> for Python, Java SE7 (for arrays) and the Android platform; for >> certain applications, this could be a real win, and I don't think that >> it would have to be invasive: it could be dropped in. > > I gather from a quick read that this is supposed to be especially good > when the data is already somewhat sorted. Would there be any merit in > trying to guess when that's likely to be the case (e.g. based on the > correlation statistic)? That seems like a stretch, but OTOH a GUC > doesn't feel great either: what is the poor user to do with a query > that does 2 sorts, one of which is faster with QS and the other of > which is faster with Timsort? Ugh. I'd imagined that it might work a bit like default_statistics_target. Of course, I was just thinking out loud. Also, we do a very good job on *perfectly* pre-sorted input because we check for pre-sorted input. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > As you know, all queries tested have lots and lots of duplicates (a > ~1.5GB table that contains the same number of distinct elements as a > ~10MB table once did), and removing the "duplicate protection" for > btrees, on top of everything else, sees the qsort do an awful lot > better than HEAD does with the psuedo-protection. Does that also win vs. unpatched master? If so we may as well pull that part out and commit it separately. >> I can't find any description of how this actually works... obviously, >> we'd like to find a close-to-median element, but how do we actually do >> that cheaply? > > Uh, it works whatever way you want it to work. The implementation has > to figure out how to get the median ahead of time. Oh. I'd be inclined to think that would be pretty inefficient, unless we only did it for very large partitions or when we observed that some less costly strategy was tanking. >> I gather from a quick read that this is supposed to be especially good >> when the data is already somewhat sorted. Would there be any merit in >> trying to guess when that's likely to be the case (e.g. based on the >> correlation statistic)? That seems like a stretch, but OTOH a GUC >> doesn't feel great either: what is the poor user to do with a query >> that does 2 sorts, one of which is faster with QS and the other of >> which is faster with Timsort? Ugh. > > I'd imagined that it might work a bit like default_statistics_target. > Of course, I was just thinking out loud. Also, we do a very good job > on *perfectly* pre-sorted input because we check for pre-sorted input. We occasionally have people who want to do SELECT * FROM foo WHERE a > 100 AND a < 200 ORDER BY a, b where foo has an index on (a) but not (a, b). This tends to suck, because we fail to realize that we can batch the sort. We either seqscan and filter the table then sort the results, or else we scan the index on (a) and then ignore the fact that we have data which is already partially sorted. It's particularly annoying if a is "mostly unique" and needs b only occasionally to break ties. It would be nice to improve this, but it may be more of a planner/executor problem that something we can fix directly inside the sort implementation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6 January 2012 17:29, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 6, 2012 at 12:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> As you know, all queries tested have lots and lots of duplicates (a >> ~1.5GB table that contains the same number of distinct elements as a >> ~10MB table once did), and removing the "duplicate protection" for >> btrees, on top of everything else, sees the qsort do an awful lot >> better than HEAD does with the psuedo-protection. > > Does that also win vs. unpatched master? If so we may as well pull > that part out and commit it separately. I didn't bother isolating that, because it doesn't really make sense to (not having it is probably only of particular value when doing what I'm doing anyway, but who knows). Go ahead and commit something to remove that code (get it in both comparetup_index_btree and comparetup_index_hash), as well as the tuple1 != tuple2 test now if you like. It's patently clear that it is unnecessary, and so doesn't have to be justified as a performance enhancement - it's a simple case of refactoring for clarity. As I've said, we don't do this for heap tuples and we've heard no complaints in all that time. It was added in commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when problems with system qsorts came to light. >> I'd imagined that it might work a bit like default_statistics_target. >> Of course, I was just thinking out loud. Also, we do a very good job >> on *perfectly* pre-sorted input because we check for pre-sorted input. > > We occasionally have people who want to do SELECT * FROM foo WHERE a > > 100 AND a < 200 ORDER BY a, b where foo has an index on (a) but not > (a, b). This tends to suck, because we fail to realize that we can > batch the sort. We either seqscan and filter the table then sort the > results, or else we scan the index on (a) and then ignore the fact > that we have data which is already partially sorted. It's > particularly annoying if a is "mostly unique" and needs b only > occasionally to break ties. It would be nice to improve this, but it > may be more of a planner/executor problem that something we can fix > directly inside the sort implementation. That sounds like an interesting problem. Something to look into for 9.3, perhaps. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes: > I didn't bother isolating that, because it doesn't really make sense > to (not having it is probably only of particular value when doing what > I'm doing anyway, but who knows). Go ahead and commit something to > remove that code (get it in both comparetup_index_btree and > comparetup_index_hash), as well as the tuple1 != tuple2 test now if > you like. It's patently clear that it is unnecessary, and so doesn't > have to be justified as a performance enhancement - it's a simple case > of refactoring for clarity. As I've said, we don't do this for heap > tuples and we've heard no complaints in all that time. It was added in > commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when > problems with system qsorts came to light. Actually, I'm going to object to reverting that commit, as I believe that having equal-keyed index entries in physical table order may offer some performance benefits at access time. If you don't like the comment, we can change that. regards, tom lane
On 6 January 2012 18:45, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <peter@2ndquadrant.com> writes: >> I didn't bother isolating that, because it doesn't really make sense >> to (not having it is probably only of particular value when doing what >> I'm doing anyway, but who knows). Go ahead and commit something to >> remove that code (get it in both comparetup_index_btree and >> comparetup_index_hash), as well as the tuple1 != tuple2 test now if >> you like. It's patently clear that it is unnecessary, and so doesn't >> have to be justified as a performance enhancement - it's a simple case >> of refactoring for clarity. As I've said, we don't do this for heap >> tuples and we've heard no complaints in all that time. It was added in >> commit fbac1272b89b547dbaacd78bbe8da68e5493cbda, presumably when >> problems with system qsorts came to light. > > Actually, I'm going to object to reverting that commit, as I believe > that having equal-keyed index entries in physical table order may offer > some performance benefits at access time. If you don't like the > comment, we can change that. Please explain your position. When is this supposed to be useful? Even if you can present an argument for keeping it, it has to weigh the fact that people don't tend to have much use for indexes with lots of duplicates anyway. Prior to 2004, we did not do this - it was justified as insurance against bad qsort implementations only. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes: > On 6 January 2012 18:45, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Actually, I'm going to object to reverting that commit, as I believe >> that having equal-keyed index entries in physical table order may offer >> some performance benefits at access time. If you don't like the >> comment, we can change that. > Please explain your position. When is this supposed to be useful? When there are lots of duplicates of a particular indexed value, the existing code will cause an indexscan to search them in physical order, whereas if we remove the existing logic it'll be random --- in particular, accesses to the same heap page can no longer be expected to be clustered. Admittedly, I don't have any numbers quantifying just how useful that might be, but on the other hand you've not presented any evidence justifying removing the behavior either. If we believe your position that indexes don't generally have lots of duplicates, then the code in question will seldom be reached and therefore there would be no performance benefit in removing it. regards, tom lane
On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Admittedly, I don't have any numbers quantifying just how useful that > might be, but on the other hand you've not presented any evidence > justifying removing the behavior either. If we believe your position > that indexes don't generally have lots of duplicates, then the code in > question will seldom be reached and therefore there would be no > performance benefit in removing it. Obviously, many indexes are unique and thus won't have duplicates at all. But if someone creates an index and doesn't make it unique, odds are very high that it has some duplicates. Not sure how many we typically expect to see, but more than zero... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6 January 2012 21:14, Tom Lane <tgl@sss.pgh.pa.us> wrote: > When there are lots of duplicates of a particular indexed value, the > existing code will cause an indexscan to search them in physical order, > whereas if we remove the existing logic it'll be random --- in > particular, accesses to the same heap page can no longer be expected to > be clustered. Isn't it possible to get them in physical order anyway, by reading them into memory in that order? Efficient quick sort implementations are not stable, and ours is no exception, but we could perhaps come up with a cheaper tie-breaker value at that stage, if you're determined to maintain this behaviour. We have sufficient incentive to, as I describe below. > Admittedly, I don't have any numbers quantifying just how useful that > might be, but on the other hand you've not presented any evidence > justifying removing the behavior either. If we believe your position > that indexes don't generally have lots of duplicates, then the code in > question will seldom be reached and therefore there would be no > performance benefit in removing it. I ran the same btree benchmark on master, but without the "cheap insurance". The results were interesting, to say the least. The columns indexed were the same columns and data that I've been using all along. Initially this made sense, as the performance of multi sort key sorts often largely hinged on being able to get away with doing one comparison per pair of tuples - with many duplicates, I could avoid cheating and show something closer to worst case for the patch. I didn't think it mattered that indexing the same columns would produce what happened to be a not so useful index in the real world, due to having so many duplicates - better to have figures that were somewhat comparable for btree tuple sorting and heap tuple sorting. When I ran the same benchmark on a server that differed from master only in that their was no insurance, it momentarily appeared that *all* of the gains for btree index creation came from being able to elide the "cheap insurance", but only where it would have to be paid for a high number of times. I soon realised that I'd made a blunder: the code (that is, the patch that I posed most recently) wasn't even using my specialisation for qsorting, because the SortSupport pointer was null! I did not have tuplesort_begin_index_btree initialise the SortSupport struct as tuplesort_begin_heap did, so my earlier benchmark was effectively meaningless, except that it indicated the benefits of eliding the cheap insurance alone, if only for that not so compelling case. You should note that the benefits of not paying for the insurance can be very significant indeed. Attached are figures for an identical run of the same btree python script, but with a version of the patch that actually uses my specialisations. Granted, these numbers are still partially predicated on the index in question having a large number of duplicates, but it's worth noting: 1. The gain from specialisation isn't bad; not as good as the improvements we saw for heap tuples, but not so bad either, especially considering that binary bloat should be much less controversial for btree tuples. 2. The index that results from the tests is still useful; the planner is perfectly willing to use it rather than than performing an in-memory sort. It will also use it to satisfy a query like "select * from orderlines where prod_id = 5", albeit via a bitmap index scan. I took the precaution of increasing default_statistics_target to its maximum value, as well an performing an analyze on orderlines in advance of checking this. Revision to this patch that fixes the bug to follow - I produced these new numbers from a rough cut of that. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
> Obviously, many indexes are unique and thus won't have duplicates at > all. But if someone creates an index and doesn't make it unique, odds > are very high that it has some duplicates. Not sure how many we > typically expect to see, but more than zero... Peter may not, but I personally admin lots of databases which have indexes on values like "category" or "city" which have 100's or 1000's of duplicates per value. I don't think this is uncommon at all. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 9 January 2012 19:45, Josh Berkus <josh@agliodbs.com> wrote: >> Obviously, many indexes are unique and thus won't have duplicates at >> all. But if someone creates an index and doesn't make it unique, odds >> are very high that it has some duplicates. Not sure how many we >> typically expect to see, but more than zero... > > Peter may not, but I personally admin lots of databases which have > indexes on values like "category" or "city" which have 100's or 1000's > of duplicates per value. I don't think this is uncommon at all. Uh, then all the more reason to do what I recommend, I imagine. There is most definitely a large overhead to creating such indexes, at least for scalar types. As far as I can tell, Tom's complaint is quite speculative. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 6 January 2012 21:47, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 6, 2012 at 4:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Admittedly, I don't have any numbers quantifying just how useful that >> might be, but on the other hand you've not presented any evidence >> justifying removing the behavior either. If we believe your position >> that indexes don't generally have lots of duplicates, then the code in >> question will seldom be reached and therefore there would be no >> performance benefit in removing it. I have decided on a tactical retreat in relation to this patch. This has been dragging on since September. I cannot risk not having the patch accepted for 9.2, due to trying to handle both heap and btree tuple sorting at once - the additional relatively modest improvements for common cases that it will bring to btree index creation time do not warrant digging my heels in to cover that case in one larger commit. For that reason, I attach for your consideration a revision of the patch without any support for btree index tuples whatsoever (though I have adjusted btree tuplesort comments, and one tiny piece of code, in a non-controversial way). I'll revisit btree sorting towards the end of the commitfest, circumstances permitting. As to the question of binary bloat, I have devised a pgbench-tools workload that I think will go some way towards reassuring those who are concerned about its distributed costs. The attached spreadsheet has all the relevant details. A custom scripts has been specified (details of which are evident from benchmark results themselves). If we experienced some kind of noticeable marginalisation of usefully cached instructions, that's obviously where it'll show up. Note that I have taken the preparatory step of updating some tables with random data, rather than the sequential data that they have by default, which, unusually for such large tables, can allow us to avail of our pre-sorted input check optimisation, spoiling things. I did vacuumdb immediately afterwards, immediately before the pgbench run in each case. I've kept the test at 8 clients/8 threads, on a machine with 4 physical cores + hyperthreading for 8 logical ("Intel Core(TM) i7 CPU 870 @ 2.93GHz", with 4 x 32 KB instruction caches, among several other CPU caches) on a newly-initialised, throw-away database, single run at scale 50 for 15 minutes in all cases. This is the same server that I have used throughout. My analysis is that although there may be a very slight regression in non-affected queries (it's a tough call to make - how queries happen to coincide undoubtedly muddies the waters, as does the fact that we cut lots of time from periods in which backends hold a lot of locally allocated sort memory - it might actually be a win for those other queries sometimes but not others, and it appears that the margin either way is very small), that is more than compensated for by the benefit of the specialisations. The CPU cache is doing its fjob as expected, and we clearly see a net benefit from its preference for cacheing more instructions that are specialised - those sort operations are way more expensive (if they weren't, it wouldn't cache them so heavily). I also include results for running the same query again and again with a single client, to put that effect in context (though note that previous benchmarks avoiding paying a high client overhead by using explain analyze, and indeed originally compared pre-SortSupport Postgres, so these numbers aren't as good either). It's unusual for a database workload to be so heavily CPU bound, so I'd suggest that the latter benchmark is more representative than the former. Either way, we win by some margin. If the queries didn't have such a high client overhead, as for example with a sort node that feeds a merge join, we'd do better still. If a sorting specialisation is never used anyway, the overhead, for practical purposes, is zero. I believe that sorting specialisations are just too useful to not be a net win in almost all reasonable cases. Why haven't I used all specialisations at once, rather than only two (one for single int4, the other multiple)? Well, I might have used all of them, but there was no floats available in the pgbench tables, and besides, the chances of all of them being simultaneously in play during any sufficiently short period for marginalisation of CPU cache contents to be of particular concern is, in general, not all that great. A major systems programming language, C++, produces multiple specialisations as its standard library's sort function is used, for example, each of which will have separate entries in the procedure linkage table (though various implementation-defined techniques are frequently used to reduce the number of copies across translation units at link time). If the programmer specifies either a different datatype, or a different comparator (through the use of an alternative to the default std::less_than<T> functor/predicate), a whole new specialisation is generated by the compiler. This does raise concerns in relation to binary bloat, but they're reasonably well understood, and this is the default way of doing things, for general purpose application development. Now, I know that Postgres isn't written in C++, and you might not find "C++ does it" to be a particularly compelling argument, but it does go to show that these ideas are not by any stretch of the imagination radical. I remind everyone that the improvement seen in raw qsort_arg runtime is fairly large, as it would have to be, in order to bring such large though proportionally smaller improvements to each given affected query as a whole - there are of course many other sources of overhead involved in parsing, planning and executing affected queries. Here is a complete detailing of the specialisations that this latest revision produces: int4, single key (supports date too) int8, multiple keys (supports timestamps too, with and without TZ, where we HAVE_INT64_TIMESTAMP) float4, single key float8, multiple keys Type-generic single key specialisation. Expected to deliver some of the same additional benefits for types that do not merit there own specialisations but would still benefit, like name, int2, etc, but is used all the time where applicable, even for types like text for which it is expected that there will be no appreciable benefit. Thoughts? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
I decided to take advantage of my ongoing access to a benchmarking server to see how I could get on with a query with an especially large sort. Now, that box has 16GB of ram, and some people have that much memory in their laptops these days, so I was somewhat limited in how far I could push things. I eventually decided upon much the same benchmark as I'd made in my previous e-mail (the query "SELECT * FROM pgbench_accounts ORDER BY bid;", but primed with random data, while being sure to subsequently vacuum). I stress that I have not made any attempt to artificially remove client overhead. I have, however, used a faster disk (caches were not warmed in a seperate run of pgbench or anything like that for either of my two most recent benchmarks, so there may have been and indeed may still be some small bias there), in addition to connecting pgbench with the -h option. Apparently a known problem with Linux kernels using the Completely Fair Scheduler introduced in 2.6.23 is that it does not schedule the pgbench program very well when it's connecting to the database usingUnix-domain sockets, as I have been until now. I'm not sure that this had all that much potential to spoil my results, but didn't want to take the chance . Anyway, the results of running this latest benchmark, with 20 successive runs of each large query, were: Patch: pghost: localhost pgport: nclients: 1 nxacts: 20 dbName: pgbench transaction type: Custom query scaling factor: 1 query mode: prepared number of clients: 1 number of threads: 1 number of transactions per client: 20 number of transactions actually processed: 20/20 tps = 0.030924 (including connections establishing) tps = 0.030924 (excluding connections establishing) statement latencies in milliseconds:31163.957400 SELECT * FROM pgbench_accounts ORDER BY bid; Master: pghost: localhost pgport: nclients: 1 nxacts: 20 dbName: pgbench pghost: localhost pgport: nclients: 1 nxacts: 20 dbName: pgbench transaction type: Custom query scaling factor: 1 query mode: prepared number of clients: 1 number of threads: 1 number of transactions per client: 20 number of transactions actually processed: 20/20 tps = 0.026769 (including connections establishing) tps = 0.026769 (excluding connections establishing) statement latencies in milliseconds:36179.508750 SELECT * FROM pgbench_accounts ORDER BY bid; That's a larger proportional improvement than reported on Thursday, and at significantly greater scale. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Jan 19, 2012 at 1:47 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > Thoughts? I generated some random data using this query: create table nodups (g) as select (g%10)*1000+g/10 from generate_series(1,10000) g; And then used pgbench to repeatedly sort it using this query: select * from nodups order by g offset 10001; The patch came out about 28% faster than master. Admittedly, that's with no client overhead, but still: not bad. I don't like the API you've designed, though: as you have it, PrepareSortSupportFromOrderingOp does this after calling the sort support function: + ssup->usable_compar = ResolveComparatorProper(sortFunction); I think it would be better to simply have the sort support functions set usable_compar themselves. That would allow any user-defined functions that happen to have the same binary representation and comparison rules as one of the types for which we supply a custom qsort() to use initialize it to still make use of the optimization. There's no real reason to have a separate switch to decide how to initialize that field: the sort support trampoline already does that, and I don't see any reason to introduce a second way of doing the same thing. I am also a little unhappy that we have to duplicate code the fastcmp functions from nbtcompare.c in builtins.h. Wouldn't it make more sense to declare those functions as inline in nbtcompare.c, and then call the qsort-generating macro from that file? There were a couple of comment updates in tuplesort.c that looked independent from the reset of the patch, so I've committed those separately. I also committed your change to downgrade the belt-and-suspenders check for self-comparison to an assert, with some rewording of your proposed comment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 26 January 2012 19:45, Robert Haas <robertmhaas@gmail.com> wrote: > The patch came out about 28% faster than master. Admittedly, that's > with no client overhead, but still: not bad. Thanks. There was a 28% reduction in the time it took to execute the query, but there would have also been a larger reduction in the time that the backend held that all of that locally-allocated memory. That might also be worth instrumenting directly, by turning on "trace_sort" - can you report numbers on that, please? > I don't like the API you've designed, though: as you have it, > PrepareSortSupportFromOrderingOp does this after calling the sort > support function: > > + ssup->usable_compar = ResolveComparatorProper(sortFunction); > > I think it would be better to simply have the sort support functions > set usable_compar themselves. That would allow any user-defined > functions that happen to have the same binary representation and > comparison rules as one of the types for which we supply a custom > qsort() to use initialize it to still make use of the optimization. > There's no real reason to have a separate switch to decide how to > initialize that field: the sort support trampoline already does that, > and I don't see any reason to introduce a second way of doing the same > thing. Hmm. You're right. I can't believe that that didn't occur to me. In practice, types that use the SortSupport API are all going to be façades on scalar types anyway, much like date and timestamp, and of those a good proportion will surely have the same comparator representation as the specialisations introduced by this patch. It might be that virtually all third-party types that end up using the API can avail of some specialisation. > I am also a little unhappy that we have to duplicate code the fastcmp > functions from nbtcompare.c in builtins.h. Wouldn't it make more > sense to declare those functions as inline in nbtcompare.c, and then > call the qsort-generating macro from that file? Maybe it would, but since the meta-qsort_arg introduced includes partial duplicates of code from tuplesort.c, it kind of felt right to "instantiate" specialisations there. It may be that doing it in nbtcompare.c is the best option available to us. Off the top of my head, I'm pretty sure that that's a good bit less code. > There were a couple of comment updates in tuplesort.c that looked > independent from the reset of the patch, so I've committed those > separately. I also committed your change to downgrade the > belt-and-suspenders check for self-comparison to an assert, with some > rewording of your proposed comment. That seems reasonable. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Jan 26, 2012 at 4:09 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 26 January 2012 19:45, Robert Haas <robertmhaas@gmail.com> wrote: >> The patch came out about 28% faster than master. Admittedly, that's >> with no client overhead, but still: not bad. > > Thanks. There was a 28% reduction in the time it took to execute the > query, but there would have also been a larger reduction in the time > that the backend held that all of that locally-allocated memory. That > might also be worth instrumenting directly, by turning on "trace_sort" > - can you report numbers on that, please? Apparently not. The sort is too short to register in the trace_sort output. I just get: LOG: begin tuple sort: nkeys = 1, workMem = 1024, randomAccess = f LOG: performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec LOG: performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec LOG: internal sort ended, 853 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec >> I don't like the API you've designed, though: as you have it, >> PrepareSortSupportFromOrderingOp does this after calling the sort >> support function: >> >> + ssup->usable_compar = ResolveComparatorProper(sortFunction); >> >> I think it would be better to simply have the sort support functions >> set usable_compar themselves. That would allow any user-defined >> functions that happen to have the same binary representation and >> comparison rules as one of the types for which we supply a custom >> qsort() to use initialize it to still make use of the optimization. >> There's no real reason to have a separate switch to decide how to >> initialize that field: the sort support trampoline already does that, >> and I don't see any reason to introduce a second way of doing the same >> thing. > > Hmm. You're right. I can't believe that that didn't occur to me. In > practice, types that use the SortSupport API are all going to be > façades on scalar types anyway, much like date and timestamp, and of > those a good proportion will surely have the same comparator > representation as the specialisations introduced by this patch. It > might be that virtually all third-party types that end up using the > API can avail of some specialisation. Possibly. At a minimum it keeps the door open. >> I am also a little unhappy that we have to duplicate code the fastcmp >> functions from nbtcompare.c in builtins.h. Wouldn't it make more >> sense to declare those functions as inline in nbtcompare.c, and then >> call the qsort-generating macro from that file? > > Maybe it would, but since the meta-qsort_arg introduced includes > partial duplicates of code from tuplesort.c, it kind of felt right to > "instantiate" specialisations there. It may be that doing it in > nbtcompare.c is the best option available to us. Off the top of my > head, I'm pretty sure that that's a good bit less code. I was hoping so... >> There were a couple of comment updates in tuplesort.c that looked >> independent from the reset of the patch, so I've committed those >> separately. I also committed your change to downgrade the >> belt-and-suspenders check for self-comparison to an assert, with some >> rewording of your proposed comment. > > That seems reasonable. Cool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Alright, so while I agree with everything you've asked for, the fact is that there is a controversy in relation to binary bloat, and that's the blocker here. How can we satisfactorily resolve that, or is that question adequately addressed by the benchmark that I produced? What if third party modules could include the template_qsort_arg.h header, and instantiate their own specialisation? The sort support function could either set an enum, or set a pointer to a qsort_arg specialisation, if there comparator was a little different, but still just a few instructions, as with float-based timestamps (I don't care enough about them to provide one in core, though). It would also essentially allow for user-defined sort functions, provided they fulfilled a basic interface. They may not even have to be comparison-based. I know that I expressed scepticism about the weird and wonderful ideas that some people have put forward in that area, but that's mostly because I don't think that GPU based sorting in a database is going to be practical. Why not add this capability to the SortSupport API, given that it wouldn't be very hard? The user would set the enum too, so it would appear in explain analyze output as "custom sort" or something like that. While a sorting specialisation of our qsort_arg is actually pretty close to optimal for scalar types and façades thereof, there could be a type that would benefit from using radix sort, for example, which isn't even comparison-based, and a user couldn't very well do that with the existing SortSupport API. I certainly don't care about this capability enough to defend it against any objections that anyone may have, especially at this late stage in the cycle. I just think that we might as well have it. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Jan 26, 2012 at 5:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > Alright, so while I agree with everything you've asked for, the fact > is that there is a controversy in relation to binary bloat, and that's > the blocker here. How can we satisfactorily resolve that, or is that > question adequately addressed by the benchmark that I produced? I could go either way on that, honestly. I took a look today and found out that on my machine, a postgres 9.0.x executable was about 282kB larger than an 8.4 postgres executable. 9.1 was about 386kB larger than that, and 9.2 current is 239kB larger still. So it's not as if your patch is the first one to enlarge the size of the binary - it's obviously been growing steadily from release to release for years. I found that your patch adds 55kB to the executable size, which is clearly a lot more than what a typical patch adds, but still under 1%. So I wouldn't be shocked and appalled if we decided to commit this as-is. But if we want to put it on a diet, the first thing I'd probably be inclined to lose is the float4 specialization. Some members of the audience will recall that I take dim view of floating point arithmetic generally, but I'll concede that there are valid reasons for using float8. I have a harder time coming up with a good reason to use float4 - ever, for anything you care about. So I would be inclined to think that if we want to trim this back a bit, maybe that's the one to let go. If we want to be even more aggressive, the next thing I'd probably lose is the optimization of multiple sortkey cases, on the theory that single sort keys are probably by far the most common practical case. I'm not surprised that you weren't able to measure a performance regression from the binary bloat. Any such regression is bound to be very small and probably quite difficult to notice most of the time; it's really the cumulative effect of many binary-size-increasing changes we're worried about, not each individual one. Certainly, trying to shrink the binary is micro-optimimzation at its finest, but then so is inlining comparators. I don't think it can be realistically argued that we can increasing the size of the binary arbitrarily will never get us in trouble, much like (for a typical American family) spending $30 to have dinner at a cheap resteraunt will never break the budget. But if you do it every day, it gets expensive (and fattening). > What if third party modules could include the template_qsort_arg.h > header, and instantiate their own specialisation? Seems reasonable to me. > The sort support > function could either set an enum, or set a pointer to a qsort_arg > specialisation, if there comparator was a little different, but still The latter seems better. > just a few instructions, as with float-based timestamps (I don't care > enough about them to provide one in core, though). It would also > essentially allow for user-defined sort functions, provided they > fulfilled a basic interface. They may not even have to be > comparison-based. I know that I expressed scepticism about the weird > and wonderful ideas that some people have put forward in that area, > but that's mostly because I don't think that GPU based sorting in a > database is going to be practical. A question for another day. > Why not add this capability to the SortSupport API, given that it > wouldn't be very hard? The user would set the enum too, so it would > appear in explain analyze output as "custom sort" or something like > that. I'm unenthused about going to any trouble to change the EXPLAIN format. > While a sorting specialisation of our qsort_arg is actually pretty > close to optimal for scalar types and façades thereof, there could be > a type that would benefit from using radix sort, for example, which > isn't even comparison-based, and a user couldn't very well do that > with the existing SortSupport API. > > I certainly don't care about this capability enough to defend it > against any objections that anyone may have, especially at this late > stage in the cycle. I just think that we might as well have it. I don't see any reason not too, assuming it's not a lot of code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 27 January 2012 03:32, Robert Haas <robertmhaas@gmail.com> wrote: > But if we want to put it on a diet, the first thing I'd probably be > inclined to lose is the float4 specialization. Some members of the > audience will recall that I take dim view of floating point arithmetic > generally, but I'll concede that there are valid reasons for using > float8. I have a harder time coming up with a good reason to use > float4 - ever, for anything you care about. So I would be inclined to > think that if we want to trim this back a bit, maybe that's the one to > let go. If we want to be even more aggressive, the next thing I'd > probably lose is the optimization of multiple sortkey cases, on the > theory that single sort keys are probably by far the most common > practical case. Obviously I don't think that we should let anything go, as the improvement in performance is so large that we're bound to be ahead - we only really pay for what we use anyway, and when we use a specialisation the difference is really quite big, particularly when you look at sorting in isolation. If a specialisation is never used, it is more or less never paid for, so there's no point in worrying about that. That said, float4 is obviously the weakest link. I'm inclined to think that float8 is the second weakest though, mostly because we get both dates and timestamps "for free" with the integer specialisations. > I'm not surprised that you weren't able to measure a performance > regression from the binary bloat. Any such regression is bound to be > very small and probably quite difficult to notice most of the time; > it's really the cumulative effect of many binary-size-increasing > changes we're worried about, not each individual one. Certainly, > trying to shrink the binary is micro-optimimzation at its finest, but > then so is inlining comparators. I don't think it can be > realistically argued that we can increasing the size of the binary > arbitrarily will never get us in trouble, much like (for a typical > American family) spending $30 to have dinner at a cheap resteraunt > will never break the budget. But if you do it every day, it gets > expensive (and fattening). Sure. At the risk of stating the obvious, and of repeating myself, I will point out that the true cost of increasing the size of the binary is not necessarily linear - it's a complex equation. I hope that this doesn't sound flippant, but if some naive person were to look at just the increasing binary size of Postgres and its performance in each successive release, they might conclude that there was a positive correlation between the two (since we didn't add flab to the binary, but muscle that pulls its own weight and then some). At the continued risk of stating the obvious, CPUs don't just cache instructions - they cache data too. If we spend less than half the time sorting data, which is the level of improvement I was able to demonstrate against pre-SortSupport Postgres, that will surely very often have the aggregate effect of ameliorating cache contention between cores. >> just a few instructions, as with float-based timestamps (I don't care >> enough about them to provide one in core, though). It would also >> essentially allow for user-defined sort functions, provided they >> fulfilled a basic interface. They may not even have to be >> comparison-based. I know that I expressed scepticism about the weird >> and wonderful ideas that some people have put forward in that area, >> but that's mostly because I don't think that GPU based sorting in a >> database is going to be practical. > > A question for another day. Fair enough. >> I certainly don't care about this capability enough to defend it >> against any objections that anyone may have, especially at this late >> stage in the cycle. I just think that we might as well have it. > > I don't see any reason not too, assuming it's not a lot of code. Good. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Jan 26, 2012 at 11:36 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> I'm not surprised that you weren't able to measure a performance >> regression from the binary bloat. Any such regression is bound to be >> very small and probably quite difficult to notice most of the time; >> it's really the cumulative effect of many binary-size-increasing >> changes we're worried about, not each individual one. Certainly, >> trying to shrink the binary is micro-optimimzation at its finest, but >> then so is inlining comparators. I don't think it can be >> realistically argued that we can increasing the size of the binary >> arbitrarily will never get us in trouble, much like (for a typical >> American family) spending $30 to have dinner at a cheap resteraunt >> will never break the budget. But if you do it every day, it gets >> expensive (and fattening). > > Sure. At the risk of stating the obvious, and of repeating myself, I > will point out that the true cost of increasing the size of the binary > is not necessarily linear - it's a complex equation. I hope that this > doesn't sound flippant, but if some naive person were to look at just > the increasing binary size of Postgres and its performance in each > successive release, they might conclude that there was a positive > correlation between the two (since we didn't add flab to the binary, > but muscle that pulls its own weight and then some). I completely agree. So the point is that, when faced a patch that adds an atypically large number of CPU instructions, we ought to ask ourselves whether those instructions are pulling their weight. By way of comparison, the latest posted version of Tom's generalized index-only paths patch adds 14kB to the resulting executable (on my Mac). Yours adds 55kB. We might ask ourselves whether the benefits of this patch are four times greater than the benefits of Tom's patch.It's pretty hard to compare them directly since they'redoing very different things, and all of this is completely subjective, but I doubt it. On the other hand, there is no rule that every byte of code that gets committed must be made of solid gold, either. So, once again: judgement call. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Uh, obviously I meant causal relationship and not correlation. On 27 January 2012 13:37, Robert Haas <robertmhaas@gmail.com> wrote: > I completely agree. So the point is that, when faced a patch that > adds an atypically large number of CPU instructions, we ought to ask > ourselves whether those instructions are pulling their weight. By way > of comparison, the latest posted version of Tom's generalized > index-only paths patch adds 14kB to the resulting executable (on my > Mac). Yours adds 55kB. We might ask ourselves whether the benefits > of this patch are four times greater than the benefits of Tom's patch. > It's pretty hard to compare them directly since they're doing very > different things, and all of this is completely subjective, but I > doubt it. On the other hand, there is no rule that every byte of code > that gets committed must be made of solid gold, either. So, once > again: judgement call. Well, I don't think it's all that subjective - it's more the case that it is just difficult, or it gets that way as you consider more specialisations. As for what types/specialisations may not make the cut, I'm increasingly convinced that floats (in the following order: float4, float8) should be the first to go. Aside from the fact that we cannot use their specialisations for anything like dates and timestamps, floats are just way less useful than integers in the context of database applications, or at least those that I've been involved with. As important as floats are in the broad context of computing, it's usually only acceptable to store data in a database as floats within scientific applications, and only then when their limitations are well-understood and acceptable. I think we've all heard anecdotes at one time or another, involving their limitations not being well understood. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > Well, I don't think it's all that subjective - it's more the case that > it is just difficult, or it gets that way as you consider more > specialisations. Sure it's subjective. Two well-meaning people could have different opinions without either of them being "wrong". If you do a lot of small, in-memory sorts, more of this stuff is going to seem worthwhile than if you don't. > As for what types/specialisations may not make the cut, I'm > increasingly convinced that floats (in the following order: float4, > float8) should be the first to go. Aside from the fact that we cannot > use their specialisations for anything like dates and timestamps, > floats are just way less useful than integers in the context of > database applications, or at least those that I've been involved with. > As important as floats are in the broad context of computing, it's > usually only acceptable to store data in a database as floats within > scientific applications, and only then when their limitations are > well-understood and acceptable. I think we've all heard anecdotes at > one time or another, involving their limitations not being well > understood. While we're waiting for anyone else to weigh in with an opinion on the right place to draw the line here, do you want to post an updated patch with the changes previously discussed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 27 January 2012 14:37, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> Well, I don't think it's all that subjective - it's more the case that >> it is just difficult, or it gets that way as you consider more >> specialisations. > > Sure it's subjective. Two well-meaning people could have different > opinions without either of them being "wrong". If you do a lot of > small, in-memory sorts, more of this stuff is going to seem worthwhile > than if you don't. But if you don't, then you're not going to have your cache compromised, so the cost is limited to having to store a few tens of kilobytes of extra binary executable data on disk, and perhaps main-memory, that you wouldn't otherwise have to - a cost that is virtually indistinguishable from zero. When you do eventually need to do some in-memory sort, you get to have that go significantly faster, and since you don't have much use for the specialisations anyway, you get that with essentially no down-side. The concern is a perfect storm of all specialisations being simultaneously used such that it'd be more efficient to use a generic qsort. I think that's pretty implausible - the general assumption is that database applications are not frequently CPU bound. They're assumed to be frequently memory-bound though, so any effort to reduced memory consumption - which this patch effectively does - is probably going to be more valuable. Even if we suppose that the perfect storm can and does occur, on a chip that is so starved of instruction cache that it turns out to be a net loss, surely even then the perfect storm is a rare occurrence, and the aggregate effect is that they benefit. Besides, Postgres performance optimisations for which you can contrive a case that results in a net-loss in performance are well precedented. >> As for what types/specialisations may not make the cut, I'm >> increasingly convinced that floats (in the following order: float4, >> float8) should be the first to go. Aside from the fact that we cannot >> use their specialisations for anything like dates and timestamps, >> floats are just way less useful than integers in the context of >> database applications, or at least those that I've been involved with. >> As important as floats are in the broad context of computing, it's >> usually only acceptable to store data in a database as floats within >> scientific applications, and only then when their limitations are >> well-understood and acceptable. I think we've all heard anecdotes at >> one time or another, involving their limitations not being well >> understood. > > While we're waiting for anyone else to weigh in with an opinion on the > right place to draw the line here, do you want to post an updated > patch with the changes previously discussed? Patch is attached. I have not changed the duplicate functions. This is because I concluded that it was the lesser of two evils to have to get the template to generate both declarations in the header file, and definitions in the .c file - that seemed particularly obscure. We're never going to have to expose/duplicate any more comparators anyway. Do you agree? It's pretty easy to remove a specialisation at any time - just remove less than 10 lines of code. It's also pretty difficult to determine, with everyone's absolute confidence, where the right balance lies. Perhaps the sensible thing to do is to not be so conservative in what we initially commit, while clearly acknowledging that we may not have the balance right, and that it may have to change. We then have the entire beta part of the cycle in which to decide to roll back from that position, without any plausible downside. If, on the other hand, we conservatively lean towards fewer specialisations in the initial commit, no one will complain about the lack of an improvement in performance that they never had. Tom's objections related to the total number of specialisations, and their distributed costs - the very idea of full specialisations was not objected to. I think it's fair to say that there is no controversy at all remaining about whether or not we should have *some* number of specialisations. Therefore, I'm going to suggest that assuming you have no further objections to the style of the code, and no one else voices any other objections in the next couple of days, that you provisionally commit this latest revision with all of its specialisations, while putting people on notice about this. I think that possibly the one remaining blocker to tentatively committing this with all specialisations intact is that I haven't tested this on Windows, as I don't currently have access to a Windows development environment. I have set one up before, but it's a huge pain. Can anyone help me out? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Fri, Jan 27, 2012 at 3:33 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > Patch is attached. I have not changed the duplicate functions. This is > because I concluded that it was the lesser of two evils to have to get > the template to generate both declarations in the header file, and > definitions in the .c file - that seemed particularly obscure. We're > never going to have to expose/duplicate any more comparators anyway. > Do you agree? Not really. You don't really need macros to generate the prototypes; you could just write them out longhand. I think there's a mess of naming confusion in here, though, as perhaps best illlustrated by this macro definition: #define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR) \ DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap, \ SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline) \ DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap, \ MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc), \ TYPE##regheapcomparetup_inline) The idea here is that when we have only a single sort key, we include SING_ADDITIONAL_CODE in the tuple comparison function, whereas when we have more than one, we instead include MULT_ADDITIONAL_CODE. Right there, I think we have a naming problem, because abbreviating "single" to "sing" and multiple to "mult" is less than entirely clear. For a minute or two I was trying to figure out whether our sorting code was musically inclined, and I'm a native english speaker. But then we switch to another set of terminology completely for the generated functions: inlheap for the single-key case, and regheap for the multiple-key case. I find that even more confusing. I think we ought to get rid of this: +typedef enum TypeCompar +{ + TYPE_COMP_OTHER, + TYPE_COMP_INT4, + TYPE_COMP_INT8, + TYPE_COMP_FLOAT4, + TYPE_COMP_FLOAT8, + TYPE_COMP_FULL_SPECIALISATION +} TypeCompar; Instead, just modify SortSupportData to have two function pointers as members, one for the single-key case and another for the multiple-key case, and have the sortsupport functions initialize them to the actual functions that should be called. The layer of indirection, AFAICS, serves no purpose. > It's pretty easy to remove a specialisation at any time - just remove > less than 10 lines of code. It's also pretty difficult to determine, > with everyone's absolute confidence, where the right balance lies. > Perhaps the sensible thing to do is to not be so conservative in what > we initially commit, while clearly acknowledging that we may not have > the balance right, and that it may have to change. We then have the > entire beta part of the cycle in which to decide to roll back from > that position, without any plausible downside. If, on the other hand, > we conservatively lean towards fewer specialisations in the initial > commit, no one will complain about the lack of an improvement in > performance that they never had. Eh, really? Typically when we do something good, the wolves are howling at the door to make it work in more cases. > I think that possibly the one remaining blocker to tentatively > committing this with all specialisations intact is that I haven't > tested this on Windows, as I don't currently have access to a Windows > development environment. I have set one up before, but it's a huge > pain. Can anyone help me out? This doesn't strike me as terribly OS-dependent, unless by that we mean compiler-dependent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 31 January 2012 19:47, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Jan 27, 2012 at 3:33 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> Patch is attached. I have not changed the duplicate functions. This is >> because I concluded that it was the lesser of two evils to have to get >> the template to generate both declarations in the header file, and >> definitions in the .c file - that seemed particularly obscure. We're >> never going to have to expose/duplicate any more comparators anyway. >> Do you agree? > > Not really. You don't really need macros to generate the prototypes; > you could just write them out longhand. Well, since that would have the advantage of forcing the client of those macros to explicitly acknowledge what specialisations they were getting in order to get a warning-free build, that might be a better approach (though uncalled static functions are discarded anyway). However, I don't really see how you expect me to make the comparator-proper functions within nbtcompare.c inline without some duplication. How can I have inline functions in that .c file, while still having those functions callable from other TUs, while avoiding duplication/moving the functions? I suppose that I could do some trick with macros, but once again I believe that the cure is worse than the disease - it it really such a big deal to duplicate a handful of trivial functions, given than we're never going to have to expand that number? > I think there's a mess of naming confusion in here, though, as perhaps > best illlustrated by this macro definition: > > #define TEMPLATE_QSORT_ARG_HEAP(TYPE, COMPAR) \ > DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inlheap, \ > SING_ADDITIONAL_CODE, TYPE##inlheapcomparetup_inline) \ > DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, regheap, \ > MULT_ADDITIONAL_CODE(TYPE##regheapAppFunc), \ > TYPE##regheapcomparetup_inline) > > The idea here is that when we have only a single sort key, we include > SING_ADDITIONAL_CODE in the tuple comparison function, whereas when we > have more than one, we instead include MULT_ADDITIONAL_CODE. Right > there, I think we have a naming problem, because abbreviating "single" > to "sing" and multiple to "mult" is less than entirely clear. For a > minute or two I was trying to figure out whether our sorting code was > musically inclined, and I'm a native english speaker. But then we > switch to another set of terminology completely for the generated > functions: inlheap for the single-key case, and regheap for the > multiple-key case. I find that even more confusing. I'm happy to clean that up, though I think that when you try and browbeat C into the role of supporting generic programming, confusing code is more or less inevitable. > I think we ought to get rid of this: > > +typedef enum TypeCompar > +{ > + TYPE_COMP_OTHER, > + TYPE_COMP_INT4, > + TYPE_COMP_INT8, > + TYPE_COMP_FLOAT4, > + TYPE_COMP_FLOAT8, > + TYPE_COMP_FULL_SPECIALISATION > +} TypeCompar; > > Instead, just modify SortSupportData to have two function pointers as > members, one for the single-key case and another for the multiple-key > case, and have the sortsupport functions initialize them to the actual > functions that should be called. The layer of indirection, AFAICS, > serves no purpose. It anticipates a time when the system will have both btree and heap variants of the specialisations. It also serves to usefully document the intent of the optimisation - the most direct interface isn't necessarily the best. That said, I really am looking for the path of least resistance towards getting this committed at this stage, so if you still think it's a bad idea, I won't complain if you modify the patch to have multiple function pointers as you described. >> It's pretty easy to remove a specialisation at any time - just remove >> less than 10 lines of code. It's also pretty difficult to determine, >> with everyone's absolute confidence, where the right balance lies. >> Perhaps the sensible thing to do is to not be so conservative in what >> we initially commit, while clearly acknowledging that we may not have >> the balance right, and that it may have to change. We then have the >> entire beta part of the cycle in which to decide to roll back from >> that position, without any plausible downside. If, on the other hand, >> we conservatively lean towards fewer specialisations in the initial >> commit, no one will complain about the lack of an improvement in >> performance that they never had. > > Eh, really? Typically when we do something good, the wolves are > howling at the door to make it work in more cases. Well, if anyone was complaining about a regression, because the optimisation represented a net loss for their reasonable use-case, then clearly we wouldn't be doing well, so we'd have to reconsider our position, and no sensible wolf is going to argue with that. Obviously it's not possible to prove that there couldn't be some kind of regression under some set of circumstances in a water-tight fashion. However, I believe that I have argued well that on balance, it's worth maintaining the full set of specialisations. This is one case where committing code and working out if there are problems makes sense, because: 1) There very probably aren't (you yourself were not surprised that a regression couldn't be demonstrated), and 2) if there are, we can withdraw from having so many specialisations in small increments, rather than having to throw everything out as might have been the case in the past with other proposed performance patches. >> I think that possibly the one remaining blocker to tentatively >> committing this with all specialisations intact is that I haven't >> tested this on Windows, as I don't currently have access to a Windows >> development environment. I have set one up before, but it's a huge >> pain. Can anyone help me out? > > This doesn't strike me as terribly OS-dependent, unless by that we > mean compiler-dependent. What concerns me is how the pg_always_inline macro behaves when building with MSVC. While I have no reason to doubt that the effect should be similar there, I would like to have that verified. Building with MSVC is simple if you're only building release source packages, but when you bring Flex + Bison into the equation, it becomes rather difficult, which is why I asked for help there. You are generally right to say that it is not terribly OS-dependent though, since, as I've already shown, very portable software like the GNU C library uses tricks like pg_always_inline extensively, both in code intended for a particular arch, and code intended for all. Even where pg_always_inline doesn't have an underlying compiler-specific way of compelling inlining (it just falls back on inline, which is probably now a no-op on this obscure platform anyway), that's just improving performance. The usefulness of this optimisation does not hinge on pg_always_inline in any way. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Jan 26, 2012, at 9:32 PM, Robert Haas wrote: > But if we want to put it on a diet, the first thing I'd probably be > inclined to lose is the float4 specialization. Some members of the > audience will recall that I take dim view of floating point arithmetic > generally, but I'll concede that there are valid reasons for using > float8. I have a harder time coming up with a good reason to use > float4 - ever, for anything you care about. So I would be inclined to > think that if we want to trim this back a bit, maybe that's the one to > let go. If we want to be even more aggressive, the next thing I'd > probably lose is the optimization of multiple sortkey cases, on the > theory that single sort keys are probably by far the most common > practical case. I do find float4 to be useful, though it's possible that my understanding is flawed… We end up using float to represent ratios in our database; things that really, honest to God do NOT need to be exact. In most cases, 7 digits of precision (which AFAIK is what you're guaranteed with float4) is plenty, so we use float4 ratherthan bloat the database (though, since we're on 64bit hardware I guess that distinction is somewhat moot…). Is there something I'm missing that would make float4 useless as compared to float8? -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Excerpts from Jim Nasby's message of mié feb 01 19:12:58 -0300 2012: > On Jan 26, 2012, at 9:32 PM, Robert Haas wrote: > > But if we want to put it on a diet, the first thing I'd probably be > > inclined to lose is the float4 specialization. Some members of the > > audience will recall that I take dim view of floating point arithmetic > > generally, but I'll concede that there are valid reasons for using > > float8. I have a harder time coming up with a good reason to use > > float4 - ever, for anything you care about. So I would be inclined to > > think that if we want to trim this back a bit, maybe that's the one to > > let go. If we want to be even more aggressive, the next thing I'd > > probably lose is the optimization of multiple sortkey cases, on the > > theory that single sort keys are probably by far the most common > > practical case. > > I do find float4 to be useful, though it's possible that my understanding is flawed… > > We end up using float to represent ratios in our database; things that really, honest to God do NOT need to be exact. But then, do you sort using those ratios? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Feb 01, 2012 at 04:12:58PM -0600, Jim Nasby wrote: > On Jan 26, 2012, at 9:32 PM, Robert Haas wrote: > > But if we want to put it on a diet, the first thing I'd probably be > > inclined to lose is the float4 specialization. Some members of the > > audience will recall that I take dim view of floating point arithmetic > > generally, but I'll concede that there are valid reasons for using > > float8. I have a harder time coming up with a good reason to use > > float4 - ever, for anything you care about. So I would be inclined to > > think that if we want to trim this back a bit, maybe that's the one to > > let go. If we want to be even more aggressive, the next thing I'd > > probably lose is the optimization of multiple sortkey cases, on the > > theory that single sort keys are probably by far the most common > > practical case. > > I do find float4 to be useful, though it's possible that my understanding is flawed… > > We end up using float to represent ratios in our database; things that really, honest to God do NOT need to be exact. > > In most cases, 7 digits of precision (which AFAIK is what you're guaranteed with float4) is plenty, so we use float4 ratherthan bloat the database (though, since we're on 64bit hardware I guess that distinction is somewhat moot…). > > Is there something I'm missing that would make float4 useless as compared to float8? > -- > Jim C. Nasby, Database Architect jim@nasby.net > 512.569.9461 (cell) http://jim.nasby.net > If the values stored are float4, it would be nice to have that fast-path sort available too. The cases where I have used float4 values in the past, I absolutely did not need any of the float8 baggage and in my case, using the actual float4 comparison operator resulted in a significant time savings over the normal float8. This could be processor specific, but it would be worth testing before throwing it out. Regards, Ken
On Fri, Jan 27, 2012 at 09:37:37AM -0500, Robert Haas wrote: > On Fri, Jan 27, 2012 at 9:27 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > > Well, I don't think it's all that subjective - it's more the case that > > it is just difficult, or it gets that way as you consider more > > specialisations. > > Sure it's subjective. Two well-meaning people could have different > opinions without either of them being "wrong". If you do a lot of > small, in-memory sorts, more of this stuff is going to seem worthwhile > than if you don't. > > > As for what types/specialisations may not make the cut, I'm > > increasingly convinced that floats (in the following order: float4, > > float8) should be the first to go. Aside from the fact that we cannot > > use their specialisations for anything like dates and timestamps, > > floats are just way less useful than integers in the context of > > database applications, or at least those that I've been involved with. > > As important as floats are in the broad context of computing, it's > > usually only acceptable to store data in a database as floats within > > scientific applications, and only then when their limitations are > > well-understood and acceptable. I think we've all heard anecdotes at > > one time or another, involving their limitations not being well > > understood. > > While we're waiting for anyone else to weigh in with an opinion on the > right place to draw the line here, do you want to post an updated > patch with the changes previously discussed? Well, I think we have to ask not only how many people are using float4/8, but how many people are sorting or creating indexes on them. I think it would be few and perhaps should be eliminated. Peter Geoghegan obviously has done some serious work in improving sorting, and worked well with the community process. He has done enough analysis that I am hard-pressed to see how we would get similar improvement using a different method, so I think it comes down to whether we want the 28% speedup by adding 55k (1%) to the binary. I think Peter has shown us how to get that, and what it will cost --- we just need to decide now whether it is worth it. What I am saying is there probably isn't a cheaper way to get that speedup, either now or in the next few years. (COPY might need similar help for speedups.) I believe this is a big win and well worth the increased binary size because the speed up is significant, and because it is of general usefulness for a wide range of queries. Either of these would be enough to justify the additional 1% size, but both make it an easy decision for me. FYI, I believe COPY needs similar optimizations; we have gotten repeated complaints about its performance and this method of optmization might also be our only option. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Feb 06, 2012 at 04:19:07PM -0500, Bruce Momjian wrote: > Peter Geoghegan obviously has done some serious work in improving > sorting, and worked well with the community process. He has done enough > analysis that I am hard-pressed to see how we would get similar > improvement using a different method, so I think it comes down to > whether we want the 28% speedup by adding 55k (1%) to the binary. > > I think Peter has shown us how to get that, and what it will cost --- we > just need to decide now whether it is worth it. What I am saying is > there probably isn't a cheaper way to get that speedup, either now or in > the next few years. (COPY might need similar help for speedups.) > > I believe this is a big win and well worth the increased binary size > because the speed up is significant, and because it is of general > usefulness for a wide range of queries. Either of these would be enough > to justify the additional 1% size, but both make it an easy decision for > me. > > FYI, I believe COPY needs similar optimizations; we have gotten repeated > complaints about its performance and this method of optmization might > also be our only option. Sorry, that was wordy. What I am saying is that years ago we did hot-spot optimization for storage and tuple access using macros. We have looked for cheap optimizations for sort (and COPY) for years, but haven't found it. If we want optimizations in these areas, we are probably going to need to do the same sort of hot-spot optimization we did earlier, and Peter's work is an excellent candidate for that. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 6 February 2012 21:19, Bruce Momjian <bruce@momjian.us> wrote: > Peter Geoghegan obviously has done some serious work in improving > sorting, and worked well with the community process. Thank you for acknowledging that. It's unfortunate that C does not support expressing these kinds of ideas in a more natural way. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Mon, Feb 06, 2012 at 10:49:10PM +0000, Peter Geoghegan wrote: > On 6 February 2012 21:19, Bruce Momjian <bruce@momjian.us> wrote: > > Peter Geoghegan obviously has done some serious work in improving > > sorting, and worked well with the community process. > > Thank you for acknowledging that. > > It's unfortunate that C does not support expressing these kinds of > ideas in a more natural way. Yes, it is a problem, and a benefit. We have avoided C++ because these types of trade-offs that we are discussing are often done invisibly, so we can't make the decision ourselves. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Feb 06, 2012 at 06:43:04PM -0500, Bruce Momjian wrote: > On Mon, Feb 06, 2012 at 10:49:10PM +0000, Peter Geoghegan wrote: > > On 6 February 2012 21:19, Bruce Momjian <bruce@momjian.us> wrote: > > > Peter Geoghegan obviously has done some serious work in improving > > > sorting, and worked well with the community process. > > > > Thank you for acknowledging that. > > > > It's unfortunate that C does not support expressing these kinds of > > ideas in a more natural way. > > Yes, it is a problem, and a benefit. We have avoided C++ because these > types of trade-offs that we are discussing are often done invisibly, so > we can't make the decision ourselves. Let me add that while it is fine for languages like C++ to make these decisions for application code automatically, operating systems and high-performance databases developers prefer to make such decisions explicitly, which is what we are discussing now. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 2/6/12 3:19 PM, Bruce Momjian wrote: >> > While we're waiting for anyone else to weigh in with an opinion on the >> > right place to draw the line here, do you want to post an updated >> > patch with the changes previously discussed? > Well, I think we have to ask not only how many people are using > float4/8, but how many people are sorting or creating indexes on them. > I think it would be few and perhaps should be eliminated. I agree that it's probably pretty unusual to index floats. My objection was on the assumption that float8 is valid but float4 isn't. If we are going to provide a fast-path for one then we should do it for both if for no other reason than least surprise. -- Jim C. Nasby, Database Architectjim@nasby.net 512.569.9461 (cell)http://jim.nasby.net
Jim "Decibel!" Nasby wrote: > I agree that it's probably pretty unusual to index floats. FWIW: Cubes and points are floats, right? So would spatial indexes benefit from this optimization, or is it only raw floats? Jay Levitt
On Tue, Feb 7, 2012 at 2:39 PM, Jay Levitt <jay.levitt@gmail.com> wrote: > Jim "Decibel!" Nasby wrote: >> >> I agree that it's probably pretty unusual to index floats. > > FWIW: Cubes and points are floats, right? So would spatial indexes benefit > from this optimization, or is it only raw floats? Cubes are not floats, nor are points. In general, what's being proposed here is to make a separate copy of quicksort for each of N datatypes, with the comparator function inlined. In order for this to benefit multiple types, they'd have to use the same set of machine instructions to compare values on disk. So in general you can make this apply to as many datatypes as you want: it's just that each one will require another copy of the quicksort code in the binary. The more complex the comparator is, the less you'll save, because the savings presumably come largely from things like saving and resoring registers and pushing/popping stack frames, and that's really only going to be material if the underlining comparator is pretty darn cheap. Actually, to be more precise, the patch is proposing to create TWO copies of the quicksort code for each of N datatypes, one for the case where there is but a single sort key and the other for the case where there are multiple sort keys. Having a separate code for the single sort key case saves more because it lets you get rid of a control loop - you don't have to iterate down the list of keys, because there's only one. I've been skeptical of this patch for a number of reasons, the major one of which is that most workloads spend only a very small amount of time in doing qucksorts, and most quicksorts are of very small amounts of data and therefore fast anyway. It is easy to construct an artificial test case that does lots and lots of in-memory sorting, but in real life I think that's not the great part of what people use databases for. The time gets spent on things like groveling through big tables or doing joins, and then maybe we sort the rows at the end, but that's not where most of the time goes. It may be rightly pointed out, of course, that a penny saved is a penny earned: the fact that most people don't spend much time quicksorting doesn't mean that we shouldn't try to make quicksorting as fast as possible. But there are a couple of additional considerations. First, the code is, at present, uglier than I would like. I mean to spend some time trying to clean that up, but there are 102 patches in this CommitFest (the number seems to keep slowly growing, despite the deadline being three weeks gone) and this isn't the only one I care about, so I haven't quite gotten there yet. But hopefully soon. Second, there's a concern about binary bloat: duplicating lots of code with different comparators inlined generates, well, a lot of machine code. Of course, an 0.8% increase in the size of the resulting binary is very unlikely to produce any measurable performance regression on its own, but that's not really the issue. The issue is rather that we could easily go nuts applying this technique in lots of different places - or even just in one place to lots of different types. Let's say that we find that even for types with very complex comparators, we can get a 5% speedup on quick-sorting speed using this technique. Should we, then, apply the technique to every type we have? Perhaps the binary would grow by, I don't know, 10%. Maybe that's still not enough to start hurting performance (or making packagers complain), but surely it's getting there, and what happens when we discover that similar speedups are possible using the same trick in five other scenarios? I think it's a bit like going out to dinner and spending $50. It's nicer than eating at home, and many people can afford to do it occasionally, but if you do it every night, it gets expensive (and possibly fattening). So we need some principled way of deciding how much inlining is reasonable, because I am 100% certain this is not going to be the last time someone discovers that a massive exercise in inlining can yield a nifty performance benefit in some case or another: index builds and COPY have already been mentioned on this thread, and I expect that people will find other cases as well. I'm not really sure what the "budget" is - i.e. how much binary bloat we can afford to add - or how many cases there are that can benefit, but the first isn't infinite and the second is more than the first. Having said all that, I am inclined to commit at least some portion of this, but I wanted to knock off a few other patches that have been lingering for a while first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote: > So we need some principled way of deciding how much inlining is > reasonable, because I am 100% certain this is not going to be the last > time someone discovers that a massive exercise in inlining can yield a > nifty performance benefit in some case or another: index builds and > COPY have already been mentioned on this thread, and I expect that > people will find other cases as well. I'm not really sure what the > "budget" is - i.e. how much binary bloat we can afford to add - or how > many cases there are that can benefit, but the first isn't infinite > and the second is more than the first. > > Having said all that, I am inclined to commit at least some portion of > this, but I wanted to knock off a few other patches that have been > lingering for a while first. One approach would be to only do a few types now, e.g. integers and strings, and perhaps leave the others for later. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
It doesn't necessarily matter if we increase the size of the postgres binary by 10%, precisely because most of that is not going to be in play from one instant to the next. I'm thinking, in particular, of btree index specialisations, where it could make perfect sense to "go crazy". You cannot have a reasonable discussion about such costs without considering that they will perhaps never be paid, given any plausible workload. That's why the percentage that the postgres binary size has been shown to increase by really isn't pertinent at all. At best, it's a weak proxy for such costs, assuming you don't have a demonscene-like preoccupation with reducing binary size, and I don't believe that we should. It would be difficult for me to measure such things objectively, but I'd speculate that the proprietary databases have much larger binaries than ours, while having far fewer features, precisely because they started applying tricks like this a long time ago. You could counter that their code bases probably look terrible, and you'd have a point, but so do I. On 8 February 2012 02:38, Robert Haas <robertmhaas@gmail.com> wrote: > I've been skeptical of this patch for a number of reasons, the major > one of which is that most workloads spend only a very small amount of > time in doing qucksorts, and most quicksorts are of very small amounts > of data and therefore fast anyway. It is easy to construct an > artificial test case that does lots and lots of in-memory sorting, but > in real life I think that's not the great part of what people use > databases for. Fair enough, but if that's true, then it's also true that the cost due to cache marginalisation - the only cost that I think is worth considering at all - is correspondingly a small fraction of that very small amount of sorting. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes: > It doesn't necessarily matter if we increase the size of the postgres > binary by 10%, precisely because most of that is not going to be in > play from one instant to the next. You've heard of swapping, no? Basically what I'm hearing from you is total denial that binary bloat costs anything, and that just does not square with reality. Even if the costs in performance terms are negligible in many common situations (something that you've asserted but without offering any proof), there are other costs; software distribution for instance is definitely sensitive to size. As a Red Hat person I've had to spend time fighting to keep Postgres included in the minimum "does it fit on a DVD" distribution of RHEL/Fedora. That case gets harder to make every year, and yet it's pretty critical to the success of the project --- if you don't get distributed, you lose users. IMO this patch is already well past the point of diminishing returns in value-per-byte-added. I'd like to see it trimmed back to provide a fast path for just single-column int4/int8/float4/float8 sorts. The other cases aren't going to offer enough of a win to justify the code space. regards, tom lane
On Wed, Feb 8, 2012 at 8:33 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > It doesn't necessarily matter if we increase the size of the postgres > binary by 10%, precisely because most of that is not going to be in > play from one instant to the next. As Tom says, that doesn't jive with my experience. If you add on enough binary bloat, you will have more page faults. It's true (as I think you pointed out upthread) that packing all the copies of quicksort into the binary one after the other minimizes the effect on other things, but it doesn't eliminate it entirely. If you've got this: <random other stuff> ... <a zillion copies of quicksort> .... <more other stuff> ...then a branch from the "random other stuff" section of the binary to the "more other stuff" section of the binary may cost more. For example, suppose the OS does X amount of readahead. By stuffing all those copies of quicksort into the middle there, you increase the chances that the page you need was beyond the readahead window. Or, if it wasn't going to be in the readahead window either way, then you increase the distance that the disk head needs to move to find the required block. These costs are very small, no question about it. They are almost impossible to measure individually, in much the same way that the cost of pollution or greenhouse gas emissions is difficult to measure. But it's an error to assume that because the costs are individually small that they will never add up to anything material. As long as you keep insisting on that, it's hard to have a rational conversation. We can reasonably debate the magnitude of the costs, but to assert that they don't exist gets us nowhere. Suppose we could get a 10% speedup on sin(numeric) by adding 40GB to the binary size. Would you be in favor of that? Do you think it would hurt performance on any other workloads? Would our packagers complain at all? Surely your same argument would apply to that case in spades: anyone who is not using the gigantic hard-coded lookup table will not pay any portion of the cost of it. > It would be difficult for me to measure such things objectively, but > I'd speculate that the proprietary databases have much larger binaries > than ours, while having far fewer features, precisely because they > started applying tricks like this a long time ago. You could counter > that their code bases probably look terrible, and you'd have a point, > but so do I. That might be true; I have no idea. There are probably lots of reasons why their code bases look terrible, including a long history of backward compatibility with now-defunct versions, a variety of commercial pressures, and the fact that they don't have to take flak in the public square for what their code looks like. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > IMO this patch is already well past the point of diminishing returns in > value-per-byte-added. I'd like to see it trimmed back to provide a fast > path for just single-column int4/int8/float4/float8 sorts. The other > cases aren't going to offer enough of a win to justify the code space. I'm curious about how much we're gaining from the single-column specializations vs. the type-specific specializations. I think I'm going to go try to characterize that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Feb 08, 2012 at 01:33:30PM +0000, Peter Geoghegan wrote: > It doesn't necessarily matter if we increase the size of the postgres > binary by 10%, precisely because most of that is not going to be in > play from one instant to the next. I'm thinking, in particular, of > btree index specialisations, where it could make perfect sense to "go > crazy". You cannot have a reasonable discussion about such costs > without considering that they will perhaps never be paid, given any > plausible workload. That's why the percentage that the postgres binary > size has been shown to increase by really isn't pertinent at all. At > best, it's a weak proxy for such costs, assuming you don't have a > demonscene-like preoccupation with reducing binary size, and I don't > believe that we should. When you start a binary, your executable is mapped to a file system binary, and you page-fault in the pages you need. Now, if your optimization was alone in its own 4k (x86) virtual pages, and you never called the functions, odds are you would not pay a penalty, aside from distribution penalty, and perhaps a small penalty if useful code was before and after your block. The sort code expansion, however, is done in-line, in the middle of the sort code, you are clearly are filling in 64-byte (x86) CPU cache lines with type-specific expansion code for every sort case, whether we use the code or not. Now, I don't think it is a big problem, and I think the speedup is worth it for common data types, but we can't say the cost is zero. Saying it another way, having a binary in your file system that you never call is not overhead except for storage, but in this case, the sort code expansion is inside critical functions we are already calling. Frankly, it is the cost that has kept use away from using such optimizations for a long time. I recently posted that the zero-cost optimizations are mostly completed for sort and COPY, and we have to start considering non-zero-cost optimizations --- sad, but true. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote: > On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > IMO this patch is already well past the point of diminishing returns in > > value-per-byte-added. I'd like to see it trimmed back to provide a fast > > path for just single-column int4/int8/float4/float8 sorts. The other > > cases aren't going to offer enough of a win to justify the code space. > > I'm curious about how much we're gaining from the single-column > specializations vs. the type-specific specializations. I think I'm > going to go try to characterize that. Yes, please. That would be a big help. Is there no optimization for strings? I assume they are sorted a lot. We can alway add more data types later. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes: > Yes, please. That would be a big help. Is there no optimization for > strings? I assume they are sorted a lot. It seems unlikely that it'd be worth including strings, especially if your locale is not C. This whole thing only makes sense for datatypes that are comparable in approximately 1 machine instruction. regards, tom lane
On Wed, Feb 08, 2012 at 11:35:46AM -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Yes, please. That would be a big help. Is there no optimization for > > strings? I assume they are sorted a lot. > > It seems unlikely that it'd be worth including strings, especially if > your locale is not C. This whole thing only makes sense for datatypes > that are comparable in approximately 1 machine instruction. Ah, OK, interesting. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 8 February 2012 15:17, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> IMO this patch is already well past the point of diminishing returns in >> value-per-byte-added. I'd like to see it trimmed back to provide a fast >> path for just single-column int4/int8/float4/float8 sorts. The other >> cases aren't going to offer enough of a win to justify the code space. > > I'm curious about how much we're gaining from the single-column > specializations vs. the type-specific specializations. I think I'm > going to go try to characterize that. I think it might make more sense to lose the type-specific specialisations for the multi-key case while adding a generic multi-key specialisation, than to lose all multi-key specialisations, though I have not considered that question at length, and would think that we'd still want to keep an int4 version in that case. Note that I *did not* include a generic multi-key specialisation, though only because I saw little point, having already covered by far the most common cases. While you're at it, I'd like to suggest that you perform a benchmark on a multi-key specialisation, so we can see just what we're throwing away before we do so. Better to have those numbers come from you. I continue to maintain that the most appropriate course of action is to provisionally commit all specialisations. If it's hard to know what effect this is going to have on real workloads, let's defer to beta testers, who presumably try the new release out with their application. It's a question you could squarely put to them, without gradually rolling back from that initial position being much of a problem. The mysql-server package is 45 MB on Fedora 16. That 1% of Postgres binary figure is for my earlier patch with btree specialisations, right? I'm not asking you to look at that right now. I also don't think that "where do we eventually draw the line with specialisations like this in Postgres generally?" is a question that you should expect me to answer, though I will say that we should look at each case on its merits. I have not "totally denied" binary bloat costs. I have attempted to quantify them, while acknowledging that such a task is difficult, as was evident from the fact that Robert "wasn't suprised" that I could not demonstrate any regression. Granted, my definition of a regression is that there is very clearly no net loss in performance at some reasonable granularity, which is a very practical definition. You can quite easily contrive a case that HOT handles really badly. Some people did, I believe, but HOT won out because it was clearly very useful in the real world. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Feb 8, 2012 at 10:59 AM, Bruce Momjian <bruce@momjian.us> wrote: > On Wed, Feb 08, 2012 at 10:17:36AM -0500, Robert Haas wrote: >> On Wed, Feb 8, 2012 at 9:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> > IMO this patch is already well past the point of diminishing returns in >> > value-per-byte-added. I'd like to see it trimmed back to provide a fast >> > path for just single-column int4/int8/float4/float8 sorts. The other >> > cases aren't going to offer enough of a win to justify the code space. >> >> I'm curious about how much we're gaining from the single-column >> specializations vs. the type-specific specializations. I think I'm >> going to go try to characterize that. > > Yes, please. That would be a big help. Is there no optimization for > strings? I assume they are sorted a lot. They are, but the value of the optimization is expected to diminish for more complex comparators. I did a quick test of this using the same setup I mentioned upthread: create table nodups (g) as select (g%10)*1000+g/10 from generate_series(1,10000) g; This was the actual test query: select * from nodups order by g offset 10001; I did a ten-second warmup on after starting each postmaster, followed by three one-minute test runs: master: tps = 298.824321 (including connections establishing) tps = 301.741619 (including connections establishing) tps = 302.238016 (including connections establishing) Peter's patch, but with the type-specific optimizations disabled, so using just the single-sortkey optimizations: tps = 357.553690 (including connections establishing) tps = 358.765875 (including connections establishing) tps = 358.825051 (including connections establishing) Peter's patch, intact: tps = 394.566994 (including connections establishing) tps = 394.228667 (including connections establishing) tps = 394.492677 (including connections establishing) So that's about a 19% speedup for just the single sort-key optimizations, and about a 30% speedup in total. Compared to a baseline that includes the single sort-key optimizations, the additional type-specific optimization is buying about 10% on this test. I tried changing the test query to return the data, like this: select * from nodups order by g; master: tps = 181.301129 (including connections establishing) tps = 180.348333 (excluding connections establishing) tps = 179.600825 (including connections establishing) Peter's patch, but with the type-specific optimizations disabled, so using just the single-sortkey optimizations: tps = 201.728269 (including connections establishing) tps = 198.022527 (including connections establishing) tps = 199.663082 (including connections establishing) Peter's patch, intact: tps = 214.037387 (including connections establishing) tps = 212.895145 (including connections establishing) tps = 211.596558 (including connections establishing) So, with the overhead of actually returning the sorted data, we get 10% from the generic single-sortkey optimizations and 18% from the two combined. Compared to a baseline that includes the single-sortkey optimizations, the incremental improvement from adding the type-specific optimizations is about 6.6%. It would be less on a wider table, presumably. I also tested a two-column sort: create table twocol (a, b) as select g%101, g%17 from generate_series(1,10000) g; select * from twocol order by a, b offset 10000; master: tps = 270.218604 (including connections establishing) tps = 265.634964 (including connections establishing) tps = 268.520805 (including connections establishing) Peter's patch, intact: tps = 285.528626 (including connections establishing) tps = 285.140345 (including connections establishing) tps = 282.388457 (including connections establishing) So here the type-specific optimizations are buying us even less: only about 6%, and that's with no client overhead. And I tested float8: create table f8 (g) as select random() from generate_series(1,10000); select * from f8 order by g offset 10000; master: tps = 228.373239 (including connections establishing) tps = 225.117704 (including connections establishing) tps = 225.196197 (including connections establishing) Peter's patch, but with the type-specific optimizations disabled, so using just the single-sortkey optimizations: tps = 263.060820 (including connections establishing) tps = 262.788926 (including connections establishing) tps = 263.041186 (including connections establishing) Peter's patch, intact: tps = 283.274797 (including connections establishing) tps = 280.597965 (including connections establishing) tps = 280.151349 (including connections establishing) That's +17% for the single-sortkey stuff, +25% for everything, and the incremental improvement of the type-specific optimizations over the generic single-key optimizations is about 6.7%. It seems clear that the single sort-key optimizations are a much better value per byte of code than the type-specific optimizations. Ignoring client overhead, we get between half and two-thirds of the benefit, and it costs us just one extra copy of the quicksort logic for all data types. The type-specific optimizations only apply to a subset of the available data types, produce a lot more machine code (one extra copy of quicksort per type), and the absolute performance benefit is less. My gut feeling is that the few extra percentage points we can get by including lots of extra copies of quicksort is not worth it. There are probably many cases where we can squeeze a few extra points out of inlining (Tom and I have discussed a few of them in the past, in particular one related to index-only scans) and I don't have any confidence that this is the most important one, or even in the top ten. If we were talking about speeding up COPY or index lookups, I might well feel differently, but very few people are going to quicksort 10,000 rows on a sufficiently regular basis to justify doing this. To even measure the benefit of this, you have to set up a totally artificial test case (such as the above), and you have to run it via pgbench, because if you just run it interactively, the run-to-run variation swamps the actual improvement, and without doing a statistical analysis of the results you can't even be sure whether anything's changed. I just can't get excited about that. However, I find the single-key optimizations much more compelling, for the reasons stated above, and feel we ought to include those. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > [ lots of numbers ] > ... I just can't get excited about that. However, I > find the single-key optimizations much more compelling, for the > reasons stated above, and feel we ought to include those. This conclusion seems sound to me, for the reasons you stated and one more: optimizations for a single sort key are going to be applicable to a very wide variety of queries, whereas all the other cases are necessarily less widely applicable. regards, tom lane
On 8 February 2012 17:58, Robert Haas <robertmhaas@gmail.com> wrote: > It seems clear that the single sort-key optimizations are a much > better value per byte of code than the type-specific optimizations. > Ignoring client overhead, we get between half and two-thirds of the > benefit, and it costs us just one extra copy of the quicksort logic > for all data types. That was clear from an early stage, and is something that I acknowledged way back in September - it's pretty obvious that chasing diminishing returns is the nature of this sort of thing, which is fine, provided that they are not chased beyond some point that doesn't make sense. However, I do not see why we should not at least include full integer specialisations as well (for single-keys at least), given that we get the additional, not inconsiderable improvements. I do not accept the premise that we need to find the optimal bytes added to performance increase ratio, because, as I've already stressed, adding bytes to the application binary does not have a linear cost. Here's what might be a good compromise: singlekey generic, singlekey int4, singlekey int8, multikey generic. That is a total number of 4 contiguous copies of the qsort, because we've obsoleted the original qsort_arg (accept for btree index sorting, which likely matters far less). We get the most benefit for 5 very common types - int4, int8, date, timestamp and timestamptz. A 30% improvement has to be better than the 19% speedup for just the single sort-key optimisations, considering that in practice these types are very commonly used. I think that there may be additional benefits from making the qsort_arg specialisation look less like a c stdlib one, like refining the swap logic to have compile-time knowledge of the type it is sorting. I'm thinking that we could usefully trim quite a bit from this: #define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \(es) % sizeof(long) ? 2 : (es) == sizeof(long)?0 : 1; #define thistype_vecswap(a, b, n) \if ((n) > 0) inl_swapfunc((a), (b), (size_t)(n), swaptype) #define thistype_swap(a, b) \ if (swaptype == 0) { \ long t = *(long *)(void *)(a); \ *(long*)(void *)(a) = *(long *)(void *)(b); \ *(long *)(void *)(b) = t; \} else \ inl_swapfunc(a, b, es, swaptype) inline static void inl_swapfunc(char *a, char *b, size_t n, int swaptype) {if (swaptype <= 1) swapcode(long, a, b, n);else swapcode(char, a, b, n); } -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 8 February 2012 18:48, Peter Geoghegan <peter@2ndquadrant.com> wrote: > I think that there may be additional benefits from making the > qsort_arg specialisation look less like a c stdlib one, like refining > the swap logic to have compile-time knowledge of the type it is > sorting. I'm thinking that we could usefully trim quite a bit from > this: It seems like while we cannot get any better performance, no doubt because the code is already using all manner of optimisations, including perhaps constant propagation, we can simplify this part of the code quite a bit. That seems fairly incidental though. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > That was clear from an early stage, and is something that I > acknowledged way back in September OK, so why didn't/don't we do and commit that part first, and then proceed to argue about the remainder once it's in? > I think that there may be additional benefits from making the > qsort_arg specialisation look less like a c stdlib one, like refining > the swap logic to have compile-time knowledge of the type it is > sorting. I'm thinking that we could usefully trim quite a bit from > this: That's an interesting idea, which seems worth pursuing, though possibly not for 9.2. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8 February 2012 23:33, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Feb 8, 2012 at 1:48 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> That was clear from an early stage, and is something that I >> acknowledged way back in September > > OK, so why didn't/don't we do and commit that part first, and then > proceed to argue about the remainder once it's in? I have no objection to that. I'm not sure that it's going to be possible to agree to anything beyond what has already been agreed. We don't seem to be covering new ground. What would it take to convince you to be more inclusive, and have at least a few full specialisations, in particular, int4 and int8 single key full specialisations? I'm sure that you'll agree that they're the second most compelling case. There doesn't seem to be a standard that I can meet that you can even /describe/ to have those additional specialisations committed, which is a shame, since they appear to be pretty decent wins above and beyond the generic single key specialisation, all things considered. I suppose that the standard refrain here is "it's your patch; you must prove the case for committing it". That would be fine by me, but this isn't just about me, and it seems to be a practical impossibility in this case. It may be quite impossible even with far greater resources than I can afford to apply here, since it seems like you're hinting at some kind of rigorous proof that this cannot cause a regression for a single client, even though in order to see that regression multiple other clients would have to be benefiting. I cannot provide such a proof, since it would probably have to consider all commercially available CPUs and all possible workloads - I doubt that anyone can. That being the case, I'm eager not to waste any more time on this. I bear no ill will; that just seems to be the situation that we find ourselves in. That said, if you can describe the standard, I'll try and meet it - you seem to be suggesting that months and months of benchmarks wouldn't really change things, since this is the first step on the road to binary bloat ruin. The only fundamentally new argument that I can offer is that by applying the standard that the community does here, it severely limits the number of performance improvements that can be added, and the aggregate effect is that we're quite certainly worse off. It's a pity this isn't a really easy decision for you to make, but that's the nature of these things, and we should expect that to increasingly be the case. Is it really so unacceptable for us to risk doing a little worse under some rare circumstances in order to do better under some common circumstances? Why should it be an easy decision - when are important decisions ever easy? >> I think that there may be additional benefits from making the >> qsort_arg specialisation look less like a c stdlib one, like refining >> the swap logic to have compile-time knowledge of the type it is >> sorting. I'm thinking that we could usefully trim quite a bit from >> this: > > That's an interesting idea, which seems worth pursuing, though > possibly not for 9.2. Well, it's really just a code clean-up. I suspect that qsort_arg is a minimally modified version of the NetBSD one that wasn't necessarily thoroughly understood when it was initially added (not that I'm suggesting that it ought to have been). Then again, you might prefer to keep it as consistent as possible with qsort (the other copy of that sort function that we already have). -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote: > Second, there's a concern about binary bloat: duplicating lots of code > with different comparators inlined generates, well, a lot of machine > code. Of course, an 0.8% increase in the size of the resulting binary > is very unlikely to produce any measurable performance regression on > its own, but that's not really the issue. The issue is rather that we > could easily go nuts applying this technique in lots of different > places - or even just in one place to lots of different types. Let's > say that we find that even for types with very complex comparators, we > can get a 5% speedup on quick-sorting speed using this technique. > Should we, then, apply the technique to every type we have? Perhaps > the binary would grow by, I don't know, 10%. Maybe that's still not > enough to start hurting performance (or making packagers complain), > but surely it's getting there, and what happens when we discover that > similar speedups are possible using the same trick in five other > scenarios? I think it's a bit like going out to dinner and spending > $50. It's nicer than eating at home, and many people can afford to do > it occasionally, but if you do it every night, it gets expensive (and > possibly fattening). > > So we need some principled way of deciding how much inlining is > reasonable, because I am 100% certain this is not going to be the last > time someone discovers that a massive exercise in inlining can yield a > nifty performance benefit in some case or another: index builds and > COPY have already been mentioned on this thread, and I expect that > people will find other cases as well. I'm not really sure what the > "budget" is - i.e. how much binary bloat we can afford to add - or how > many cases there are that can benefit, but the first isn't infinite > and the second is more than the first. Having such a metric would resolve this discussion, but formulating it will be all but impossible. We don't know how much time PostgreSQL installations now spend sorting or how much additional time they would spend on cache misses, TLB misses, page faults, etc. That data won't show up on this thread. You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB lookup table. I would not feel bad about rejecting that, because it can live quite comfortably as an external module. Indeed, I think the best thing we can do to constrain long-term bloat in the "postgres" executable is to improve pluggability. Let a wider range of features live outside the postmaster binary. For example, if we had the infrastructure to let hash, GIN, GiST and SP-GiST index access methods live in their own DSOs (like PL/pgSQL does today), I would support doing that. Extensibility is a hallmark of PostgreSQL. It's a bad day when we reject a minority-interest feature or optimization that has no other way to exist. Better pluggability can't ease the decision process for Peter's patch, because its specializations essentially must live in the "postgres" executable to live at all. Nominally, we could let external modules inject sort specializations for core types, and Peter could publish an all_fast_sorts extension. That would be useless punting: if we can't make a principled decision on whether these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers the module make that decision? This patch has gotten more than its fair share of attention for bloat, and I think that's mostly because there's a dial-a-bloat-level knob sticking out and begging to be frobbed. On my system, fastpath_sort_2012_01_19.patch adds 85 KiB to the postgres binary. A recent commit added 57 KiB and bloat never came up in discussions thereof. I appreciate your concern over a slippery slope of inlining proposals, but I also don't wish to see the day when every patch needs to declare and justify its binary byte count. Unless, of course, we discover that elusive metric for evaluating it fairly. If we'd like to start taking interest in binary bloat, how about having the buildfarm log capture an "ls -l" on the bin directory? We'll then have data to mine if we ever wish to really take action. All that being said, I'd want to see a 15-30% (depending on how contrived) improvement to a microbenchmark or a 5% improvement to a generic benchmark (pgbench, DBT-<N>) before adopting an optimization of this complexity. Based on your measurements[2], per-type inlining gives us a 7% microbenchmark improvement. I'd therefore excise the per-type inlining. (For the record, given a 20% improvement to the same benchmark, I'd vote yea on the submission and perhaps even suggest int2 and uuid support.) nm [1] http://archives.postgresql.org/message-id/CA+TgmoZO1xSz+YiqZ2mRoKMcMqtb+JiR0Lz43CNe6de7--QDAA@mail.gmail.com [2] http://archives.postgresql.org/message-id/CA+TgmobFz8SqTGTaaRYEryWUZFODGETprbuYgBhsPLR4+Li6TQ@mail.gmail.com
On Thu, Feb 9, 2012 at 7:24 AM, Noah Misch <noah@leadboat.com> wrote: > On Tue, Feb 07, 2012 at 09:38:39PM -0500, Robert Haas wrote: >> Second, there's a concern about binary bloat: duplicating lots of code >> with different comparators inlined generates, well, a lot of machine >> code. Of course, an 0.8% increase in the size of the resulting binary >> is very unlikely to produce any measurable performance regression on >> its own, but that's not really the issue. The issue is rather that we >> could easily go nuts applying this technique in lots of different >> places - or even just in one place to lots of different types. Let's >> say that we find that even for types with very complex comparators, we >> can get a 5% speedup on quick-sorting speed using this technique. >> Should we, then, apply the technique to every type we have? Perhaps >> the binary would grow by, I don't know, 10%. Maybe that's still not >> enough to start hurting performance (or making packagers complain), >> but surely it's getting there, and what happens when we discover that >> similar speedups are possible using the same trick in five other >> scenarios? I think it's a bit like going out to dinner and spending >> $50. It's nicer than eating at home, and many people can afford to do >> it occasionally, but if you do it every night, it gets expensive (and >> possibly fattening). >> >> So we need some principled way of deciding how much inlining is >> reasonable, because I am 100% certain this is not going to be the last >> time someone discovers that a massive exercise in inlining can yield a >> nifty performance benefit in some case or another: index builds and >> COPY have already been mentioned on this thread, and I expect that >> people will find other cases as well. I'm not really sure what the >> "budget" is - i.e. how much binary bloat we can afford to add - or how >> many cases there are that can benefit, but the first isn't infinite >> and the second is more than the first. > > Having such a metric would resolve this discussion, but formulating it will be > all but impossible. We don't know how much time PostgreSQL installations now > spend sorting or how much additional time they would spend on cache misses, > TLB misses, page faults, etc. That data won't show up on this thread. > > You posed[1] the straw man of a sin(numeric) optimization requiring a 40GB > lookup table. I would not feel bad about rejecting that, because it can live > quite comfortably as an external module. Indeed, I think the best thing we > can do to constrain long-term bloat in the "postgres" executable is to improve > pluggability. Let a wider range of features live outside the postmaster > binary. For example, if we had the infrastructure to let hash, GIN, GiST and > SP-GiST index access methods live in their own DSOs (like PL/pgSQL does > today), I would support doing that. Extensibility is a hallmark of > PostgreSQL. It's a bad day when we reject a minority-interest feature or > optimization that has no other way to exist. > > Better pluggability can't ease the decision process for Peter's patch, because > its specializations essentially must live in the "postgres" executable to live > at all. Nominally, we could let external modules inject sort specializations > for core types, and Peter could publish an all_fast_sorts extension. That > would be useless punting: if we can't make a principled decision on whether > these accelerated sorts justify 85 KiB of binary, how will a DBA who discovers > the module make that decision? > > This patch has gotten more than its fair share of attention for bloat, and I > think that's mostly because there's a dial-a-bloat-level knob sticking out and > begging to be frobbed. On my system, fastpath_sort_2012_01_19.patch adds 85 > KiB to the postgres binary. A recent commit added 57 KiB and bloat never came > up in discussions thereof. I appreciate your concern over a slippery slope of > inlining proposals, but I also don't wish to see the day when every patch > needs to declare and justify its binary byte count. Unless, of course, we > discover that elusive metric for evaluating it fairly. > > If we'd like to start taking interest in binary bloat, how about having the > buildfarm log capture an "ls -l" on the bin directory? We'll then have data > to mine if we ever wish to really take action. > > > All that being said, I'd want to see a 15-30% (depending on how contrived) > improvement to a microbenchmark or a 5% improvement to a generic benchmark > (pgbench, DBT-<N>) before adopting an optimization of this complexity. Based > on your measurements[2], per-type inlining gives us a 7% microbenchmark > improvement. I'd therefore excise the per-type inlining. (For the record, > given a 20% improvement to the same benchmark, I'd vote yea on the submission > and perhaps even suggest int2 and uuid support.) That's all well-said and I mostly agree with all of it, including your suggested percentages in the last paragraph (but does anyone really sort int2s or uuids?). What I think is different about inlining as compared to writing code is that it's possible to generate an amount of binary bloat that is disproportionate to the amount of code you actually wrote. You can blow up your binary more or less on auto-pilot, just by adding more macro invocations. Of course, the overall impact is the same, but when you have to actually write the code by hand, you're more likely to notice that you're cranking out hundreds of lines of code of dubious real-world value. I think the other thing to keep in mind is that, inevitably, people decide what considerations to worry about with reference to a particular patch based on the perceived value of the feature. I thought about the fact that Tom's parameterized-path patch adds a lot of new code and to be honest I'm not totally sanguine about it, but on the other hand it has the potential to make certain kinds of queries that were seriously slow in previous releases run hundreds or thousands of times faster, and maybe open the door to further planner optimizations in the future. When you start talking about multiples with lots of zeros on the end, binary bloat fades out of the picture: you COULD assess it, but we already pretty much know what the outcome of that discussion will be. Similarly, when you hear Tom complaining about the cost of adding a new grammar keyword, that's a tip-off that he either thinks that no effort whatsoever was made to find a syntax that will work with the existing keywords, or he doesn't think the feature is worth very much, because if you look at git log -p --author=Lane src/include/parser/kwlist.h you'll find he's added them on various occasions with nary a word spoken. What we need to be wary of, in my opinion anyway, is not adding binary bloat or keywords per se, but rather of adding them for trivial gains. And I do consider a 6% gain on a highly artificial macrobenchmark to be trivial. If we were talking about a 10x speedup, I doubt the word "bloat" would be anywhere on this thread, and it certainly wouldn't be getting the prominence that it has here. I'm also more than slightly concerned that we are losing sight of the forest for the trees. I have heard reports that sorting large amounts of data is many TIMES slower for us than for a certain other database product. I frankly wonder if this entire patch is a distraction. When inlining is the tool of choice for further performance improvement, it suggests that we're pretty much out of ideas, and that whatever we get out of inlining - be it 6% or 30% - will be the end. I don't like that idea very much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 9 February 2012 13:50, Robert Haas <robertmhaas@gmail.com> wrote: > I'm also more than slightly concerned that we are losing sight of the > forest for the trees. I have heard reports that sorting large amounts > of data is many TIMES slower for us than for a certain other database > product. I frankly wonder if this entire patch is a distraction. > When inlining is the tool of choice for further performance > improvement, it suggests that we're pretty much out of ideas, and that > whatever we get out of inlining - be it 6% or 30% - will be the end. > I don't like that idea very much. If you're talking about the product I think you're talking about, well, their contracts forbid any actual benchmarks, so we won't have any hard figures, but I find it intuitively difficult to believe that the difference is that large, mostly because I've demonstrated that the performance of qsort is a pretty good proxy for the performance of a query like "select * from foo order by bar", and qsort can't be too far from optimal. Besides, didn't you yourself say "I've been skeptical of this patch for a number of reasons, the major one of which is that most workloads spend only a very small amount of time in doing qucksorts"? I have a new plan. I think you'll like it: * drop the type specific specialisations entirely. * drop qsort_arg (the original, unspecialised version). We now have exactly the same number of source code copies of qsort as before: 2. * gut the comparison of the leading sort key only from comparetup_heap, comparetup_index_btree, etc (I collectively refer to these functions as tuple-class encapsulating comparators, distinct from their comparator proper). * Instantiate two specialisations of qsort_arg: single key and multiple key (exactly the same number as has already been agreed to, since we lost the generic version). However, there is one key difference now: we call one of the tuple class encapsulating functions through a function pointer if the first comparison gives equality - . Early indications are that this is almost as much of a win as the single-key case. So the code effectively does this (but through a function pointer): #define MULTI_ADDITIONAL_CODE(COMPAR) \ { \return comparetup_heap(aT, bT, state); \ } This works because most comparisons won't actually need to look at the second key, and because the cut-down tuple-class encapsulating comparator that is used in the last revision of the patch doesn't actually make any assumption about the particular tuple-class encapsulating comparator in use. It may not even be worth-while having a separate copy of the qsort function for the multiple key case, so the question of binary bloat becomes entirely irrelevant, as there would be a net gain of zero copies of qsort_arg. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Feb 9, 2012 at 9:37 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > On 9 February 2012 13:50, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm also more than slightly concerned that we are losing sight of the >> forest for the trees. I have heard reports that sorting large amounts >> of data is many TIMES slower for us than for a certain other database >> product. I frankly wonder if this entire patch is a distraction. >> When inlining is the tool of choice for further performance >> improvement, it suggests that we're pretty much out of ideas, and that >> whatever we get out of inlining - be it 6% or 30% - will be the end. >> I don't like that idea very much. > > If you're talking about the product I think you're talking about, > well, their contracts forbid any actual benchmarks, so we won't have > any hard figures, but I find it intuitively difficult to believe that > the difference is that large, mostly because I've demonstrated that > the performance of qsort is a pretty good proxy for the performance of > a query like "select * from foo order by bar", and qsort can't be too > far from optimal. Besides, didn't you yourself say "I've been > skeptical of this patch for a number of reasons, the major one of > which is that most workloads spend only a very small amount of time in > doing qucksorts"? Emphasis on "large" amounts of data. I've said from the beginning that the small-sort case is less interesting to me: it's nice to make it faster, but nobody's going to quit using PostgreSQL because a quicksort takes 8ms instead of 6ms. It's when we start talking about operations that take minutes or hours that the differences become material. > I have a new plan. I think you'll like it: > > * drop the type specific specialisations entirely. Check. > * drop qsort_arg (the original, unspecialised version). We now have > exactly the same number of source code copies of qsort as before: 2. This may be useful elsewhere some day, but I suppose we can always pull it back out of git if we need it. > * gut the comparison of the leading sort key only from > comparetup_heap, comparetup_index_btree, etc (I collectively refer to > these functions as tuple-class encapsulating comparators, distinct > from their comparator proper). > > * Instantiate two specialisations of qsort_arg: single key and > multiple key (exactly the same number as has already been agreed to, > since we lost the generic version). However, there is one key > difference now: we call one of the tuple class encapsulating functions > through a function pointer if the first comparison gives equality - . > Early indications are that this is almost as much of a win as the > single-key case. So the code effectively does this (but through a > function pointer): > > #define MULTI_ADDITIONAL_CODE(COMPAR) \ > { \ > return comparetup_heap(aT, bT, state); \ > } > > This works because most comparisons won't actually need to look at the > second key, and because the cut-down tuple-class encapsulating > comparator that is used in the last revision of the patch doesn't > actually make any assumption about the particular tuple-class > encapsulating comparator in use. > > It may not even be worth-while having a separate copy of the qsort > function for the multiple key case, so the question of binary bloat > becomes entirely irrelevant, as there would be a net gain of zero > copies of qsort_arg. I'm not sure I entirely follow all this, but I'll look at the code once you have it. Are you saying that all the comparetup_yadda functions are redundant to each other in the single-key case? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Feb 09, 2012 at 07:24:49AM -0500, Noah Misch wrote: > This patch has gotten more than its fair share of attention for bloat, and I > think that's mostly because there's a dial-a-bloat-level knob sticking out and > begging to be frobbed. I already emailed Peter privately stating that he shouldn't feel bad about the many critiques of his patch --- this dial-a-bloat-level knob is new for us, and we have to get used to the idea. We are more discussing the idea of a dial-a-bloat-level knob rather than his patch. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 9 February 2012 14:51, Robert Haas <robertmhaas@gmail.com> wrote: > I'm not sure I entirely follow all this, but I'll look at the code > once you have it. Are you saying that all the comparetup_yadda > functions are redundant to each other in the single-key case? Yes, I am. The main reason that the loops exist in those functions (which is the only way that they substantially differ) is because they each have to get the other keys through various ways that characterise the tuple class that they encapsulate (index_getattr(), heap_getattr(), etc). -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Thu, Feb 09, 2012 at 03:36:23PM +0000, Peter Geoghegan wrote: > On 9 February 2012 14:51, Robert Haas <robertmhaas@gmail.com> wrote: > > I'm not sure I entirely follow all this, but I'll look at the code > > once you have it. Are you saying that all the comparetup_yadda > > functions are redundant to each other in the single-key case? > > Yes, I am. The main reason that the loops exist in those functions > (which is the only way that they substantially differ) is because they > each have to get the other keys through various ways that characterise > the tuple class that they encapsulate (index_getattr(), > heap_getattr(), etc). Does this help all types for sorting, including strings? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 9 February 2012 17:16, Bruce Momjian <bruce@momjian.us> wrote: >> Yes, I am. The main reason that the loops exist in those functions >> (which is the only way that they substantially differ) is because they >> each have to get the other keys through various ways that characterise >> the tuple class that they encapsulate (index_getattr(), >> heap_getattr(), etc). > > Does this help all types for sorting, including strings? Yes, it does, though of course that's not expected to make too much difference with text when the C locale isn't used, because the comparisons are inherently much more expensive, and there isn't a whole lot we can do about that, at least here. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 9 February 2012 14:51, Robert Haas <robertmhaas@gmail.com> wrote: > I'm not sure I entirely follow all this, but I'll look at the code > once you have it. I have attached a revision of the patch, with the adjustments already described. Note that I have not made this support btree tuplesorting yet, as there is an impedance mismatch that must be resolved, particularly with the SortSupport stuff, and I wanted to know what you think of the multiple key specialisation first. Arguably, we could get away with only a single specialisation - I haven't really though about it much. You say "Well, how often will we sort 10,000 integers?", and I think that btree index creation is one very common and useful case, so I'd like to look at that in more detail. I certainly don't see any reason to not do it too. This should give you performance for sorting multiple-keys that is almost as good as the single-key optimisation that you found to be more compelling. Obviously the need to actually call comparetup_heap to look at non-leading sortkeys will vary from case to case, and this is based on your test case, where there are no duplicates and thus no need to ever do that. That isn't too far from representative, as I think that in general, a majority of comparisons won't result in equality. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
Attachment
On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > [ new patch ] I spent quite a bit of time looking at this today - the patch specifically, and the issue of making quicksort fast more generally. It seemed to me that if we're going to have separate copies of the quicksort code for tuple sorting, we might as well go whole hog and specialize those copies to the particular needs of tuplesort.c as much as possible. Accordingly, I whacked the code around so that it knows that it is sorting SortTuple objects and need not conditionalize at runtime on the size of the objects being swapped. You suggested upthread that this might be worthwhile, and it seems that it is, so I think we should do it. Your patch removes the CHECK_FOR_INTERRUPTS() call from comparetup_heap, which is no good. However, since I'd already decided to specialize the copies of quicksort intended for sorting as much as possible, it made sense to me to refactor things so that the qsort routine itself, rather than the comparator, is responsible for calling CHECK_FOR_INTERRUPTS(). This slightly reduces the number of times we CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons before doing it. I find that your pg_always_inline macro is equivalent to just plain inline on my system (MacOS X v10.6.8, gcc 4.2.1). It seems to need something like this: +#elif __GNUC__ +#define pg_always_inline inline __attribute__((always_inline)) ...but I'm not very happy about relying on that, because I don't know that it will work on every gcc version (never mind non-gcc compilers), and I'm not convinced it's actually improving performance even on this one. The documentation seems to indicate that this is intended to force inlining even when not optimizing, which may have something to do with the lack of effect: that's not really the point here anyway. What I did instead is to replace template_qsort_arg.h with a script called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c that tuplesort.c then #includes. This seems more flexible to me than the macro-based approach. In particular, it allows me to generate versions of qsort with different call signatures. The attached patch generates two: static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator cmp_tuple, Tuplesortstate *state); static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup); The first of these is a drop-in replacement for qsort_arg() - any tuplesort can use it, not just heap sorts. But it is faster than qsort_arg() because of the specializations for the SortTuple data type. The second handles the special case where we are sorting by a single key that has an associated SortSupport object. In this case we don't need to carry the overhead of passing around the Tuplesortstate and dereferencing it, nor do we need the SortTupleComparator: we can just pass the SortSupport itself. Maybe there's a way to get this effect using macros, but I couldn't figure it out. At any rate, at least for the single-key case, this approach effectively forces the comparator to be inlined without requiring pg_always_inline. With this patch, I get the following results, as compared with your 2012-02-10 version and master, using the same test cases I tested before. select * from nodups order by g offset 10001; tps on master: 289.471274, 289.967984, 289.595958 tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900 tps on attached version: 388.212793, 386.085083, 386.867478 select * from twocol order by a, b offset 10000; tps on master: 261.676611, 260.440886, 259.529362 tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208 tps on attached version: 283.146463, 278.344827, 280.727798 select * from f8 order by g offset 10000; tps on master: 228.299924, 222.650355, 227.408506 tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456 tps on attached version: 276.985299, 275.341575, 274.428095 There's some variation (which I can't account for) between the results on master now and the results on master before - possibly just code shifting around between cache lines due to unrelated changes, or maybe some inadvertent change in my test setup. But it looks to me like your 2012-02-10 version, without any type-specific optimizations, does pretty much just as well on multi-key sorting as your previous version, which had them - or if there is a difference, it's pretty small. Overall, I think the numbers for the version I'm attaching here look pretty good: the single-key performance is clearly better than your last version, and the multi-key performance is very slightly worse. I think that slight worsening is a good trade-off, though, because this version can use qsort_tuple() for all kinds of tuplesorts, not just heap tuplesorts. Still, it seems like we ought to be able to do even better: the multi-key specialization that you had in your patch can be coded in this framework, too, and in theory those are ndependent of the swapcode improvements. I tried coding up a multi-key specialization which I believe to be quite similar to what you did, but it didn't seem to do much. I'm attaching it here; maybe there's some way to improve it (or a different test case where it pays off). It strikes me that if we wanted to take this further, we could look at squeezing out ApplySortComparator. For example, suppose that, upon discovering that we can do an in-memory quicksort on a single sort key, we make an initial pass over the data where we check whether it's sorted and, as we go, swap all the entries with isnull1 = true to the end of the memtuples array. We then sort the isnull1 = true entries with the standard comparator, and the isnull1 = false entries with an optimized comparator that elides most of ApplySortComparator and instead just calls the comparison function directly. We then decide on whether to emit the isnull1 = true entries first or last based on NULLS FIRST/LAST, and decide whether to emit the remaining entries in forward or reverse order based on ASC/DESC. Or maybe not exactly that thing, but something like that, so that we sort the null and non-null entries separately. The additional complexity in the read-out logic would probably be more than offset by being able to use a simpler comparator. What do you think of this version? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 15 February 2012 06:16, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 10, 2012 at 10:30 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: >> [ new patch ] > > I spent quite a bit of time looking at this today - the patch > specifically, and the issue of making quicksort fast more generally. > It seemed to me that if we're going to have separate copies of the > quicksort code for tuple sorting, we might as well go whole hog and > specialize those copies to the particular needs of tuplesort.c as much > as possible. Accordingly, I whacked the code around so that it knows > that it is sorting SortTuple objects and need not conditionalize at > runtime on the size of the objects being swapped. You suggested > upthread that this might be worthwhile, and it seems that it is, so I > think we should do it. Cool. I agree that we should do this. It doesn't need to be justified as a performance optimisation - it makes sense to refactor in this way. If that makes things faster, then so much the better. > Your patch removes the CHECK_FOR_INTERRUPTS() call from > comparetup_heap, which is no good. However, since I'd already decided > to specialize the copies of quicksort intended for sorting as much as > possible, it made sense to me to refactor things so that the qsort > routine itself, rather than the comparator, is responsible for calling > CHECK_FOR_INTERRUPTS(). This slightly reduces the number of times we > CHECK_FOR_INTERRUPTS(), but never allows more than a few comparisons > before doing it. Well, the idea of that was to have one and only CHECK_FOR_INTERRUPTS() call in the leading comparator. I should perhaps have further stressed that that patch was only intended to establish the idea of having that function pointer call within an inlined "leading-key's comparator calling" outer comparator. > I find that your pg_always_inline macro is equivalent to just plain > inline on my system (MacOS X v10.6.8, gcc 4.2.1). It seems to need > something like this: > > +#elif __GNUC__ > +#define pg_always_inline inline __attribute__((always_inline)) > > ...but I'm not very happy about relying on that, because I don't know > that it will work on every gcc version (never mind non-gcc compilers), > and I'm not convinced it's actually improving performance even on this > one. The documentation seems to indicate that this is intended to > force inlining even when not optimizing, which may have something to > do with the lack of effect: that's not really the point here anyway. There is a scant reference that suggests this, yes, but that's contradicted by other scanty sources. The macro was observed to be helpful on the multi-key case, and the effect was quite apparent and reproducible. At any rate its appearance in my most recenty patch was vestigial - I think that there ought to be no need for it, now that the compiler cost/benefit analysis probably has things right by inlining. > What I did instead is to replace template_qsort_arg.h with a script > called gen_qsort_tuple.pl, which generates a file called qsort_tuple.c > that tuplesort.c then #includes. This seems more flexible to me than > the macro-based approach. In particular, it allows me to generate > versions of qsort with different call signatures. The attached patch > generates two: > > static void qsort_tuple(SortTuple *a, size_t n, SortTupleComparator > cmp_tuple, Tuplesortstate *state); > static void qsort_ssup(SortTuple *a, size_t n, SortSupport ssup); > > The first of these is a drop-in replacement for qsort_arg() - any > tuplesort can use it, not just heap sorts. But it is faster than > qsort_arg() because of the specializations for the SortTuple data > type. The second handles the special case where we are sorting by a > single key that has an associated SortSupport object. In this case we > don't need to carry the overhead of passing around the Tuplesortstate > and dereferencing it, nor do we need the SortTupleComparator: we can > just pass the SortSupport itself. Maybe there's a way to get this > effect using macros, but I couldn't figure it out. I've seen the pattern where generic programming is aped with the preprocessor in in the style of my patch in, of all things, wxWidgets, where it continues to be used as part of pgAdmin, or was when I worked on wxWidgets 3 support exactly one year ago. One thing that is particularly bizarre about that code is that clients actually #include a cpp file. This is, I'd guess, a legacy of having to target pre C++98 compilers without decent template support. MSVC 6, in particular, was completely inadequate in this area. I am inclined to agree that given that we already use Perl to generate source code like this, it seems natural that we should prefer to do that, if only to avoid paranoia about the inclusion of a dial-a-bloat knob. I am at a disadvantage here, since I've never written a line of Perl. > With this patch, I get the following results, as compared with your > 2012-02-10 version and master, using the same test cases I tested > before. > > select * from nodups order by g offset 10001; > tps on master: 289.471274, 289.967984, 289.595958 > tps on 2012-02-10 version: 359.150280, 356.284723, 356.888900 > tps on attached version: 388.212793, 386.085083, 386.867478 > > select * from twocol order by a, b offset 10000; > tps on master: 261.676611, 260.440886, 259.529362 > tps on 2012-02-10 version: 283.941312, 279.981723, 283.140208 > tps on attached version: 283.146463, 278.344827, 280.727798 > > select * from f8 order by g offset 10000; > tps on master: 228.299924, 222.650355, 227.408506 > tps on 2012-02-10 version: 260.289273, 257.181035, 256.377456 > tps on attached version: 276.985299, 275.341575, 274.428095 > > There's some variation (which I can't account for) between the results > on master now and the results on master before - possibly just code > shifting around between cache lines due to unrelated changes, or maybe > some inadvertent change in my test setup. But it looks to me like > your 2012-02-10 version, without any type-specific optimizations, does > pretty much just as well on multi-key sorting as your previous > version, which had them - or if there is a difference, it's pretty > small. I'd say that that's a fair assessment. It does just as well in that case, but with far fewer additional instructions. > Overall, I think the numbers for the version I'm attaching here look > pretty good: the single-key performance is clearly better than your > last version, and the multi-key performance is very slightly worse. I > think that slight worsening is a good trade-off, though, because this > version can use qsort_tuple() for all kinds of tuplesorts, not just > heap tuplesorts. Still, it seems like we ought to be able to do even > better: the multi-key specialization that you had in your patch can be > coded in this framework, too, and in theory those are ndependent of > the swapcode improvements. I tried coding up a multi-key > specialization which I believe to be quite similar to what you did, > but it didn't seem to do much. I'm attaching it here; maybe there's > some way to improve it (or a different test case where it pays off). Hmm. I'll run some tests myself when I get a chance, and attempt to determine what's going on here. I don't have immediate access to a good benchmarking server, which would be preferable, and I'd like to post a revision of pg_stat_statements normalisation today, as it's been too long. Nice work though. > It strikes me that if we wanted to take this further, we could look at > squeezing out ApplySortComparator. For example, suppose that, upon > discovering that we can do an in-memory quicksort on a single sort > key, we make an initial pass over the data where we check whether it's > sorted and, as we go, swap all the entries with isnull1 = true to the > end of the memtuples array. We then sort the isnull1 = true entries > with the standard comparator, and the isnull1 = false entries with an > optimized comparator that elides most of ApplySortComparator and > instead just calls the comparison function directly. We then decide > on whether to emit the isnull1 = true entries first or last based on > NULLS FIRST/LAST, and decide whether to emit the remaining entries in > forward or reverse order based on ASC/DESC. Or maybe not exactly that > thing, but something like that, so that we sort the null and non-null > entries separately. The additional complexity in the read-out logic > would probably be more than offset by being able to use a simpler > comparator. While it's really easy to be wrong about these things, I'd venture to guess that branch prediction makes this a less than compelling win in practice. That said, if you want to know how good a win it might be, you need only knock out a quick-and-dirty prototype and see how fast a sort is - however well this does will be pretty close to your best case, but not quite, since presumably such a prototype wouldn't worry about the applicability of the optimisation. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On Wed, Feb 15, 2012 at 8:29 AM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > Cool. I agree that we should do this. It doesn't need to be justified > as a performance optimisation - it makes sense to refactor in this > way. If that makes things faster, then so much the better. Well, maybe so, but I think if the performance had been the same I might not have messed with it. > I am inclined to agree that given that we already use Perl to generate > source code like this, it seems natural that we should prefer to do > that, if only to avoid paranoia about the inclusion of a dial-a-bloat > knob. I am at a disadvantage here, since I've never written a line of > Perl. I think it's still dial-a-bloat, but I feel pretty comfortable about how we've got that knob adjusted in this version. It's almost as much improvement as any previous version, it applies to more cases, and the code footprint is the least of any version I've measured. > Hmm. I'll run some tests myself when I get a chance, and attempt to > determine what's going on here. I don't have immediate access to a > good benchmarking server, which would be preferable, and I'd like to > post a revision of pg_stat_statements normalisation today, as it's > been too long. Nice work though. Thanks. >> It strikes me that if we wanted to take this further, we could look at >> squeezing out ApplySortComparator. For example, suppose that, upon >> discovering that we can do an in-memory quicksort on a single sort >> key, we make an initial pass over the data where we check whether it's >> sorted and, as we go, swap all the entries with isnull1 = true to the >> end of the memtuples array. We then sort the isnull1 = true entries >> with the standard comparator, and the isnull1 = false entries with an >> optimized comparator that elides most of ApplySortComparator and >> instead just calls the comparison function directly. We then decide >> on whether to emit the isnull1 = true entries first or last based on >> NULLS FIRST/LAST, and decide whether to emit the remaining entries in >> forward or reverse order based on ASC/DESC. Or maybe not exactly that >> thing, but something like that, so that we sort the null and non-null >> entries separately. The additional complexity in the read-out logic >> would probably be more than offset by being able to use a simpler >> comparator. > > While it's really easy to be wrong about these things, I'd venture to > guess that branch prediction makes this a less than compelling win in > practice. That said, if you want to know how good a win it might be, > you need only knock out a quick-and-dirty prototype and see how fast a > sort is - however well this does will be pretty close to your best > case, but not quite, since presumably such a prototype wouldn't worry > about the applicability of the optimisation. You might be right. But note that the swapcode improvements were also vulnerable to branch prediction, too, though there could also be other effects there. One effect that should perhaps be considered is the cost of saving and restoring registers. Even if the branch itself doesn't cost much because we know what the result will be, the operands still have to be loaded into registers, which isn't free of itself and also potentially forces those registers to be saved and restored across function calls. Still, I'm inclined not to poke at this right now: there are a lot of patches that still need to be looked at, and at the rate we're currently disposing of patches this CommitFest will still be ongoing come Easter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 15 February 2012 15:27, Robert Haas <robertmhaas@gmail.com> wrote: >> I am inclined to agree that given that we already use Perl to generate >> source code like this, it seems natural that we should prefer to do >> that, if only to avoid paranoia about the inclusion of a dial-a-bloat >> knob. I am at a disadvantage here, since I've never written a line of >> Perl. > > I think it's still dial-a-bloat, but I feel pretty comfortable about > how we've got that knob adjusted in this version. It's almost as much > improvement as any previous version, it applies to more cases, and the > code footprint is the least of any version I've measured. I'm happy that the principle that a dial-a-bloat knob isn't necessarily a bad thing has been accepted, though that term is kind of pejorative in that it implies that the knob necessarily adds bloat to the binary. I define bloat here as the addition of dead instructions to the binary, or at least code that doesn't pull its weight. Clearly, that isn't the case here, and I suspect that we will find that it isn't the case in other places too. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services