Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: B-Tree support function number 3 (strxfrm() optimization) |
Date | |
Msg-id | CAM3SWZQKwELa58h1R9sxwAOCJpgs=xJbiu7apDyq9pUuSfQX6g@mail.gmail.com Whole thread Raw |
In response to | Re: B-Tree support function number 3 (strxfrm() optimization) (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: B-Tree support function number 3 (strxfrm() optimization)
(Peter Geoghegan <pg@heroku.com>)
Re: B-Tree support function number 3 (strxfrm() optimization) (Peter Geoghegan <pg@heroku.com>) Re: B-Tree support function number 3 (strxfrm() optimization) (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On Tue, Apr 8, 2014 at 2:57 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > Hmm. I would expect the worst case to be where the strxfrm is not helping > because all the entries have the same prefix, but the actual key is as short > and cheap-to-compare as possible. So the padding should be as short as > possible. Also, we have a fast path for pre-sorted input, which reduces the > number of comparisons performed; that will make the strxfrm overhead more > significant. > > I'm getting about 2x slowdown on this test case: > > create table sorttest (t text); > insert into sorttest select 'foobarfo' || (g) || repeat('a', 75) from > generate_series(10000, 30000) g; One thing that isn't all that obvious about this worst case is that it's in general very qsort() friendly, and therefore the startup costs (copying) totally dominates. Actually, you're not even sorting - you're verifying that the tuples are already exactly in order (a questionable optimization we apply at every level). Consider: postgres=# select correlation from pg_stats where tablename = 'sorttest' and attname = 't'; correlation ------------- 1 (1 row) So you only have to do n - 1 comparisons, never n log n. This is a very skewed test. There is no reason to think that strcoll() is better than strxfrm() + strcmp() if the first difference is earlier. That's the main reason why what I called a worse case is much more representative than this, even if we assume that it's very common for people to sort strings that don't differ in the first 8 bytes and yet are not identical. I took a look at fixing your worst case all the same, despite becoming discouraged during discussion at pgCon. I thought about it for a while, and realized that the problems are probably solvable. Even still, I would like to emphasize that this is a family of techniques for B-Tree operator classes. Sorting heap tuples is not the only use case, and it's probably not even the most interesting one, yet all the criticisms are largely confined to sorting alone. This is because wasted cycles on a poor man's comparison are not too problematic if the only additional work is a few CPU instructions for comparison (and not the entire transformation, which happens when that cost can be amortized over a huge length of time). I'm thinking of inner pages in B-Trees here. I've looked at heap tuple sorting first because the sortsupport infrastructure happens to be available, making it the best proving ground. That's the only reason. Anyway, major steps against worst case performance in the attached revision include: * A configure AC_TRY_RUN tests the suitability of the system strxfrm() implementation for the purposes of this optimization. There can be no "header bytes" (people at pgCon reported redundant bytes on the Mac OSX implementation at pgCon, to my great surprise), and with ASCII code points you get the full benefit of being able to store the leading 8 code points in the poorman's key (we also insist that sizeof(Datum) == 8 to apply the poor man's optimization). It might be that this restricts the optimization to 64-bit Linux entirely, where glibc is almost universally used. If that's true, I consider it to be a acceptable given that the majority of Postgres systems use this family of platforms in practice. Even still, the fact that every implementation doesn't meet my standard came as a big surprise to me, and so I hope that the problem is limited to Mac OSX. I'm slightly concerned that all BSD systems are affected by this issue, but even if that was true not covering those cases would be acceptable in my view. *Maybe* we could come up with a scheme for stripping those header bytes if someone feels very strongly about it, and we judge it to be worth it to get our hands dirty in that way (we could even dynamically verify the header bytes always matched, and bail if our assumption was somehow undermined at little cost on those platforms). * The dominant additional cost in Heikki's worst case is wasted strxfrm() transformations (or maybe just the memcpy() overhead required to NULL terminate text for strxfrm() processing). When copying over heap tuples into the memtuples array, we instrument the costs and weigh them against the likely eventual benefits using heuristic rules. We may choose to abort the transformation process early if it doesn't look promising. This process includes maintaining the mean query length so far (which is taken as a proxy for transformation costs), as well as maintaining the approximate cardinality of the set of already generated poor man's keys using the HyperLogLog algorithm [1]. This almost completely fixes the worst case performance Heikki complained of. With smaller strings you might get my implementation to show worse regressions than what you'll see for your worst case by placing careful traps for the heuristics that now appear within bttext_abort(), but I suspect not by terribly much. These are still fairly implausible cases. The cost of adding all of these ameliorating measures appears to be very low. We're so bottlenecked on memory bandwidth that the fixed-size overhead of maintaining poor man cardinality, and the extra cycles from hashing n keys for the benefit of HyperLogLog just don't seem to matter next to the big savings for n log n comparisons. The best case performance is seemingly just as good as before (about a 200% improvement in transaction throughput for one client, but closer to a 250% improvement with many clients and/or expensive collations), while the worst case is much much better. As the HyperLogLog paper [2] says: "In practice, the HYPERLOGLOG program is quite efficient: like the standard LOGLOG of [10] or MINCOUNT of [16], its running time is dominated by the computation of the hash function, so that it is only three to four times slower than a plain scan of the data (e.g., by the Unix command “wc -l”, which merely counts end-of-lines)." (Since we're doing all that copying, and everything else anyway, and only hashing the first 8 bytes, the ratio for us is more compelling still, before we even begin the sort). I'm not entirely sure that the fact that I now pass down outer plan estimated plan rows in the case of ExecSort(), to hint to the bttext_abort() heuristics so it has a "sense of proportion" about costs is the right thing. It's a difficult problem to judge, so I've left that in despite some misgivings. In any case, I think that this revision isn't too far from the best of both worlds. I do have some concerns that I don't have things well balanced within bttext_abort(), but it's a start. It's hard to exactly characterize the break even point here, particularly given that we only know about cardinality and string length, and not distribution. Let's not lose sight of the fact that in the real world the large majority of sorts will be on the right side of that break even point, though. This is all about ameliorating infrequent worst cases. Because of the new strxfrm()-quality AC_TRY_RUN configure test, you'll need to "autoreconf" and so on to get the patch working. I didn't include changes to the "configure" file generated by these changes in the patch itself. I use "en_US.UTF8" for the purposes of all benchmark figures here. I attach some interesting numbers for a variety of cases (poorman-performance.txt). Worse case performance is considered. I've been using a table of 3,17,102 cities of the world to look at average and best case performance. Overview: postgres=# select count(*), country from cities group by country order by 1 desc limit 15; count | country -------+-------------------------- 66408 | India 35845 | France 29926 | United States of America 13154 | Turkey 11210 | Mexico 11028 | Germany 9352 | Tanzania 8118 | Spain 8109 | Italy 7680 | Burkina Faso 6066 | Czech Republic 6043 | Iran 5777 | Slovenia 5584 | Brazil 5115 | Philippines (15 rows) These city names are all Latin characters with accents and other diacritics. For example: postgres=# select * from cities where country = 'Sweden' limit 5; country | province | city | one | two | three | four ---------+----------+---------------+--------+--------+--------+-------- Sweden | Blekinge | Backaryd | [null] | [null] | [null] | [null] Sweden | Blekinge | Bräkne-Hoby | [null] | [null] | [null] | [null] Sweden | Blekinge | Brömsebro | [null] | [null] | [null] | [null] Sweden | Blekinge | Drottningskär | [null] | [null] | [null] | [null] Sweden | Blekinge | Eringsboda | [null] | [null] | [null] | [null] (5 rows) You can download a plain format dump of this database here: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump I also attach some DEBUG1 output that shows the accuracy of HyperLogLog estimation (of poor man's normalized key cardinality) for the cases tested (poorman-debug-hll-cardinality.txt). It's very accurate given the low, fixed memory overhead chosen. [1] https://en.wikipedia.org/wiki/HyperLogLog [2] http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf -- Peter Geoghegan
Attachment
pgsql-hackers by date: