Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: B-Tree support function number 3 (strxfrm() optimization) |
Date | |
Msg-id | CAM3SWZS7iUfourNKVk4YBPWDhYe7pVK-BgPv7rdJD7xpkdgxmw@mail.gmail.com Whole thread Raw |
In response to | Re: B-Tree support function number 3 (strxfrm() optimization) (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: B-Tree support function number 3 (strxfrm()
optimization)
|
List | pgsql-hackers |
On Thu, Jul 31, 2014 at 1:12 PM, Peter Geoghegan <pg@heroku.com> wrote: > Abbreviated it is. Attached revision uses new terminology. I have abandoned configure tests entirely, preferring to explicitly discriminate against Mac OS X (Darwin) as a platform with a known odd locale implementation, just like pg_get_encoding_from_locale() does. Since Robert did not give a reason for discussing the basic fmgr-elision patch despite having already discussed it a couple of years ago, I have not split the patch into two (if I did, the first patch would be virtually identical to Robert's 2012 patch). I hope that's okay. I am willing to revisit this question if Robert feels that something is uncertain about the costs and benefits of a basic fmgr-eliding sort support for text. There are still some open items: * I have left the licensing question unresolved. There is plenty we can do about this, and if necessary we can ask the original author to changed his license from the MIT license to the PostgreSQL license. AFAICT, the BSD license is *more* restrictive than the MIT license, and the PostgreSQL license is identical. The wording is almost identical. I don't see the concern. If the PostgreSQL license had the “non-endorsement clause” of the New BSD license, or alternatively if the MIT license had a similar clause that the PostgreSQL licensed lacked, that would be a substantive and appreciable difference. That isn't the case. * I have not made aggregates use the optimization where they currently accidentally don't due to using datum tuplesort. I can artificially force them to use heap tuplesort where that's likely to help [1]. Let's defer this question until we have an abort algorithm that seems reasonable. There is a FIXME item. Improvements in full: * New terminology ("Abbreviated keys"). * Better explanations for why we don't use the additional optimization of abbreviated keys where we're using sort support, in analyze.c and nodeMergeAppend.c. * Better handling of NULLs. * Never use the optimization with bounded heap sorts. * Better abort strategy, that centers on the idea of measuring full key cardinality, and abbreviated key cardinality, and weighing how good a proxy the former is for the latter. This is heavily weighed when deciding if we should abort normalization as tuples are copied. Exact details are explained within bttext_abort(). As already mentioned, this can allow us to use the optimization when we're sorting a million tuples with only five distinct values. This will still win decisively, but it's obviously impossible to make work while only considering abbreviated key cardinality. Determining cardinality of both abbreviated keys and full keys appears to have very low overhead, and is exactly what we ought to care about, so that's what we do. While there is still some magic to the algorithm's inputs, my sense is that I'm much closer to characterizing the break-even point than before. I also attach a second patch, which adds additional debug instrumentation, and is intended to be applied on top of the real patch to help with review. Run Postgres with DEBUG1 output when it's applied. With the patch applied, the setting backslash_quote also controls whether or not the optimization is used. So with the debug instrumentation patch applied: "backslash_quote = on" - use optimization, but never abort "backslash_quote = off" - Never use optimization - set up shim (just like the win32 UTF-8 case). Equivalent to master's behavior. "backslash_quote = safe_encoding" - Use optimization, but actually abort if it doesn't work out, the behavior that is always seen without instrumentation. This is useful for testing the overhead of considering the optimization in cases where it didn't work out (i.e. it's useful to compare this with "backslash_quote = off"). I've found it useful to experiment with real-world data with the optimization dynamically enabled and disabled. Thoughts? [1] http://www.postgresql.org/message-id/CAM3SWZSf0Ftxy8QHGAKJh=S80vD2SBx83zkEzuJyZ6R=pTy5xA@mail.gmail.com -- Peter Geoghegan
Attachment
pgsql-hackers by date: