Re: Abbreviated keys for Numeric - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Abbreviated keys for Numeric
Date
Msg-id CAM3SWZRsvR758mWWdpNbohnN-qQhqSpSJAis8nhm2zmhDUgNpA@mail.gmail.com
Whole thread Raw
In response to Re: Abbreviated keys for Numeric  (Andrew Gierth <andrew@tao11.riddles.org.uk>)
Responses Re: Abbreviated keys for Numeric
List pgsql-hackers
On Fri, Mar 20, 2015 at 11:58 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> Comparisons between nulls and nulls, or between nulls and non-nulls, are
> cheap; only comparisons between non-nulls and non-nulls can be
> expensive.
>
> The purpose of abbreviation is to replace expensive comparisons by cheap
> ones where possible, and therefore the cost model used for abbreviation
> should ignore nulls entirely; all that matters is the number of non-null
> values and the probability of saving time by abbreviating them.

I don't think it's that simple. The actual point of abbreviation is to
amortize the total cost of performing O(n log n) comparisons (the
average case, which is usually assumed by infrastructure like
sortsupport), by performing an encoding process n times. That encoding
process may be more expensive than each of the comparisons. Sometimes
significantly more expensive.

Forget about abbreviation entirely for a minute - consider plain old
sorting of strxfrm() blobs, which for example the glibc documentation
recommends (with some caveats) as the preferred approach to sorting
strings while respecting collation. Suppose that your applications
happens to frequently encounter fully sorted arrays of strings when
passed an array of strings to sort. If you were using our qsort()
implementation, which, rather questionably, has a pre-check for fully
sorted input (that throws away work when the pre-check fails), then
you'll only have n comparisons. Even if an encoding is only as
expensive as an actual comparison, the potential to lose with encoding
clearly exists. Similarly, we don't abbreviate for top-n heap sorts,
even though all of those abbreviated keys may be used for at least one
comparison.

Now, I hear what you're saying about weighing the costs and the
benefits -- you point out that abbreviation of NULL values is not a
cost, which is certainly true. But if the total number of NULL values
is so high that they dominate, then abbreviation might well be the
wrong thing for that reason alone. In general, string transformation
is not recommended for sorting very small lists of strings.

Where I think your argument is stronger is around the cost of actually
aborting in particular (even though not much work is thrown away).
Scanning through the memtuples array once more certainly isn't free.
And you could take the view that it's always worth the risk, since
it's at most a marginal cost saved if the first ~10k tuples are
representative, but a large cost incurred when they are not. So on
reflection, I am inclined to put the check for non-null values back
in. I also concede that the cost of abbreviation is lower with your
numeric encoding scheme than it is for that of the text opclass, which
favors this approach.

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Mark Kirkwood
Date:
Subject: Re: Remove fsync ON/OFF as a visible option?
Next
From: Kouhei Kaigai
Date:
Subject: Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)