On Sun, Mar 22, 2015 at 6:48 PM, Andrew Gierth
<andrew@tao11.riddles.org.uk> wrote:
> Your version would have aborted abbrevation on that second query, thus
> losing a 3 second speedup. How on earth is that supposed to be
> justified? It's not like there's any realistically possible case where
> your version performs faster than mine by more than a tiny margin.
As I said, that's really why you won the argument on this particular
point. Why are you still going on about it?
> Peter> Where I think your argument is stronger is around the cost of
> Peter> actually aborting in particular (even though not much work is
> Peter> thrown away). Scanning through the memtuples array once more
> Peter> certainly isn't free.
>
> Yes, the cost of actually aborting abbreviation goes up with the number
> of nulls. But your version of the code makes that WORSE, by making it
> more likely that we will abort unnecessarily.
>
> If we use the raw value of memtuplecount for anything, it should be to
> make it LESS likely that we abort abbreviations (since we'd be paying a
> higher cost), not more. But benchmarking doesn't suggest that this is
> necessary, at least not for numerics.
>
> Peter> And you could take the view that it's always worth the risk,
> Peter> since it's at most a marginal cost saved if the first ~10k
> Peter> tuples are representative, but a large cost incurred when they
> Peter> are not. So on reflection, I am inclined to put the check for
> Peter> non-null values back in.
>
> Right result but the wrong reasoning.
I think that there is an argument to be made against abbreviation when
we simply have a small list of strings to begin with (e.g. 50), as per
the glibc strxfrm() docs (which, as I said, may not apply with numeric
to the same extent). It didn't end up actually happening that way for
the text opclass for various reasons, mostly because the cost model
was already complicated enough. Ideally, the number of comparisons per
key is at least 10 when we abbreviate with text, which works out at
about 1,000 tuples (as costed by cost_sort()). For that reason,
leaving aside the risk of aborting when the sampled tuples are not
representative (which is a big issue, that does clearly swing things
in favor of always disregarding NULLs), having few actual values to
sort is in theory a reason to not encode/abbreviate.
--
Peter Geoghegan