Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: B-Tree support function number 3 (strxfrm() optimization)
Date
Msg-id CAM3SWZR47h9-D0RSJbS_vCNzea3guCBNenzNejWV8aF+Eet4Kw@mail.gmail.com
Whole thread Raw
In response to Re: B-Tree support function number 3 (strxfrm() optimization)  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: B-Tree support function number 3 (strxfrm() optimization)  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
On Tue, Dec 2, 2014 at 1:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Incidentally, I think that an under-appreciated possible source of
> regressions here is that attributes abbreviated have a strong
> physical/logical correlation. I could see a small regression for one
> such case even though my cost model indicated that it should be very
> profitable.

This was the column in question:

postgres=# select * from pg_stats where tablename = 'ohio_voters' and
attname = 'mailing_address1' ;
-[ RECORD 1 ]-
schemaname             | public
tablename              | ohio_voters
attname                | mailing_address1
inherited              | f
null_frac              | 0
avg_width              | 5
n_distinct             | 789
most_common_vals       | {"    "}
most_common_freqs      | {0.969267}
histogram_bounds       |  ****SNIP ***
correlation            | 0.944785****SNIP ***

This n_distinct is wrong, though. In fact, the number of distinct
columns is 25,946, while the number of distinct abbreviated keys is
13,691. So correlation was not the dominant factor here (although it
was probably still a factor) - rather, the dominant factor was that
the vast majority of comparisons would get away with an opportunistic
"memcmp() == 0" anyway (although not with Postgres 9.4), and so my
baseline is very fast for this case.

This would not have come up had the value "   " been represented as
NULL (as it clearly should have been), since that would not undergo
strxfrm() transformation/abbreviation in the first place. Even still,
highly skewed attributes exist in the wild, and deserve our
consideration - we do not model the distribution of values within the
set.

I believe that these cases are rare enough, and (thanks to the already
committed parts of this work) fast enough to probably not be worried
about; maybe a more complex cost model could do better, but I'm
inclined to think that it's not worth it. We'd end up slightly
improving this case at bigger cost to other, much more common cases.
Besides, equality-resolved comparisons are not necessarily much
cheaper for datatypes other than Postgres 9.5 text (in a world where
there is a variety of datatypes accelerated by abbreviation), which
discourages a general solution.

A custom format dump of this data (publicly available Ohio State voter
records) is available from:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/ohio_voters.custom.dump

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: tracking commit timestamps
Next
From: Robert Haas
Date:
Subject: Re: tracking commit timestamps