Home > mailing lists

Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers

From	Peter Geoghegan
Subject	Re: B-Tree support function number 3 (strxfrm() optimization)
Date	December 4, 2014 00:05:33
Msg-id	CAM3SWZR47h9-D0RSJbS_vCNzea3guCBNenzNejWV8aF+Eet4Kw@mail.gmail.com Whole thread Raw
In response to	Re: B-Tree support function number 3 (strxfrm() optimization) (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: B-Tree support function number 3 (strxfrm() optimization) (Peter Geoghegan <pg@heroku.com>)
List	pgsql-hackers

Tree view

On Tue, Dec 2, 2014 at 1:21 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Incidentally, I think that an under-appreciated possible source of
> regressions here is that attributes abbreviated have a strong
> physical/logical correlation. I could see a small regression for one
> such case even though my cost model indicated that it should be very
> profitable.

This was the column in question:

postgres=# select * from pg_stats where tablename = 'ohio_voters' and
attname = 'mailing_address1' ;
-[ RECORD 1 ]-
schemaname             | public
tablename              | ohio_voters
attname                | mailing_address1
inherited              | f
null_frac              | 0
avg_width              | 5
n_distinct             | 789
most_common_vals       | {"    "}
most_common_freqs      | {0.969267}
histogram_bounds       |  ****SNIP ***
correlation            | 0.944785****SNIP ***

This n_distinct is wrong, though. In fact, the number of distinct
columns is 25,946, while the number of distinct abbreviated keys is
13,691. So correlation was not the dominant factor here (although it
was probably still a factor) - rather, the dominant factor was that
the vast majority of comparisons would get away with an opportunistic
"memcmp() == 0" anyway (although not with Postgres 9.4), and so my
baseline is very fast for this case.

This would not have come up had the value "   " been represented as
NULL (as it clearly should have been), since that would not undergo
strxfrm() transformation/abbreviation in the first place. Even still,
highly skewed attributes exist in the wild, and deserve our
consideration - we do not model the distribution of values within the
set.

I believe that these cases are rare enough, and (thanks to the already
committed parts of this work) fast enough to probably not be worried
about; maybe a more complex cost model could do better, but I'm
inclined to think that it's not worth it. We'd end up slightly
improving this case at bigger cost to other, much more common cases.
Besides, equality-resolved comparisons are not necessarily much
cheaper for datatypes other than Postgres 9.5 text (in a world where
there is a variety of datatypes accelerated by abbreviation), which
discourages a general solution.

A custom format dump of this data (publicly available Ohio State voter
records) is available from:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/ohio_voters.custom.dump

-- 
Peter Geoghegan

pgsql-hackers by date:

From: Alvaro Herrera
Date: 03 December 2014, 22:36:51
Subject: Re: tracking commit timestamps

From: Robert Haas
Date: 04 December 2014, 00:18:18
Subject: Re: tracking commit timestamps

Re: B-Tree support function number 3 (strxfrm() optimization) - Mailing list pgsql-hackers

Previous

Next