Thread: Making the C collation less inclined to abort abbreviation

Making the C collation less inclined to abort abbreviation

From
Peter Geoghegan
Date:
The C collation is treated exactly the same as other collations when
considering whether the generation of abbreviated keys for text should
continue. This doesn't make much sense. With text, the big cost that
we are concerned about going to waste should abbreviated keys not
capture sufficient entropy is the cost of n strxfrm() calls. However,
the C collation doesn't use strxfrm() -- it uses memcmp(), which is
far cheaper.

With other types, like numeric and now UUID, the cost of generating an
abbreviated key is significantly lower than text when using collations
other than the C collation. Their cost models reflect this, and abort
abbreviation far less aggressively than text's, even though the
trade-off is very similar when text uses the C collation.

Attached patch fixes this inconsistency by making it significantly
less likely that abbreviation will be aborted when the C collation is
in use. The behavior with other collations is unchanged. This should
be backpatched to 9.5 as a bugfix, IMV.

--
Peter Geoghegan

Attachment

Re: Making the C collation less inclined to abort abbreviation

From
Robert Haas
Date:
On Sun, Nov 29, 2015 at 4:02 PM, Peter Geoghegan <pg@heroku.com> wrote:
> The C collation is treated exactly the same as other collations when
> considering whether the generation of abbreviated keys for text should
> continue. This doesn't make much sense. With text, the big cost that
> we are concerned about going to waste should abbreviated keys not
> capture sufficient entropy is the cost of n strxfrm() calls. However,
> the C collation doesn't use strxfrm() -- it uses memcmp(), which is
> far cheaper.
>
> With other types, like numeric and now UUID, the cost of generating an
> abbreviated key is significantly lower than text when using collations
> other than the C collation. Their cost models reflect this, and abort
> abbreviation far less aggressively than text's, even though the
> trade-off is very similar when text uses the C collation.
>
> Attached patch fixes this inconsistency by making it significantly
> less likely that abbreviation will be aborted when the C collation is
> in use. The behavior with other collations is unchanged. This should
> be backpatched to 9.5 as a bugfix, IMV.

Could you provide a test case where this change is a winner for the C
locale but a loser for some other locale?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company