Re: Optimizer on sort aggregate - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Optimizer on sort aggregate
Date
Msg-id CAM3SWZTyWe5J69TaPvZf2CM7mhSKKE3UhHnK9gLuQckkWqoL5w@mail.gmail.com
Whole thread Raw
In response to Re: Optimizer on sort aggregate  (Feng Tian <ftian@vitessedata.com>)
Responses Re: Optimizer on sort aggregate
List pgsql-hackers
On Fri, Oct 17, 2014 at 6:25 PM, Feng Tian <ftian@vitessedata.com> wrote:
> I feel sorting string as if it is bytea is particularly interesting.  I am
> aware Peter G's patch and I think it is great, but for this sort agg case,
> first, I believe it is still slower than sorting bytea, and second, Peter
> G's patch depends on data.   A common long prefix will make the patch less
> effective, which is probably not so uncommon (for example, URL with a domain
> prefix).  I don't see any downside of sort bytea, other than lost the
> interest ordering.

FWIW, that's probably less true than you'd think. Using Robert's test program:

pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.something"
"http://www.something" ->
131f1f1b2222221e1a18101f131419120109090909090909090909090909090909010909090909090909090909090909090901053d014201420444
(59 bytes)
pg@hamster:~/code$ ./strxfrm-binary en_US.UTF-8 "http://www.another"
"http://www.another" ->
131f1f1b2222220c191a1f13101d01090909090909090909090909090901090909090909090909090909090901053d014201420444
(53 bytes)

So the first eight bytes of the first string is 0x131F1F1B2222221E,
and the second 0x131F1F1B2222220C. The last byte is different. That's
because the way the Unicode algorithm [1] works, there is often a
significantly greater concentration of entropy in the first 8 bytes as
compared to raw C strings compared with memcmp() - punctuation
characters and so on are not actually described at the primary weight
level. If we can get even a single byte to somewhat differentiate each
string, we can still win by a very significant amount - just not an
enormous amount. The break even point is hard to characterize exactly,
but I'm quite optimistic that a large majority of real-world text
sorts will see at least some benefit, while a smaller majority will be
much, much faster.

[1] http://www.unicode.org/reports/tr10/#Notation
-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Directory/File Access Permissions for COPY and Generic File Access Functions
Next
From: Stephen Frost
Date:
Subject: Re: Directory/File Access Permissions for COPY and Generic File Access Functions