More work on SortSupport for text - strcoll() and strxfrm() caching - Mailing list pgsql-hackers

From Peter Geoghegan
Subject More work on SortSupport for text - strcoll() and strxfrm() caching
Date
Msg-id CAM3SWZTa-8vLxLNfUMXOPH7U8ahqYr9KyPdtAmhTp3Aw7O1KQw@mail.gmail.com
Whole thread Raw
Responses Re: More work on SortSupport for text - strcoll() and strxfrm() caching
List pgsql-hackers
Since apparently we're back to development work, I thought it was time
to share a patch implementing a few additional simple tricks to make
sorting text under a non-C locale even faster than in 9.5. These
techniques are mostly effective when values are physically clustered
together. This might be because there is a physical/logical
correlation, but cases involving any kind of clustering of values are
helped significantly.

Caching
======

The basic idea is that we cache strxfrm() blobs. Separately, we
exploit temporal locality and clustering of values by caching the
result of the most recent strcoll()-resolved comparison performed. The
strxfrm() technique helps a lot with low cardinality single attribute
sorts if we can avoid most strxfrm() work. On the other hand,
strcoll() comparison caching particularly helps with multi-attribute
sorts where there is a low to moderate cardinality secondary attribute
and low cardinality leading attribute. The master branch will still
opportunistically take the equality memcmp() fastpath plenty of times
for that second attribute, but there are no abbreviated keys to help
when that doesn't work out (because it isn't the leading attribute).

Regressions
==========

The patch still helps with strcoll() avoidance when the ordering of a
moderate cardinality attribute is totally random, but it helps much
less there. I have not seen a regression for any case. I'm expecting
someone to ask me to do something with the program I wrote last year,
to prove the opportunistic memcmp() equality fastpath for text is
"free" [1]. This patch has exactly the same tension as last year's
memcmp() equality one [2]: I add something opportunistic, that in
general might consistently not work out at all in some cases, and on
the face of it implies extra costs for those cases -- costs which must
be paid every single time. So as with the opportunistic memcmp()
equality thing, the *actual* overhead for cases that do not benefit
must be virtually zero for the patch to be worthwhile. That is the
standard that I expect that this patch will be held to, too.

Benchmark
=========

The query that I've been trying this out with is a typical rollup
query, using my "cities" sample data [3] (this is somewhat, although
not perfectly correlated on (country, province) before sorting):

postgres=# select country, province, count(*) from cities group by
rollup (country, province);
               country               |               province
      | count
-------------------------------------+---------------------------------------+--------
 Afghanistan                         | Badaẖšan
       |      5
 Afghanistan                         | Bādgīs
      |      2
 Afghanistan                         | Baġlān
      |      5
 Afghanistan                         | Balẖ
       |      6
 Afghanistan                         | Bāmiyān
      |      3
 Afghanistan                         | Farāh
      |      3
 Afghanistan                         | Fāryāb
      |      4
 Afghanistan                         | Ġawr
      |      3

*** SNIP *****
...

 Zimbabwe                            | Manicaland
      |     22
 Zimbabwe                            | Mashonaland Central
      |     13
 Zimbabwe                            | Mashonaland East
      |      9
 Zimbabwe                            | Mashonaland West
      |     21
 Zimbabwe                            | Masvingo
      |     11
 Zimbabwe                            | Matabeleland North
      |      8
 Zimbabwe                            | Matabeleland South
      |     14
 Zimbabwe                            | Midlands
      |     14
 Zimbabwe                            | [null]
      |    116
 [null]                              | [null]
      | 317102
(3529 rows)

With master, this takes about 525ms when it stabilizes after a few
runs on my laptop. With the patch, it takes about 405ms. That's almost
a 25% reduction in total run time. If I perform a more direct test of
sort performance against this data with minimal non-sorting overhead,
I see a reduction of as much as 30% in total query runtime (I chose
this rollup query because it is obviously representative of the real
world).

If this data is *perfectly* correlated (e.g. because someone ran
CLUSTER) and some sort can use the dubious "bubble sort best case"
path [4] that we added to qsort back in 2006, the improvement still
hold up at ~20%, I've found.

Performance of the "C" locale
---------------------------------------

For this particular rollup query, my 25% improvement leaves the
collated text sort perhaps marginally faster than an equivalent query
that uses the "C" locale (with or without the patch applied). It's
hard to be sure that that effect is real -- many trials are needed --
but it's reasonable to speculate that it's possible to sometimes beat
the "C" locale because of factors like final abbreviated key
cardinality.

It's easy to *contrive* a case where the "C" locale is beaten even
with 9.5 -- just sort a bunch of strings (that are abbreviated), that
look something like this:

"``..,,``..AAA"
"``..,,``..CCC"
"``..,,``..ZZZ"
"``..,,``..BBB"

Anyway, this avoidance of strxfrm() work I've introduced makes it
possible that abbreviated keys could make a strxfrm() locale-based
sort beat the "C" locale fair-and-square with a realistic dataset and
specific realistic query. That would be pretty nice, because that
can't be too far from optimal, and these cases are not uncommon.

A further idea -- unsigned integer comparisons
===================================

I've also changed text abbreviated keys to compare as unsigned
integers. On my Thinkpad laptop (which, of course, has an Intel CPU),
this makes a noticeable difference. memcmp() may be fast, but an
unsigned integer comparison is even faster (not sure if a big-endian
machine can have the existing memcmp() call optimized away, so that
effectively the same thing happens automatically).

Maybe other platforms benefit less, but it's very hard to imagine it
ever costing us anything.

[1] http://www.postgresql.org/message-id/5415A843.3070602@vmware.com
[2] Commit e246b3d6
[3] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump
[4] Commit a3f0b3d6
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Jim Nasby
Date:
Subject: Re: Determine operator from it's function
Next
From: Noah Misch
Date:
Subject: Re: Solaris testers wanted for strxfrm() behavior