Re: More work on SortSupport for text - strcoll() and strxfrm() caching - Mailing list pgsql-hackers

From Robert Haas
Subject Re: More work on SortSupport for text - strcoll() and strxfrm() caching
Date
Msg-id CA+Tgmoa=PmK=myKYvF_UgWq3hiWgm94jRKAxTGVy1chjBqBi1w@mail.gmail.com
Whole thread Raw
In response to More work on SortSupport for text - strcoll() and strxfrm() caching  (Peter Geoghegan <pg@heroku.com>)
Responses Re: More work on SortSupport for text - strcoll() and strxfrm() caching
Re: More work on SortSupport for text - strcoll() and strxfrm() caching
List pgsql-hackers
On Fri, Jul 3, 2015 at 8:33 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Since apparently we're back to development work, I thought it was time
> to share a patch implementing a few additional simple tricks to make
> sorting text under a non-C locale even faster than in 9.5. These
> techniques are mostly effective when values are physically clustered
> together. This might be because there is a physical/logical
> correlation, but cases involving any kind of clustering of values are
> helped significantly.

Interesting work.

Some comments:

1. My biggest gripe with this patch is that the comments are not easy
to understand.  For example:

+     * Attempt to re-use buffers across calls.  Also, avoid work in the event
+     * of clustered together identical items by exploiting temporal locality.
+     * This works well with divide-and-conquer, comparison-based sorts like
+     * quicksort and mergesort.
+     *
+     * With quicksort, there is, in general, a pretty strong chance that the
+     * same buffer contents can be used repeatedly for pivot items -- early
+     * pivot items will account for large number of total comparisons, since
+     * they must be compared against many (possibly all other) items.

Well, what I would have written is something like: "We're likely to be
asked to compare the same strings repeatedly, and memcmp() is so much
cheaper than memcpy() that it pays to attempt a memcmp() in the hopes
of avoiding a memcpy().  This doesn't seem to slow things down
measurably even if it doesn't work out very often."

+     * While it is worth going to the trouble of trying to re-use buffer
+     * contents across calls, ideally that will lead to entirely avoiding a
+     * strcoll() call by using a cached return value.
+     *
+     * This optimization can work well again and again for the same set of
+     * clustered together identical attributes;  as they're relocated to new
+     * subpartitions, only one strcoll() is required for each pivot (in respect
+     * of that clump of identical values, at least).  Similarly, the final
+     * N-way merge of a mergesort can be effectively accelerated if each run
+     * has its own locally clustered values.

And here I would have written something like: "If we're comparing the
same two strings that we compared last time, we can return the same
answer without calling strcoll() again.  This is more likely than it
seems, because quicksort compares the same pivot against many values,
and some of those values might be duplicates."

Of course everybody may prefer something different here; I'm just
telling you what I think.

2. I believe the change to bttextcmp_abbrev() should be pulled out
into a separate patch and committed separately.  That part  seems like
a slam dunk.

3. What is the worst case for the strxfrm()-reuse stuff?  I suppose
it's the case where we have many strings that are long, all
equal-length, and all different, but only in the last few characters.
Then the memcmp() is as expensive as possible but never works out.
How does the patch hold up in that case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: FSM versus GIN pending list bloat
Next
From: Tom Lane
Date:
Subject: Re: Raising our compiler requirements for 9.6