Re: sortsupport for text - Mailing list pgsql-hackers

From Robert Haas
Subject Re: sortsupport for text
Date
Msg-id CA+TgmoYkRLDVORC5OP8xdtsPY4rE380cLw+EHfuxmJdcV4TrcQ@mail.gmail.com
Whole thread Raw
In response to Re: sortsupport for text  (Peter Geoghegan <peter@2ndquadrant.com>)
List pgsql-hackers
On Thu, Jun 14, 2012 at 1:56 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> On 14 June 2012 17:35, Robert Haas <robertmhaas@gmail.com> wrote:
>> The problem with pre-detoasting to save comparison cycles is that you
>> can now fit many, many fewer tuples in work_mem.  There might be cases
>> where it wins (for example, because the entire data set fits even
>> after decompressing everything) but in most cases it seems like a
>> loser.
>
> If I had to guess, I'd say you're probably right about that -
> optimising sorting toasted text doesn't seem like a terribly sensible
> use of your time.
>
> What about the strxfrm suggestion of Greg's? You might find that the
> added benefit of being able to avail of a highly optimised strcmp()
> tipped the balance in favour of that idea, beyond the simple fact that
> there's only a linear number of what you might loosely call "strcoll_l
> units of work" rather than as many as O(n ^ 2). Furthermore, I'd
> speculate that if you were to interlace the strxfrm() calls with
> copying each text string, somewhere like within a specialised
> datumCopy(), that would make the approach more efficient still, as you
> specify a location for the blob in the just-palloc()'d leading-key
> private memory directly, rather than just using memcpy.

Well, it's still got the problem of blowing up memory usage.  I just
can't get excited about optimizing for the case where we can consume
10x the memory and still fit in work_mem.  If we've got that case, the
sort is gonna be pretty fast anyway.   The case where preprocessing
wins is when there are going to be a large number of comparisons
against each tuple - i.e. lg(N) is large.  But the cases where we
could pre-transform with strxfrm are those where the data fits in a
small percentage of work mem - i.e. lg(N) is small.  I'm open to
somebody showing up with a test result that demonstrates that it's
worthwhile, but to me it seems like it's chasing diminishing returns.

The point of this patch isn't really to improve things for the
collation-aware case, although it's nice that it does.  The point is
rather to shave off a double-digit percentage off the time it takes to
do the sort required to build a C-collation index, which is what
people should be using when they don't care about < and >, which most
don't.  Despite Tom's concerns, I don't think there's anything in this
patch that can't be fairly easily revised at a later date if we decide
we want a different API.  I think it's worth picking the low-hanging
fruit in the meantime.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: sortsupport for text
Next
From: Robert Haas
Date:
Subject: Re: sortsupport for text