Re: sortsupport for text - Mailing list pgsql-hackers

From Robert Haas
Subject Re: sortsupport for text
Date
Msg-id CA+TgmoYJjjngPON1b4W8PZ+C6F0HEZFTe4mNeeiknw6BTuHuQA@mail.gmail.com
Whole thread Raw
In response to Re: sortsupport for text  (Peter Geoghegan <peter@2ndquadrant.com>)
Responses Re: sortsupport for text
List pgsql-hackers
On Thu, Jun 14, 2012 at 2:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote:
> Why have you made the reusable buffer managed by sortsupport
> TEXTBUFLEN-aligned? The existing rationale for that constant (whose
> value is 1024) does not seem to carry forward here:
>
>  * This should be large enough that most strings will be fit, but small
>  * enough that we feel comfortable putting it on the stack.

Well, as the comments explain:

+       /*
+        * We avoid repeated palloc/pfree on long strings by keeping the buffers
+        * we allocate around for the duration of the sort.  When we
expand them,
+        * we round off the to the next multiple of TEXTBUFLEN in order to avoid
+        * repeatedly expanding them by very small amounts.
+        */

> ISTM it would be on average worth the hit of having to repalloc a few
> more times for larger strings by making that buffer much smaller
> initially, and doubling its size each time that proved insufficient,
> rather than increasing its size to the smallest possible
> TEXTBUFLEN-aligned size that you can get away with for the immediately
> subsequent memcpy. Realistically, any database I've ever worked with
> would probably be able to fit a large majority of its text strings
> into 16 chars of memory - you yourself said that sorting toasted text
> isn't at all common.

I thought that doubling repeatedly would be overly aggressive in terms
of memory usage.  Blowing the buffers out to 8kB because we hit a
string that's a bit over 4kB isn't so bad, but blowing them out to
256MB because we hit a string that's a bit over 128MB seems a bit
excessive.

Also, I don't think it really saves anything from a performance point
of view.  The worst case for this is if we're asked to repeatedly
compare strings that expand the buffer by a kilobyte each time.
First, how likely is that to happen in a real world test case?  In
many cases, most of the input strings will be of approximately equal
length; also in many cases, that length will be short; even if the
lengths take the worst possible distribution, we have to hit them in
the worst possible order for this to be a problem.  Second, even if it
does happen, does it really matter?  Suppose we expand the buffer a
kilobyte at a time from an initial size of 1kB all the way out to
256MB.  That's 256,000 palloc calls, so we must be sorting at least
256,000 datums, at least 128,000 of which are longer than 128MB.  I
think the cost of calling memcpy() and strcoll() repeatedly on all
those long datums - not to mention repeatedly detoasting them - is
going to bludgeon the palloc overhead into complete insignificance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: sortsupport for text
Next
From: Robert Haas
Date:
Subject: Re: Patch pg_is_in_backup()