Re: sortsupport for text - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: sortsupport for text |
Date | |
Msg-id | CA+TgmoYJjjngPON1b4W8PZ+C6F0HEZFTe4mNeeiknw6BTuHuQA@mail.gmail.com Whole thread Raw |
In response to | Re: sortsupport for text (Peter Geoghegan <peter@2ndquadrant.com>) |
Responses |
Re: sortsupport for text
|
List | pgsql-hackers |
On Thu, Jun 14, 2012 at 2:10 PM, Peter Geoghegan <peter@2ndquadrant.com> wrote: > Why have you made the reusable buffer managed by sortsupport > TEXTBUFLEN-aligned? The existing rationale for that constant (whose > value is 1024) does not seem to carry forward here: > > * This should be large enough that most strings will be fit, but small > * enough that we feel comfortable putting it on the stack. Well, as the comments explain: + /* + * We avoid repeated palloc/pfree on long strings by keeping the buffers + * we allocate around for the duration of the sort. When we expand them, + * we round off the to the next multiple of TEXTBUFLEN in order to avoid + * repeatedly expanding them by very small amounts. + */ > ISTM it would be on average worth the hit of having to repalloc a few > more times for larger strings by making that buffer much smaller > initially, and doubling its size each time that proved insufficient, > rather than increasing its size to the smallest possible > TEXTBUFLEN-aligned size that you can get away with for the immediately > subsequent memcpy. Realistically, any database I've ever worked with > would probably be able to fit a large majority of its text strings > into 16 chars of memory - you yourself said that sorting toasted text > isn't at all common. I thought that doubling repeatedly would be overly aggressive in terms of memory usage. Blowing the buffers out to 8kB because we hit a string that's a bit over 4kB isn't so bad, but blowing them out to 256MB because we hit a string that's a bit over 128MB seems a bit excessive. Also, I don't think it really saves anything from a performance point of view. The worst case for this is if we're asked to repeatedly compare strings that expand the buffer by a kilobyte each time. First, how likely is that to happen in a real world test case? In many cases, most of the input strings will be of approximately equal length; also in many cases, that length will be short; even if the lengths take the worst possible distribution, we have to hit them in the worst possible order for this to be a problem. Second, even if it does happen, does it really matter? Suppose we expand the buffer a kilobyte at a time from an initial size of 1kB all the way out to 256MB. That's 256,000 palloc calls, so we must be sorting at least 256,000 datums, at least 128,000 of which are longer than 128MB. I think the cost of calling memcpy() and strcoll() repeatedly on all those long datums - not to mention repeatedly detoasting them - is going to bludgeon the palloc overhead into complete insignificance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: