On Sun, Mar 18, 2012 at 11:08 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> However, it occurred to me that we could pretty easily jury-rig
> something that would give us an idea about the actual benefit available
> here. To wit: make a C function that wraps strxfrm, basically
> strxfrm(text) returns bytea. Then compare the performance of
> ORDER BY text_col to ORDER BY strxfrm(text_col).
>
> (You would need to have either both or neither of text and bytea
> using the sortsupport code paths for this to be a fair comparison.)
Since the index will be ~9x bigger at least on this machine, I think I
know the answer, but I suppose it doesn't hurt to test it. It's not
that much work.
> One other thing I've always wondered about in this connection is the
> general performance of sorting toasted datums. Is it better to detoast
> them in every comparison, or pre-detoast to save comparison cycles at
> the cost of having to push much more data around? I didn't see any
> discussion of this point in Robert's benchmarks, but I don't think we
> should go very far towards enabling sortsupport for text until we
> understand the issue and know whether we need to add more infrastructure
> for it. If you cross your eyes a little bit, this is very much like
> the strxfrm question...
It would be surprising to me if there is a one-size-fits-all answer to
this question. For example, if the tuple got toasted because it's got
lots of columns and we had to take desperate measures to get it to fit
into an 8kB block at all, chances are that detoasting will work out
well. We'll use a bit more memory, but hopefully that'll be repaid by
much faster comparisons. OTOH, if you have a data set with many
relatively short strings and a few very long ones, detoasting up-front
could turn a quicksort into a heapsort. Since only a small fraction
of the comparisons would have involved one of the problematic long
strings anyway, it's unlikely to be worth the expense of keeping those
strings around in detoasted form for the entire sort (unless maybe
reconstructing them even a few times is problematic because we're
under heavy cache pressure and we get lots of disk seeks as a result).
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company