Re: Re: Abbreviated keys for Datum tuplesort - Mailing list pgsql-hackers

From Andrew Gierth
Subject Re: Re: Abbreviated keys for Datum tuplesort
Date
Msg-id 87ppa3gx3p.fsf@news-spur.riddles.org.uk
Whole thread Raw
In response to Re: Abbreviated keys for Datum tuplesort (was: Re: B-Tree support function number 3 (strxfrm() optimization))  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Re: Abbreviated keys for Datum tuplesort
Re: Re: Abbreviated keys for Datum tuplesort
List pgsql-hackers
>>>>> "Robert" == Robert Haas <robertmhaas@gmail.com> writes:
>> Here's the cleaned-up version of the patch to allow abbreviated keys>> when sorting a single Datum. This also
removescomments that suggest>> that the caller of tuplesort_begin_datum should ever have to care>> about the
abbreviatedkey optimization.>> >> I'll add this to the CF.
 
Robert> I think this is a good idea.  Do you have a test case thatRobert> shows the benefit?

The best test case for datum sort performance is to use percentile_disc,
since that has almost no overhead beyond performing the actual sort.
(Unlike, say, count(distinct) or mode(), both of which have to do an
additional comparison pass over the sorted data; but count(distinct) is
probably the most common use of the datum sort in the wild, so it's
useful to try that too.)

So given some suitable test data, such as

create table stuff as select random()::text as randtext from generate_series(1,1000000);  -- or however many rows

you can do

select percentile_disc(0) within group (order by randtext) from stuff;

or

select count(distinct randtext) from stuff;

The performance improvements I saw were pretty much exactly as expected
from the improvement in the ORDER BY and CREATE INDEX cases.

The best test case for checking the correct order of results is to use
array_agg(x order by x), for example as follows:

select u, u <= lag(u) over () from (select unnest(a) as u         from (select array_agg(randtext order by randtext)
            from stuff) s1) s2;
 

(note that array_agg(x order by x) uses the datum sort, but
array_agg(x order by y) uses the ordinary heap tuple sort)

-- 
Andrew (irc:RhodiumToad)



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: proposal: row_to_array function
Next
From: Andrew Gierth
Date:
Subject: Re: New CF app deployment