Re: Sort time - Mailing list pgsql-performance
From | Hannu Krosing |
---|---|
Subject | Re: Sort time |
Date | |
Msg-id | 1037567411.2422.33.camel@rh72.home.ee Whole thread Raw |
In response to | Re: Sort time (Stephan Szabo <sszabo@megazone23.bigpanda.com>) |
Responses |
Re: Sort time
|
List | pgsql-performance |
Stephan Szabo kirjutas P, 17.11.2002 kell 22:29: > On Sun, 17 Nov 2002, pginfo wrote: > > > > On my not terribly powerful or memory filled box, I got a time of about > > > 16s after going through a couple iterations of raising sort_mem and > > > watching if it made temp files (which is probably a good idea to check as > > > well). The data size ended up being in the vicinity of 100 meg in my > > > case. > > > > The time is very good! > > It is very good idea to watch the temp files. > > I started the sort_mem to 32 mb (it is 256 on the production system) > > and I see 3 temp files. The first is ~ 1.8 mb. The second is ~55 mb and the last is ~150 > > mb. > > As a note, the same data loaded into a non-"C" locale database took about > 42 seconds on the same machine, approximately 2.5x as long. I have investigated IBM's ICU (International Code for Unicode or smth like that) in order to use it for implementing native UNICODE text types. The sorting portion seems to work in two stages - 1. convert UTF_16 to "sorting string" and 2. compare said "sorting strings" - with the stages being also available separately. if the same is true for "native" locale support, then there is a good explanation why the text sort is orders of magnitude slower than int sort: as the full conversion to "sorting string" has to be done at each comparison (plus probably malloc/free) for locale-aware compare, but on most cases in C locale one does not need these, plus the comparison can usually stop at first or second char. Getting good performance on locale-aware text sorts seems to require storing these "sorting strings", either additionally or only these and find a way for reverse conversion ("sorting string" --> original) Some speed could be gained by doing the original --> "sorting string" conversion only once for each line, but that will probably require a major rewrite of sorting code - in essence select loctxt,a,b,c,d,e,f,g from mytab sort by localestring; should become select loctxt,a,b,c,d,e,f,g from ( select localestring,a,b,c,d,e,f,g from mytab sort by sorting_string(loctxt) ) t; or even select loctxt,a,b,c,d,e,f,g from ( select localestring,a,b,c,d,e,f,g, ss from ( select localestring,a,b,c,d,e,f,g, sorting_string(loctxt) as ss from from mytab ) sort by ss ) t; depending on how the second form is implemented (i.e. if sorting_string(loctxt) is evaluated once per row or one per compare) ------------- Hannu
pgsql-performance by date: