Re: Sort time - Mailing list pgsql-performance

From Hannu Krosing
Subject Re: Sort time
Date
Msg-id 1037567411.2422.33.camel@rh72.home.ee
Whole thread Raw
In response to Re: Sort time  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
Responses Re: Sort time
List pgsql-performance
Stephan Szabo kirjutas P, 17.11.2002 kell 22:29:
> On Sun, 17 Nov 2002, pginfo wrote:
>
> > > On my not terribly powerful or memory filled box, I got a time of about
> > > 16s after going through a couple iterations of raising sort_mem and
> > > watching if it made temp files (which is probably a good idea to check as
> > > well).  The data size ended up being in the vicinity of 100 meg in my
> > > case.
> >
> > The time is very good!
> > It is very good idea to watch the temp files.
> > I started the sort_mem to 32 mb (it is 256 on the production system)
> > and I see 3 temp files. The first is ~ 1.8 mb. The second is ~55 mb and the last is ~150
> > mb.
>
> As a note, the same data loaded into a non-"C" locale database took about
> 42 seconds on the same machine, approximately 2.5x as long.

I have investigated IBM's ICU (International Code for Unicode or smth
like that) in order to use it for implementing native UNICODE text
types.

The sorting portion seems to work in two stages - 1. convert UTF_16 to
"sorting string" and 2. compare said "sorting strings" - with the stages
being also available separately.

if the same is true for "native" locale support, then there is a good
explanation why the text sort is orders of magnitude slower than int
sort: as the full conversion to "sorting string" has to be done at each
comparison (plus probably malloc/free) for locale-aware compare, but on
most cases in C locale one does not need these, plus the comparison can
usually stop at first or second char.

Getting good performance on locale-aware text sorts seems to require
storing these "sorting strings", either additionally or only these and
find a way for reverse conversion ("sorting string" --> original)

Some speed could be gained by doing the original --> "sorting string"
conversion only once for each line, but that will probably require a
major rewrite of sorting code - in essence

select loctxt,a,b,c,d,e,f,g from mytab sort by localestring;

should become

select loctxt,a,b,c,d,e,f,g from (
   select localestring,a,b,c,d,e,f,g
     from mytab
  sort by sorting_string(loctxt)
) t;

or even

select loctxt,a,b,c,d,e,f,g from (
  select localestring,a,b,c,d,e,f,g, ss  from (
    select localestring,a,b,c,d,e,f,g, sorting_string(loctxt) as ss from
      from mytab
    )
  sort by ss
) t;

depending on how the second form is implemented (i.e. if
sorting_string(loctxt) is evaluated once per row or one per compare)

-------------
Hannu



pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Sort time
Next
From: Tom Lane
Date:
Subject: Re: Sort time