Re: Sort time - Mailing list pgsql-performance

From pginfo
Subject Re: Sort time
Date
Msg-id 3DD88445.F617D9B@t1.unisoftbg.com
Whole thread Raw
In response to Re: Sort time  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
Responses Re: Sort time
List pgsql-performance
Hi Tom,

The idea is very good.
I recreated the tables and for all IDS_xxx I used name (not varchar(20)).
The the is also in unicode.
I ran the query and got huge improvement!
The work time is 166 sek. ( before it was ~320 - 340 sek.).

I will continue to make new tests and play around the setups.
I think all this can be more quicker. I expect to get ~ 45-60 sek. ( this is the time in oracle), but laso 166 sek is
good.

I think that we need to work around the non us sorting and compare.
It is not possible to be so slow (all the functions are executed in memory
and in java and by oracle and by ms all this is working very fast).

regards,
ivan.

Tom Lane wrote:

> Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> > As a note, the same data loaded into a non-"C" locale database took about
> > 42 seconds on the same machine, approximately 2.5x as long.
>
> The non-C locale is undoubtedly the problem.  I made a test dataset of
> 700000 all-alphabetic 20-character random strings:
>
> $ head rand.data
> duofoesrlycdnilvlcrg
> crealjdrjpyczfbnlouo
> lxaiyicslwjnxgpehtzp
> ykizuovkvpkvvqsaocys
> rkkvrqfiiybczwqdvvfu
> stonxhbbvgwtjszodguv
> prqxhwcfibiopjpiddud
> ubgexbfdodhnauytebcf
> urfoqifgbrladpssrwzw
> ydcrsnxjpxospfqqoilw
>
> I performed the following experiment in 7.3 using a database in
> en_US locale, SQL_ASCII encoding:
>
> enus=# create table vc20 (f1 varchar(20));
> CREATE TABLE
> enus=# \copy vc20 from rand.data
> \.
> enus=# vacuum analyze vc20;
> VACUUM
> enus=# set sort_mem to 50000;
> SET
> enus=# explain analyze select count(*) from
> enus-# (select * from vc20 order by f1) ss;
>                                                            QUERY PLAN
>
>
---------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=83607.48..83607.48 rows=1 width=24) (actual time=1058167.66..1058167.67 rows=1 loops=1)
>    ->  Subquery Scan ss  (cost=80107.48..81857.48 rows=700000 width=24) (actual time=1022972.86..1049559.50
rows=700000loops=1) 
>          ->  Sort  (cost=80107.48..81857.48 rows=700000 width=24) (actual time=1022972.80..1034036.58 rows=700000
loops=1)
>                Sort Key: f1
>                ->  Seq Scan on vc20  (cost=0.00..12148.00 rows=700000 width=24) (actual time=0.20..24651.65
rows=700000loops=1) 
>  Total runtime: 1058220.10 msec
> (6 rows)
>
> (The point of the select count(*) was to avoid shipping the result rows
> to the client, but in hindsight "explain analyze" would suppress that
> anyway.  But the main datapoint here is the time for the Sort step.)
>
> I tried the test using datatype NAME as well, since it sorts using
> plain strcmp() instead of strcoll():
>
> enus=# create table nm (f1 name);
> CREATE TABLE
> enus=# insert into nm select f1 from vc20;
> INSERT 0 700000
> enus=# vacuum analyze nm;
> VACUUM
> enus=# set sort_mem to 50000;
> SET
> enus=# explain analyze select count(*) from
> enus-# (select * from nm order by f1) ss;
>                                                            QUERY PLAN
>
>
---------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=122701.48..122701.48 rows=1 width=64) (actual time=157877.84..157877.85 rows=1 loops=1)
>    ->  Subquery Scan ss  (cost=119201.48..120951.48 rows=700000 width=64) (actual time=121286.65..149376.93
rows=700000loops=1) 
>          ->  Sort  (cost=119201.48..120951.48 rows=700000 width=64) (actual time=121286.60..134075.61 rows=700000
loops=1)
>                Sort Key: f1
>                ->  Seq Scan on nm  (cost=0.00..15642.00 rows=700000 width=64) (actual time=0.21..24150.57 rows=700000
loops=1)
>  Total runtime: 157962.79 msec
> (6 rows)
>
> In C locale, the identical test sequence gives
>
>                                                            QUERY PLAN
>
>
---------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=83607.48..83607.48 rows=1 width=24) (actual time=187480.70..187480.71 rows=1 loops=1)
>    ->  Subquery Scan ss  (cost=80107.48..81857.48 rows=700000 width=24) (actual time=141100.03..178625.97 rows=700000
loops=1)
>          ->  Sort  (cost=80107.48..81857.48 rows=700000 width=24) (actual time=141099.98..162288.95 rows=700000
loops=1)
>                Sort Key: f1
>                ->  Seq Scan on vc20  (cost=0.00..12148.00 rows=700000 width=24) (actual time=0.20..23954.71
rows=700000loops=1) 
>  Total runtime: 187565.79 msec
> (6 rows)
>
> and of course about the same runtime as before for datatype NAME.  So on
> this platform (HPUX 10.20), en_US locale incurs about a 6x penalty over
> C locale for sorting varchars.
>
> Note that NAME beats VARCHAR by a noticeable margin even in C locale,
> despite the handicap of requiring much more I/O (being 64 bytes per row
> not 24).  This surprises me; it looks like varstr_cmp() is reasonably
> well optimized in the C-locale case.  But the real loser is VARCHAR in
> non-C locales.  I suspect the primary time sink is strcoll() not the
> palloc/copy overhead in varstr_cmp(), but don't have time right now to
> do profiling to prove it.
>
> Anyway, use of NAME instead of VARCHAR might be a workable workaround
> if you cannot change your database locale to C.
>
>                         regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org




pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Sort time
Next
From: pginfo
Date:
Subject: Re: Sort time