Thread: Increase performance of a UNION query that thakes 655.07 msec to be runned ?

Increase performance of a UNION query that thakes 655.07 msec to be runned ?

From

"Bruno BAGUETTE"

Date:

06 February 2004, 07:58:01

Hello,

Do you see a way to get better performances with this query which takes
currently 655.07 msec to be done.

levure=> explain analyze SELECT distinct lower(substr(l_name, 1, 1)) AS
initiale FROM people
levure-> UNION
levure-> SELECT distinct lower(substr(org_name, 1, 1)) AS initiale FROM
organizations
levure-> ORDER BY initiale;


QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-----------
 Sort  (cost=158.73..158.78 rows=20 width=43) (actual
time=650.82..650.89 rows=39 loops=1)
   Sort Key: initiale
   ->  Unique  (cost=157.30..158.30 rows=20 width=43) (actual
time=649.55..650.17 rows=39 loops=1)
         ->  Sort  (cost=157.30..157.80 rows=200 width=43) (actual
time=649.55..649.67 rows=69 loops=1)
               Sort Key: initiale
               ->  Append  (cost=69.83..149.66 rows=200 width=43)
(actual time=198.48..648.51 rows=69 loops=1)
                     ->  Subquery Scan "*SELECT* 1"  (cost=69.83..74.83
rows=100 width=38) (actual time=198.48..230.62 rows=37 loops=1)
                           ->  Unique  (cost=69.83..74.83 rows=100
width=38) (actual time=198.46..230.31 rows=37 loops=1)
                                 ->  Sort  (cost=69.83..72.33 rows=1000
width=38) (actual time=198.45..205.99 rows=4093 loops=1)
                                       Sort Key:
lower(substr((l_name)::text, 1, 1))
                                       ->  Seq Scan on people
(cost=0.00..20.00 rows=1000 width=38) (actual time=0.19..52.33 rows=4093
loops=1)
                     ->  Subquery Scan "*SELECT* 2"  (cost=69.83..74.83
rows=100 width=43) (actual time=361.82..417.62 rows=32 loops=1)
                           ->  Unique  (cost=69.83..74.83 rows=100
width=43) (actual time=361.79..417.33 rows=32 loops=1)
                                 ->  Sort  (cost=69.83..72.33 rows=1000
width=43) (actual time=361.79..374.81 rows=7074 loops=1)
                                       Sort Key:
lower(substr((org_name)::text, 1, 1))
                                       ->  Seq Scan on organizations
(cost=0.00..20.00 rows=1000 width=43) (actual time=0.23..95.47 rows=7074
loops=1)
 Total runtime: 655.07 msec
(17 rows)


I was thinking that a index on lower(substr(l_name, 1, 1)) and another
index on lower(substr(org_name, 1, 1)) should gives better performances.
When I've to create theses two indexes, it seems like this is not
allowed :

levure=> CREATE INDEX firstchar_lastname_idx ON
people(lower(substr(l_name, 1, 1)));
ERROR:  parser: parse error at or near "(" at character 59

Do you have another idea to get better performances ?

Thanks in advance :-)


PS : Note that this database is VACUUMed twice per day (and sometimes
more).

-------------------------------------
Bruno BAGUETTE - pgsql-ml@baguette.net

Re: Increase performance of a UNION query that thakes 655.07 msec to be runned ?

From

Tom Lane

Date:

06 February 2004, 14:29:47

"Bruno BAGUETTE" <pgsql-ml@baguette.net> writes:
> Do you see a way to get better performances with this query which takes
> currently 655.07 msec to be done.

> levure=> explain analyze SELECT distinct lower(substr(l_name, 1, 1)) AS
> initiale FROM people
> levure-> UNION
> levure-> SELECT distinct lower(substr(org_name, 1, 1)) AS initiale FROM
> organizations
> levure-> ORDER BY initiale;

This is inherently a bit inefficient since the UNION implies a DISTINCT
step, thus partially repeating the DISTINCT work done inside each SELECT.
It would likely be a tad faster to drop the DISTINCTs from the
subselects and rely on UNION to do the filtering.  However, you're still
gonna have a big SORT/UNIQUE step.

As of PG 7.4 you could probably get a performance win by converting the
thing to use GROUP BY instead of DISTINCT or UNION:

select initiale from (
  select lower(substr(l_name,1,1)) as initiale from people
  union all
  select lower(substr(org_name,1,1)) as initiale from organizations
) ss
group by initiale order by initiale;

This should use a HashAggregate to do the unique-ification.  I think
that will be faster than Sort/Unique.

            regards, tom lane

Re: Increase performance of a UNION query that thakes

From

Stephan Szabo

Date:

06 February 2004, 14:52:52

On Fri, 6 Feb 2004, Bruno BAGUETTE wrote:

> I was thinking that a index on lower(substr(l_name, 1, 1)) and another
> index on lower(substr(org_name, 1, 1)) should gives better performances.
> When I've to create theses two indexes, it seems like this is not
> allowed :
>
> levure=> CREATE INDEX firstchar_lastname_idx ON
> people(lower(substr(l_name, 1, 1)));
> ERROR:  parser: parse error at or near "(" at character 59

In 7.4, I believe you would say
 on people((lower(substr(l_name,1,1))))
but I'm not sure that index would really help in practice.

> Do you have another idea to get better performances ?

In addition to what Tom said, the row estimates look suspiciously default.
You mention vacuuming, but do you ever analyze the tables?

Also, what do you have sort_mem set to?

RE : Increase performance of a UNION query that thakes 655.07 msec to be runned ?

From

"Bruno BAGUETTE"

Date:

06 February 2004, 15:35:44

re-Hello,

As suggested by Tom, I've removed the distinct and tried it's query :

levure=> explain analyze select initiale from (
levure(>   select lower(substr(l_name,1,1)) as initiale from people
levure(>   union all
levure(>   select lower(substr(org_name,1,1)) as initiale from
organizations
levure(> ) ss
levure-> group by initiale order by initiale;
                                                                   QUERY
PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-
 Group  (cost=1018.48..1074.32 rows=1117 width=17) (actual
time=783.47..867.61 rows=39 loops=1)
   ->  Sort  (cost=1018.48..1046.40 rows=11167 width=17) (actual
time=782.18..801.68 rows=11167 loops=1)
         Sort Key: initiale
         ->  Subquery Scan ss  (cost=0.00..267.67 rows=11167 width=17)
(actual time=0.23..330.31 rows=11167 loops=1)
               ->  Append  (cost=0.00..267.67 rows=11167 width=17)
(actual time=0.22..263.69 rows=11167 loops=1)
                     ->  Subquery Scan "*SELECT* 1"  (cost=0.00..87.93
rows=4093 width=15) (actual time=0.22..79.51 rows=4093 loops=1)
                           ->  Seq Scan on people  (cost=0.00..87.93
rows=4093 width=15) (actual time=0.20..53.82 rows=4093 loops=1)
                     ->  Subquery Scan "*SELECT* 2"  (cost=0.00..179.74
rows=7074 width=17) (actual time=0.24..146.12 rows=7074 loops=1)
                           ->  Seq Scan on organizations
(cost=0.00..179.74 rows=7074 width=17) (actual time=0.23..100.70
rows=7074 loops=1)
 Total runtime: 874.79 msec
(10 rows)


That seems to be 200 msec longer that my first query... Indeed, I've
noticed something strange : now, if I rerun my initial query, I get
worse runtime than this morning :


levure=> EXPLAIN ANALYZE SELECT lower(substr(l_name, 1, 1)) AS initiale
FROM people
levure-> UNION
levure-> SELECT lower(substr(org_name, 1, 1)) AS initiale FROM
organizations
levure-> ORDER BY initiale;
                                                                   QUERY
PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
 Sort  (cost=1130.85..1133.64 rows=1117 width=17) (actual
time=802.52..802.58 rows=39 loops=1)
   Sort Key: initiale
   ->  Unique  (cost=1018.48..1074.32 rows=1117 width=17) (actual
time=712.04..801.83 rows=39 loops=1)
         ->  Sort  (cost=1018.48..1046.40 rows=11167 width=17) (actual
time=712.03..732.63 rows=11167 loops=1)
               Sort Key: initiale
               ->  Append  (cost=0.00..267.67 rows=11167 width=17)
(actual time=0.21..263.54 rows=11167 loops=1)
                     ->  Subquery Scan "*SELECT* 1"  (cost=0.00..87.93
rows=4093 width=15) (actual time=0.20..80.47 rows=4093 loops=1)
                           ->  Seq Scan on people  (cost=0.00..87.93
rows=4093 width=15) (actual time=0.19..54.14 rows=4093 loops=1)
                     ->  Subquery Scan "*SELECT* 2"  (cost=0.00..179.74
rows=7074 width=17) (actual time=0.28..144.82 rows=7074 loops=1)
                           ->  Seq Scan on organizations
(cost=0.00..179.74 rows=7074 width=17) (actual time=0.27..99.06
rows=7074 loops=1)
 Total runtime: 806.47 msec
(11 rows)


I don't understand why this runtime has changed because no data has been
added/updated/deleted since several weeks (I'm working on a copy of the
production database. And this copy is not accessible for the users).

My PostgreSQL version is PostgreSQL 7.3.2, I have to ask to the
administrator if it can be upgraded to 7.4 in the production server.

Thanks in advance for your help.

---------------------------------------
Bruno BAGUETTE - pgsql-ml@baguette.net

RE : Increase performance of a UNION query that thakes 655.07 msec to be runned ?

From

"Bruno BAGUETTE"

Date:

06 February 2004, 15:40:19

> In addition to what Tom said, the row estimates look
> suspiciously default. You mention vacuuming, but do you ever
> analyze the tables?

I run VACUUM FULL ANALYZE with the postgres user on all the PostgreSQL
databases on the server, twice a day, sometimes more.

> Also, what do you have sort_mem set to?

[root@levure data]# cat postgresql.conf | grep sort_mem
sort_mem = 6144                 # min 64, size in KB

Do you think I should increase that value ?

It's not so easy to do a good setup of that postgresql.conf file, is
there any tool that suggests some values for that ?

Thanks in advance for your tips :-)

---------------------------------------
Bruno BAGUETTE - pgsql-ml@baguette.net

Re: RE : Increase performance of a UNION query that thakes

From

Stephan Szabo

Date:

06 February 2004, 16:16:28

On Fri, 6 Feb 2004, Bruno BAGUETTE wrote:

> > In addition to what Tom said, the row estimates look
> > suspiciously default. You mention vacuuming, but do you ever
> > analyze the tables?
>
> I run VACUUM FULL ANALYZE with the postgres user on all the PostgreSQL
> databases on the server, twice a day, sometimes more.

Wierd, because you're getting 1000 estimated on both people and
organizations.  What does pg_class have to say about those two tables?

> > Also, what do you have sort_mem set to?
>
> [root@levure data]# cat postgresql.conf | grep sort_mem
> sort_mem = 6144                 # min 64, size in KB
>
> Do you think I should increase that value ?

Hmm, I'd expect that the sort would fit in that space in general.  If you
want to try different values, you can set sort_mem from psql rather than
changing the configuration file.

----

On my machine the index does actually help, although I needed to lower
random_page_cost a little from its default of 4 to get it to use it
preferentially, but I'm also getting times about 1/3 of yours (and my
machine is pretty poor) so I think I may not have data that matches yours
very well.

RE : RE : Increase performance of a UNION query that thakes 655.07 msec to be runned ?

From

"Bruno BAGUETTE"

Date:

07 February 2004, 23:25:32

> On Fri, 6 Feb 2004, Bruno BAGUETTE wrote:
>
> > > In addition to what Tom said, the row estimates look suspiciously
> > > default. You mention vacuuming, but do you ever analyze
> > > the tables?
> >
> > I run VACUUM FULL ANALYZE with the postgres user on all the
> > PostgreSQL
> > databases on the server, twice a day, sometimes more.
>
> Wierd, because you're getting 1000 estimated on both people
> and organizations.  What does pg_class have to say about
> those two tables?

I'm sorry but I think that I misunderstand you. Are you telling me that
running VACUUM FULL ANALYZE is weird ? Or do you mean another thing ?


Finally, I've found another way : I've build a MATERIALIZED VIEW that
stores the initial (CHAR(1) of both people and organizations, with an
index on that column. I get excellent results :

 Unique  (cost=0.00..290.34 rows=1117 width=5) (actual time=0.52..267.38
rows=39 loops=1)
   ->  Index Scan using idx_mview_initials on mview_contacts
(cost=0.00..262.42 rows=11167 width=5) (actual time=0.51..172.15
rows=11167 loops=1)
 Total runtime: 267.81 msec
(3 rows)

So, that's a better runtime :-)

Thanks for your help :-)

-------------------------------------
Bruno BAGUETTE - pgsql-ml@baguette.net

Re: RE : RE : Increase performance of a UNION query that

From

Stephan Szabo

Date:

08 February 2004, 04:36:26

On Sun, 8 Feb 2004, Bruno BAGUETTE wrote:

> > On Fri, 6 Feb 2004, Bruno BAGUETTE wrote:
> >
> > > > In addition to what Tom said, the row estimates look suspiciously
> > > > default. You mention vacuuming, but do you ever analyze
> > > > the tables?
> > >
> > > I run VACUUM FULL ANALYZE with the postgres user on all the
> > > PostgreSQL
> > > databases on the server, twice a day, sometimes more.
> >
> > Wierd, because you're getting 1000 estimated on both people
> > and organizations.  What does pg_class have to say about
> > those two tables?
>
> I'm sorry but I think that I misunderstand you. Are you telling me that
> running VACUUM FULL ANALYZE is weird ? Or do you mean another thing ?

No, I was saying it's wierd that it'd be misestimating to the default
values after a vacuum full analyze.