Home > mailing lists
Workaround for cross column stats dependency - Mailing list pgsql-performance

From	Guillaume Smet
Subject	Workaround for cross column stats dependency
Date	January 22, 2008 23:57:18
Msg-id	1d4e0c10801221657s6b345fcdqd1c0b3f20e519316@mail.gmail.com Whole thread Raw
Responses	Re: Workaround for cross column stats dependency
List	pgsql-performance
Tree view
Hi -performance,

While testing 8.3, I found this query which is equally slow on 8.1 and
8.3 and seems to be really slow for a not so complex query. The stats
are as good as possible and the behaviour of PostgreSQL seems to be
logical considering the stats but I'm looking for a workaround to
speed up this query.

So here is the original query:
cityvox_prod=# EXPLAIN ANALYZE SELECT vq.codequar, vq.liblong, vq.libcourt
    FROM lieu l, vilquartier vq, rubtylieu rtl, genrelieu gl, lieugelieu lgl
    WHERE l.codequar = vq.codequar AND l.dfinvalidlieu is null AND
vq.codevil = 'MUC' AND lgl.numlieu = l.numlieu AND lgl.codegelieu =
gl.codegelieu
    AND gl.codetylieu = rtl.codetylieu AND rtl.codeth = 'RES' -- the
interesting part is here
    GROUP BY vq.codequar, vq.liblong, vq.libcourt, vq.flagintramuros
ORDER BY vq.flagintramuros, vq.liblong;

              QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=2773.18..2773.25 rows=26 width=43) (actual
time=602.822..602.829 rows=13 loops=1)
   Sort Key: vq.flagintramuros, vq.liblong
   Sort Method:  quicksort  Memory: 26kB
   ->  HashAggregate  (cost=2772.31..2772.57 rows=26 width=43) (actual
time=602.769..602.778 rows=13 loops=1)
         ->  Hash Join  (cost=2737.48..2769.83 rows=248 width=43)
(actual time=601.999..602.580 rows=167 loops=1)
               Hash Cond: ((vq.codequar)::text = (l.codequar)::text)
               ->  Bitmap Heap Scan on vilquartier vq
(cost=1.91..27.50 rows=47 width=43) (actual time=0.032..0.067 rows=47
loops=1)
                     Recheck Cond: ((codevil)::text = 'MUC'::text)
                     ->  Bitmap Index Scan on idx_vilquartier_codevil
(cost=0.00..1.90 rows=47 width=0) (actual time=0.023..0.023 rows=47
loops=1)
                           Index Cond: ((codevil)::text = 'MUC'::text)
               ->  Hash  (cost=2646.25..2646.25 rows=7145 width=5)
(actual time=601.955..601.955 rows=62526 loops=1)
                     ->  Nested Loop  (cost=0.00..2646.25 rows=*7145*
width=5) (actual time=0.058..548.618 rows=*62526* loops=1)
                           ->  Nested Loop  (cost=0.00..349.71
rows=7232 width=4) (actual time=0.049..147.221 rows=66292 loops=1)
                                 ->  Nested Loop  (cost=0.00..8.59
rows=13 width=4) (actual time=0.027..0.254 rows=88 loops=1)
                                       ->  Seq Scan on rubtylieu rtl
(cost=0.00..2.74 rows=1 width=4) (actual time=0.013..0.026 rows=1
loops=1)
                                             Filter: ((codeth)::text =
'RES'::text)
                                       ->  Index Scan using
ind_genrelieu2 on genrelieu gl  (cost=0.00..5.68 rows=*14* width=8)
(actual time=0.013..0.119 rows=*88* loops=1)
                                             Index Cond:
((gl.codetylieu)::text = (rtl.codetylieu)::text)
                                 ->  Index Scan using
idx_lieugelieu_codegelieu on lieugelieu lgl  (cost=0.00..18.33
rows=633 width=8) (actual time=0.014..0.802 rows=753 loops=88)
                                       Index Cond:
((lgl.codegelieu)::text = (gl.codegelieu)::text)
                           ->  Index Scan using pk_lieu on lieu l
(cost=0.00..0.31 rows=1 width=9) (actual time=0.003..0.004 rows=1
loops=66292)
                                 Index Cond: (l.numlieu = lgl.numlieu)
                                 Filter: (l.dfinvalidlieu IS NULL)
 Total runtime: 602.930 ms

The query is looking for parts of the city where we can find
restaurants. The problem of this query is that we have several
categories (here genrelieu) for a given type (here rubtylieu) and we
have several types for a given theme (here the theme is codeth =
'RES'). When the value of the theme is RES (restaurants), we have only
1 type (it's also RES). The fact is that there are a lot of rows for
the value RES in genrelieu (all the types of food and so on) compared
to other values. Considering that PostgreSQL doesn't have the value of
RES when building the stats for genrelieu, 14 rows is an expected
value considering the distribution of the values in genrelieu so
there's nothing really wrong here. But it's really really slow.

If I remove the join on rubtylieu and I inject directly the value
obtained in the query, the stats are exact and I get a far better
plan:

cityvox_prod=# EXPLAIN ANALYZE SELECT vq.codequar, vq.liblong, vq.libcourt
    FROM lieu l, vilquartier vq, genrelieu gl, lieugelieu lgl
    WHERE l.codequar = vq.codequar AND l.dfinvalidlieu is null AND
vq.codevil = 'MUC' AND lgl.numlieu = l.numlieu AND lgl.codegelieu =
gl.codegelieu
    AND gl.codetylieu = 'RES'
    GROUP BY vq.codequar, vq.liblong, vq.libcourt, vq.flagintramuros
ORDER BY vq.flagintramuros, vq.liblong;

         QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7070.53..7070.59 rows=26 width=43) (actual
time=8.502..8.511 rows=13 loops=1)
   Sort Key: vq.flagintramuros, vq.liblong
   Sort Method:  quicksort  Memory: 26kB
   ->  HashAggregate  (cost=7069.65..7069.91 rows=26 width=43) (actual
time=8.451..8.458 rows=13 loops=1)
         ->  Hash Join  (cost=10.06..7053.22 rows=1643 width=43)
(actual time=0.318..8.267 rows=167 loops=1)
               Hash Cond: ((lgl.codegelieu)::text = (gl.codegelieu)::text)
               ->  Nested Loop  (cost=2.63..7001.77 rows=7358
width=47) (actual time=0.098..7.361 rows=973 loops=1)
                     ->  Nested Loop  (cost=2.63..5527.36 rows=4319
width=47) (actual time=0.084..2.574 rows=630 loops=1)
                           ->  Index Scan using
idx_vilquartier_codevil on vilquartier vq  (cost=0.00..34.06 rows=47
width=43) (actual time=0.023..0.062 rows=47 loops=1)
                                 Index Cond: ((codevil)::text = 'MUC'::text)
                           ->  Bitmap Heap Scan on lieu l
(cost=2.63..115.05 rows=146 width=9) (actual time=0.014..0.035 rows=13
loops=47)
                                 Recheck Cond: ((l.codequar)::text =
(vq.codequar)::text)
                                 Filter: (l.dfinvalidlieu IS NULL)
                                 ->  Bitmap Index Scan on
lieu_i_codequar  (cost=0.00..2.59 rows=146 width=0) (actual
time=0.010..0.010 rows=13 loops=47)
                                       Index Cond: ((l.codequar)::text
= (vq.codequar)::text)
                     ->  Index Scan using
idx_lieugelieu_numlieu_principal on lieugelieu lgl  (cost=0.00..0.32
rows=2 width=8) (actual time=0.003..0.005 rows=2 loops=630)
                           Index Cond: (lgl.numlieu = l.numlieu)
               ->  Hash  (cost=6.33..6.33 rows=88 width=4) (actual
time=0.153..0.153 rows=88 loops=1)
                     ->  Bitmap Heap Scan on genrelieu gl
(cost=2.23..6.33 rows=*88* width=4) (actual time=0.028..0.088
rows=*88* loops=1)
                           Recheck Cond: ((codetylieu)::text = 'RES'::text)
                           ->  Bitmap Index Scan on ind_genrelieu2
(cost=0.00..2.21 rows=88 width=0) (actual time=0.021..0.021 rows=88
loops=1)
                                 Index Cond: ((codetylieu)::text = 'RES'::text)
 Total runtime: 8.589 ms

So the question is: is there any way to improve the results of the
original query, other than doing a first query in the application to
get the list of types and inject them in a second query (the one just
above)?

Thanks.

--
Guillaume
pgsql-performance by date:
From: "Scott Marlowe"
Date: 22 January 2008, 19:34:24
Subject: Re: Optimizing queries that use temporary tables
From: Tom Lane
Date: 23 January 2008, 00:43:21
Subject: Re: Workaround for cross column stats dependency
Workaround for cross column stats dependency - Mailing list pgsql-performance

Previous

Next