Re: PG planning randomly ? - Mailing list pgsql-performance

From Laurent Raufaste
Subject Re: PG planning randomly ?
Date
Msg-id 669dc9710802280946l6b6b1bb4lf5a8950213552b6d@mail.gmail.com
Whole thread Raw
In response to Re: PG planning randomly ?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-performance
2008/2/27, Tom Lane <tgl@sss.pgh.pa.us>:
>
> Are you sure the table had been analyzed recently at all on that server?
>
>  If it had, then what you must be dealing with is a different result from
>  a different random sample.  The laws of statistics say that sometimes a
>  random sample won't be very representative ... but if the sample is
>  reasonably large they also say that won't happen very often.  You could
>  try ANALYZEing over and over and seeing what rowcount estimate you get
>  after each one.  If you frequently get a bad estimate, maybe it would be
>  worth looking at the pg_stats row for _comment.path to see if there's
>  anything obviously bogus about the bad samples.
>

Thanks for your help Tom, it's greetly appreciated here =)

Yes, I ANALYZE the table before any bunch of request.
I isolated the problem, it happens when a lot (thousands) of rows in
the _comment table are matching the query. I can now reproduce the
problem at will, and on any server, even on our development server.

- I took 3 paths mathing thousands of rows: 0.1.4108047, 0.1.15021804
and 0.1.4749259
- I wrote the following script:
ANALYZE _comment (default_stats_target is 100)
EXPLAIN ANALYZE SELECT 1 FROM _comment WHERE _path <@ 0.1.4108047 ;
EXPLAIN ANALYZE SELECT 1 FROM _comment WHERE _path <@ 0.1.15021804 ;
EXPLAIN ANALYZE SELECT 1 FROM _comment WHERE _path <@ 0.1.4749259 ;
SELECT * FROM pg_stats WHERE tablename = '_comment' AND attname='path';

If I execute it a lot of times, approx. 2/3 of the executed query costs are OK:

Bitmap Heap Scan on _comment  (cost=114.24..4634.75 rows=1540 width=0)
(actual time=6.715..13.836 rows=12589 loops=1)
 Recheck Cond: (path <@ '0.1.15021804'::ltree)
 ->  Bitmap Index Scan on gist_idx_comment_path  (cost=0.00..113.85
rows=1540 width=0) (actual time=6.515..6.515 rows=12589 loops=1)
       Index Cond: (path <@ '0.1.15021804'::ltree)

And 1/3 of the executed queries are huge:

Bitmap Heap Scan on _comment  (cost=10366.65..342840.71 rows=156174
width=0) (actual time=6.513..12.984 rows=12589 loops=1)
 Recheck Cond: (path <@ '0.1.15021804'::ltree)
 ->  Bitmap Index Scan on gist_idx_comment_path  (cost=0.00..10327.61
rows=156174 width=0) (actual time=6.313..6.313 rows=12589 loops=1)
       Index Cond: (path <@ '0.1.15021804'::ltree)

The pg_stats table show no strange value, and the only rows not
constant are null_frac and correlation. avg_width go from 56 to 57 and
n_disctinct stay to -1 (which is OK, all path are distinct) after each
ANALYZE:
   [schemaname] => ob2
   [tablename] => _comment
   [attname] => path
   [null_frac] => 6.66667e-05
   [avg_width] => 56
   [n_distinct] => -1
   [most_common_vals] =>
   [most_common_freqs] =>
   [correlation] => -0.256958

If I do the same test with default_stats_target=1000, I get the same
behavior (huge row counts) but a bit closer to the reality. Instead of
only getting 2 different estimations accross all the requests
(rows=1540 and rows=156174), I get 3 different ones: (rows=1543
rows=15446 and rows=61784).

The problem is that the cost is still huge compared to the reality.
And the query we use in our production environment switch to a
different way of running it.

Fast version:

Limit  (cost=15557.29..15557.30 rows=5 width=570) (actual
time=1305.824..1305.829 rows=5 loops=1)
 ->  Sort  (cost=15557.29..15561.14 rows=1540 width=570) (actual
time=1305.822..1305.825 rows=5 loops=1)
       Sort Key: _comment.date_publishing
       ->  Nested Loop  (cost=0.00..15475.75 rows=1540 width=570)
(actual time=0.185..847.502 rows=61537 loops=1)
             ->  Index Scan using gist_idx_comment_path on _comment
(cost=0.00..4746.26 rows=1540 width=537) (actual time=0.118..307.553
rows=64825 loops=1)
                   Index Cond: (path <@ '0.1.4108047'::ltree)
             ->  Index Scan using _article_pkey on _article
(cost=0.00..6.95 rows=1 width=41) (actual time=0.006..0.006 rows=1
loops=64825)
                   Index Cond: (_article.id = _comment.parent_id)


Slow version:

Limit  (cost=0.00..1047.60 rows=5 width=566) (actual time=0.352..1.625
rows=5 loops=1)
 ->  Nested Loop  (cost=0.00..32663447.76 rows=155897 width=566)
(actual time=0.351..1.620 rows=5 loops=1)
       ->  Index Scan Backward using idx_comment_date_publishing on
_comment  (cost=0.00..31719108.69 rows=155897 width=533) (actual
time=0.286..1.412 rows=5 loops=1)
             Filter: (path <@ '0.1.4108047'::ltree)
       ->  Index Scan using _article_pkey on _article
(cost=0.00..6.04 rows=1 width=41) (actual time=0.038..0.039 rows=1
loops=5)
             Index Cond: (_article.id = _comment.parent_id)

Don't you think an increase in some RAM parameter would help the
server working on this kind of query ? We have 20+GB of RAM for those
servers

Thanks.

--
Laurent Raufaste
<http://www.glop.org/>

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Bypassing useless ORDER BY in a VIEW
Next
From: "Dean Gibson (DB Administrator)"
Date:
Subject: Re: Bypassing useless ORDER BY in a VIEW