Re: Strange statistics - Mailing list pgsql-general

From Henrik
Subject Re: Strange statistics
Date
Msg-id 7A5A80E8-8F03-40C2-B2B8-9E70AE89DE04@mac.se
Whole thread Raw
In response to Re: Strange statistics  (Joris Dobbelsteen <joris@familiedobbelsteen.nl>)
List pgsql-general
3 jun 2008 kl. 23.31 skrev Joris Dobbelsteen:

> Henrik wrote:
>> Hi list,
>> I'm having a table with a lots of file names in it. (Aprox 3
>> million) in a 8.3.1 db.
>> Doing this simple query shows that the statistics is way of but I
>> can get them right even when I raise the statistics to 1000.
>> db=# alter table tbl_file alter file_name set statistics 1000;
>> ALTER TABLE
>> db=# analyze tbl_file;
>> ANALYZE
>> db=# explain analyze select * from tbl_file where lower(file_name)
>> like lower('to%');
>>                                                         QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------------
Bitmap  
>>  Heap Scan on tbl_file  (cost=23.18..2325.13 rows=625 width=134)
>> (actual time=7.938..82.386 rows=17553 loops=1)
>>   Filter: (lower((file_name)::text) ~~ 'to%'::text)
>>   ->  Bitmap Index Scan on tbl_file_idx  (cost=0.00..23.02 rows=625
>> width=0) (actual time=6.408..6.408 rows=17553 loops=1)
>>         Index Cond: ((lower((file_name)::text) ~>=~ 'to'::text) AND
>> (lower((file_name)::text) ~<~ 'tp'::text))
>> Total runtime: 86.230 ms
>> (5 rows)
>> How can it be off by a magnitude of 28??
>
> These are statistics and represent an only estimate! In this case,
> the planner seems to be doing the right thing(tm) anyway.
>
> Statistics is a frequently misunderstood subject and usually
> provides excellent material to draw plain wrong conclusions. There
> is a good chance that due to the physical layout of your data, the
> algorithms in the statistics collector, the existence of uncertainty
> and some more unknown factors your statistics will be biased. This
> is a situations where you noticed it.
>
> Running "SELECT * FROM pg_stats;" will give you the statistics the
> planner uses and can provide some hints to why the planner has
> chosen these estimates.
> Probably statistics will vary between ANALYZE runs. Its also
> possible to try "CLUSTER" and friends. Try different queries and
> look at the deviations.
Thanks Joris for your input. You are the second person that suggests
CLUSTER for me. Maybe I should take a look. The problem is that our
select queries are kinda random. Would CLUSTER help then also? Should
I just CLUSTER on the moste used index or?

Thanks
/henke


>
>
> All in all, you should really start worrying when the planner starts
> planning inefficient queries. Since its a filename, it might be
> highly irregular (random) and a low statistics target might be good
> enough anyways.
>
> Unfortunately I'm not a statistics expert...
>
> - Joris


pgsql-general by date:

Previous
From: "Richard Broersma"
Date:
Subject: Re: Tripping up on my first attempt at building PG from source
Next
From: Henrik
Date:
Subject: Re: Database growing. Need autovacuum help.