Postgresql 8.3 statistic stadistinct value dead wrong even with default_statistics_target=1000 - Mailing list pgsql-general

From Maxim Boguk
Subject Postgresql 8.3 statistic stadistinct value dead wrong even with default_statistics_target=1000
Date
Msg-id 48AE99BC.2090008@masterhost.ru
Whole thread Raw
Responses Re: Postgresql 8.3 statistic stadistinct value dead wrong even with default_statistics_target=1000
List pgsql-general
Some time ago i found one simple sql over large table eat whole ram+swap and almost killed server (postgresql 8.3.3 on
4gbfreebsd server): 
After some exploring i found what happens:

Query was over simple table:

profiles=# \d+ counter_vis
       Table "counter_vis"
     Column    |  Type   | Modifiers | Description
--------------+---------+-----------+-------------
  counter_id   | bigint  | not null  |
  visitor_id   | bigint  | not null  |
  ts           | bigint  | not null  |
  sessionstart | bigint  | not null  |
  sessionend   | bigint  | not null  |
  numpageviews | integer | not null  |
Indexes:
     "counter_vis_counter" btree (counter_id)
     "counter_vis_vis" btree (visitor_id)
Has OIDs: no

Which contain around 648M entries. (according fresh analyzed stats from pg_stat_user_tables).

Query was: select count(*) from (select visitor_id, sum(numpageviews) as s from counter_vis group by visitor_id having
sum(numpageviews)>1)as foo; 

With plan:
                                             QUERY PLAN
---------------------------------------------------------------------------------------------------
  Aggregate  (cost=17429989.40..17429989.41 rows=1 width=0)
    ->  HashAggregate  (cost=17422096.40..17426700.65 rows=263100 width=12)
          Filter: (sum(counter_vis.numpageviews) > 1)
          ->  Seq Scan on counter_vis (cost=0.00..12554826.80 rows=648969280 width=12)


Plan look ok... but how query eat over 4gb ram?

After lookin i found one strange point: " rows=263100" because i know in table must have around 70M unique
visitor_id's.

Manual analyze on table do not changed plan.
Here is pg_statistic entry after analyze (with default_statistics_target=10):

profiles=# SELECT * from pg_statistic where starelid=25488 and staattnum=2 order by 2 asc;
-[ RECORD 1
]-------------------------------------------------------------------------------------------------------------------------------------------
starelid    | 25488
staattnum   | 2
stanullfrac | 0
stawidth    | 8
stadistinct | 263100              (!!!! here is 70M distinct values in reality!!!!)
stakind1    | 1
stakind2    | 2
stakind3    | 3
stakind4    | 0
staop1      | 410
staop2      | 412
staop3      | 412
staop4      | 0
stanumbers1 |
{0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667,0.000666667}
stanumbers2 |
stanumbers3 | {-0.0443004}
stanumbers4 |
stavalues1  |
{413866965,489514660,624858316,753063164,790095243,1279713644,1628857812,2104294292,2726728837,2771123172}
stavalues2  |
{-9035671468843485583,184524075,555699387,921684844,1329929495,1897558820,2602632340,3083805588,3511488708,3963719633,9173061477162286552}
stavalues3  |
stavalues4  |


so stadistinct is 300 times wrong from reality.

Already strange... and i need run that query anyway... so i changed default_statistics_target to 1000 and analyzed
tableagain: 

profiles=# ANALYZE verbose counter_vis;
INFO:  analyzing "counter_vis"
INFO:  "counter_vis": scanned 300000 of 6065134 pages, containing 32100000 live rows and 0 dead rows; 300000 rows in
sample,648969338 estimated total rows 
ANALYZE

After statistic was better:

starelid    | 25488
staattnum   | 2
stanullfrac | 0
stawidth    | 8
stadistinct | 7.12958e+06
stakind1    | 1
stakind2    | 2
stakind3    | 3
stakind4    | 0
staop1      | 410
staop2      | 412
staop3      | 412
staop4      | 0
... long stats skipped...

!!!! but stadistinct still 10 times wrong from reality:

profiles=# SELECT count(distinct visitor_id) from counter_vis;
   count
----------
  69573318
(1 row)

Any way deal with such situations? Because 10х difference if postgres choose hashed plan will easy kill server because
OOM(because query will use 10x more ram then postgres awaited). 

Probably some strange effects in statdistinc count algoritm?
Or just any way remove limits on default_statistics_target?

Thanks for help.

PS: sorry for bad english.

--
Maxim Boguk

pgsql-general by date:

Previous
From: Peter Billen
Date:
Subject: problem with foreign keys + data-only backup
Next
From: "c k"
Date:
Subject: LIKE and SIMILAR TO