Re: Performance problem with low correlation data - Mailing list pgsql-general

From m_lists@yahoo.it
Subject Re: Performance problem with low correlation data
Date
Msg-id 486059.1675.qm@web24607.mail.ird.yahoo.com
Whole thread Raw
In response to Performance problem with low correlation data  (Scara Maccai <m_lists@yahoo.it>)
Responses Re: Performance problem with low correlation data
List pgsql-general
Since noone replied to http://www.mail-archive.com/pgsql-general@postgresql.org/msg133360.html, I tried another
approach:

I can't cluster the whole table every day; it would take too much (as I said, table as 60M rows, and I have hundreds of
them).
Plus, it wouldn't really make much sense: the only portion of table to be clustered is the one written after the last
"cluster"command (since no row is deleted/updated, only inserted each 15 minutes). 
So I thought: I'll "cluster" only the part that has been written every day:

begin;
lock table testinsert in ACCESS EXCLUSIVE MODE;
insert into testinsert select ne_id+100000, t, v from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02
00:00:00' order by ne_id,t; 
DELETE from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00' and ne_id<100000;
update testinsert set ne_id = ne_id - 100000 where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00';
commit;


this would run after midnight of 2009-08-02. Next day would have different time values.
What I'm trying to do here is cluster on ne_id,t the portion of table written every day.
Well, I guess the table is layed out as expected, but in pg_stats correlation for the ne_id col is still VERY low:
select attname,n_distinct,correlation from pg_stats where tablename='testinsert3';
attname | n_distinct | correlation
---------+------------+-------------
ne_id   |      20000 |    0.111041  <---- low value
t       |        864 |    0.987778
v       |          1 |           1


this leads the planner to sequence scans of the table as soon as 10% of the table has to be read:
explain  select * FROM idtable as g      inner  join testinsert on id=ne_id where  groupid between 1 and 4 and  t
between'2009-08-01 00:00:00' and '2009-08-09 00:00:00'  
Hash Join  (cost=134.45..2127071.28 rows=614331 width=244)
   Hash Cond: (testinsert3.ne_id = g.id)
   ->  Seq Scan on testinsert  (cost=0.00..2063200.00 rows=15358272 width=236)
               Filter: ((t >= '2009-08-01 00:00:00'::timestamp without time zone) AND (t <= '2009-08-09
00:00:00'::timestampwithout time zone)) 
   ->  Hash  (cost=124.45..124.45 rows=800 width=8)
               ->  Bitmap Heap Scan on idtable g  (cost=24.45..124.45 rows=800 width=8)
                          Recheck Cond: ((groupid >= 1) AND (groupid <= 4))
                               ->  Bitmap Index Scan on idtable_pk  (cost=0.00..24.25 rows=800 width=0)
                                             Index Cond: ((groupid >= 1) AND (groupid <= 4))

Which is a terrible plan!
testinsert contains t values between '2009-08-01' and '2009-08-09', and ne_id from 1 to 20000. But only 800 out of
20000ne_id have to be read; there's no need for a table scan! 
I guess this is a reflection of the poor "correlation" on ne_id; but, as I said, I don't really think ne_id is so bad
correlated.
In fact, doing a "select ne_id, t from testinsert limit 100000"  I can see that data is laid out pretty much by "ne_id,
t",grouped by day (that is, same ne_id for one day, then next ne_id and so on until next day). 
How is the "correlation" calculated? Can someone explain to me why, after the procedure above,correlation is so low???





pgsql-general by date:

Previous
From: "Albe Laurenz"
Date:
Subject: Re: sslv3 alert illegal parameter
Next
From: John R Pierce
Date:
Subject: Re: c++ program to connect to postgre database