Re: Performance problem with low correlation data - Mailing list pgsql-general
From | m_lists@yahoo.it |
---|---|
Subject | Re: Performance problem with low correlation data |
Date | |
Msg-id | 486059.1675.qm@web24607.mail.ird.yahoo.com Whole thread Raw |
In response to | Performance problem with low correlation data (Scara Maccai <m_lists@yahoo.it>) |
Responses |
Re: Performance problem with low correlation data
|
List | pgsql-general |
Since noone replied to http://www.mail-archive.com/pgsql-general@postgresql.org/msg133360.html, I tried another approach: I can't cluster the whole table every day; it would take too much (as I said, table as 60M rows, and I have hundreds of them). Plus, it wouldn't really make much sense: the only portion of table to be clustered is the one written after the last "cluster"command (since no row is deleted/updated, only inserted each 15 minutes). So I thought: I'll "cluster" only the part that has been written every day: begin; lock table testinsert in ACCESS EXCLUSIVE MODE; insert into testinsert select ne_id+100000, t, v from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00' order by ne_id,t; DELETE from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00' and ne_id<100000; update testinsert set ne_id = ne_id - 100000 where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00'; commit; this would run after midnight of 2009-08-02. Next day would have different time values. What I'm trying to do here is cluster on ne_id,t the portion of table written every day. Well, I guess the table is layed out as expected, but in pg_stats correlation for the ne_id col is still VERY low: select attname,n_distinct,correlation from pg_stats where tablename='testinsert3'; attname | n_distinct | correlation ---------+------------+------------- ne_id | 20000 | 0.111041 <---- low value t | 864 | 0.987778 v | 1 | 1 this leads the planner to sequence scans of the table as soon as 10% of the table has to be read: explain select * FROM idtable as g inner join testinsert on id=ne_id where groupid between 1 and 4 and t between'2009-08-01 00:00:00' and '2009-08-09 00:00:00' Hash Join (cost=134.45..2127071.28 rows=614331 width=244) Hash Cond: (testinsert3.ne_id = g.id) -> Seq Scan on testinsert (cost=0.00..2063200.00 rows=15358272 width=236) Filter: ((t >= '2009-08-01 00:00:00'::timestamp without time zone) AND (t <= '2009-08-09 00:00:00'::timestampwithout time zone)) -> Hash (cost=124.45..124.45 rows=800 width=8) -> Bitmap Heap Scan on idtable g (cost=24.45..124.45 rows=800 width=8) Recheck Cond: ((groupid >= 1) AND (groupid <= 4)) -> Bitmap Index Scan on idtable_pk (cost=0.00..24.25 rows=800 width=0) Index Cond: ((groupid >= 1) AND (groupid <= 4)) Which is a terrible plan! testinsert contains t values between '2009-08-01' and '2009-08-09', and ne_id from 1 to 20000. But only 800 out of 20000ne_id have to be read; there's no need for a table scan! I guess this is a reflection of the poor "correlation" on ne_id; but, as I said, I don't really think ne_id is so bad correlated. In fact, doing a "select ne_id, t from testinsert limit 100000" I can see that data is laid out pretty much by "ne_id, t",grouped by day (that is, same ne_id for one day, then next ne_id and so on until next day). How is the "correlation" calculated? Can someone explain to me why, after the procedure above,correlation is so low???
pgsql-general by date: