Thread: Performance problem with low correlation data

Performance problem with low correlation data

From
Scara Maccai
Date:
I have a problem with the method that PG uses to access my data.

Data into testinsert is inserted every 15 minutes.

ne_id varies from 1 to 20000.


CREATE TABLE testinsert
(
  ne_id integer NOT NULL,
  t timestamp without time zone NOT NULL,
  v integer[],
  CONSTRAINT testinsert_pk PRIMARY KEY (ne_id, t)
)

CREATE UNIQUE INDEX testinsert_time_key
  ON testinsert
  USING btree
  (t, ne_id);

This table has, then, a t correlation of 1, and a ne_id correlation close to 0.

I query this table using another table:

CREATE TABLE idtable
(
  id integer NOT NULL,
  groupname varchar(50)

  CONSTRAINT idtable_pk PRIMARY KEY (id, groupname)
)

CREATE INDEX idtable_group_idx
  ON idtable
  USING btree
  (groupname);


where each id is associated with a group:

select * from idtable left outer join testinsert on id=ne_id where groupname='a group name' and time between $a_date
and$another_date 


PG usually choose a nested loop join over all the ne_ids found for groupname='a group name'.
BUT, given the correlation in the table, this is a horrible data access: the table (15GB) gets read randomly, since
datafor one ne_id is scattered all over the table; 

The "best" way to read the table would still be a nested loop, but a loop on the "t" values, not on the ne_id values,
sincedata for the same timestamp is "close". 

Or, even better, something like this would be very nice:

Bitmap Heap Scan
    for each id found in idtable where groupname='a group name'
        BitmapOr
            BitmapIndexScan using ne_id and time between $a_date and $another_date


That is: I understand why PG is using that access method to fetch the indexes, but I would like it to fetch the heaps
onlyafter ALL the indexes have been read, so that it could reorder them... 

So, given that:


How can I tell to PG to use an algorithm such as:

fetch the heap
    for each quarter
        for each id found where groupname='a group name'
            fetch all the indexes

instead of:

for each id found where groupname='a group name'
    fetch the heap
        fetch all the indexes where ne_id=id  time between $a_date and $another_date



????


(

some other infos:

1) results clustering the table are x10-x20 faster, but I can't cluster the table (it gets written every 15 minutes and
readpretty often) 
2) I know all the "t" values that I'm going to query, since there won't be more than 1 t per ne_id per 15 minutes;
so I could use a generate_series($a_date, $another_date, 15 minutes) if that could help somehow:

select * from
idtable
cross join generate_series($a_date, $another_date, 15 minutes) as myt
left outer join testinsert on id=ne_id and myt=t
where groupname='a group name'

but it doesn't help...
)





Re: Performance problem with low correlation data

From
Greg Stark
Date:
On Mon, Jul 6, 2009 at 6:32 PM, Scara Maccai<m_lists@yahoo.it> wrote:


> The "best" way to read the table would still be a nested loop, but a loop on the
> "t" values, not on the ne_id values, since data for the same timestamp is "close".

But that would be a different query -- there's no restrictions on the
t values in this one.

> How can I tell to PG to use an algorithm such as:
>
> fetch the heap
>        for each quarter
>                for each id found where groupname='a group name'
>                        fetch all the indexes

Have you tried something using IN or EXISTS instead of a join? The
algorithm you describe doesn't work for the join because it has to
produce a record which includes the matching group columns. A bitmap
scan would return the various groups (I know in your case there's only
one but the optimizer can't be sure) mixed together.

That might work better in 8.4 than 8.3 as the IN and EXISTS handling
is improved.

Actually I wonder if doing a sequential scan with a hash join against
the group list wouldn't be a better option. That would scan more of
the heap but if they're really that spread out it might not make much
difference, and it would avoid having to do all the index scans. You
could experiment with turning enable_nestloop off and see how fast the
hash join plan is.

--
greg
http://mit.edu/~gsstark/resume.pdf

Re: Performance problem with low correlation data

From
Scara Maccai
Date:
> But that would be a different query -- there's no
> restrictions on the
> t values in this one.

There is a restriction on the t values:

select * from idtable left outer join testinsert on id=ne_id where groupname='a group name' and time between $a_date
and$another_date 


> Have you tried something using IN or EXISTS instead of a
> join?

I still get nested loop join on the ne_id column...

> The
> algorithm you describe doesn't work for the join because it
> has to
> produce a record which includes the matching group columns.

Yeah, I thought about that.
Basically I guess the "perfect" algorithm would be something like:

Hash Join <---- this is needed to join values from both relations
  -> Bitmap Heap Scan
     for each id found in idtable where groupname='a group name'
       BitmapOr
     BitmapIndexScan using ne_id and time between $a_date and $another_date
  -> select id from idtable where groupname='a group name'

> Actually I wonder if doing a sequential scan with a hash
> join against
> the group list wouldn't be a better option.

The table is pretty big (60M rows), sequential scans are the reason why my queries are so slow: since the correlation
onthe ne_id col is so bad, the planner chooses seq scans when dealing with most of the "t" values, even if the number
of"ne_id" values is low. 

For the moment I've found this solution:

whenever too many "t" are selected, which would lead the planner towards a seq scan (or a very poor bitmap index scan
incase I disable seq scans) I create a temporary table: 

create temporary table alldata as
select * FROM generate_series(mydatestart, mydateend, '15 minutes'::interval) as t
cross join idtable where groupname='a group name'
order by t,id;

analyze alldata;

select * from alldata left outer join testinsert using (ne_id,t);

basically I'm doing what I'd like PG to do:

since the correlation on the "t" col is good, and correlation on the "id" col is bad, query the index using the right
order:"t" first, "id" then (given by the "order by t,id" on the creation of the temp table). 

I would like PG to do that for me. Since it knows an index scan looping on ne_id would be wrong, I'd like it to create
a"materialized" table where data is ordered by "t" first instead of going for the seq scan. 

This would lead to a x10 - x100 improvement on query time.








Re: Performance problem with low correlation data

From
m_lists@yahoo.it
Date:
Since noone replied to http://www.mail-archive.com/pgsql-general@postgresql.org/msg133360.html, I tried another
approach:

I can't cluster the whole table every day; it would take too much (as I said, table as 60M rows, and I have hundreds of
them).
Plus, it wouldn't really make much sense: the only portion of table to be clustered is the one written after the last
"cluster"command (since no row is deleted/updated, only inserted each 15 minutes). 
So I thought: I'll "cluster" only the part that has been written every day:

begin;
lock table testinsert in ACCESS EXCLUSIVE MODE;
insert into testinsert select ne_id+100000, t, v from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02
00:00:00' order by ne_id,t; 
DELETE from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00' and ne_id<100000;
update testinsert set ne_id = ne_id - 100000 where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00';
commit;


this would run after midnight of 2009-08-02. Next day would have different time values.
What I'm trying to do here is cluster on ne_id,t the portion of table written every day.
Well, I guess the table is layed out as expected, but in pg_stats correlation for the ne_id col is still VERY low:
select attname,n_distinct,correlation from pg_stats where tablename='testinsert3';
attname | n_distinct | correlation
---------+------------+-------------
ne_id   |      20000 |    0.111041  <---- low value
t       |        864 |    0.987778
v       |          1 |           1


this leads the planner to sequence scans of the table as soon as 10% of the table has to be read:
explain  select * FROM idtable as g      inner  join testinsert on id=ne_id where  groupid between 1 and 4 and  t
between'2009-08-01 00:00:00' and '2009-08-09 00:00:00'  
Hash Join  (cost=134.45..2127071.28 rows=614331 width=244)
   Hash Cond: (testinsert3.ne_id = g.id)
   ->  Seq Scan on testinsert  (cost=0.00..2063200.00 rows=15358272 width=236)
               Filter: ((t >= '2009-08-01 00:00:00'::timestamp without time zone) AND (t <= '2009-08-09
00:00:00'::timestampwithout time zone)) 
   ->  Hash  (cost=124.45..124.45 rows=800 width=8)
               ->  Bitmap Heap Scan on idtable g  (cost=24.45..124.45 rows=800 width=8)
                          Recheck Cond: ((groupid >= 1) AND (groupid <= 4))
                               ->  Bitmap Index Scan on idtable_pk  (cost=0.00..24.25 rows=800 width=0)
                                             Index Cond: ((groupid >= 1) AND (groupid <= 4))

Which is a terrible plan!
testinsert contains t values between '2009-08-01' and '2009-08-09', and ne_id from 1 to 20000. But only 800 out of
20000ne_id have to be read; there's no need for a table scan! 
I guess this is a reflection of the poor "correlation" on ne_id; but, as I said, I don't really think ne_id is so bad
correlated.
In fact, doing a "select ne_id, t from testinsert limit 100000"  I can see that data is laid out pretty much by "ne_id,
t",grouped by day (that is, same ne_id for one day, then next ne_id and so on until next day). 
How is the "correlation" calculated? Can someone explain to me why, after the procedure above,correlation is so low???





Re: Performance problem with low correlation data

From
Alvaro Herrera
Date:
m_lists@yahoo.it wrote:

> testinsert contains t values between '2009-08-01' and '2009-08-09', and ne_id from 1 to 20000. But only 800 out of
20000ne_id have to be read; there's no need for a table scan! 
> I guess this is a reflection of the poor "correlation" on ne_id; but, as I said, I don't really think ne_id is so bad
correlated.
> In fact, doing a "select ne_id, t from testinsert limit 100000"  I can see that data is laid out pretty much by
"ne_id,t", grouped by day (that is, same ne_id for one day, then next ne_id and so on until next day). 
> How is the "correlation" calculated? Can someone explain to me why, after the procedure above,correlation is so
low???

Did you run ANALYZE after the procedure above?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Performance problem with low correlation data

From
m_lists@yahoo.it
Date:

> > testinsert contains t values between '2009-08-01' and '2009-08-09', and ne_id
> from 1 to 20000. But only 800 out of 20000 ne_id have to be read; there's no
> need for a table scan!
> > I guess this is a reflection of the poor "correlation" on ne_id; but, as I
> said, I don't really think ne_id is so bad correlated.
> > In fact, doing a "select ne_id, t from testinsert limit 100000"  I can see
> that data is laid out pretty much by "ne_id, t", grouped by day (that is, same
> ne_id for one day, then next ne_id and so on until next day).
> > How is the "correlation" calculated? Can someone explain to me why, after the
> procedure above,correlation is so low???
>
> Did you run ANALYZE after the procedure above?

Yes I did; the correlation on that column stays low.
Of course, I didn't expect a correlation = 1, since data is layed out (pretty much) like this:
(ne_id1) (t1 day1)
(ne_id1) (t2 day1)
...
(ne_id1) (tn day1)
(ne_id2) (t1 day1)
(ne_id2) (t2 day1)
...
(ne_id2) (tn day1)
...
(pretty much all the ne_ids)
(ne_id1) (t1 day2)
(ne_id1)  (t2 day2)
...
(ne_id1)  (tn day2)
(ne_id2)  (t1 day2)
(ne_id2) (t2 day2)
...
(ne_id2)  (tn day2)
... and so on
so I ne_id is not strictly incrementing, but it is pretty much the same (sequencially) for a whole whole day...