Thread: Comparing two slices within one table efficiently

Comparing two slices within one table efficiently

From
"Ken Simpson"
Date:
I have a table with the following simplified form:

create table t (run_id integer,domain_id integer,mta_id integer,attribute1 integer,attribute2 integer,unique(run_id,
domain_id,mta_id)
 
);

The table has about 1 million rows with run_id=1, another 1 million rows with run_id=2, and so on.

I need to efficiently query the differences between "runs" - i.e. For each (domain_id, mta_id) tuple in run 1, is there
acoresponding tuple in run 2 where either attribute1 or attribute2 have changed?
 

The only way I have been able to think of doing this so far is an o(n^2) search, which even with indexes takes a long
time.e.g.
 
select * from t t1 where exists (select 1 from t t2 where t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and
(t2.attribute1!= t1.attribute1 or t2.attribute2 != t1.attribute2)
 

This query takes millenia...

Any help would be greatly appreciated. I hope I am naively missing some obvious alternative strategy, since this sort
ofoperation must be common in databases.
 

Thanks,
Ken


--
Ken Simpson, CEO
MailChannels Corporation
Reliable Email Delivery (tm)
http://www.mailchannels.com

Re: Comparing two slices within one table efficiently

From
chester c young
Date:
> I have a table with the following simplified form:
> 
> create table t (
>  run_id integer,
>  domain_id integer,
>  mta_id integer,
>  attribute1 integer,
>  attribute2 integer,
>  unique(run_id, domain_id, mta_id)
> );
> 
> The table has about 1 million rows with run_id=1, another 1 million
> rows with run_id=2, and so on.
> 
> I need to efficiently query the differences between "runs" - i.e. For
> each (domain_id, mta_id) tuple in run 1, is there a coresponding
> tuple in run 2 where either attribute1 or attribute2 have changed?
> 
> The only way I have been able to think of doing this so far is an
> o(n^2) search, which even with indexes takes a long time. e.g.
> 
>  select * from t t1 where exists (select 1 from t t2 where
> t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and (t2.attribute1
> != t1.attribute1 or t2.attribute2 != t1.attribute2)
> 
> This query takes millenia...
> 

first, add a change flag change_tf that is set through a trigger
whether this record different from record in the previous run.  second,
create an index on domain and mta where change_tf, so you're only
indexing changed records.

this would allow you to find your changes very efficiently at the
relatively small cost of adding one lookup and one extra index per
insert.


      
____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/


Re: Comparing two slices within one table efficiently

From
Andrew Kroeger
Date:
Ken Simpson wrote:
> I have a table with the following simplified form:
> 
> create table t (
>  run_id integer,
>  domain_id integer,
>  mta_id integer,
>  attribute1 integer,
>  attribute2 integer,
>  unique(run_id, domain_id, mta_id)
> );
> 
> The table has about 1 million rows with run_id=1, another 1 million rows with run_id=2, and so on.
> 
> I need to efficiently query the differences between "runs" - i.e. For each (domain_id, mta_id) tuple in run 1, is
therea coresponding tuple in run 2 where either attribute1 or attribute2 have changed?
 
> 
> The only way I have been able to think of doing this so far is an o(n^2) search, which even with indexes takes a long
time.e.g.
 
> 
>  select * from t t1 where exists (select 1 from t t2 where t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and
(t2.attribute1!= t1.attribute1 or t2.attribute2 != t1.attribute2)
 
> 
> This query takes millenia...
> 
> Any help would be greatly appreciated. I hope I am naively missing some obvious alternative strategy, since this sort
ofoperation must be common in databases.
 

A self-join will probably perform better in your case.  Consider the
following query:

select   t1.domain_id as domain_id,   t1.mta_id as mta_id,   t1.run_id as run_id_1,   t1.attribute1 as attribute1_1,
t1.attribute2as attribute2_1,   t2.run_id as run_id_2,   t2.attribute1 as attribute1_2,   t2.attribute2 as
attribute2_2
from   t t1
join t t2 on   t1.mta_id = t2.mta_id and   t1.domain_id = t2.domain_id and   (t1.attribute1 != t2.attribute1 or
t1.attribute2!= t2.attribute2)
 

You are basically joining together 2 copies of the same table (one
aliased to "t1" and the other aliased to "t2").  The logic in the join
conditions are the same you had in the subselect.

You will probably want to constrain the query some more in the join
conditions.  As my example above is currently written, you will get 2
records for each true difference (e.g. for a given mta_id/domain_id,
you'll get 1 record when run_id X comes from t1 and run_id Y comes from
t2 & you get another record describing the same difference when run_id X
comes from t2 and run_id Y comes from t1).

Some example constraints that might help are:
t1.run_id < t2.run_id

Assuming run_id increases chronologically, this will return 1 record for
each case there is a difference in attributes from any prior run.
t1.run_id + 1 = t2.run_id

Again assuming run_id increases chronologically, this will only return a
record when there are attribute differences between 2 successive runs.

Indexes should also be considered for the performance of this query.  At
a minimum, you will probably want a compound index on (domain_id,
mta_id).  [... thinks for a bit ...]  Actually, you could re-order your
unique constraint to take care of this, as a unique constraint is
basically an index with a unique constraint on it.  Your unique
constraint is currently on (run_id, domain_id, mta_id).  If you re-order
that unique constraint to (domain_id, mta_id, run_id), you will probably
see a couple of benefits:

- Due to how PG can use the first parts of a compound index, it will be
able to use the (domain_id, mta_id) portion as an index to help your joins.

- I assume domain_id is much more selective than run_id, so the unique
checks should speed up as well.  The way you have the unique constraint
written, the index is first scanned based on run_id which isn't very
selective - once it finds the correct run_id in the index, there will
still be a lot of index entries to scan to match on domain_id and
mta_id.  Re-ordering the way I propose should narrow down the number of
index entries quite quickly, as it will first be narrowed on domain_id,
then mta_id, and finally run_id.

Hope this helps.

Andrew


Re: Comparing two slices within one table efficiently

From
"Christian Kindler"
Date:
Yes and you could make it even more speedy with the use table partitioning.
http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html

> select
>     t1.domain_id as domain_id,
>     t1.mta_id as mta_id,
>     t1.run_id as run_id_1,
>     t1.attribute1 as attribute1_1,
>     t1.attribute2 as attribute2_1,
>     t2.run_id as run_id_2,
>     t2.attribute1 as attribute1_2,
>     t2.attribute2 as attribute2_2
> from
>     t t1
> join t t2 on
>     t1.mta_id = t2.mta_id and
>     t1.domain_id = t2.domain_id and
>     (t1.attribute1 != t2.attribute1 or t1.attribute2 != t2.attribute2)

-- 
cu
Chris

Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten 
Browser-Versionen downloaden: http://www.gmx.net/de/go/browser


Re: Comparing two slices within one table efficiently

From
Tom Lane
Date:
"Ken Simpson" <ksimpson@mailchannels.com> writes:
> select * from t t1 where exists (select 1 from t t2 where
> t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and (t2.attribute1
> != t1.attribute1 or t2.attribute2 != t1.attribute2)

> This query takes millenia...

Yeah, because you're effectively forcing the least efficient style of
join --- a nestloop is generally going to suck for a full-table join,
even if you've got indexes.  Try something like this:

select * from t t1 join t t2 on (t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id)
where (t2.attribute1 != t1.attribute1 or t2.attribute2 != t1.attribute2)

Make sure you've got work_mem cranked up to something appropriate.
        regards, tom lane


Re: Comparing two slices within one table efficiently

From
Ken Simpson
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Christian Kindler [13/08/07 21:34 +0200]:
> Yes and you could make it even more speedy with the use table partitioning.
> http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html

Thanks for all your speedy help, everyone. I tried doing a "self join"
and that sped things up enormously (query took on the order of 30
seconds to compare two million-row table slices, resulting in a 20K
row result). I will also try re-ordering the unique constraint to get
speedier indexing out of it and will look at table partitioning.

Regards,
Ken

- -- 
Ken Simpson
CEO, MailChannels

Fax: +1 604 677 6320
Web: http://mailchannels.com
MailChannels - Reliable Email Delivery (tm)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFGwLQq2YHPr/ypq5QRApP8AKDfRGqDFkcONh0YaojX7362nXP12gCg3WZ6
k5ZBwcMplXyVkEguQtbgdFU=
=bsyu
-----END PGP SIGNATURE-----