Thread: Comparing two slices within one table efficiently
I have a table with the following simplified form: create table t (run_id integer,domain_id integer,mta_id integer,attribute1 integer,attribute2 integer,unique(run_id, domain_id,mta_id) ); The table has about 1 million rows with run_id=1, another 1 million rows with run_id=2, and so on. I need to efficiently query the differences between "runs" - i.e. For each (domain_id, mta_id) tuple in run 1, is there acoresponding tuple in run 2 where either attribute1 or attribute2 have changed? The only way I have been able to think of doing this so far is an o(n^2) search, which even with indexes takes a long time.e.g. select * from t t1 where exists (select 1 from t t2 where t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and (t2.attribute1!= t1.attribute1 or t2.attribute2 != t1.attribute2) This query takes millenia... Any help would be greatly appreciated. I hope I am naively missing some obvious alternative strategy, since this sort ofoperation must be common in databases. Thanks, Ken -- Ken Simpson, CEO MailChannels Corporation Reliable Email Delivery (tm) http://www.mailchannels.com
> I have a table with the following simplified form: > > create table t ( > run_id integer, > domain_id integer, > mta_id integer, > attribute1 integer, > attribute2 integer, > unique(run_id, domain_id, mta_id) > ); > > The table has about 1 million rows with run_id=1, another 1 million > rows with run_id=2, and so on. > > I need to efficiently query the differences between "runs" - i.e. For > each (domain_id, mta_id) tuple in run 1, is there a coresponding > tuple in run 2 where either attribute1 or attribute2 have changed? > > The only way I have been able to think of doing this so far is an > o(n^2) search, which even with indexes takes a long time. e.g. > > select * from t t1 where exists (select 1 from t t2 where > t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and (t2.attribute1 > != t1.attribute1 or t2.attribute2 != t1.attribute2) > > This query takes millenia... > first, add a change flag change_tf that is set through a trigger whether this record different from record in the previous run. second, create an index on domain and mta where change_tf, so you're only indexing changed records. this would allow you to find your changes very efficiently at the relatively small cost of adding one lookup and one extra index per insert. ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/
Ken Simpson wrote: > I have a table with the following simplified form: > > create table t ( > run_id integer, > domain_id integer, > mta_id integer, > attribute1 integer, > attribute2 integer, > unique(run_id, domain_id, mta_id) > ); > > The table has about 1 million rows with run_id=1, another 1 million rows with run_id=2, and so on. > > I need to efficiently query the differences between "runs" - i.e. For each (domain_id, mta_id) tuple in run 1, is therea coresponding tuple in run 2 where either attribute1 or attribute2 have changed? > > The only way I have been able to think of doing this so far is an o(n^2) search, which even with indexes takes a long time.e.g. > > select * from t t1 where exists (select 1 from t t2 where t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and (t2.attribute1!= t1.attribute1 or t2.attribute2 != t1.attribute2) > > This query takes millenia... > > Any help would be greatly appreciated. I hope I am naively missing some obvious alternative strategy, since this sort ofoperation must be common in databases. A self-join will probably perform better in your case. Consider the following query: select t1.domain_id as domain_id, t1.mta_id as mta_id, t1.run_id as run_id_1, t1.attribute1 as attribute1_1, t1.attribute2as attribute2_1, t2.run_id as run_id_2, t2.attribute1 as attribute1_2, t2.attribute2 as attribute2_2 from t t1 join t t2 on t1.mta_id = t2.mta_id and t1.domain_id = t2.domain_id and (t1.attribute1 != t2.attribute1 or t1.attribute2!= t2.attribute2) You are basically joining together 2 copies of the same table (one aliased to "t1" and the other aliased to "t2"). The logic in the join conditions are the same you had in the subselect. You will probably want to constrain the query some more in the join conditions. As my example above is currently written, you will get 2 records for each true difference (e.g. for a given mta_id/domain_id, you'll get 1 record when run_id X comes from t1 and run_id Y comes from t2 & you get another record describing the same difference when run_id X comes from t2 and run_id Y comes from t1). Some example constraints that might help are: t1.run_id < t2.run_id Assuming run_id increases chronologically, this will return 1 record for each case there is a difference in attributes from any prior run. t1.run_id + 1 = t2.run_id Again assuming run_id increases chronologically, this will only return a record when there are attribute differences between 2 successive runs. Indexes should also be considered for the performance of this query. At a minimum, you will probably want a compound index on (domain_id, mta_id). [... thinks for a bit ...] Actually, you could re-order your unique constraint to take care of this, as a unique constraint is basically an index with a unique constraint on it. Your unique constraint is currently on (run_id, domain_id, mta_id). If you re-order that unique constraint to (domain_id, mta_id, run_id), you will probably see a couple of benefits: - Due to how PG can use the first parts of a compound index, it will be able to use the (domain_id, mta_id) portion as an index to help your joins. - I assume domain_id is much more selective than run_id, so the unique checks should speed up as well. The way you have the unique constraint written, the index is first scanned based on run_id which isn't very selective - once it finds the correct run_id in the index, there will still be a lot of index entries to scan to match on domain_id and mta_id. Re-ordering the way I propose should narrow down the number of index entries quite quickly, as it will first be narrowed on domain_id, then mta_id, and finally run_id. Hope this helps. Andrew
Yes and you could make it even more speedy with the use table partitioning. http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html > select > t1.domain_id as domain_id, > t1.mta_id as mta_id, > t1.run_id as run_id_1, > t1.attribute1 as attribute1_1, > t1.attribute2 as attribute2_1, > t2.run_id as run_id_2, > t2.attribute1 as attribute1_2, > t2.attribute2 as attribute2_2 > from > t t1 > join t t2 on > t1.mta_id = t2.mta_id and > t1.domain_id = t2.domain_id and > (t1.attribute1 != t2.attribute1 or t1.attribute2 != t2.attribute2) -- cu Chris Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten Browser-Versionen downloaden: http://www.gmx.net/de/go/browser
"Ken Simpson" <ksimpson@mailchannels.com> writes: > select * from t t1 where exists (select 1 from t t2 where > t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id and (t2.attribute1 > != t1.attribute1 or t2.attribute2 != t1.attribute2) > This query takes millenia... Yeah, because you're effectively forcing the least efficient style of join --- a nestloop is generally going to suck for a full-table join, even if you've got indexes. Try something like this: select * from t t1 join t t2 on (t2.mta_id=t1.mta_id and t2.domain_id=t1.domain_id) where (t2.attribute1 != t1.attribute1 or t2.attribute2 != t1.attribute2) Make sure you've got work_mem cranked up to something appropriate. regards, tom lane
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Christian Kindler [13/08/07 21:34 +0200]: > Yes and you could make it even more speedy with the use table partitioning. > http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html Thanks for all your speedy help, everyone. I tried doing a "self join" and that sped things up enormously (query took on the order of 30 seconds to compare two million-row table slices, resulting in a 20K row result). I will also try re-ordering the unique constraint to get speedier indexing out of it and will look at table partitioning. Regards, Ken - -- Ken Simpson CEO, MailChannels Fax: +1 604 677 6320 Web: http://mailchannels.com MailChannels - Reliable Email Delivery (tm) -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFGwLQq2YHPr/ypq5QRApP8AKDfRGqDFkcONh0YaojX7362nXP12gCg3WZ6 k5ZBwcMplXyVkEguQtbgdFU= =bsyu -----END PGP SIGNATURE-----