Thread: best way to query
Hello List, this is a noobie question: I have had to take over an existing system - it was supposed to have some contraints that prevented dangling references - but... any way i have 2 table - A and B. each table has a key field and if a row is in B it should have a corresponding row in A - but theres the problem it doesn't for all the rows in B. So I want to do something like delete from B where key not in (select key from A order by key); The problem is there are about 1,000,000 rows in A and 300,000 rows in B. I let the above run all night and it was still running the next morning. Does anyone have an idea of a better way. B = t_event_ack_log A = t_unit_event_log explain shows: Aggregate (cost=4712921585.30..4712921585.31 rows=1 width=0) -> Seq Scan on t_event_ack_log a (cost=103170.29..4712920878.60 rows=282677 width=0) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=103170.29..117301.92 rows=1016163 width=4) -> Index Scan using pk_tuel_eln on t_unit_event_log (cost=0.00..98184.12 rows=1016163 width=4) OBTW: how do I interpret the cost - the manual says: planner's guess at how long it will take to run the statement (measured in units of disk page fetches)" Not sure I understand (measured in units of disk page fetches) Thanks, Steve
Steve Clark <sclark@netwolves.com> writes: > explain shows: > Aggregate (cost=4712921585.30..4712921585.31 rows=1 width=0) > -> Seq Scan on t_event_ack_log a (cost=103170.29..4712920878.60 > rows=282677 width=0) > Filter: (NOT (subplan)) > SubPlan > -> Materialize (cost=103170.29..117301.92 rows=1016163 > width=4) > -> Index Scan using pk_tuel_eln on t_unit_event_log > (cost=0.00..98184.12 rows=1016163 width=4) Yeah, that's going to suck. A brute force solution is to see if you can get it to switch to a "hashed subplan" by increasing work_mem. Also, whatever is the ORDER BY for? regards, tom lane
Tom Lane wrote: > Steve Clark <sclark@netwolves.com> writes: > >>explain shows: > > >> Aggregate (cost=4712921585.30..4712921585.31 rows=1 width=0) >> -> Seq Scan on t_event_ack_log a (cost=103170.29..4712920878.60 >>rows=282677 width=0) >> Filter: (NOT (subplan)) >> SubPlan >> -> Materialize (cost=103170.29..117301.92 rows=1016163 >>width=4) >> -> Index Scan using pk_tuel_eln on t_unit_event_log >> (cost=0.00..98184.12 rows=1016163 width=4) > > > Yeah, that's going to suck. A brute force solution is to see if you > can get it to switch to a "hashed subplan" by increasing work_mem. > > Also, whatever is the ORDER BY for? > > regards, tom lane > without the order by it wants to do a seq scan of t_unit_event_log. see below: explain select count(*) from t_event_ack_log where event_log_no not in (select event_log_no from t_unit_event_log); QUERY PLAN ---------------------------------------------------------------------------------------- Aggregate (cost=12144872193.82..12144872193.82 rows=1 width=0) -> Seq Scan on t_event_ack_log (cost=0.00..12144871485.07 rows=283497 width=0) Filter: (NOT (subplan)) SubPlan -> Seq Scan on t_unit_event_log (cost=0.00..40286.56 rows=1021156 width=4) (5 rows) Will try increasing work_memory.
Steve Clark wrote: > any way i have 2 table - A and B. > each table has a key field and if a row is in B it should have a > corresponding row in A - but theres > the problem it doesn't for all the rows in B. > > So I want to do something like > delete from B where key not in (select key from A order by key); > > The problem is there are about 1,000,000 rows in A and 300,000 rows in > B. I let the above run > all night and it was still running the next morning. Does anyone have > an idea of a better way. An outer join is sometimes spectacularly more efficient for this particular kind of query. I'd suggest you try: delete from B where key in (select B.key from B left outer join A on A.key=B.key where A.key is null) -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org
Steve Clark <sclark@netwolves.com> writes: > Tom Lane wrote: >> Also, whatever is the ORDER BY for? > without the order by it wants to do a seq scan of t_unit_event_log. > see below: > explain select count(*) from t_event_ack_log where event_log_no not > in (select event_log_no from t_unit_event_log); > QUERY PLAN > ---------------------------------------------------------------------------------------- > Aggregate (cost=12144872193.82..12144872193.82 rows=1 width=0) > -> Seq Scan on t_event_ack_log (cost=0.00..12144871485.07 > rows=283497 width=0) > Filter: (NOT (subplan)) > SubPlan > -> Seq Scan on t_unit_event_log (cost=0.00..40286.56 > rows=1021156 width=4) > (5 rows) Hmm, the big problem with that isn't the seqscan but the lack of a Materialize step to buffer it; which says to me that you're running a pretty old Postgres version (8.0 or older). You should think about an update if you're running into performance issues. regards, tom lane
Daniel Verite wrote: > Steve Clark wrote: > > >>any way i have 2 table - A and B. >>each table has a key field and if a row is in B it should have a >>corresponding row in A - but theres >>the problem it doesn't for all the rows in B. >> >>So I want to do something like >>delete from B where key not in (select key from A order by key); >> >>The problem is there are about 1,000,000 rows in A and 300,000 rows > > in > >>B. I let the above run >>all night and it was still running the next morning. Does anyone have > > >>an idea of a better way. > > > An outer join is sometimes spectacularly more efficient for this > particular kind of query. > > I'd suggest you try: > > delete from B where key in > (select B.key from B left outer join A on A.key=B.key > where A.key is null) > WOW! this runs in about 10 seconds - thanks Daniel. explain select count(*) from t_event_ack_log where event_log_no in (select t_event_ack_log.event_log_no from t_event_ack_log left outer join t_unit_event_log a on a.event_log_no=t_event_ack_log.event_log_no where a.event_log_no is null); QUERY PLAN ---------------------------------------------------------------------------------------------------------------- Aggregate (cost=128349.56..128349.57 rows=1 width=0) -> Hash Join (cost=94512.91..126935.36 rows=565681 width=0) Hash Cond: (public.t_event_ack_log.event_log_no = public.t_event_ack_log.event_log_no) -> Seq Scan on t_event_ack_log (cost=0.00..14759.85 rows=565685 width=4) -> Hash (cost=92609.85..92609.85 rows=152245 width=4) -> HashAggregate (cost=91087.40..92609.85 rows=152245 width=4) -> Hash Left Join (cost=57337.95..90380.29 rows=282842 width=4) Hash Cond: (public.t_event_ack_log.event_log_no = a.event_log_no) Filter: (a.event_log_no IS NULL) -> Seq Scan on t_event_ack_log (cost=0.00..14759.85 rows=565685 width=4) -> Hash (cost=40696.09..40696.09 rows=1014309 width=4) -> Seq Scan on t_unit_event_log a (cost=0.00..40696.09 rows=1014309 width=4)
Tom Lane wrote: > Steve Clark <sclark@netwolves.com> writes: > >>Tom Lane wrote: >> >>>Also, whatever is the ORDER BY for? > > >>without the order by it wants to do a seq scan of t_unit_event_log. >>see below: >> explain select count(*) from t_event_ack_log where event_log_no not >>in (select event_log_no from t_unit_event_log); >> QUERY PLAN >>---------------------------------------------------------------------------------------- >> Aggregate (cost=12144872193.82..12144872193.82 rows=1 width=0) >> -> Seq Scan on t_event_ack_log (cost=0.00..12144871485.07 >>rows=283497 width=0) >> Filter: (NOT (subplan)) >> SubPlan >> -> Seq Scan on t_unit_event_log (cost=0.00..40286.56 >>rows=1021156 width=4) >>(5 rows) > > > Hmm, the big problem with that isn't the seqscan but the lack of a > Materialize step to buffer it; which says to me that you're running a > pretty old Postgres version (8.0 or older). You should think about an > update if you're running into performance issues. > > regards, tom lane > > Hi Tom, Actually this is show server_version; server_version ---------------- 8.2.5 On FreeBSD 6.2 And Daniel gave me a query that does the job in just a few seconds. Thanks, Steve