Thread: Slow - grindingly slow - query
Hi I have a single table with two views. The table effectively contains both master and detail info (legacy stuff I'm afraid). The query in question is used to see if any records exist in the detail that do not exist in the master. The table and index definition is as follows create table accounts ( domain text, registrationtype char /* Plus a couple of other irrelevant fields */ ); create index domain_idx on accounts (domain); create index domain_type_idx on accounts (domain, registrationtype); The views are create view accountmaster as SELECT * from accounts where registrationtype = 'N'; create view accountdetail as SELECT * from accounts where registrationtype <> 'N'; The query is select accountdetail.domain from accountdetail where accountdetail.domain not in (select accountmaster.domain fromaccountmaster); I started the query about 5 hours ago and it is still running. I did the same on Informix Online 7 and it took less than two minutes... My system details are postgres: 6.5.3 O/S: RH6.0 Kernel 2.2.5-15smp Explain shows the following explain select accountdetail.domain from accountdetail where accountdetail.domain not in (select accountmaster.domainfrom accountmaster) limit 10; NOTICE: QUERY PLAN: Seq Scan on accounts (cost=3667.89 rows=34958 width=12) SubPlan -> Seq Scan on accounts (cost=3667.89 rows=33373width=12) EXPLAIN The number of records in the two views are psql -c "select count(*) from accountmaster" coza; count ----- 45527 (1 row) psql -c "select count(*) from accountdetail" coza; count ----- 22803 I know of exactly one record (I put it there myself) that satisfies the selection criteria. Any ideas would be appreciated -------- Regards Theo PS We have it running live at http://co.za (commercial domains in South Africa).
What does: explain select domain from accountdetail where domain not in ( select domain from accountmaster); show? Also, did you do a 'vacuum analyze' on the tables? Also, how about if you get rid of the views SELECT domain FROM account WHERE registrationtype <> 'N'; *shakes head* am I missing something here? I'm reading your SELECT and 'CREATE VIEW's and don't they negate each other? *scratch head* If I'm reading your select properly, and with the amount of sleep I've had recently, its possible I'm not... The subselect is saying give me all domains whose registration type = 'N'. The select itself is saying give me all domains whoe registration type <> 'N' (select accountdetail.domain from accountdetail), and narrow that listing down further to only include those domains whose registration type <> 'N'? Either I'm reading this *totally* wrong, or you satisfy that condition ujust by doing a 'SELECT domain FROM accountdetail;' ... No? On Thu, 11 Nov 1999, Theo Kramer wrote: > Hi > > I have a single table with two views. The table effectively contains both > master and detail info (legacy stuff I'm afraid). The query in question is > used to see if any records exist in the detail that do not exist in the > master. The table and index definition is as follows > > create table accounts ( > domain text, > registrationtype char > /* Plus a couple of other irrelevant fields */ > ); > > create index domain_idx on accounts (domain); > create index domain_type_idx on accounts (domain, registrationtype); > > The views are > > create view accountmaster as SELECT * from accounts where registrationtype = > 'N'; > create view accountdetail as SELECT * from accounts where registrationtype <> > 'N'; > > The query is > > select accountdetail.domain from accountdetail where > accountdetail.domain not in > (select accountmaster.domain from accountmaster); > > I started the query about 5 hours ago and it is still running. I did the same > on Informix Online 7 and it took less than two minutes... > > My system details are > postgres: 6.5.3 > O/S: RH6.0 Kernel 2.2.5-15smp > > Explain shows the following > > explain select accountdetail.domain from accountdetail where > accountdetail.domain not in > (select accountmaster.domain from accountmaster) limit 10; > NOTICE: QUERY PLAN: > > Seq Scan on accounts (cost=3667.89 rows=34958 width=12) > SubPlan > -> Seq Scan on accounts (cost=3667.89 rows=33373 width=12) > > EXPLAIN > > The number of records in the two views are > > psql -c "select count(*) from accountmaster" coza; > count > ----- > 45527 > (1 row) > > psql -c "select count(*) from accountdetail" coza; > count > ----- > 22803 > > I know of exactly one record (I put it there myself) that satisfies the > selection criteria. > > Any ideas would be appreciated > > -------- > Regards > Theo > > PS We have it running live at http://co.za (commercial domains in South Africa). > > ************ > Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
Theo Kramer wrote: > > Hi > > I have a single table with two views. The table effectively contains both > master and detail info (legacy stuff I'm afraid). The query in question is > used to see if any records exist in the detail that do not exist in the > master. The table and index definition is as follows > > create table accounts ( > domain text, > registrationtype char > /* Plus a couple of other irrelevant fields */ > ); > > create index domain_idx on accounts (domain); > create index domain_type_idx on accounts (domain, registrationtype); try using create index registrationtype_index on accounts (registrationtype); ------ Hannu
The Hermit Hacker wrote: > > What does: > > explain select domain from accountdetail > where domain not in ( > select domain from accountmaster); > > show? NOTICE: QUERY PLAN: Seq Scan on accounts (cost=3667.89 rows=34958 width=12) SubPlan -> Seq Scan on accounts (cost=3667.89 rows=33373 width=12) EXPLAIN > Also, did you do a 'vacuum analyze' on the tables? Yes - should have mentioned that. > Also, how about if you get rid of the views > > SELECT domain FROM account > WHERE registrationtype <> 'N'; > > *shakes head* am I missing something here? I'm reading your SELECT and > 'CREATE VIEW's and don't they negate each other? *scratch head* No - a domain can both be new (registrationtype 'N') and updated (registrationtype 'U') ie. one or more rows with the same domain with one row containing a domain with registrationtype 'N' and zero or more rows containing the same domain with registrationtype not 'N'. The reason for the <> 'N' and not just = 'U' is that we have a couple of rows with registrationtype set to something else. > The subselect is saying give me all domains whose registration type = 'N'. > The select itself is saying give me all domains whoe registration type <> > 'N' (select accountdetail.domain from accountdetail), and narrow that > listing down further to only include those domains whose registration type > <> 'N'? > > Either I'm reading this *totally* wrong, or you satisfy that condition > ujust by doing a 'SELECT domain FROM accountdetail;' ... > > No? No :). See above -------- Regards Theo
Hannu Krosing wrote: > try using > create index registrationtype_index on accounts (registrationtype); OK did that, and am rerunning the query. The explain now shows explain select accountdetail.domain from accountdetail where accountdetail.domain not in (selectaccountmaster.domain from accountmaster); NOTICE: QUERY PLAN: Seq Scan on accounts (cost=3667.89 rows=34958 width=12) SubPlan -> Index Scan using registrationtype_idx on accounts (cost=2444.62 rows=33373 width=12) EXPLAIN Will let you all know when it completes. -------- Regards Theo
Theo Kramer <theo@flame.co.za> writes: > The query is > select accountdetail.domain from accountdetail where > accountdetail.domain not in > (select accountmaster.domain from accountmaster); Try something like select accountdetail.domain from accountdetail where not exists (select accountmaster.domain from accountmaster where accountmaster.domain = accountdetail.domain); I believe this is in the FAQ... regards, tom lane
Tom Lane wrote: > > Theo Kramer <theo@flame.co.za> writes: > > The query is > > > select accountdetail.domain from accountdetail where > > accountdetail.domain not in > > (select accountmaster.domain from accountmaster); This takes more than 5 hours and 30 minutes. > Try something like > > select accountdetail.domain from accountdetail where > not exists (select accountmaster.domain from accountmaster where > accountmaster.domain = accountdetail.domain); This takes 5 seconds - wow! > I believe this is in the FAQ... Will check out the FAQs. Many thanks. -------- Regards Theo
Theo Kramer wrote: > > > Try something like > > > > select accountdetail.domain from accountdetail where > > not exists (select accountmaster.domain from accountmaster where > > accountmaster.domain = accountdetail.domain); > > This takes 5 seconds - wow! > I did the same on Informix Online 7 and it took less than two minutes... ^^^^^^^^^^^ Could you run the query above in Informix? How long would it take to complete? Vadim
> > > select accountdetail.domain from accountdetail where > > > accountdetail.domain not in > > > (select accountmaster.domain from accountmaster); > > This takes more than 5 hours and 30 minutes. > > > select accountdetail.domain from accountdetail where > > not exists (select accountmaster.domain from accountmaster where > > accountmaster.domain = accountdetail.domain); > > This takes 5 seconds - wow! > I have a general comment/question here. Why do in/not in clauses seem to perform so slowly? I've noticed this type of behavior with with my system also. I think the above queries will always return the exact same results regardless of the data. From looking at the query plan with explain, it's clear the second query makes better use of the indexes. Can't the rewrite engine recognize a simple case like the one above and rewrite it to use exists and not exists with the proper joins? Or possibly the optimizer can generate a better plan? Sometimes it's not so easy to just change a query in the code. Sometimes you can't change the code because you only have executables and sometimes you are using a tool that automatically generates SQL using in clauses. Additionally, since intersect and union get rewritten as in clauses they suffer the same performance problems. -brian -- The world's most ambitious and comprehensive PC game database project. http://www.mobygames.com
Brian Hirt <bhirt@mobygames.com> writes: > Can't the rewrite engine recognize a simple case like the > one above and rewrite it to use exists and not exists with the proper > joins? Or possibly the optimizer can generate a better plan? This is on the TODO list, and will get done someday. IMHO it's not as urgent as a lot of the planner/optimizer's other shortcomings, because it can usually be worked around by revising the query. If it's bugging you enough to go fix it now, contributions are always welcome ;-) regards, tom lane
On Fri, Nov 12, 1999 at 09:58:14AM -0500, Tom Lane wrote: > Brian Hirt <bhirt@mobygames.com> writes: > > Can't the rewrite engine recognize a simple case like the > > one above and rewrite it to use exists and not exists with the proper > > joins? Or possibly the optimizer can generate a better plan? > > This is on the TODO list, and will get done someday. IMHO it's not as > urgent as a lot of the planner/optimizer's other shortcomings, because > it can usually be worked around by revising the query. > > If it's bugging you enough to go fix it now, contributions are always > welcome ;-) > Okay, what would be the correct approach to solving the problem, and where would be a good place to start? I'v only been on this list for a few weeks, so I'm missed discussion on the approach to solving this problem. Should this change be localized to just the planner? Should the rewrite system be creating a different query tree? Will both need to be changed? If a lot of work is being done to this part of the system, is now a bad time to try this work? I'm willing to jump in to this, but I may take a while to figure it out and ask a lot of questions that are obvious to the hardened postgres programmer. I'm not famaliar with the postgres code, yet. -brian -- The world's most ambitious and comprehensive PC game database project. http://www.mobygames.com
Brian Hirt <bhirt@mobygames.com> writes: > On Fri, Nov 12, 1999 at 09:58:14AM -0500, Tom Lane wrote: >> If it's bugging you enough to go fix it now, contributions are always >> welcome ;-) > Okay, what would be the correct approach to solving the problem, > and where would be a good place to start? I'v only been on this list > for a few weeks, so I'm missed discussion on the approach to solving > this problem. Should this change be localized to just the planner? > Should the rewrite system be creating a different query tree? Will both > need to be changed? If a lot of work is being done to this part of > the system, is now a bad time to try this work? Well, actually, figuring out how & where to do it is the trickiest part of the work. Might not be the best project for a newbie backend-hacker to start with :-(. After a few moments' thought, it seems to me that this issue might be closely intertwined with the OUTER JOIN stuff that Thomas is working on and the querytree representation redesign that Jan and I have been muttering about (but not yet actually doing anything about). We want to handle SELECT ... WHERE expr IN (SELECT ...) like a join, but the semantics aren't exactly the same as a conventional join, so it might be that the thing needs to be rewritten as a special join type. In that case it'd fit right in with OUTER JOIN, I suspect. The Informix EXPLAIN results that Theo Kramer posted (a few messages back in this thread) are pretty interesting too. If I'm reading that printout right, Informix is not any smarter than we are about choosing the scan types for the outer and inner queries; and yet they have a much faster runtime for the WHERE IN query. I speculate that they are doing the physical matching of outer and inner tuples in a smarter way than we are --- perhaps they are doing one scan of the inner query and entering all the values into a hashtable that's then probed for each outer tuple. (As opposed to rescanning the inner query for each outer tuple, as we currently do.) If that's the answer, then it could probably be implemented as a localized change: rewrite the SubPlan node executor to look more like the HashJoin node executor. This isn't perfect --- it wouldn't pick up the possibility of a merge-style join --- but it would be better than what we have for a lot less work than the "full" solution. This is all shooting from the hip; I haven't spent time looking into it. Has anyone else got insights to offer? regards, tom lane
Tom Lane wrote: > The Informix EXPLAIN results that Theo Kramer posted (a few messages > back in this thread) are pretty interesting too. If I'm reading that > printout right, Informix is not any smarter than we are about choosing > the scan types for the outer and inner queries; and yet they have a much > faster runtime for the WHERE IN query. The informix EXPLAIN for the 'not in' query was when I did not have an index on registrationtype (the explain appends to file sqexplain.out so I missed it :(). Anyway here is the Informix EXPLAIN with the index on registrationtype. QUERY: ------ select accountdetail.domain from accountdetail where accountdetail.domain not in (select accountmaster.domain from accountmaster) Estimated Cost: 4510 Estimated # of Rows Returned: 58810 1) informix.accounts: SEQUENTIAL SCAN Filters: (informix.accounts.domain != ALL <subquery> AND informix.accounts.registrationtype != 'N' ) Subquery: --------- Estimated Cost: 12 Estimated # of Rows Returned: 10 1) informix.accounts: INDEX PATH (1) Index Keys: registrationtype Lower Index Filter: informix.accounts.registrationtype = 'N' The speed difference with or without the subquery index is neglible for Informix. -------- Regards Theo
hi... is anyone working on replication services in pgsql? -- Aaron J. Seigo Sys Admin