Thread: Search optimisation
I'm new to SQL databases although I've worked a lifetime on PICK type systems. I'm having trouble gettting selects to work well. I have a database with a number of tables. I'm having a number of problems one is this: This query runs very quickly and returns one row: select * from biblioitems where isbn='031051911X'; explanation: Index Scan using isbnidx on biblioitems (cost=203.35 rows=2988 width=102) This is fast: select * from biblioitems,items,biblio where biblioitems.biblionumber ='109' and biblioitems.biblionumber = items.biblionumber and biblio.biblionumber = biblioitems.biblionumber explanation: Nested Loop (cost=54.91 rows=657 width=340) -> Nested Loop (cost=4.08 rows=2 width=167) -> Index Scan using bibitbnoidxon biblioitems (cost=2.03 rows=1 width=102) -> Index Scan using bibnumidx on biblio (cost=2.05 rows=59945 width=65) -> Index Scan using bibnumitem on items (cost=25.42 rows=73185 width=173) This one is slow: select * from biblioitems,items,biblio where isbn='031051911X' and biblioitems.biblionumber = items.biblionumber and biblio.biblionumber = biblioitems.biblionumber; explanation:Hash Join (cost=11830.17 rows=981357 width=340) -> Seq Scan on items (cost=4158.10 rows=73185 width=173) -> Hash (cost=5091.33 rows=2989 width=167) -> Hash Join (cost=5091.33 rows=2989 width=167) -> Seq Scanon biblio (cost=2767.19 rows=59945 width=65) -> Hash (cost=203.35 rows=2988 width=102) -> Index Scan using isbnidx on biblioitems (cost=203.35 rows=2988 width=102) How can I make this query use the indexes?
Olwen Williams <olwen@ihug.co.nz> writes: > [ why is query A so much faster than query B? ] Hmm. The system is correctly estimating that query A is much faster than B: > Nested Loop (cost=54.91 rows=657 width=340) ^^^^^^^^^^ vs > Hash Join (cost=11830.17 rows=981357 width=340) ^^^^^^^^^^^^^ so the question is why it thought this hashjoin plan was the best available option for query B. It looks to me like the problem has to do with the estimated selectivities of the two restrictions on the biblioitems table. Query A recognizes that biblionumber='109' selects just one item from biblioitems: -> Index Scan using bibitbnoidx on biblioitems (cost=2.03 rows=1 width=102) ^^^^^^ so it produces a good plan, even though a doubly nested loop would be an extremely horrible plan if there were a lot of rows involved. (Actually, this is the first complaint about bad plans I've ever seen in which doubly-nested-loop was the *right* choice ;-) ...) Meanwhile, query B thinks that isbn='031051911X' is going to accept several thousand rows from biblioitems: -> Index Scan using isbnidx on biblioitems (cost=203.35 rows=2988 width=102) ^^^^^^^^^ and because of that, it's choosing a plan that would be well adapted for a lot of matching rows, yet is overkill (read inefficient) for a few rows. Now, I'm going to read between the lines and guess that the restriction on isbn should yield only one biblioitems row; if this guess is wrong, you can ignore all that follows. If my guess is correct, then the blame lies with misestimation of the selectivity of the isbn='XXX' clause. You can double-check this by doing explain select * from biblioitems where isbn='031051911X'; and seeing whether the estimated row count is close to what you actually get from doing this query. The first question has to be "have you done a VACUUM ANALYZE lately?". If not, the planner is working in the dark about the distribution of isbn values, and you can't fairly blame it for guessing conservatively that it's going to have to deal with a lot of rows. However, if you have done a VACUUM ANALYZE and still get a ridiculous estimate, then that's a bug that I'd like to try to fix. Please contact me off-list and we can pursue the details of why this is happening. regards, tom lane
Hey guys, I have a few queries here that I would greatly appreciate any pointers on. I ned to get them as fast as possible but would settle for ANY speed improvement over what it is now.. select * from applicants as a where a.status = 'A' and a.app_id in(select b.app_id from resume_search as b where a.app_id=b.app_id and b.user_id=291) resume_search is a table populated by a query before this one. The schema is as follows : | user_id | int4 | 4 | | app_id | int4 | 4 | Indices: resume_app_id resume_search_id So, in it's current form, with only a few (say 5) rows in resume_search, it's still tking quite some time. In addition to this, I have another query liek this one, only it inserts to another table also.. insert into users_download_app (user_id,app_id) select app_id from applicants where status = 'A' and app_id in(select b.app_id from resume_search as b where app_id=b.app_id and b.user_id=291) limit 200 This one is so slow that it's unusable. (We're talking minutes here on my Celeron 333 devel server). I wish there was another way to get what I need to get done but the sub-selects are the only way I can see doing it now. I have a feeling though that there are some extra things that I might be able to do to get the same results from the same tables, faster. Thanks to one and all. -Mitch
"Mitch Vincent" <mitch@venux.net> writes: > select * from applicants as a where a.status = 'A' and a.app_id in(select > b.app_id from resume_search as b where a.app_id=b.app_id and b.user_id=291) WHERE ... IN (sub-select) is a very inefficient substitute for a plain join query. Try something like select a.* from applicants as a, resume_search as b where a.status = 'A' and a.app_id = b.app_id and b.user_id = 291; Providing indexes on one or both app_id fields might help. Also, don't forget to do a VACUUM every so often to make sure the planner has useful statistics about the sizes of the tables. There has been some talk of automatically rewriting queries to eliminate unnecessary sub-selects, but I don't foresee it getting done for a while yet. > insert into users_download_app (user_id,app_id) select app_id from > applicants where status = 'A' and app_id in(select b.app_id from > resume_search as b where app_id=b.app_id and b.user_id=291) limit 200 BTW, there's a bug in current sources: LIMIT is ignored by INSERT ... SELECT. I think it works OK to do SELECT INTO ... LIMIT, however, and then you could insert into the final destination table from the INTO temp table. regards, tom lane
does avg() work on numeric ? it should, so, the more accurate one should be: on PG 6.5.1 ? please see below: (errors are on the last lines) ########################################## create table salesreps ( empl_num integer not null, name varchar (15) not null, age integer, rep_office integer, title varchar (10), hiredate date not null, manager integer, quota numeric(12,2), -- money, sales numeric(12,2) --money ); test=> select * from salesreps; empl_num|name |age|rep_office|title | hiredate|manager| quota| sales --------+-------------+---+----------+---------+----------+-------+---------+--------- 105|Bill Adams | 37| 13|Sales Rep|02-12-1988| 104|350000.00|367911.00 109|Mary Jones | 31| 11|Sales Rep|10-12-1989| 106|300000.00|392725.00 102|Sue Smith | 48| 21|Sales Rep|12-10-1986| 108|350000.00|474050.00 106|Sam Clark | 52| 11|VP Sales |06-14-1988| |275000.00|299912.00 104|Bob Smith | 33| 12|Sales Mgr|05-19-1987| 106|200000.00|142594.00 101|Dan Roberts | 45| 12|Sales Rep|10-20-1986| 104|300000.00|305673.00 110|Tom Snyder | 41| |Sales Rep|01-13-1990| 101| | 75985.00 108|Larry Fitch | 62| 21|Sales Mgr|10-12-1989| 106|350000.00|361865.00 103|Paul Cruz | 29| 12|Sales Rep|03-01-1987| 104|275000.00|286775.00 107|Nancy Angelli| 49| 22|Sales Rep|11-14-1988| 108|300000.00|186042.00 (10 rows) test=> select sum(quota) from salesreps; sum ---------- 2700000.00 (1 row) test=> select count(quota) from salesreps; count ----- 9 (1 row) test=> select sum(quota)/count(quota) from salesreps; ?column? ----------------- 300000.0000000000 (1 row) test=> select avg(quota) from salesreps; ERROR: overflow on numeric ABS(value) >= 10^-1 for field with precision 0 scale 1723 test=> select avg(sales) from salesreps; ERROR: overflow on numeric ABS(value) >= 10^-1 for field with precision 0 scale 1723
> "Mitch Vincent" <mitch@venux.net> writes: > > select * from applicants as a where a.status = 'A' and a.app_id in(select > > b.app_id from resume_search as b where a.app_id=b.app_id and b.user_id=291) > > WHERE ... IN (sub-select) is a very inefficient substitute for a plain join > query. Try something like > > select a.* from applicants as a, resume_search as b > where a.status = 'A' and a.app_id = b.app_id and b.user_id = 291; > > Providing indexes on one or both app_id fields might help. Also, don't > forget to do a VACUUM every so often to make sure the planner has useful > statistics about the sizes of the tables. > > There has been some talk of automatically rewriting queries to eliminate > unnecessary sub-selects, but I don't foresee it getting done for a > while yet. Tom, you mentioned that subselects use nested join, but they could be hardcoded to use hash join. My opinion is that this should be done if it is easy. I know the concern was that a hash could overflow if the subquery is too larger, but if the subquery is too large, nested join will take forever, so it really doesn't matter if it completes or not. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
<kaiq@realtyideas.com> writes: > test=> select avg(quota) from salesreps; > ERROR: overflow on numeric ABS(value) >= 10^-1 for field with precision 0 scale 1723 > test=> select avg(sales) from salesreps; > ERROR: overflow on numeric ABS(value) >= 10^-1 for field with precision 0 > scale 1723 Hmm ... there were some similar bugs reported a couple of months ago, and I thought Jan would've fixed them by now. Jan, if you don't have time to look at this, let me know and I'll see if I can do anything with it. regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> There has been some talk of automatically rewriting queries to eliminate >> unnecessary sub-selects, but I don't foresee it getting done for a >> while yet. > Tom, you mentioned that subselects use nested join, but they could be > hardcoded to use hash join. My opinion is that this should be done if > it is easy. That would probably help for uncorrelated subselects (where no variable from the outer query is referenced in the inner one). For correlated subselects, such as this one is, hashing the subselect result isn't going to help much --- the real problem is that the subselect is repeated afresh for each outer tuple. regards, tom lane
> > <kaiq@realtyideas.com> writes: > > test=> select avg(quota) from salesreps; > > ERROR: overflow on numeric ABS(value) >= 10^-1 for field with precision 0 scale 1723 > > > test=> select avg(sales) from salesreps; > > ERROR: overflow on numeric ABS(value) >= 10^-1 for field with precision 0 > > scale 1723 > > Hmm ... there were some similar bugs reported a couple of months ago, > and I thought Jan would've fixed them by now. Jan, if you don't have > time to look at this, let me know and I'll see if I can do anything > with it. Looks to me there's something wrong with the adjustment of actual computation and/or display precision during longer taking aggregates. I know where to look at - tnx anyway. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #========================================= wieck@debis.com (Jan Wieck) #
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> There has been some talk of automatically rewriting queries to eliminate > >> unnecessary sub-selects, but I don't foresee it getting done for a > >> while yet. > > > Tom, you mentioned that subselects use nested join, but they could be > > hardcoded to use hash join. My opinion is that this should be done if > > it is easy. > > That would probably help for uncorrelated subselects (where no variable > from the outer query is referenced in the inner one). For correlated > subselects, such as this one is, hashing the subselect result isn't > going to help much --- the real problem is that the subselect is > repeated afresh for each outer tuple. Oh, I was thinking in general of our EXISTS() workaround and if we could help that by forcing hash joins. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
wieck@debis.com (Jan Wieck) writes: > Looks to me there's something wrong with the adjustment of > actual computation and/or display precision during longer > taking aggregates. I know where to look at - tnx anyway. It doesn't necessarily take long; Zot O'Connor posted this example in late October: create table example(other decimal(4,4)); CREATE insert into example (other) values(3.9); ERROR: overflow on numeric ABS(value) >= 10^0 for field with precision 4 scale 4 insert into example (other) values(0.0); ERROR: overflow on numeric ABS(value) >= 10^0 for field with precision 4 scale 4 which still fails in current sources. regards, tom lane