Thread: Search optimisation

Search optimisation

From
Olwen Williams
Date:
I'm new to SQL databases although I've worked a lifetime on PICK type
systems.  I'm having trouble gettting selects to work well.

I have a database with a number of tables. I'm having a number of
problems one is this:

This query runs very quickly and returns one row:
select * from biblioitems where isbn='031051911X';
explanation:
Index Scan using isbnidx on biblioitems  (cost=203.35 rows=2988
width=102)

This is fast:
select * from biblioitems,items,biblio where biblioitems.biblionumber
='109' and biblioitems.biblionumber = items.biblionumber and
biblio.biblionumber = biblioitems.biblionumber
explanation:
Nested Loop  (cost=54.91 rows=657 width=340) ->  Nested Loop  (cost=4.08 rows=2 width=167)       ->  Index Scan using
bibitbnoidxon biblioitems  (cost=2.03
 
rows=1 width=102)       ->  Index Scan using bibnumidx on biblio  (cost=2.05 rows=59945
width=65) ->  Index Scan using bibnumitem on items  (cost=25.42 rows=73185
width=173)

This one is slow:
select * from biblioitems,items,biblio where isbn='031051911X' and
biblioitems.biblionumber = items.biblionumber and biblio.biblionumber =
biblioitems.biblionumber;
explanation:Hash Join  (cost=11830.17 rows=981357 width=340) ->  Seq Scan on items  (cost=4158.10 rows=73185 width=173)
-> Hash  (cost=5091.33 rows=2989 width=167)       ->  Hash Join  (cost=5091.33 rows=2989 width=167)             ->  Seq
Scanon biblio  (cost=2767.19 rows=59945 width=65)
 
             ->  Hash  (cost=203.35 rows=2988 width=102)                   ->  Index Scan using isbnidx on biblioitems
(cost=203.35 rows=2988 width=102)

How can I make this query use the indexes?




Re: [SQL] Search optimisation

From
Tom Lane
Date:
Olwen Williams <olwen@ihug.co.nz> writes:
> [ why is query A so much faster than query B? ]

Hmm.  The system is correctly estimating that query A is much
faster than B:

> Nested Loop  (cost=54.91 rows=657 width=340)               ^^^^^^^^^^
vs
>  Hash Join  (cost=11830.17 rows=981357 width=340)              ^^^^^^^^^^^^^

so the question is why it thought this hashjoin plan was the best
available option for query B.  It looks to me like the problem has
to do with the estimated selectivities of the two restrictions on the
biblioitems table.  Query A recognizes that biblionumber='109' selects
just one item from biblioitems:

-> Index Scan using bibitbnoidx on biblioitems  (cost=2.03 rows=1 width=102)
             ^^^^^^
 

so it produces a good plan, even though a doubly nested loop would
be an extremely horrible plan if there were a lot of rows involved.
(Actually, this is the first complaint about bad plans I've ever seen
in which doubly-nested-loop was the *right* choice ;-) ...)

Meanwhile, query B thinks that isbn='031051911X' is going to accept
several thousand rows from biblioitems:

-> Index Scan using isbnidx on biblioitems (cost=203.35 rows=2988 width=102)
          ^^^^^^^^^
 

and because of that, it's choosing a plan that would be well adapted for
a lot of matching rows, yet is overkill (read inefficient) for a few
rows.

Now, I'm going to read between the lines and guess that the restriction
on isbn should yield only one biblioitems row; if this guess is wrong,
you can ignore all that follows.

If my guess is correct, then the blame lies with misestimation of the
selectivity of the isbn='XXX' clause.  You can double-check this by
doing 
explain select * from biblioitems where isbn='031051911X';

and seeing whether the estimated row count is close to what you
actually get from doing this query.

The first question has to be "have you done a VACUUM ANALYZE lately?".
If not, the planner is working in the dark about the distribution of
isbn values, and you can't fairly blame it for guessing conservatively
that it's going to have to deal with a lot of rows.  However, if you
have done a VACUUM ANALYZE and still get a ridiculous estimate, then
that's a bug that I'd like to try to fix.  Please contact me off-list
and we can pursue the details of why this is happening.
        regards, tom lane


Sub-select speed.

From
"Mitch Vincent"
Date:
Hey guys, I have a few queries here that I would greatly appreciate any
pointers on. I ned to get them as fast as possible but would settle for ANY
speed improvement over what it is now..

select * from applicants as a where a.status = 'A' and a.app_id in(select
b.app_id from resume_search as b where a.app_id=b.app_id and b.user_id=291)

resume_search is a table populated by a query before this one.  The schema
is as follows :
| user_id                          | int4                             |
4 |
| app_id                           | int4                             |
4 |

Indices:  resume_app_id             resume_search_id

So, in it's current form, with only a few (say 5) rows in resume_search,
it's still tking quite some time.

In addition to this, I have another query liek this one, only it inserts to
another table also..

insert into users_download_app (user_id,app_id) select app_id from
applicants where status = 'A' and app_id in(select b.app_id from
resume_search as b where app_id=b.app_id and b.user_id=291) limit 200

This one is so slow that it's unusable. (We're talking minutes here on my
Celeron 333 devel server).

I wish there was another way to get what I need to get done but the
sub-selects are the only way I can see doing it now. I have a feeling though
that there are some extra things that I might be able to do to get the same
results from the same tables, faster.

Thanks to one and all.

-Mitch







Re: [SQL] Sub-select speed.

From
Tom Lane
Date:
"Mitch Vincent" <mitch@venux.net> writes:
> select * from applicants as a where a.status = 'A' and a.app_id in(select
> b.app_id from resume_search as b where a.app_id=b.app_id and b.user_id=291)

WHERE ... IN (sub-select) is a very inefficient substitute for a plain join
query.  Try something like

select a.* from applicants as a, resume_search as b
where a.status = 'A' and a.app_id = b.app_id and b.user_id = 291;

Providing indexes on one or both app_id fields might help.  Also, don't
forget to do a VACUUM every so often to make sure the planner has useful
statistics about the sizes of the tables.

There has been some talk of automatically rewriting queries to eliminate
unnecessary sub-selects, but I don't foresee it getting done for a
while yet.

> insert into users_download_app (user_id,app_id) select app_id from
> applicants where status = 'A' and app_id in(select b.app_id from
> resume_search as b where app_id=b.app_id and b.user_id=291) limit 200

BTW, there's a bug in current sources: LIMIT is ignored by INSERT
... SELECT.  I think it works OK to do SELECT INTO ... LIMIT, however,
and then you could insert into the final destination table from the
INTO temp table.
        regards, tom lane


avg() on numeric ?

From
Date:
does avg() work on numeric ? it should, so, the more 
accurate one should be: on PG 6.5.1 ? 

please see below: (errors are on the last lines)
##########################################
create table salesreps (
empl_num        integer not null,
name            varchar (15) not null,
age             integer,
rep_office      integer,
title           varchar (10),
hiredate        date not null,
manager         integer,
quota           numeric(12,2), -- money,
sales           numeric(12,2) --money
);

test=> select * from salesreps;
empl_num|name         |age|rep_office|title    |  hiredate|manager|    quota|    sales
--------+-------------+---+----------+---------+----------+-------+---------+---------

105|Bill Adams   | 37|        13|Sales Rep|02-12-1988|    104|350000.00|367911.00
109|Mary Jones   | 31|        11|Sales Rep|10-12-1989|    106|300000.00|392725.00
102|Sue Smith    | 48|        21|Sales Rep|12-10-1986|    108|350000.00|474050.00
106|Sam Clark    | 52|        11|VP Sales |06-14-1988|       |275000.00|299912.00
104|Bob Smith    | 33|        12|Sales Mgr|05-19-1987|    106|200000.00|142594.00
101|Dan Roberts  | 45|        12|Sales Rep|10-20-1986|    104|300000.00|305673.00
110|Tom Snyder   | 41|          |Sales Rep|01-13-1990|    101|         | 75985.00
108|Larry Fitch  | 62|        21|Sales Mgr|10-12-1989|    106|350000.00|361865.00
103|Paul Cruz    | 29|        12|Sales Rep|03-01-1987|    104|275000.00|286775.00
107|Nancy Angelli| 49|        22|Sales Rep|11-14-1988|    108|300000.00|186042.00
(10 rows)


test=> select sum(quota) from salesreps;
sum
----------
2700000.00
(1 row)

test=> select count(quota) from salesreps;
count
-----
9
(1 row)

test=> select sum(quota)/count(quota) from salesreps;
?column?
-----------------
300000.0000000000
(1 row)

test=> select avg(quota) from salesreps;
ERROR:  overflow on numeric ABS(value) >= 10^-1 for field with precision 0 scale 1723

test=> select avg(sales) from salesreps;
ERROR:  overflow on numeric ABS(value) >= 10^-1 for field with precision 0
scale 1723



Re: [SQL] Sub-select speed.

From
Bruce Momjian
Date:
> "Mitch Vincent" <mitch@venux.net> writes:
> > select * from applicants as a where a.status = 'A' and a.app_id in(select
> > b.app_id from resume_search as b where a.app_id=b.app_id and b.user_id=291)
> 
> WHERE ... IN (sub-select) is a very inefficient substitute for a plain join
> query.  Try something like
> 
> select a.* from applicants as a, resume_search as b
> where a.status = 'A' and a.app_id = b.app_id and b.user_id = 291;
> 
> Providing indexes on one or both app_id fields might help.  Also, don't
> forget to do a VACUUM every so often to make sure the planner has useful
> statistics about the sizes of the tables.
> 
> There has been some talk of automatically rewriting queries to eliminate
> unnecessary sub-selects, but I don't foresee it getting done for a
> while yet.


Tom, you mentioned that subselects use nested join, but they could be
hardcoded to use hash join.  My opinion is that this should be done if
it is easy.  I know the concern was that a hash could overflow if the
subquery is too larger, but if the subquery is too large, nested join
will take forever, so it really doesn't matter if it completes or not.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [SQL] avg() on numeric ?

From
Tom Lane
Date:
<kaiq@realtyideas.com> writes:
> test=> select avg(quota) from salesreps;
> ERROR:  overflow on numeric ABS(value) >= 10^-1 for field with precision 0 scale 1723

> test=> select avg(sales) from salesreps;
> ERROR:  overflow on numeric ABS(value) >= 10^-1 for field with precision 0
> scale 1723

Hmm ... there were some similar bugs reported a couple of months ago,
and I thought Jan would've fixed them by now.  Jan, if you don't have
time to look at this, let me know and I'll see if I can do anything
with it.
        regards, tom lane


Re: [SQL] Sub-select speed.

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> There has been some talk of automatically rewriting queries to eliminate
>> unnecessary sub-selects, but I don't foresee it getting done for a
>> while yet.

> Tom, you mentioned that subselects use nested join, but they could be
> hardcoded to use hash join.  My opinion is that this should be done if
> it is easy.

That would probably help for uncorrelated subselects (where no variable
from the outer query is referenced in the inner one).  For correlated
subselects, such as this one is, hashing the subselect result isn't
going to help much --- the real problem is that the subselect is
repeated afresh for each outer tuple.
        regards, tom lane


Re: [SQL] avg() on numeric ?

From
wieck@debis.com (Jan Wieck)
Date:
>
> <kaiq@realtyideas.com> writes:
> > test=> select avg(quota) from salesreps;
> > ERROR:  overflow on numeric ABS(value) >= 10^-1 for field with precision 0 scale 1723
>
> > test=> select avg(sales) from salesreps;
> > ERROR:  overflow on numeric ABS(value) >= 10^-1 for field with precision 0
> > scale 1723
>
> Hmm ... there were some similar bugs reported a couple of months ago,
> and I thought Jan would've fixed them by now.  Jan, if you don't have
> time to look at this, let me know and I'll see if I can do anything
> with it.

    Looks  to  me  there's something wrong with the adjustment of
    actual computation and/or  display  precision  during  longer
    taking aggregates.  I know where to look at - tnx anyway.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [SQL] Sub-select speed.

From
Bruce Momjian
Date:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> There has been some talk of automatically rewriting queries to eliminate
> >> unnecessary sub-selects, but I don't foresee it getting done for a
> >> while yet.
> 
> > Tom, you mentioned that subselects use nested join, but they could be
> > hardcoded to use hash join.  My opinion is that this should be done if
> > it is easy.
> 
> That would probably help for uncorrelated subselects (where no variable
> from the outer query is referenced in the inner one).  For correlated
> subselects, such as this one is, hashing the subselect result isn't
> going to help much --- the real problem is that the subselect is
> repeated afresh for each outer tuple.

Oh, I was thinking in general of our EXISTS() workaround and if we could
help that by forcing hash joins.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [SQL] avg() on numeric ?

From
Tom Lane
Date:
wieck@debis.com (Jan Wieck) writes:
>     Looks  to  me  there's something wrong with the adjustment of
>     actual computation and/or  display  precision  during  longer
>     taking aggregates.  I know where to look at - tnx anyway.

It doesn't necessarily take long; Zot O'Connor posted this example
in late October:

create table example(other      decimal(4,4));
CREATE

insert into example         (other) values(3.9);
ERROR:  overflow on numeric ABS(value) >= 10^0 for field with precision 4 scale 4
insert into example         (other) values(0.0);
ERROR:  overflow on numeric ABS(value) >= 10^0 for field with precision 4 scale 4

which still fails in current sources.
        regards, tom lane