Thread: which is better: using OR clauses or UNION?

which is better: using OR clauses or UNION?

From

adam_pgsql

Date:

16 August 2011, 11:39:40

Hi,

I have a query hitting a table of 25 million rows. The table has a text field ('identifier') which i need to query for
matchingrows. The question is if i have multiple strings to match against this field I can use multiple OR
sub-statementsor multiple statements in a UNION. The UNION seems to run quicker.... is this to be expected? or is there
anythingelse I can do improve the speed of this query? Some query details: 


table "dba_data_base", index:
"in_dba_data_base_identifier" btree (lower(identifier) varchar_pattern_ops)


Query 1
-------
datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM    dba_data_base aWHERE    (
lower(identifier)LIKE lower('BUGS0000001884677') OR      lower(identifier) LIKE lower('BUGS0000001884678') OR
lower(identifier)LIKE lower('BUGS0000001884679') OR      lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') )ORDER BY
a.identifier;
QUERY PLAN                                                                                


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------
Sort  (cost=14930.13..14939.77 rows=3857 width=62) (actual time=3208.466..3208.652 rows=318 loops=1) Sort Key:
identifier->  Bitmap Heap Scan on dba_data_base a  (cost=134.43..14700.38 rows=3857 width=62) (actual
time=81.106..3207.721rows=318 loops=1)       Recheck Cond: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR
(lower(identifier)~~ 'bugs0000001884678'::text) OR (lower(identifier) ~~ 'bugs0000001884679'::text) OR
(lower(identifier)~ 
~ 'sptigr4-2210 (6f24)'::text))       Filter: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR (lower(identifier)
~~'bugs0000001884678'::text) OR (lower(identifier) ~~ 'bugs0000001884679'::text) OR (lower(identifier) ~~ 'spt 
igr4-2210 (6f24)'::text))       ->  BitmapOr  (cost=134.43..134.43 rows=3857 width=0) (actual time=71.397..71.397
rows=0loops=1)             ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
(actualtime=0.029..0.029 rows=0 loops=1)                   Index Cond: (lower(identifier) ~=~
'bugs0000001884677'::charactervarying)             ->  Bitmap Index Scan on in_dba_data_base_identifier
(cost=0.00..32.64rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)                   Index Cond:
(lower(identifier)~=~ 'bugs0000001884678'::character varying)             ->  Bitmap Index Scan on
in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)
   Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying)             ->  Bitmap Index Scan on
in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 rows=318 loops=1)
       Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) 
Total runtime: 3208.904 ms


Query 2
-------
datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM    dba_data_base aWHERE
lower(identifier)LIKE lower('BUGS0000001884677') 
UNION
SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM    dba_data_base aWHERE      lower(identifier) LIKE
lower('BUGS0000001884678')
UNION
SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM    dba_data_base aWHERE      lower(identifier) LIKE
lower('BUGS0000001884679')
UNION
SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM    dba_data_base aWHERE      lower(identifier) LIKE
lower('SpTIGR4-2210(6F24)')ORDER BY identifier; 

Sort  (cost=15702.26..15711.90 rows=3856 width=62) (actual time=3.688..3.886 rows=317 loops=1) Sort Key: identifier ->
Unique (cost=15414.74..15472.58 rows=3856 width=62) (actual time=2.663..3.387 rows=317 loops=1)       ->  Sort
(cost=15414.74..15424.38rows=3856 width=62) (actual time=2.660..2.834 rows=318 loops=1)             Sort Key:
bioassay_id,identifier, ratio, log_ratio, p_value             ->  Append  (cost=32.88..15185.06 rows=3856 width=62)
(actualtime=0.320..2.131 rows=318 loops=1)                   ->  Bitmap Heap Scan on dba_data_base a
(cost=32.88..3786.62rows=964 width=62) (actual time=0.041..0.041 rows=0 loops=1)                         Filter:
(lower(identifier)~~ 'bugs0000001884677'::text)                         ->  Bitmap Index Scan on
in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.036..0.036 rows=0 loops=1)
               Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying)                   ->  Bitmap
HeapScan on dba_data_base a  (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010 rows=0 loops=1)
              Filter: (lower(identifier) ~~ 'bugs0000001884678'::text)                         ->  Bitmap Index Scan on
in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)
               Index Cond: (lower(identifier) ~=~ 'bugs0000001884678'::character varying)                   ->  Bitmap
HeapScan on dba_data_base a  (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010 rows=0 loops=1)
              Filter: (lower(identifier) ~~ 'bugs0000001884679'::text)                         ->  Bitmap Index Scan on
in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)
               Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying)                   ->  Bitmap
HeapScan on dba_data_base a  (cost=32.88..3786.62 rows=964 width=62) (actual time=0.255..1.676 rows=318 loops=1)
                Filter: (lower(identifier) ~~ 'sptigr4-2210 (6f24)'::text)                         ->  Bitmap Index
Scanon in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual time=0.178..0.178 rows=318 loops=1)
                         Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) 
Total runtime: 4.174 ms

Also which should scale better if I add more strings to match? would there be any better design patterns for this
problem?

Thanks for any help

Adam

select version();                          version
----------------------------------------------------------------
PostgreSQL 8.2.12 on i686-pc-linux-gnu, compiled by GCC 2.95.4

Re: which is better: using OR clauses or UNION?

From

Viktor Bojović

Date:

16 August 2011, 11:56:48

hi Adam,<br />im not sure which is faster/slower but, possibly you can speed it up by using "in" operator<br /><br
/>...wherelower(identifier) <b>in </b>(lower('BUGS0000001884677')<b> ,</b> lower('BUGS0000001884678')<b>,</b>....);<br
/><br/>if you create function based index:<br /><pre>CREATE INDEX idx_table_lower_text ON
table(<b>lower</b>(text_field));</pre>(takenfrom: <a
href="http://archives.postgresql.org/pgsql-sql/2003-09/msg00395.php">http://archives.postgresql.org/pgsql-sql/2003-09/msg00395.php</a>)<br
/><br/><br /><br /><br /><div class="gmail_quote">On Tue, Aug 16, 2011 at 1:39 PM, adam_pgsql <span dir="ltr"><<a
href="mailto:adam_pgsql@witneyweb.org">adam_pgsql@witneyweb.org</a>></span>wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><br /> Hi,<br /><br /> I have
aquery hitting a table of 25 million rows. The table has a text field ('identifier') which i need to query for matching
rows.The question is if i have multiple strings to match against this field I can use multiple OR sub-statements or
multiplestatements in a UNION. The UNION seems to run quicker.... is this to be expected? or is there anything else I
cando improve the speed of this query? Some query details:<br /><br /><br /> table "dba_data_base", index:<br />
"in_dba_data_base_identifier"btree (lower(identifier) varchar_pattern_ops)<br /><br /><br /> Query 1<br /> -------<br
/>datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value<br />  FROM<br />    
dba_data_basea<br />  WHERE<br />     ( lower(identifier) LIKE lower('BUGS0000001884677') OR<br />      
lower(identifier)LIKE lower('BUGS0000001884678') OR<br />       lower(identifier) LIKE lower('BUGS0000001884679') OR<br
/>      lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') )<br />  ORDER BY a.identifier;<br />                      
                                                                                      QUERY PLAN<br /><br />
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br
/>--------------------------------<br /> Sort  (cost=14930.13..14939.77 rows=3857 width=62) (actual
time=3208.466..3208.652rows=318 loops=1)<br />  Sort Key: identifier<br />  ->  Bitmap Heap Scan on dba_data_base a
 (cost=134.43..14700.38rows=3857 width=62) (actual time=81.106..3207.721 rows=318 loops=1)<br />        Recheck Cond:
((lower(identifier)~~ 'bugs0000001884677'::text) OR (lower(identifier) ~~ 'bugs0000001884678'::text) OR
(lower(identifier)~~ 'bugs0000001884679'::text) OR (lower(identifier) ~<br /> ~ 'sptigr4-2210 (6f24)'::text))<br />    
  Filter: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR (lower(identifier) ~~ 'bugs0000001884678'::text) OR
(lower(identifier)~~ 'bugs0000001884679'::text) OR (lower(identifier) ~~ 'spt<br /> igr4-2210 (6f24)'::text))<br />    
  ->  BitmapOr  (cost=134.43..134.43 rows=3857 width=0) (actual time=<a href="tel:71.397..71.397"
value="+17139771397">71.397..71.397</a>rows=0 loops=1)<br />              ->  Bitmap Index Scan on
in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.029..0.029 rows=0 loops=1)<br />        
          Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying)<br />              ->  Bitmap
IndexScan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0
loops=1)<br/>                    Index Cond: (lower(identifier) ~=~ 'bugs0000001884678'::character varying)<br />      
      ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual
time=0.008..0.008rows=0 loops=1)<br />                    Index Cond: (lower(identifier) ~=~
'bugs0000001884679'::charactervarying)<br />              ->  Bitmap Index Scan on in_dba_data_base_identifier
 (cost=0.00..32.64rows=964 width=0) (actual time=<a href="tel:71.347..71.347" value="+17134771347">71.347..71.347</a>
rows=318loops=1)<br />                    Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character
varying)<br/> Total runtime: 3208.904 ms<br /><br /><br /> Query 2<br /> -------<br /> datadb=#explain analyse SELECT
a.bioassay_id,a.identifier, ratio, log_ratio, p_value<br />  FROM<br />     dba_data_base a<br />  WHERE<br />    
lower(identifier)LIKE lower('BUGS0000001884677')<br /> UNION<br /> SELECT a.bioassay_id, a.identifier, ratio,
log_ratio,p_value<br />  FROM<br />     dba_data_base a<br />  WHERE<br />       lower(identifier) LIKE
lower('BUGS0000001884678')<br/> UNION<br /> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value<br />
 FROM<br/>     dba_data_base a<br />  WHERE<br />       lower(identifier) LIKE lower('BUGS0000001884679')<br />
UNION<br/> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value<br />  FROM<br />     dba_data_base a<br />
 WHERE<br/>       lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)')<br />  ORDER BY identifier;<br /><br /> Sort
 (cost=15702.26..15711.90rows=3856 width=62) (actual time=3.688..3.886 rows=317 loops=1)<br />  Sort Key: identifier<br
/> ->  Unique  (cost=15414.74..15472.58 rows=3856 width=62) (actual time=2.663..3.387 rows=317 loops=1)<br />      
 -> Sort  (cost=15414.74..15424.38 rows=3856 width=62) (actual time=2.660..2.834 rows=318 loops=1)<br />            
 SortKey: bioassay_id, identifier, ratio, log_ratio, p_value<br />              ->  Append  (cost=32.88..15185.06
rows=3856width=62) (actual time=0.320..2.131 rows=318 loops=1)<br />                    ->  Bitmap Heap Scan on
dba_data_basea  (cost=32.88..3786.62 rows=964 width=62) (actual time=0.041..0.041 rows=0 loops=1)<br />                
        Filter: (lower(identifier) ~~ 'bugs0000001884677'::text)<br />                          ->  Bitmap Index
Scanon in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual time=0.036..0.036 rows=0 loops=1)<br />
                              Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying)<br />        
          ->  Bitmap Heap Scan on dba_data_base a  (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010
rows=0loops=1)<br />                          Filter: (lower(identifier) ~~ 'bugs0000001884678'::text)<br />          
              ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual
time=0.008..0.008rows=0 loops=1)<br />                                Index Cond: (lower(identifier) ~=~
'bugs0000001884678'::charactervarying)<br />                    ->  Bitmap Heap Scan on dba_data_base a
 (cost=32.88..3786.62rows=964 width=62) (actual time=0.010..0.010 rows=0 loops=1)<br />                        
 Filter:(lower(identifier) ~~ 'bugs0000001884679'::text)<br />                          ->  Bitmap Index Scan on
in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)<br />        
                      Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying)<br />                
  ->  Bitmap Heap Scan on dba_data_base a  (cost=32.88..3786.62 rows=964 width=62) (actual time=0.255..1.676
rows=318loops=1)<br />                          Filter: (lower(identifier) ~~ 'sptigr4-2210 (6f24)'::text)<br />      
                  ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual
time=0.178..0.178rows=318 loops=1)<br />                                Index Cond: (lower(identifier) ~=~
'sptigr4-2210(6f24)'::character varying)<br /> Total runtime: 4.174 ms<br /><br /> Also which should scale better if I
addmore strings to match? would there be any better design patterns for this problem?<br /><br /> Thanks for any
help<br/><br /> Adam<br /><br /> select version();<br />                           version<br />
----------------------------------------------------------------<br/> PostgreSQL 8.2.12 on i686-pc-linux-gnu, compiled
byGCC 2.95.4<br /><font color="#888888"><br /><br /><br /> --<br /> Sent via pgsql-sql mailing list (<a
href="mailto:pgsql-sql@postgresql.org">pgsql-sql@postgresql.org</a>)<br/> To make changes to your subscription:<br /><a
href="http://www.postgresql.org/mailpref/pgsql-sql"target="_blank">http://www.postgresql.org/mailpref/pgsql-sql</a><br
/></font></blockquote></div><br/><br clear="all" /><br />-- <br />---------------------------------------<br />Viktor
Bojović<br/>---------------------------------------<br />Wherever I go, Murphy goes with me<br />

Re: which is better: using OR clauses or UNION?

From

"David Johnston"

Date:

16 August 2011, 14:07:26

-----Original Message-----
From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org]
On Behalf Of adam_pgsql
Sent: Tuesday, August 16, 2011 7:39 AM
To: pgsql-sql
Subject: [SQL] which is better: using OR clauses or UNION?

Hi,

I have a query hitting a table of 25 million rows. The table has a text
field ('identifier') which i need to query for matching rows. The question
is if i have multiple strings to match against this field I can use multiple
OR sub-statements or multiple statements in a UNION. The UNION seems to run
quicker.... is this to be expected? or is there anything else I can do
improve the speed of this query? Some query details:

--------------------------------------------

You can also try the following form:

... WHERE column = ANY(  string_to_array ( lower( 'Bug1,Bug2,Bug3' ) , ',' )
);

The main advantage of this is that you can parameterize the input string and
so it will not logically matter how many values you are checking for.

Also, you can always just place the values you want to search for into a
table (temp or otherwise) and perform an Inner Join.

No idea which one is "faster" but the "string_to_array" above requires no
Dynamic SQL which all of the other forms (OR, UNION, IN) need.  The table
form also does not require dynamic SQL but you need additional INSERTS for
each search value.  I doubt it would be any better than the (OR/UNION/JOIN)
form but you could also create a VALUES virtual table - which also requires
Dynamic SQL.

SELECT *
FROM target_table
NATURAL JOIN ( VALUES ('Bug1'),('Bug2'),('Bug3') ) searchtarget (
join_column_name );

David J.

Re: which is better: using OR clauses or UNION?

From

Tom Lane

Date:

16 August 2011, 14:09:16

adam_pgsql <adam_pgsql@witneyweb.org> writes:
> I have a query hitting a table of 25 million rows. The table has a
> text field ('identifier') which i need to query for matching rows. The
> question is if i have multiple strings to match against this field I
> can use multiple OR sub-statements or multiple statements in a
> UNION. The UNION seems to run quicker.... is this to be expected?

Your test cases don't seem exactly comparable; in particular I think the
second one is benefiting from the first one having already read and
cached the relevant disk blocks.  Notice how you've got, eg,

>               ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual
time=71.347..71.347rows=318 loops=1)
 
>                     Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying)

versus

>                           ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
(actualtime=0.178..0.178 rows=318 loops=1)
 
>                                 Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying)

Those are the exact same subplan, so any honest comparison should be
finding them to take the same amount of time.  When the actual readings
are different by a factor of several hundred, there's something wrong
with your measurement process.

In the end this comes down to whether duplicates will be eliminated more
efficiently by a BitmapOr step or by sort/uniq on the resulting rows.
I'd have to bet on the BitmapOr myself, but it's likely that this is
down in the noise compared to the actual disk accesses in any
not-fully-cached scenario.  Also, if you don't expect the sub-statements
to yield any duplicates, or don't care about seeing the same row twice
in the output, you should consider UNION ALL instead of UNION.
        regards, tom lane

Re: which is better: using OR clauses or UNION?

From

Jasen Betts

Date:

16 August 2011, 14:32:26

On 2011-08-16, adam_pgsql <adam_pgsql@witneyweb.org> wrote:
>
> Hi,
>
> I have a query hitting a table of 25 million rows. The table has a
>text field ('identifier') which i need to query for matching rows. The
>question is if i have multiple strings to match against this field I
>can use multiple OR sub-statements or multiple statements in a UNION.
>The UNION seems to run quicker.... is this to be expected? or is there
>anything else I can do improve the speed of this query? Some query
>details:        

>  WHERE
>      ( lower(identifier) LIKE lower('BUGS0000001884677') OR
>        lower(identifier) LIKE lower('BUGS0000001884678') OR
>        lower(identifier) LIKE lower('BUGS0000001884679') OR
>        lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') )
>  ORDER BY a.identifier;

don't use like use regex.

...

WHERE identifier ~* E'^(BUGS0000001884677|BUGS0000001884678|BUGS0000001884679|SpTIGR4-2210
\\(6F24\\))$'

or 

where lower(identifier) ~* lower(E'^(BUGS0000001884677|BUGS0000001884678|BUGS0000001884679|SpTIGR4-2210
\\(6F24\\))$')

on the other hand you aren't doing any pattern stuff. - you you could
just use 'in':

WHERE lower(identifier) in (lower('BUGS0000001884677'),  lower('BUGS0000001884678'),  lower('BUGS0000001884679'),
lower('SpTIGR4-2210(6F24)') )
 

or if you need like, use like any:

WHERE lower(identifier) like ANY (lower('BUGS0000001884677'),  lower('BUGS0000001884678'),  lower('BUGS0000001884679'),
lower('SpTIGR4-2210(6F24)')  )
 


> Also which should scale better if I add more strings to match? would there be any better design patterns for this
problem?

use one of the above: preferably in, else regex, or failing that like any.
"= any" will also work but I don't thing it will ever be better than "in"

-- 
⚂⚃ 100% natural

Re: which is better: using OR clauses or UNION?

From

adam_pgsql

Date:

16 August 2011, 14:52:45

Hi Viktor,

thanks for your email, gave that a try, but the

lower(identifier) LIKE lower('BUGS0000001884677') OR

still comes in quicker than the IN approach

thanks

adam



On 16 Aug 2011, at 12:56, Viktor Bojović wrote:

> hi Adam,
> im not sure which is faster/slower but, possibly you can speed it up by using "in" operator
>
> ...where lower(identifier) in (lower('BUGS0000001884677') , lower('BUGS0000001884678'),....);
>
> if you create function based index:
> CREATE INDEX idx_table_lower_text ON table(lower(text_field));
> (taken from: http://archives.postgresql.org/pgsql-sql/2003-09/msg00395.php)
>
>
>
>
> On Tue, Aug 16, 2011 at 1:39 PM, adam_pgsql <adam_pgsql@witneyweb.org> wrote:
>
> Hi,
>
> I have a query hitting a table of 25 million rows. The table has a text field ('identifier') which i need to query
formatching rows. The question is if i have multiple strings to match against this field I can use multiple OR
sub-statementsor multiple statements in a UNION. The UNION seems to run quicker.... is this to be expected? or is there
anythingelse I can do improve the speed of this query? Some query details: 
>
>
> table "dba_data_base", index:
> "in_dba_data_base_identifier" btree (lower(identifier) varchar_pattern_ops)
>
>
> Query 1
> -------
> datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
>     dba_data_base a
>  WHERE
>     ( lower(identifier) LIKE lower('BUGS0000001884677') OR
>       lower(identifier) LIKE lower('BUGS0000001884678') OR
>       lower(identifier) LIKE lower('BUGS0000001884679') OR
>       lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') )
>  ORDER BY a.identifier;
>                                                                                                              QUERY
PLAN
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> --------------------------------
> Sort  (cost=14930.13..14939.77 rows=3857 width=62) (actual time=3208.466..3208.652 rows=318 loops=1)
>  Sort Key: identifier
>  ->  Bitmap Heap Scan on dba_data_base a  (cost=134.43..14700.38 rows=3857 width=62) (actual time=81.106..3207.721
rows=318loops=1) 
>        Recheck Cond: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR (lower(identifier) ~~
'bugs0000001884678'::text)OR (lower(identifier) ~~ 'bugs0000001884679'::text) OR (lower(identifier) ~ 
> ~ 'sptigr4-2210 (6f24)'::text))
>        Filter: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR (lower(identifier) ~~ 'bugs0000001884678'::text)
OR(lower(identifier) ~~ 'bugs0000001884679'::text) OR (lower(identifier) ~~ 'spt 
> igr4-2210 (6f24)'::text))
>        ->  BitmapOr  (cost=134.43..134.43 rows=3857 width=0) (actual time=71.397..71.397 rows=0 loops=1)
>              ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual
time=0.029..0.029rows=0 loops=1) 
>                    Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying)
>              ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual
time=0.008..0.008rows=0 loops=1) 
>                    Index Cond: (lower(identifier) ~=~ 'bugs0000001884678'::character varying)
>              ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual
time=0.008..0.008rows=0 loops=1) 
>                    Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying)
>              ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual
time=71.347..71.347rows=318 loops=1) 
>                    Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying)
> Total runtime: 3208.904 ms
>
>
> Query 2
> -------
> datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
>     dba_data_base a
>  WHERE
>     lower(identifier) LIKE lower('BUGS0000001884677')
> UNION
> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
>     dba_data_base a
>  WHERE
>       lower(identifier) LIKE lower('BUGS0000001884678')
> UNION
> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
>     dba_data_base a
>  WHERE
>       lower(identifier) LIKE lower('BUGS0000001884679')
> UNION
> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
>     dba_data_base a
>  WHERE
>       lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)')
>  ORDER BY identifier;
>
> Sort  (cost=15702.26..15711.90 rows=3856 width=62) (actual time=3.688..3.886 rows=317 loops=1)
>  Sort Key: identifier
>  ->  Unique  (cost=15414.74..15472.58 rows=3856 width=62) (actual time=2.663..3.387 rows=317 loops=1)
>        ->  Sort  (cost=15414.74..15424.38 rows=3856 width=62) (actual time=2.660..2.834 rows=318 loops=1)
>              Sort Key: bioassay_id, identifier, ratio, log_ratio, p_value
>              ->  Append  (cost=32.88..15185.06 rows=3856 width=62) (actual time=0.320..2.131 rows=318 loops=1)
>                    ->  Bitmap Heap Scan on dba_data_base a  (cost=32.88..3786.62 rows=964 width=62) (actual
time=0.041..0.041rows=0 loops=1) 
>                          Filter: (lower(identifier) ~~ 'bugs0000001884677'::text)
>                          ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
(actualtime=0.036..0.036 rows=0 loops=1) 
>                                Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying)
>                    ->  Bitmap Heap Scan on dba_data_base a  (cost=32.88..3786.62 rows=964 width=62) (actual
time=0.010..0.010rows=0 loops=1) 
>                          Filter: (lower(identifier) ~~ 'bugs0000001884678'::text)
>                          ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
(actualtime=0.008..0.008 rows=0 loops=1) 
>                                Index Cond: (lower(identifier) ~=~ 'bugs0000001884678'::character varying)
>                    ->  Bitmap Heap Scan on dba_data_base a  (cost=32.88..3786.62 rows=964 width=62) (actual
time=0.010..0.010rows=0 loops=1) 
>                          Filter: (lower(identifier) ~~ 'bugs0000001884679'::text)
>                          ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
(actualtime=0.008..0.008 rows=0 loops=1) 
>                                Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying)
>                    ->  Bitmap Heap Scan on dba_data_base a  (cost=32.88..3786.62 rows=964 width=62) (actual
time=0.255..1.676rows=318 loops=1) 
>                          Filter: (lower(identifier) ~~ 'sptigr4-2210 (6f24)'::text)
>                          ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
(actualtime=0.178..0.178 rows=318 loops=1) 
>                                Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying)
> Total runtime: 4.174 ms
>
> Also which should scale better if I add more strings to match? would there be any better design patterns for this
problem?
>
> Thanks for any help
>
> Adam
>
> select version();
>                           version
> ----------------------------------------------------------------
> PostgreSQL 8.2.12 on i686-pc-linux-gnu, compiled by GCC 2.95.4
>
>
>
> --
> Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-sql
>
>
>
> --
> ---------------------------------------
> Viktor Bojović
> ---------------------------------------
> Wherever I go, Murphy goes with me

Re: which is better: using OR clauses or UNION?

From

adam_pgsql

Date:

16 August 2011, 15:55:45

On 16 Aug 2011, at 15:09, Tom Lane wrote:

> adam_pgsql <adam_pgsql@witneyweb.org> writes:
>> I have a query hitting a table of 25 million rows. The table has a
>> text field ('identifier') which i need to query for matching rows. The
>> question is if i have multiple strings to match against this field I
>> can use multiple OR sub-statements or multiple statements in a
>> UNION. The UNION seems to run quicker.... is this to be expected?
>
> Your test cases don't seem exactly comparable; in particular I think the
> second one is benefiting from the first one having already read and
> cached the relevant disk blocks.  Notice how you've got, eg,
>
>>              ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0) (actual
time=71.347..71.347rows=318 loops=1) 
>>                    Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying)
>
> versus
>
>>                          ->  Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
(actualtime=0.178..0.178 rows=318 loops=1) 
>>                                Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying)
>
> Those are the exact same subplan, so any honest comparison should be
> finding them to take the same amount of time.  When the actual readings
> are different by a factor of several hundred, there's something wrong
> with your measurement process.
>
> In the end this comes down to whether duplicates will be eliminated more
> efficiently by a BitmapOr step or by sort/uniq on the resulting rows.
> I'd have to bet on the BitmapOr myself, but it's likely that this is
> down in the noise compared to the actual disk accesses in any
> not-fully-cached scenario.  Also, if you don't expect the sub-statements
> to yield any duplicates, or don't care about seeing the same row twice
> in the output, you should consider UNION ALL instead of UNION.


Thanks guys, I'll give some of those options a try and see which ones improve performance

(Tom, yes i ran those queries after each other so there was caching going on. However, I had noticed a difference in
performancewhen spacing the queries before and after a few other big queries to help clear the cache). 

adam

Re: which is better: using OR clauses or UNION?

From

"Igor Neyman"

Date:

16 August 2011, 17:43:19

> -----Original Message-----
> From: adam_pgsql [mailto:adam_pgsql@witneyweb.org]
> Sent: Tuesday, August 16, 2011 7:39 AM
> To: pgsql-sql
> Subject: which is better: using OR clauses or UNION?
>
>
> Hi,
>
> I have a query hitting a table of 25 million rows. The table has a
text
> field ('identifier') which i need to query for matching rows. The
> question is if i have multiple strings to match against this field I
> can use multiple OR sub-statements or multiple statements in a UNION.
> The UNION seems to run quicker.... is this to be expected? or is there
> anything else I can do improve the speed of this query? Some query
> details:
>
>
> table "dba_data_base", index:
> "in_dba_data_base_identifier" btree (lower(identifier)
> varchar_pattern_ops)
>
>
> Query 1
> -------
> datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio,
> log_ratio, p_value
>  FROM
>      dba_data_base a
>  WHERE
>      ( lower(identifier) LIKE lower('BUGS0000001884677') OR
>        lower(identifier) LIKE lower('BUGS0000001884678') OR
>        lower(identifier) LIKE lower('BUGS0000001884679') OR
>        lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') )
>  ORDER BY a.identifier;
>
> QUERY PLAN
>
>
-----------------------------------------------------------------------
>
-----------------------------------------------------------------------
> ----------------------------------------------------------
> --------------------------------
> Sort  (cost=14930.13..14939.77 rows=3857 width=62) (actual
> time=3208.466..3208.652 rows=318 loops=1)
>   Sort Key: identifier
>   ->  Bitmap Heap Scan on dba_data_base a  (cost=134.43..14700.38
> rows=3857 width=62) (actual time=81.106..3207.721 rows=318 loops=1)
>         Recheck Cond: ((lower(identifier) ~~
'bugs0000001884677'::text)
> OR (lower(identifier) ~~ 'bugs0000001884678'::text) OR
> (lower(identifier) ~~ 'bugs0000001884679'::text) OR (lower(identifier)
> ~
> ~ 'sptigr4-2210 (6f24)'::text))
>         Filter: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR
> (lower(identifier) ~~ 'bugs0000001884678'::text) OR (lower(identifier)
> ~~ 'bugs0000001884679'::text) OR (lower(identifier) ~~ 'spt
> igr4-2210 (6f24)'::text))
>         ->  BitmapOr  (cost=134.43..134.43 rows=3857 width=0) (actual
> time=71.397..71.397 rows=0 loops=1)
>               ->  Bitmap Index Scan on in_dba_data_base_identifier
> (cost=0.00..32.64 rows=964 width=0) (actual time=0.029..0.029 rows=0
> loops=1)
>                     Index Cond: (lower(identifier) ~=~
> 'bugs0000001884677'::character varying)
>               ->  Bitmap Index Scan on in_dba_data_base_identifier
> (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0
> loops=1)
>                     Index Cond: (lower(identifier) ~=~
> 'bugs0000001884678'::character varying)
>               ->  Bitmap Index Scan on in_dba_data_base_identifier
> (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0
> loops=1)
>                     Index Cond: (lower(identifier) ~=~
> 'bugs0000001884679'::character varying)
>               ->  Bitmap Index Scan on in_dba_data_base_identifier
> (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347
> rows=318 loops=1)
>                     Index Cond: (lower(identifier) ~=~ 'sptigr4-2210
> (6f24)'::character varying)
> Total runtime: 3208.904 ms
>
>
> Query 2
> -------
> datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio,
> log_ratio, p_value
>  FROM
>      dba_data_base a
>  WHERE
>      lower(identifier) LIKE lower('BUGS0000001884677')
> UNION
> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
>      dba_data_base a
>  WHERE
>        lower(identifier) LIKE lower('BUGS0000001884678')
> UNION
> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
>      dba_data_base a
>  WHERE
>        lower(identifier) LIKE lower('BUGS0000001884679')
> UNION
> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value
>  FROM
>      dba_data_base a
>  WHERE
>        lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)')
>  ORDER BY identifier;
>
> Sort  (cost=15702.26..15711.90 rows=3856 width=62) (actual
> time=3.688..3.886 rows=317 loops=1)
>   Sort Key: identifier
>   ->  Unique  (cost=15414.74..15472.58 rows=3856 width=62) (actual
> time=2.663..3.387 rows=317 loops=1)
>         ->  Sort  (cost=15414.74..15424.38 rows=3856 width=62) (actual
> time=2.660..2.834 rows=318 loops=1)
>               Sort Key: bioassay_id, identifier, ratio, log_ratio,
> p_value
>               ->  Append  (cost=32.88..15185.06 rows=3856 width=62)
> (actual time=0.320..2.131 rows=318 loops=1)
>                     ->  Bitmap Heap Scan on dba_data_base a
> (cost=32.88..3786.62 rows=964 width=62) (actual time=0.041..0.041
> rows=0 loops=1)
>                           Filter: (lower(identifier) ~~
> 'bugs0000001884677'::text)
>                           ->  Bitmap Index Scan on
> in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
> (actual time=0.036..0.036 rows=0 loops=1)
>                                 Index Cond: (lower(identifier) ~=~
> 'bugs0000001884677'::character varying)
>                     ->  Bitmap Heap Scan on dba_data_base a
> (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010
> rows=0 loops=1)
>                           Filter: (lower(identifier) ~~
> 'bugs0000001884678'::text)
>                           ->  Bitmap Index Scan on
> in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
> (actual time=0.008..0.008 rows=0 loops=1)
>                                 Index Cond: (lower(identifier) ~=~
> 'bugs0000001884678'::character varying)
>                     ->  Bitmap Heap Scan on dba_data_base a
> (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010
> rows=0 loops=1)
>                           Filter: (lower(identifier) ~~
> 'bugs0000001884679'::text)
>                           ->  Bitmap Index Scan on
> in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
> (actual time=0.008..0.008 rows=0 loops=1)
>                                 Index Cond: (lower(identifier) ~=~
> 'bugs0000001884679'::character varying)
>                     ->  Bitmap Heap Scan on dba_data_base a
> (cost=32.88..3786.62 rows=964 width=62) (actual time=0.255..1.676
> rows=318 loops=1)
>                           Filter: (lower(identifier) ~~ 'sptigr4-2210
> (6f24)'::text)
>                           ->  Bitmap Index Scan on
> in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
> (actual time=0.178..0.178 rows=318 loops=1)
>                                 Index Cond: (lower(identifier) ~=~
> 'sptigr4-2210 (6f24)'::character varying)
> Total runtime: 4.174 ms
>
> Also which should scale better if I add more strings to match? would
> there be any better design patterns for this problem?
>
> Thanks for any help
>
> Adam
>
> select version();
>                            version
> ----------------------------------------------------------------
> PostgreSQL 8.2.12 on i686-pc-linux-gnu, compiled by GCC 2.95.4
>

Adam,

There is something strange in your 2 execution plans.
Exactly the same operation:

-- first plan Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64
rows=964 width=0) (actual time=71.347..71.347 rows=318 loops=1)                   Index Cond: (lower(identifier) ~=~
'sptigr4-2210
(6f24)'::character varying)
-- second plan
Bitmap Index Scan on in_dba_data_base_identifier  (cost=0.00..32.64
rows=964 width=0) (actual time=0.178..0.178 rows=318 loops=1)                   Index Cond: (lower(identifier) ~=~
'sptigr4-2210
(6f24)'::character varying)

takes quite different time: 71.347 versus 0.178 and basically makes all
the difference between duration of your first and second statement.

I think, what you are seeing here is data being cached in memory (when
you executed "union" statement after "or" statement).

Other than that, looking at 2 execution plans, I'd say that in general
"or" should run faster than "union", at least because it does "Bitmap
Heap Scan on dba_data_base" only once, while "union" statement does this
heap scan 4 times (once per "unionized" select).

HTH,
Igor Neyman

Re: which is better: using OR clauses or UNION?

From

"Igor Neyman"

Date:

17 August 2011, 20:27:11

> -----Original Message-----
> From: adam_pgsql [mailto:adam_pgsql@witneyweb.org]
> Sent: Tuesday, August 16, 2011 11:56 AM
> To: Tom Lane
> Cc: pgsql-sql
> Subject: Re: which is better: using OR clauses or UNION?
>
>
> On 16 Aug 2011, at 15:09, Tom Lane wrote:
>
> > adam_pgsql <adam_pgsql@witneyweb.org> writes:
> >> I have a query hitting a table of 25 million rows. The table has a
> >> text field ('identifier') which i need to query for matching rows.
> The
> >> question is if i have multiple strings to match against this field
I
> >> can use multiple OR sub-statements or multiple statements in a
> >> UNION. The UNION seems to run quicker.... is this to be expected?
> >
> > Your test cases don't seem exactly comparable; in particular I think
> the
> > second one is benefiting from the first one having already read and
> > cached the relevant disk blocks.  Notice how you've got, eg,
> >
> >>              ->  Bitmap Index Scan on in_dba_data_base_identifier
> (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347
> rows=318 loops=1)
> >>                    Index Cond: (lower(identifier) ~=~ 'sptigr4-2210
> (6f24)'::character varying)
> >
> > versus
> >
> >>                          ->  Bitmap Index Scan on
> in_dba_data_base_identifier  (cost=0.00..32.64 rows=964 width=0)
> (actual time=0.178..0.178 rows=318 loops=1)
> >>                                Index Cond: (lower(identifier) ~=~
> 'sptigr4-2210 (6f24)'::character varying)
> >
> > Those are the exact same subplan, so any honest comparison should be
> > finding them to take the same amount of time.  When the actual
> readings
> > are different by a factor of several hundred, there's something
wrong
> > with your measurement process.
> >
> > In the end this comes down to whether duplicates will be eliminated
> more
> > efficiently by a BitmapOr step or by sort/uniq on the resulting
rows.
> > I'd have to bet on the BitmapOr myself, but it's likely that this is
> > down in the noise compared to the actual disk accesses in any
> > not-fully-cached scenario.  Also, if you don't expect the sub-
> statements
> > to yield any duplicates, or don't care about seeing the same row
> twice
> > in the output, you should consider UNION ALL instead of UNION.
>
>
> Thanks guys, I'll give some of those options a try and see which ones
> improve performance
>
> (Tom, yes i ran those queries after each other so there was caching
> going on. However, I had noticed a difference in performance when
> spacing the queries before and after a few other big queries to help
> clear the cache).
>
> adam

Adam,

Did you verify that your cache is "cleared"? Like using pg_buffercache
contrib. module?
Besides, there is also OS cache...

Regards,
Igor Neyman