Thread: which is better: using OR clauses or UNION?
Hi, I have a query hitting a table of 25 million rows. The table has a text field ('identifier') which i need to query for matchingrows. The question is if i have multiple strings to match against this field I can use multiple OR sub-statementsor multiple statements in a UNION. The UNION seems to run quicker.... is this to be expected? or is there anythingelse I can do improve the speed of this query? Some query details: table "dba_data_base", index: "in_dba_data_base_identifier" btree (lower(identifier) varchar_pattern_ops) Query 1 ------- datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM dba_data_base aWHERE ( lower(identifier)LIKE lower('BUGS0000001884677') OR lower(identifier) LIKE lower('BUGS0000001884678') OR lower(identifier)LIKE lower('BUGS0000001884679') OR lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') )ORDER BY a.identifier; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------- Sort (cost=14930.13..14939.77 rows=3857 width=62) (actual time=3208.466..3208.652 rows=318 loops=1) Sort Key: identifier-> Bitmap Heap Scan on dba_data_base a (cost=134.43..14700.38 rows=3857 width=62) (actual time=81.106..3207.721rows=318 loops=1) Recheck Cond: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR (lower(identifier)~~ 'bugs0000001884678'::text) OR (lower(identifier) ~~ 'bugs0000001884679'::text) OR (lower(identifier)~ ~ 'sptigr4-2210 (6f24)'::text)) Filter: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR (lower(identifier) ~~'bugs0000001884678'::text) OR (lower(identifier) ~~ 'bugs0000001884679'::text) OR (lower(identifier) ~~ 'spt igr4-2210 (6f24)'::text)) -> BitmapOr (cost=134.43..134.43 rows=3857 width=0) (actual time=71.397..71.397 rows=0loops=1) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actualtime=0.029..0.029 rows=0 loops=1) Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::charactervarying) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (lower(identifier)~=~ 'bugs0000001884678'::character varying) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 rows=318 loops=1) Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) Total runtime: 3208.904 ms Query 2 ------- datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM dba_data_base aWHERE lower(identifier)LIKE lower('BUGS0000001884677') UNION SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM dba_data_base aWHERE lower(identifier) LIKE lower('BUGS0000001884678') UNION SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM dba_data_base aWHERE lower(identifier) LIKE lower('BUGS0000001884679') UNION SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_valueFROM dba_data_base aWHERE lower(identifier) LIKE lower('SpTIGR4-2210(6F24)')ORDER BY identifier; Sort (cost=15702.26..15711.90 rows=3856 width=62) (actual time=3.688..3.886 rows=317 loops=1) Sort Key: identifier -> Unique (cost=15414.74..15472.58 rows=3856 width=62) (actual time=2.663..3.387 rows=317 loops=1) -> Sort (cost=15414.74..15424.38rows=3856 width=62) (actual time=2.660..2.834 rows=318 loops=1) Sort Key: bioassay_id,identifier, ratio, log_ratio, p_value -> Append (cost=32.88..15185.06 rows=3856 width=62) (actualtime=0.320..2.131 rows=318 loops=1) -> Bitmap Heap Scan on dba_data_base a (cost=32.88..3786.62rows=964 width=62) (actual time=0.041..0.041 rows=0 loops=1) Filter: (lower(identifier)~~ 'bugs0000001884677'::text) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.036..0.036 rows=0 loops=1) Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying) -> Bitmap HeapScan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010 rows=0 loops=1) Filter: (lower(identifier) ~~ 'bugs0000001884678'::text) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (lower(identifier) ~=~ 'bugs0000001884678'::character varying) -> Bitmap HeapScan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010 rows=0 loops=1) Filter: (lower(identifier) ~~ 'bugs0000001884679'::text) -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1) Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying) -> Bitmap HeapScan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.255..1.676 rows=318 loops=1) Filter: (lower(identifier) ~~ 'sptigr4-2210 (6f24)'::text) -> Bitmap Index Scanon in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.178..0.178 rows=318 loops=1) Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) Total runtime: 4.174 ms Also which should scale better if I add more strings to match? would there be any better design patterns for this problem? Thanks for any help Adam select version(); version ---------------------------------------------------------------- PostgreSQL 8.2.12 on i686-pc-linux-gnu, compiled by GCC 2.95.4
hi Adam,<br />im not sure which is faster/slower but, possibly you can speed it up by using "in" operator<br /><br />...wherelower(identifier) <b>in </b>(lower('BUGS0000001884677')<b> ,</b> lower('BUGS0000001884678')<b>,</b>....);<br /><br/>if you create function based index:<br /><pre>CREATE INDEX idx_table_lower_text ON table(<b>lower</b>(text_field));</pre>(takenfrom: <a href="http://archives.postgresql.org/pgsql-sql/2003-09/msg00395.php">http://archives.postgresql.org/pgsql-sql/2003-09/msg00395.php</a>)<br /><br/><br /><br /><br /><div class="gmail_quote">On Tue, Aug 16, 2011 at 1:39 PM, adam_pgsql <span dir="ltr"><<a href="mailto:adam_pgsql@witneyweb.org">adam_pgsql@witneyweb.org</a>></span>wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"><br /> Hi,<br /><br /> I have aquery hitting a table of 25 million rows. The table has a text field ('identifier') which i need to query for matching rows.The question is if i have multiple strings to match against this field I can use multiple OR sub-statements or multiplestatements in a UNION. The UNION seems to run quicker.... is this to be expected? or is there anything else I cando improve the speed of this query? Some query details:<br /><br /><br /> table "dba_data_base", index:<br /> "in_dba_data_base_identifier"btree (lower(identifier) varchar_pattern_ops)<br /><br /><br /> Query 1<br /> -------<br />datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value<br /> FROM<br /> dba_data_basea<br /> WHERE<br /> ( lower(identifier) LIKE lower('BUGS0000001884677') OR<br /> lower(identifier)LIKE lower('BUGS0000001884678') OR<br /> lower(identifier) LIKE lower('BUGS0000001884679') OR<br /> lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') )<br /> ORDER BY a.identifier;<br /> QUERY PLAN<br /><br /> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br />--------------------------------<br /> Sort (cost=14930.13..14939.77 rows=3857 width=62) (actual time=3208.466..3208.652rows=318 loops=1)<br /> Sort Key: identifier<br /> -> Bitmap Heap Scan on dba_data_base a (cost=134.43..14700.38rows=3857 width=62) (actual time=81.106..3207.721 rows=318 loops=1)<br /> Recheck Cond: ((lower(identifier)~~ 'bugs0000001884677'::text) OR (lower(identifier) ~~ 'bugs0000001884678'::text) OR (lower(identifier)~~ 'bugs0000001884679'::text) OR (lower(identifier) ~<br /> ~ 'sptigr4-2210 (6f24)'::text))<br /> Filter: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR (lower(identifier) ~~ 'bugs0000001884678'::text) OR (lower(identifier)~~ 'bugs0000001884679'::text) OR (lower(identifier) ~~ 'spt<br /> igr4-2210 (6f24)'::text))<br /> -> BitmapOr (cost=134.43..134.43 rows=3857 width=0) (actual time=<a href="tel:71.397..71.397" value="+17139771397">71.397..71.397</a>rows=0 loops=1)<br /> -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.029..0.029 rows=0 loops=1)<br /> Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying)<br /> -> Bitmap IndexScan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)<br/> Index Cond: (lower(identifier) ~=~ 'bugs0000001884678'::character varying)<br /> -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008rows=0 loops=1)<br /> Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::charactervarying)<br /> -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64rows=964 width=0) (actual time=<a href="tel:71.347..71.347" value="+17134771347">71.347..71.347</a> rows=318loops=1)<br /> Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying)<br/> Total runtime: 3208.904 ms<br /><br /><br /> Query 2<br /> -------<br /> datadb=#explain analyse SELECT a.bioassay_id,a.identifier, ratio, log_ratio, p_value<br /> FROM<br /> dba_data_base a<br /> WHERE<br /> lower(identifier)LIKE lower('BUGS0000001884677')<br /> UNION<br /> SELECT a.bioassay_id, a.identifier, ratio, log_ratio,p_value<br /> FROM<br /> dba_data_base a<br /> WHERE<br /> lower(identifier) LIKE lower('BUGS0000001884678')<br/> UNION<br /> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value<br /> FROM<br/> dba_data_base a<br /> WHERE<br /> lower(identifier) LIKE lower('BUGS0000001884679')<br /> UNION<br/> SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value<br /> FROM<br /> dba_data_base a<br /> WHERE<br/> lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)')<br /> ORDER BY identifier;<br /><br /> Sort (cost=15702.26..15711.90rows=3856 width=62) (actual time=3.688..3.886 rows=317 loops=1)<br /> Sort Key: identifier<br /> -> Unique (cost=15414.74..15472.58 rows=3856 width=62) (actual time=2.663..3.387 rows=317 loops=1)<br /> -> Sort (cost=15414.74..15424.38 rows=3856 width=62) (actual time=2.660..2.834 rows=318 loops=1)<br /> SortKey: bioassay_id, identifier, ratio, log_ratio, p_value<br /> -> Append (cost=32.88..15185.06 rows=3856width=62) (actual time=0.320..2.131 rows=318 loops=1)<br /> -> Bitmap Heap Scan on dba_data_basea (cost=32.88..3786.62 rows=964 width=62) (actual time=0.041..0.041 rows=0 loops=1)<br /> Filter: (lower(identifier) ~~ 'bugs0000001884677'::text)<br /> -> Bitmap Index Scanon in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.036..0.036 rows=0 loops=1)<br /> Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying)<br /> -> Bitmap Heap Scan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010 rows=0loops=1)<br /> Filter: (lower(identifier) ~~ 'bugs0000001884678'::text)<br /> -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008rows=0 loops=1)<br /> Index Cond: (lower(identifier) ~=~ 'bugs0000001884678'::charactervarying)<br /> -> Bitmap Heap Scan on dba_data_base a (cost=32.88..3786.62rows=964 width=62) (actual time=0.010..0.010 rows=0 loops=1)<br /> Filter:(lower(identifier) ~~ 'bugs0000001884679'::text)<br /> -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 loops=1)<br /> Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying)<br /> -> Bitmap Heap Scan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.255..1.676 rows=318loops=1)<br /> Filter: (lower(identifier) ~~ 'sptigr4-2210 (6f24)'::text)<br /> -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.178..0.178rows=318 loops=1)<br /> Index Cond: (lower(identifier) ~=~ 'sptigr4-2210(6f24)'::character varying)<br /> Total runtime: 4.174 ms<br /><br /> Also which should scale better if I addmore strings to match? would there be any better design patterns for this problem?<br /><br /> Thanks for any help<br/><br /> Adam<br /><br /> select version();<br /> version<br /> ----------------------------------------------------------------<br/> PostgreSQL 8.2.12 on i686-pc-linux-gnu, compiled byGCC 2.95.4<br /><font color="#888888"><br /><br /><br /> --<br /> Sent via pgsql-sql mailing list (<a href="mailto:pgsql-sql@postgresql.org">pgsql-sql@postgresql.org</a>)<br/> To make changes to your subscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-sql"target="_blank">http://www.postgresql.org/mailpref/pgsql-sql</a><br /></font></blockquote></div><br/><br clear="all" /><br />-- <br />---------------------------------------<br />Viktor Bojović<br/>---------------------------------------<br />Wherever I go, Murphy goes with me<br />
-----Original Message----- From: pgsql-sql-owner@postgresql.org [mailto:pgsql-sql-owner@postgresql.org] On Behalf Of adam_pgsql Sent: Tuesday, August 16, 2011 7:39 AM To: pgsql-sql Subject: [SQL] which is better: using OR clauses or UNION? Hi, I have a query hitting a table of 25 million rows. The table has a text field ('identifier') which i need to query for matching rows. The question is if i have multiple strings to match against this field I can use multiple OR sub-statements or multiple statements in a UNION. The UNION seems to run quicker.... is this to be expected? or is there anything else I can do improve the speed of this query? Some query details: -------------------------------------------- You can also try the following form: ... WHERE column = ANY( string_to_array ( lower( 'Bug1,Bug2,Bug3' ) , ',' ) ); The main advantage of this is that you can parameterize the input string and so it will not logically matter how many values you are checking for. Also, you can always just place the values you want to search for into a table (temp or otherwise) and perform an Inner Join. No idea which one is "faster" but the "string_to_array" above requires no Dynamic SQL which all of the other forms (OR, UNION, IN) need. The table form also does not require dynamic SQL but you need additional INSERTS for each search value. I doubt it would be any better than the (OR/UNION/JOIN) form but you could also create a VALUES virtual table - which also requires Dynamic SQL. SELECT * FROM target_table NATURAL JOIN ( VALUES ('Bug1'),('Bug2'),('Bug3') ) searchtarget ( join_column_name ); David J.
adam_pgsql <adam_pgsql@witneyweb.org> writes: > I have a query hitting a table of 25 million rows. The table has a > text field ('identifier') which i need to query for matching rows. The > question is if i have multiple strings to match against this field I > can use multiple OR sub-statements or multiple statements in a > UNION. The UNION seems to run quicker.... is this to be expected? Your test cases don't seem exactly comparable; in particular I think the second one is benefiting from the first one having already read and cached the relevant disk blocks. Notice how you've got, eg, > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347rows=318 loops=1) > Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) versus > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actualtime=0.178..0.178 rows=318 loops=1) > Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) Those are the exact same subplan, so any honest comparison should be finding them to take the same amount of time. When the actual readings are different by a factor of several hundred, there's something wrong with your measurement process. In the end this comes down to whether duplicates will be eliminated more efficiently by a BitmapOr step or by sort/uniq on the resulting rows. I'd have to bet on the BitmapOr myself, but it's likely that this is down in the noise compared to the actual disk accesses in any not-fully-cached scenario. Also, if you don't expect the sub-statements to yield any duplicates, or don't care about seeing the same row twice in the output, you should consider UNION ALL instead of UNION. regards, tom lane
On 2011-08-16, adam_pgsql <adam_pgsql@witneyweb.org> wrote: > > Hi, > > I have a query hitting a table of 25 million rows. The table has a >text field ('identifier') which i need to query for matching rows. The >question is if i have multiple strings to match against this field I >can use multiple OR sub-statements or multiple statements in a UNION. >The UNION seems to run quicker.... is this to be expected? or is there >anything else I can do improve the speed of this query? Some query >details: > WHERE > ( lower(identifier) LIKE lower('BUGS0000001884677') OR > lower(identifier) LIKE lower('BUGS0000001884678') OR > lower(identifier) LIKE lower('BUGS0000001884679') OR > lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') ) > ORDER BY a.identifier; don't use like use regex. ... WHERE identifier ~* E'^(BUGS0000001884677|BUGS0000001884678|BUGS0000001884679|SpTIGR4-2210 \\(6F24\\))$' or where lower(identifier) ~* lower(E'^(BUGS0000001884677|BUGS0000001884678|BUGS0000001884679|SpTIGR4-2210 \\(6F24\\))$') on the other hand you aren't doing any pattern stuff. - you you could just use 'in': WHERE lower(identifier) in (lower('BUGS0000001884677'), lower('BUGS0000001884678'), lower('BUGS0000001884679'), lower('SpTIGR4-2210(6F24)') ) or if you need like, use like any: WHERE lower(identifier) like ANY (lower('BUGS0000001884677'), lower('BUGS0000001884678'), lower('BUGS0000001884679'), lower('SpTIGR4-2210(6F24)') ) > Also which should scale better if I add more strings to match? would there be any better design patterns for this problem? use one of the above: preferably in, else regex, or failing that like any. "= any" will also work but I don't thing it will ever be better than "in" -- ⚂⚃ 100% natural
Hi Viktor, thanks for your email, gave that a try, but the lower(identifier) LIKE lower('BUGS0000001884677') OR still comes in quicker than the IN approach thanks adam On 16 Aug 2011, at 12:56, Viktor Bojović wrote: > hi Adam, > im not sure which is faster/slower but, possibly you can speed it up by using "in" operator > > ...where lower(identifier) in (lower('BUGS0000001884677') , lower('BUGS0000001884678'),....); > > if you create function based index: > CREATE INDEX idx_table_lower_text ON table(lower(text_field)); > (taken from: http://archives.postgresql.org/pgsql-sql/2003-09/msg00395.php) > > > > > On Tue, Aug 16, 2011 at 1:39 PM, adam_pgsql <adam_pgsql@witneyweb.org> wrote: > > Hi, > > I have a query hitting a table of 25 million rows. The table has a text field ('identifier') which i need to query formatching rows. The question is if i have multiple strings to match against this field I can use multiple OR sub-statementsor multiple statements in a UNION. The UNION seems to run quicker.... is this to be expected? or is there anythingelse I can do improve the speed of this query? Some query details: > > > table "dba_data_base", index: > "in_dba_data_base_identifier" btree (lower(identifier) varchar_pattern_ops) > > > Query 1 > ------- > datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > WHERE > ( lower(identifier) LIKE lower('BUGS0000001884677') OR > lower(identifier) LIKE lower('BUGS0000001884678') OR > lower(identifier) LIKE lower('BUGS0000001884679') OR > lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') ) > ORDER BY a.identifier; > QUERY PLAN > > -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > -------------------------------- > Sort (cost=14930.13..14939.77 rows=3857 width=62) (actual time=3208.466..3208.652 rows=318 loops=1) > Sort Key: identifier > -> Bitmap Heap Scan on dba_data_base a (cost=134.43..14700.38 rows=3857 width=62) (actual time=81.106..3207.721 rows=318loops=1) > Recheck Cond: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR (lower(identifier) ~~ 'bugs0000001884678'::text)OR (lower(identifier) ~~ 'bugs0000001884679'::text) OR (lower(identifier) ~ > ~ 'sptigr4-2210 (6f24)'::text)) > Filter: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR (lower(identifier) ~~ 'bugs0000001884678'::text) OR(lower(identifier) ~~ 'bugs0000001884679'::text) OR (lower(identifier) ~~ 'spt > igr4-2210 (6f24)'::text)) > -> BitmapOr (cost=134.43..134.43 rows=3857 width=0) (actual time=71.397..71.397 rows=0 loops=1) > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.029..0.029rows=0 loops=1) > Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying) > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008rows=0 loops=1) > Index Cond: (lower(identifier) ~=~ 'bugs0000001884678'::character varying) > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008rows=0 loops=1) > Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying) > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347rows=318 loops=1) > Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) > Total runtime: 3208.904 ms > > > Query 2 > ------- > datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('BUGS0000001884677') > UNION > SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('BUGS0000001884678') > UNION > SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('BUGS0000001884679') > UNION > SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') > ORDER BY identifier; > > Sort (cost=15702.26..15711.90 rows=3856 width=62) (actual time=3.688..3.886 rows=317 loops=1) > Sort Key: identifier > -> Unique (cost=15414.74..15472.58 rows=3856 width=62) (actual time=2.663..3.387 rows=317 loops=1) > -> Sort (cost=15414.74..15424.38 rows=3856 width=62) (actual time=2.660..2.834 rows=318 loops=1) > Sort Key: bioassay_id, identifier, ratio, log_ratio, p_value > -> Append (cost=32.88..15185.06 rows=3856 width=62) (actual time=0.320..2.131 rows=318 loops=1) > -> Bitmap Heap Scan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.041..0.041rows=0 loops=1) > Filter: (lower(identifier) ~~ 'bugs0000001884677'::text) > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actualtime=0.036..0.036 rows=0 loops=1) > Index Cond: (lower(identifier) ~=~ 'bugs0000001884677'::character varying) > -> Bitmap Heap Scan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010rows=0 loops=1) > Filter: (lower(identifier) ~~ 'bugs0000001884678'::text) > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actualtime=0.008..0.008 rows=0 loops=1) > Index Cond: (lower(identifier) ~=~ 'bugs0000001884678'::character varying) > -> Bitmap Heap Scan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010rows=0 loops=1) > Filter: (lower(identifier) ~~ 'bugs0000001884679'::text) > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actualtime=0.008..0.008 rows=0 loops=1) > Index Cond: (lower(identifier) ~=~ 'bugs0000001884679'::character varying) > -> Bitmap Heap Scan on dba_data_base a (cost=32.88..3786.62 rows=964 width=62) (actual time=0.255..1.676rows=318 loops=1) > Filter: (lower(identifier) ~~ 'sptigr4-2210 (6f24)'::text) > -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actualtime=0.178..0.178 rows=318 loops=1) > Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) > Total runtime: 4.174 ms > > Also which should scale better if I add more strings to match? would there be any better design patterns for this problem? > > Thanks for any help > > Adam > > select version(); > version > ---------------------------------------------------------------- > PostgreSQL 8.2.12 on i686-pc-linux-gnu, compiled by GCC 2.95.4 > > > > -- > Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-sql > > > > -- > --------------------------------------- > Viktor Bojović > --------------------------------------- > Wherever I go, Murphy goes with me
On 16 Aug 2011, at 15:09, Tom Lane wrote: > adam_pgsql <adam_pgsql@witneyweb.org> writes: >> I have a query hitting a table of 25 million rows. The table has a >> text field ('identifier') which i need to query for matching rows. The >> question is if i have multiple strings to match against this field I >> can use multiple OR sub-statements or multiple statements in a >> UNION. The UNION seems to run quicker.... is this to be expected? > > Your test cases don't seem exactly comparable; in particular I think the > second one is benefiting from the first one having already read and > cached the relevant disk blocks. Notice how you've got, eg, > >> -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347rows=318 loops=1) >> Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) > > versus > >> -> Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actualtime=0.178..0.178 rows=318 loops=1) >> Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) > > Those are the exact same subplan, so any honest comparison should be > finding them to take the same amount of time. When the actual readings > are different by a factor of several hundred, there's something wrong > with your measurement process. > > In the end this comes down to whether duplicates will be eliminated more > efficiently by a BitmapOr step or by sort/uniq on the resulting rows. > I'd have to bet on the BitmapOr myself, but it's likely that this is > down in the noise compared to the actual disk accesses in any > not-fully-cached scenario. Also, if you don't expect the sub-statements > to yield any duplicates, or don't care about seeing the same row twice > in the output, you should consider UNION ALL instead of UNION. Thanks guys, I'll give some of those options a try and see which ones improve performance (Tom, yes i ran those queries after each other so there was caching going on. However, I had noticed a difference in performancewhen spacing the queries before and after a few other big queries to help clear the cache). adam
> -----Original Message----- > From: adam_pgsql [mailto:adam_pgsql@witneyweb.org] > Sent: Tuesday, August 16, 2011 7:39 AM > To: pgsql-sql > Subject: which is better: using OR clauses or UNION? > > > Hi, > > I have a query hitting a table of 25 million rows. The table has a text > field ('identifier') which i need to query for matching rows. The > question is if i have multiple strings to match against this field I > can use multiple OR sub-statements or multiple statements in a UNION. > The UNION seems to run quicker.... is this to be expected? or is there > anything else I can do improve the speed of this query? Some query > details: > > > table "dba_data_base", index: > "in_dba_data_base_identifier" btree (lower(identifier) > varchar_pattern_ops) > > > Query 1 > ------- > datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, > log_ratio, p_value > FROM > dba_data_base a > WHERE > ( lower(identifier) LIKE lower('BUGS0000001884677') OR > lower(identifier) LIKE lower('BUGS0000001884678') OR > lower(identifier) LIKE lower('BUGS0000001884679') OR > lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') ) > ORDER BY a.identifier; > > QUERY PLAN > > ----------------------------------------------------------------------- > ----------------------------------------------------------------------- > ---------------------------------------------------------- > -------------------------------- > Sort (cost=14930.13..14939.77 rows=3857 width=62) (actual > time=3208.466..3208.652 rows=318 loops=1) > Sort Key: identifier > -> Bitmap Heap Scan on dba_data_base a (cost=134.43..14700.38 > rows=3857 width=62) (actual time=81.106..3207.721 rows=318 loops=1) > Recheck Cond: ((lower(identifier) ~~ 'bugs0000001884677'::text) > OR (lower(identifier) ~~ 'bugs0000001884678'::text) OR > (lower(identifier) ~~ 'bugs0000001884679'::text) OR (lower(identifier) > ~ > ~ 'sptigr4-2210 (6f24)'::text)) > Filter: ((lower(identifier) ~~ 'bugs0000001884677'::text) OR > (lower(identifier) ~~ 'bugs0000001884678'::text) OR (lower(identifier) > ~~ 'bugs0000001884679'::text) OR (lower(identifier) ~~ 'spt > igr4-2210 (6f24)'::text)) > -> BitmapOr (cost=134.43..134.43 rows=3857 width=0) (actual > time=71.397..71.397 rows=0 loops=1) > -> Bitmap Index Scan on in_dba_data_base_identifier > (cost=0.00..32.64 rows=964 width=0) (actual time=0.029..0.029 rows=0 > loops=1) > Index Cond: (lower(identifier) ~=~ > 'bugs0000001884677'::character varying) > -> Bitmap Index Scan on in_dba_data_base_identifier > (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 > loops=1) > Index Cond: (lower(identifier) ~=~ > 'bugs0000001884678'::character varying) > -> Bitmap Index Scan on in_dba_data_base_identifier > (cost=0.00..32.64 rows=964 width=0) (actual time=0.008..0.008 rows=0 > loops=1) > Index Cond: (lower(identifier) ~=~ > 'bugs0000001884679'::character varying) > -> Bitmap Index Scan on in_dba_data_base_identifier > (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 > rows=318 loops=1) > Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 > (6f24)'::character varying) > Total runtime: 3208.904 ms > > > Query 2 > ------- > datadb=#explain analyse SELECT a.bioassay_id, a.identifier, ratio, > log_ratio, p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('BUGS0000001884677') > UNION > SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('BUGS0000001884678') > UNION > SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('BUGS0000001884679') > UNION > SELECT a.bioassay_id, a.identifier, ratio, log_ratio, p_value > FROM > dba_data_base a > WHERE > lower(identifier) LIKE lower('SpTIGR4-2210 (6F24)') > ORDER BY identifier; > > Sort (cost=15702.26..15711.90 rows=3856 width=62) (actual > time=3.688..3.886 rows=317 loops=1) > Sort Key: identifier > -> Unique (cost=15414.74..15472.58 rows=3856 width=62) (actual > time=2.663..3.387 rows=317 loops=1) > -> Sort (cost=15414.74..15424.38 rows=3856 width=62) (actual > time=2.660..2.834 rows=318 loops=1) > Sort Key: bioassay_id, identifier, ratio, log_ratio, > p_value > -> Append (cost=32.88..15185.06 rows=3856 width=62) > (actual time=0.320..2.131 rows=318 loops=1) > -> Bitmap Heap Scan on dba_data_base a > (cost=32.88..3786.62 rows=964 width=62) (actual time=0.041..0.041 > rows=0 loops=1) > Filter: (lower(identifier) ~~ > 'bugs0000001884677'::text) > -> Bitmap Index Scan on > in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) > (actual time=0.036..0.036 rows=0 loops=1) > Index Cond: (lower(identifier) ~=~ > 'bugs0000001884677'::character varying) > -> Bitmap Heap Scan on dba_data_base a > (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010 > rows=0 loops=1) > Filter: (lower(identifier) ~~ > 'bugs0000001884678'::text) > -> Bitmap Index Scan on > in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) > (actual time=0.008..0.008 rows=0 loops=1) > Index Cond: (lower(identifier) ~=~ > 'bugs0000001884678'::character varying) > -> Bitmap Heap Scan on dba_data_base a > (cost=32.88..3786.62 rows=964 width=62) (actual time=0.010..0.010 > rows=0 loops=1) > Filter: (lower(identifier) ~~ > 'bugs0000001884679'::text) > -> Bitmap Index Scan on > in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) > (actual time=0.008..0.008 rows=0 loops=1) > Index Cond: (lower(identifier) ~=~ > 'bugs0000001884679'::character varying) > -> Bitmap Heap Scan on dba_data_base a > (cost=32.88..3786.62 rows=964 width=62) (actual time=0.255..1.676 > rows=318 loops=1) > Filter: (lower(identifier) ~~ 'sptigr4-2210 > (6f24)'::text) > -> Bitmap Index Scan on > in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) > (actual time=0.178..0.178 rows=318 loops=1) > Index Cond: (lower(identifier) ~=~ > 'sptigr4-2210 (6f24)'::character varying) > Total runtime: 4.174 ms > > Also which should scale better if I add more strings to match? would > there be any better design patterns for this problem? > > Thanks for any help > > Adam > > select version(); > version > ---------------------------------------------------------------- > PostgreSQL 8.2.12 on i686-pc-linux-gnu, compiled by GCC 2.95.4 > Adam, There is something strange in your 2 execution plans. Exactly the same operation: -- first plan Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 rows=318 loops=1) Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) -- second plan Bitmap Index Scan on in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) (actual time=0.178..0.178 rows=318 loops=1) Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 (6f24)'::character varying) takes quite different time: 71.347 versus 0.178 and basically makes all the difference between duration of your first and second statement. I think, what you are seeing here is data being cached in memory (when you executed "union" statement after "or" statement). Other than that, looking at 2 execution plans, I'd say that in general "or" should run faster than "union", at least because it does "Bitmap Heap Scan on dba_data_base" only once, while "union" statement does this heap scan 4 times (once per "unionized" select). HTH, Igor Neyman
> -----Original Message----- > From: adam_pgsql [mailto:adam_pgsql@witneyweb.org] > Sent: Tuesday, August 16, 2011 11:56 AM > To: Tom Lane > Cc: pgsql-sql > Subject: Re: which is better: using OR clauses or UNION? > > > On 16 Aug 2011, at 15:09, Tom Lane wrote: > > > adam_pgsql <adam_pgsql@witneyweb.org> writes: > >> I have a query hitting a table of 25 million rows. The table has a > >> text field ('identifier') which i need to query for matching rows. > The > >> question is if i have multiple strings to match against this field I > >> can use multiple OR sub-statements or multiple statements in a > >> UNION. The UNION seems to run quicker.... is this to be expected? > > > > Your test cases don't seem exactly comparable; in particular I think > the > > second one is benefiting from the first one having already read and > > cached the relevant disk blocks. Notice how you've got, eg, > > > >> -> Bitmap Index Scan on in_dba_data_base_identifier > (cost=0.00..32.64 rows=964 width=0) (actual time=71.347..71.347 > rows=318 loops=1) > >> Index Cond: (lower(identifier) ~=~ 'sptigr4-2210 > (6f24)'::character varying) > > > > versus > > > >> -> Bitmap Index Scan on > in_dba_data_base_identifier (cost=0.00..32.64 rows=964 width=0) > (actual time=0.178..0.178 rows=318 loops=1) > >> Index Cond: (lower(identifier) ~=~ > 'sptigr4-2210 (6f24)'::character varying) > > > > Those are the exact same subplan, so any honest comparison should be > > finding them to take the same amount of time. When the actual > readings > > are different by a factor of several hundred, there's something wrong > > with your measurement process. > > > > In the end this comes down to whether duplicates will be eliminated > more > > efficiently by a BitmapOr step or by sort/uniq on the resulting rows. > > I'd have to bet on the BitmapOr myself, but it's likely that this is > > down in the noise compared to the actual disk accesses in any > > not-fully-cached scenario. Also, if you don't expect the sub- > statements > > to yield any duplicates, or don't care about seeing the same row > twice > > in the output, you should consider UNION ALL instead of UNION. > > > Thanks guys, I'll give some of those options a try and see which ones > improve performance > > (Tom, yes i ran those queries after each other so there was caching > going on. However, I had noticed a difference in performance when > spacing the queries before and after a few other big queries to help > clear the cache). > > adam Adam, Did you verify that your cache is "cleared"? Like using pg_buffercache contrib. module? Besides, there is also OS cache... Regards, Igor Neyman