Problem search on text arrays, using the overlaps (&&) operator - Mailing list pgsql-general

From John Cheng
Subject Problem search on text arrays, using the overlaps (&&) operator
Date
Msg-id 534564.57504.qm@web43406.mail.sp1.yahoo.com
Whole thread Raw
Responses Re: Problem search on text arrays, using the overlaps (&&) operator
Re: Problem search on text arrays, using the overlaps (&&) operator
List pgsql-general
We use text[] on one of our tables. This text[] column allows us to
search for records that matches a keyword in a set of keywords. For
example, if we want to find records that has a keyword of "foo" or
"bar", we can use the condition:

  keywords && '{foo, bar}'::text[]

Another wau is to do this:

  (keywords && '{foo}::text[]' OR keywords && '{bar}::text[]')

I am noticing a big difference between the two ways. I'm trying to
find out if we need to re-write our queries to speed them up, or
perhaps I am just missing something about how to use text[].

To set up a simple test case, use:

CREATE TEMP TABLE foo (
    keywords text[]
);
CREATE INDEX foo_idx ON foo USING gin (keywords);
INSERT INTO foo VALUES ('{ford}'::text[]);
INSERT INTO foo VALUES ('{toyota}'::text[]);
INSERT INTO foo VALUES ('{volkswagen}'::text[]);
INSERT INTO foo VALUES ('{dodge}'::text[]);
INSERT INTO foo VALUES ('{saturn}'::text[]);
INSERT INTO foo VALUES ('{honda}'::text[]);
INSERT INTO foo VALUES ('{porsche}'::text[]);
INSERT INTO foo VALUES ('{porsche, audi, chrysler}'::text[]);
INSERT INTO foo VALUES ('{honda, hummer, ferrari}'::text[]);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
-- Query of form 'arr && {foo, bar}'
EXPLAIN ANALYZE SELECT
    count(*)
FROM foo
WHERE keywords && '{ford, toyota}'::text[];

------ result ------
QUERY PLAN:
 Aggregate  (cost=9870.50..9870.51 rows=1 width=0) (actual time=449.937..449.938 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=104.88..9853.56 rows=6778 width=0) (actual time=61.197..308.724 rows=262144
loops=1)
         Recheck Cond: (keywords && '{ford,toyota}'::text[])
         ->  Bitmap Index Scan on foo_idx  (cost=0.00..103.19 rows=6778 width=0) (actual time=58.816..58.816
rows=262144loops=1) 
               Index Cond: (keywords && '{ford,toyota}'::text[])
 Total runtime: 450.121 ms
(6 rows)


-- Query of form 'arr && {foo} OR arr && bar'
EXPLAIN ANALYZE SELECT count(*) FROM foo WHERE
    (
        keywords && '{ford}'::text[]
        OR keywords && '{toyota}'::text[]
    )

------ result ------
QUERY PLAN:
 Aggregate  (cost=11351.85..11351.86 rows=1 width=0) (actual time=424.389..424.389 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=213.13..11318.04 rows=13522 width=0) (actual time=43.728..273.913 rows=262144
loops=1)
         Recheck Cond: ((keywords && '{ford}'::text[]) OR (keywords && '{toyota}'::text[]))
         ->  BitmapOr  (cost=213.13..213.13 rows=13556 width=0) (actual time=41.386..41.386 rows=0 loops=1)
               ->  Bitmap Index Scan on foo_idx  (cost=0.00..103.19 rows=6778 width=0) (actual time=21.216..21.216
rows=131072loops=1) 
                     Index Cond: (keywords && '{ford}'::text[])
               ->  Bitmap Index Scan on foo_idx  (cost=0.00..103.19 rows=6778 width=0) (actual time=20.167..20.167
rows=131072loops=1) 
                     Index Cond: (keywords && '{toyota}'::text[])
 Total runtime: 424.431 ms
(9 rows)


The difference is very little here. However, in our application I am
seeing a much bigger difference. The affected query is a lot more
complicated:

First, a query of the form "keywords && '{foo, bar}'::text[]"

EXPLAIN ANALYZE SELECT
    count(*)
FROM mb_lead ml
INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id
WHERE lrd.typeflags && '{autobytel.volume, automotive}'::text[];


------ result ------
QUERY PLAN:
 Aggregate  (cost=71602.10..71602.11 rows=1 width=0) (actual time=13196.895..13196.896 rows=1 loops=1)
   ->  Hash Join  (cost=60278.37..71598.14 rows=1582 width=0) (actual time=1924.076..13170.602 rows=29567 loops=1)
         Hash Cond: (ml.lead_id = lrd.lead_id)
         ->  Seq Scan on mb_lead ml  (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.014..293.214 rows=316398
loops=1)
         ->  Hash  (cost=60022.57..60022.57 rows=20464 width=8) (actual time=1922.050..1922.050 rows=473743 loops=1)
               ->  Bitmap Heap Scan on lead_reporting_data lrd  (cost=808.14..60022.57 rows=20464 width=8) (actual
time=424.841..1276.990rows=473743 loops=1) 
                     Recheck Cond: (typeflags && '{autobytel.volume,automotive}'::text[])
                     ->  Bitmap Index Scan on lead_reporting_data_typeflags_idx  (cost=0.00..803.02 rows=20464 width=0)
(actualtime=308.941..308.941 rows=483587 loops=1) 
                           Index Cond: (typeflags && '{autobytel.volume,automotive}'::text[])
 Total runtime: 13197.015 ms

Second, a query of the form:
 "keywords && '{foo}'::text[] OR keywords && '{bar}'::text[]"


EXPLAIN ANALYZE SELECT
    count(*)
FROM mb_lead ml
INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id
WHERE (lrd.typeflags && '{autobytel.volume}'::text[]
    OR lrd.typeflags && '{automotive}'::text[])


------ result ------
QUERY PLAN:
 Aggregate  (cost=112761.86..112761.87 rows=1 width=0) (actual time=7768.672..7768.673 rows=1 loops=1)
   ->  Hash Join  (cost=101418.46..112753.97 rows=3156 width=0) (actual time=1850.560..7743.651 rows=29567 loops=1)
         Hash Cond: (ml.lead_id = lrd.lead_id)
         ->  Seq Scan on mb_lead ml  (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.013..274.131 rows=316398
loops=1)
         ->  Hash  (cost=100908.13..100908.13 rows=40826 width=8) (actual time=1849.519..1849.519 rows=473743 loops=1)
               ->  Bitmap Heap Scan on lead_reporting_data lrd  (cost=1626.46..100908.13 rows=40826 width=8) (actual
time=357.613..1211.535rows=473743 loops=1) 
                     Recheck Cond: ((typeflags && '{autobytel.volume}'::text[]) OR (typeflags &&
'{automotive}'::text[]))
                     ->  BitmapOr  (cost=1626.46..1626.46 rows=40928 width=0) (actual time=240.921..240.921 rows=0
loops=1)
                           ->  Bitmap Index Scan on lead_reporting_data_typeflags_idx  (cost=0.00..803.02 rows=20464
width=0)(actual time=87.368..87.368 rows=161264 loops=1) 
                                 Index Cond: (typeflags && '{autobytel.volume}'::text[])
                           ->  Bitmap Index Scan on lead_reporting_data_typeflags_idx  (cost=0.00..803.02 rows=20464
width=0)(actual time=153.546..153.546 rows=322317 loops=1) 
                                 Index Cond: (typeflags && '{automotive}'::text[])
 Total runtime: 7768.788 ms


-----------
For some reason, I am seeing a big difference in our real database. I
don't want to just rewrite all of our queries yet. I'm guessing the
data makes a big difference.  What would be a good way to examine the
data to figure out what's the best way to write our queries? Is there
any features in PostgreSQL that can help me improve the performance?

Any advice would be greatly appreciated!






pgsql-general by date:

Previous
From: Merrick
Date:
Subject: 1 Sequence per Row i.e. each customer's first order starts at 1
Next
From: Greg Stark
Date:
Subject: Re: 1 Sequence per Row i.e. each customer's first order starts at 1