Problem search on text arrays, using the overlaps (&&) operator - Mailing list pgsql-general
From | John Cheng |
---|---|
Subject | Problem search on text arrays, using the overlaps (&&) operator |
Date | |
Msg-id | 534564.57504.qm@web43406.mail.sp1.yahoo.com Whole thread Raw |
Responses |
Re: Problem search on text arrays, using the overlaps (&&)
operator
Re: Problem search on text arrays, using the overlaps (&&) operator |
List | pgsql-general |
We use text[] on one of our tables. This text[] column allows us to search for records that matches a keyword in a set of keywords. For example, if we want to find records that has a keyword of "foo" or "bar", we can use the condition: keywords && '{foo, bar}'::text[] Another wau is to do this: (keywords && '{foo}::text[]' OR keywords && '{bar}::text[]') I am noticing a big difference between the two ways. I'm trying to find out if we need to re-write our queries to speed them up, or perhaps I am just missing something about how to use text[]. To set up a simple test case, use: CREATE TEMP TABLE foo ( keywords text[] ); CREATE INDEX foo_idx ON foo USING gin (keywords); INSERT INTO foo VALUES ('{ford}'::text[]); INSERT INTO foo VALUES ('{toyota}'::text[]); INSERT INTO foo VALUES ('{volkswagen}'::text[]); INSERT INTO foo VALUES ('{dodge}'::text[]); INSERT INTO foo VALUES ('{saturn}'::text[]); INSERT INTO foo VALUES ('{honda}'::text[]); INSERT INTO foo VALUES ('{porsche}'::text[]); INSERT INTO foo VALUES ('{porsche, audi, chrysler}'::text[]); INSERT INTO foo VALUES ('{honda, hummer, ferrari}'::text[]); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); -- Query of form 'arr && {foo, bar}' EXPLAIN ANALYZE SELECT count(*) FROM foo WHERE keywords && '{ford, toyota}'::text[]; ------ result ------ QUERY PLAN: Aggregate (cost=9870.50..9870.51 rows=1 width=0) (actual time=449.937..449.938 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=104.88..9853.56 rows=6778 width=0) (actual time=61.197..308.724 rows=262144 loops=1) Recheck Cond: (keywords && '{ford,toyota}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..103.19 rows=6778 width=0) (actual time=58.816..58.816 rows=262144loops=1) Index Cond: (keywords && '{ford,toyota}'::text[]) Total runtime: 450.121 ms (6 rows) -- Query of form 'arr && {foo} OR arr && bar' EXPLAIN ANALYZE SELECT count(*) FROM foo WHERE ( keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] ) ------ result ------ QUERY PLAN: Aggregate (cost=11351.85..11351.86 rows=1 width=0) (actual time=424.389..424.389 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=213.13..11318.04 rows=13522 width=0) (actual time=43.728..273.913 rows=262144 loops=1) Recheck Cond: ((keywords && '{ford}'::text[]) OR (keywords && '{toyota}'::text[])) -> BitmapOr (cost=213.13..213.13 rows=13556 width=0) (actual time=41.386..41.386 rows=0 loops=1) -> Bitmap Index Scan on foo_idx (cost=0.00..103.19 rows=6778 width=0) (actual time=21.216..21.216 rows=131072loops=1) Index Cond: (keywords && '{ford}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..103.19 rows=6778 width=0) (actual time=20.167..20.167 rows=131072loops=1) Index Cond: (keywords && '{toyota}'::text[]) Total runtime: 424.431 ms (9 rows) The difference is very little here. However, in our application I am seeing a much bigger difference. The affected query is a lot more complicated: First, a query of the form "keywords && '{foo, bar}'::text[]" EXPLAIN ANALYZE SELECT count(*) FROM mb_lead ml INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id WHERE lrd.typeflags && '{autobytel.volume, automotive}'::text[]; ------ result ------ QUERY PLAN: Aggregate (cost=71602.10..71602.11 rows=1 width=0) (actual time=13196.895..13196.896 rows=1 loops=1) -> Hash Join (cost=60278.37..71598.14 rows=1582 width=0) (actual time=1924.076..13170.602 rows=29567 loops=1) Hash Cond: (ml.lead_id = lrd.lead_id) -> Seq Scan on mb_lead ml (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.014..293.214 rows=316398 loops=1) -> Hash (cost=60022.57..60022.57 rows=20464 width=8) (actual time=1922.050..1922.050 rows=473743 loops=1) -> Bitmap Heap Scan on lead_reporting_data lrd (cost=808.14..60022.57 rows=20464 width=8) (actual time=424.841..1276.990rows=473743 loops=1) Recheck Cond: (typeflags && '{autobytel.volume,automotive}'::text[]) -> Bitmap Index Scan on lead_reporting_data_typeflags_idx (cost=0.00..803.02 rows=20464 width=0) (actualtime=308.941..308.941 rows=483587 loops=1) Index Cond: (typeflags && '{autobytel.volume,automotive}'::text[]) Total runtime: 13197.015 ms Second, a query of the form: "keywords && '{foo}'::text[] OR keywords && '{bar}'::text[]" EXPLAIN ANALYZE SELECT count(*) FROM mb_lead ml INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id WHERE (lrd.typeflags && '{autobytel.volume}'::text[] OR lrd.typeflags && '{automotive}'::text[]) ------ result ------ QUERY PLAN: Aggregate (cost=112761.86..112761.87 rows=1 width=0) (actual time=7768.672..7768.673 rows=1 loops=1) -> Hash Join (cost=101418.46..112753.97 rows=3156 width=0) (actual time=1850.560..7743.651 rows=29567 loops=1) Hash Cond: (ml.lead_id = lrd.lead_id) -> Seq Scan on mb_lead ml (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.013..274.131 rows=316398 loops=1) -> Hash (cost=100908.13..100908.13 rows=40826 width=8) (actual time=1849.519..1849.519 rows=473743 loops=1) -> Bitmap Heap Scan on lead_reporting_data lrd (cost=1626.46..100908.13 rows=40826 width=8) (actual time=357.613..1211.535rows=473743 loops=1) Recheck Cond: ((typeflags && '{autobytel.volume}'::text[]) OR (typeflags && '{automotive}'::text[])) -> BitmapOr (cost=1626.46..1626.46 rows=40928 width=0) (actual time=240.921..240.921 rows=0 loops=1) -> Bitmap Index Scan on lead_reporting_data_typeflags_idx (cost=0.00..803.02 rows=20464 width=0)(actual time=87.368..87.368 rows=161264 loops=1) Index Cond: (typeflags && '{autobytel.volume}'::text[]) -> Bitmap Index Scan on lead_reporting_data_typeflags_idx (cost=0.00..803.02 rows=20464 width=0)(actual time=153.546..153.546 rows=322317 loops=1) Index Cond: (typeflags && '{automotive}'::text[]) Total runtime: 7768.788 ms ----------- For some reason, I am seeing a big difference in our real database. I don't want to just rewrite all of our queries yet. I'm guessing the data makes a big difference. What would be a good way to examine the data to figure out what's the best way to write our queries? Is there any features in PostgreSQL that can help me improve the performance? Any advice would be greatly appreciated!
pgsql-general by date: