Thread: Problem search on text arrays, using the overlaps (&&) operator
We use text[] on one of our tables. This text[] column allows us to search for records that matches a keyword in a set of keywords. For example, if we want to find records that has a keyword of "foo" or "bar", we can use the condition: keywords && '{foo, bar}'::text[] Another wau is to do this: (keywords && '{foo}::text[]' OR keywords && '{bar}::text[]') I am noticing a big difference between the two ways. I'm trying to find out if we need to re-write our queries to speed them up, or perhaps I am just missing something about how to use text[]. To set up a simple test case, use: CREATE TEMP TABLE foo ( keywords text[] ); CREATE INDEX foo_idx ON foo USING gin (keywords); INSERT INTO foo VALUES ('{ford}'::text[]); INSERT INTO foo VALUES ('{toyota}'::text[]); INSERT INTO foo VALUES ('{volkswagen}'::text[]); INSERT INTO foo VALUES ('{dodge}'::text[]); INSERT INTO foo VALUES ('{saturn}'::text[]); INSERT INTO foo VALUES ('{honda}'::text[]); INSERT INTO foo VALUES ('{porsche}'::text[]); INSERT INTO foo VALUES ('{porsche, audi, chrysler}'::text[]); INSERT INTO foo VALUES ('{honda, hummer, ferrari}'::text[]); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); INSERT INTO foo (SELECT keywords FROM foo); -- Query of form 'arr && {foo, bar}' EXPLAIN ANALYZE SELECT count(*) FROM foo WHERE keywords && '{ford, toyota}'::text[]; ------ result ------ QUERY PLAN: Aggregate (cost=9870.50..9870.51 rows=1 width=0) (actual time=449.937..449.938 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=104.88..9853.56 rows=6778 width=0) (actual time=61.197..308.724 rows=262144 loops=1) Recheck Cond: (keywords && '{ford,toyota}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..103.19 rows=6778 width=0) (actual time=58.816..58.816 rows=262144loops=1) Index Cond: (keywords && '{ford,toyota}'::text[]) Total runtime: 450.121 ms (6 rows) -- Query of form 'arr && {foo} OR arr && bar' EXPLAIN ANALYZE SELECT count(*) FROM foo WHERE ( keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] ) ------ result ------ QUERY PLAN: Aggregate (cost=11351.85..11351.86 rows=1 width=0) (actual time=424.389..424.389 rows=1 loops=1) -> Bitmap Heap Scan on foo (cost=213.13..11318.04 rows=13522 width=0) (actual time=43.728..273.913 rows=262144 loops=1) Recheck Cond: ((keywords && '{ford}'::text[]) OR (keywords && '{toyota}'::text[])) -> BitmapOr (cost=213.13..213.13 rows=13556 width=0) (actual time=41.386..41.386 rows=0 loops=1) -> Bitmap Index Scan on foo_idx (cost=0.00..103.19 rows=6778 width=0) (actual time=21.216..21.216 rows=131072loops=1) Index Cond: (keywords && '{ford}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..103.19 rows=6778 width=0) (actual time=20.167..20.167 rows=131072loops=1) Index Cond: (keywords && '{toyota}'::text[]) Total runtime: 424.431 ms (9 rows) The difference is very little here. However, in our application I am seeing a much bigger difference. The affected query is a lot more complicated: First, a query of the form "keywords && '{foo, bar}'::text[]" EXPLAIN ANALYZE SELECT count(*) FROM mb_lead ml INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id WHERE lrd.typeflags && '{autobytel.volume, automotive}'::text[]; ------ result ------ QUERY PLAN: Aggregate (cost=71602.10..71602.11 rows=1 width=0) (actual time=13196.895..13196.896 rows=1 loops=1) -> Hash Join (cost=60278.37..71598.14 rows=1582 width=0) (actual time=1924.076..13170.602 rows=29567 loops=1) Hash Cond: (ml.lead_id = lrd.lead_id) -> Seq Scan on mb_lead ml (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.014..293.214 rows=316398 loops=1) -> Hash (cost=60022.57..60022.57 rows=20464 width=8) (actual time=1922.050..1922.050 rows=473743 loops=1) -> Bitmap Heap Scan on lead_reporting_data lrd (cost=808.14..60022.57 rows=20464 width=8) (actual time=424.841..1276.990rows=473743 loops=1) Recheck Cond: (typeflags && '{autobytel.volume,automotive}'::text[]) -> Bitmap Index Scan on lead_reporting_data_typeflags_idx (cost=0.00..803.02 rows=20464 width=0) (actualtime=308.941..308.941 rows=483587 loops=1) Index Cond: (typeflags && '{autobytel.volume,automotive}'::text[]) Total runtime: 13197.015 ms Second, a query of the form: "keywords && '{foo}'::text[] OR keywords && '{bar}'::text[]" EXPLAIN ANALYZE SELECT count(*) FROM mb_lead ml INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id WHERE (lrd.typeflags && '{autobytel.volume}'::text[] OR lrd.typeflags && '{automotive}'::text[]) ------ result ------ QUERY PLAN: Aggregate (cost=112761.86..112761.87 rows=1 width=0) (actual time=7768.672..7768.673 rows=1 loops=1) -> Hash Join (cost=101418.46..112753.97 rows=3156 width=0) (actual time=1850.560..7743.651 rows=29567 loops=1) Hash Cond: (ml.lead_id = lrd.lead_id) -> Seq Scan on mb_lead ml (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.013..274.131 rows=316398 loops=1) -> Hash (cost=100908.13..100908.13 rows=40826 width=8) (actual time=1849.519..1849.519 rows=473743 loops=1) -> Bitmap Heap Scan on lead_reporting_data lrd (cost=1626.46..100908.13 rows=40826 width=8) (actual time=357.613..1211.535rows=473743 loops=1) Recheck Cond: ((typeflags && '{autobytel.volume}'::text[]) OR (typeflags && '{automotive}'::text[])) -> BitmapOr (cost=1626.46..1626.46 rows=40928 width=0) (actual time=240.921..240.921 rows=0 loops=1) -> Bitmap Index Scan on lead_reporting_data_typeflags_idx (cost=0.00..803.02 rows=20464 width=0)(actual time=87.368..87.368 rows=161264 loops=1) Index Cond: (typeflags && '{autobytel.volume}'::text[]) -> Bitmap Index Scan on lead_reporting_data_typeflags_idx (cost=0.00..803.02 rows=20464 width=0)(actual time=153.546..153.546 rows=322317 loops=1) Index Cond: (typeflags && '{automotive}'::text[]) Total runtime: 7768.788 ms ----------- For some reason, I am seeing a big difference in our real database. I don't want to just rewrite all of our queries yet. I'm guessing the data makes a big difference. What would be a good way to examine the data to figure out what's the best way to write our queries? Is there any features in PostgreSQL that can help me improve the performance? Any advice would be greatly appreciated!
John Cheng schrieb: > ----------- > For some reason, I am seeing a big difference in our real database. I > don't want to just rewrite all of our queries yet. I'm guessing the > data makes a big difference. What would be a good way to examine the > data to figure out what's the best way to write our queries? Is there > any features in PostgreSQL that can help me improve the performance? > > Any advice would be greatly appreciated! Hi, did you think about using the fulltext search integrated up from version 8.3. I never used your approach and don't know if the fulltextsearch is suitable for your case ... just a hint. http://www.postgresql.org/docs/8.4/interactive/textsearch.html Cheers Andy
Hi Andreas, I'm afraid fulltext search won't fit our app here. Our application tags each record with "source flags", which is a text[] of strings that describes where the record came from. These flags are already passed into the application when we store the records. So we can simply store them as text[]. Contrast to this, doing a fulltext search would be storing these flags as one single string, then using the to_tsvector() to have PostgreSQL parse it out again. The fulltext search approach doesn't seem to make sense for us. I'm also suspcious that the same type of problem would affect queries on tsvector columns, but I have not tested myself. ----- Original Message ----- From: "Andreas Wenk" <a.wenk@netzmeister-st-pauli.de> To: "John Cheng" <jlcheng@ymail.com>, "PG-General Mailing List" <pgsql-general@postgresql.org> Sent: Friday, July 3, 2009 2:12:46 AM GMT -08:00 US/Canada Pacific Subject: Re: [GENERAL] Problem search on text arrays, using the overlaps (&&) operator John Cheng schrieb: > ----------- > For some reason, I am seeing a big difference in our real database. I > don't want to just rewrite all of our queries yet. I'm guessing the > data makes a big difference. What would be a good way to examine the > data to figure out what's the best way to write our queries? Is there > any features in PostgreSQL that can help me improve the performance? > > Any advice would be greatly appreciated! Hi, did you think about using the fulltext search integrated up from version 8.3. I never used your approach and don't know if the fulltextsearch is suitable for your case ... just a hint. http://www.postgresql.org/docs/8.4/interactive/textsearch.html Cheers Andy
Hello, Le 2/07/09 2:07, John Cheng a écrit : > We use text[] on one of our tables. This text[] column allows us to > search for records that matches a keyword in a set of keywords. For > example, if we want to find records that has a keyword of "foo" or > "bar", we can use the condition: > > keywords&& '{foo, bar}'::text[] > > Another wau is to do this: > > (keywords&& '{foo}::text[]' OR keywords&& '{bar}::text[]') > > I am noticing a big difference between the two ways. I'm trying to > find out if we need to re-write our queries to speed them up, or > perhaps I am just missing something about how to use text[]. > > [...] > For some reason, I am seeing a big difference in our real database. I > don't want to just rewrite all of our queries yet. I'm guessing the > data makes a big difference. What would be a good way to examine the > data to figure out what's the best way to write our queries? Is there > any features in PostgreSQL that can help me improve the performance? > > Any advice would be greatly appreciated! > With your exhaustive example statements based on table foo and cars, I performed some measures on my side (PostgreSQL 8.3.1 server). Here are some statistical results: seq rtm delta ratio deco --- --- ----- ----- ---- s1cc 873 -1,74% 2//9 1+1 s1or 889 1,71% 2//9 1+1 s2cc 1322 8,53% 3//9 1+2 s2or 1209 -9,32% 3//9 1+2 s3cc 892 -2,61% 2//9 1+(.5*2) s3or 915 2,54% 2//9 1+(.5*2) s4cc 511 -3,00% 1//9 (.5*2) s4or 526 2,91% 1//9 (.5*2) s5cc 1635 2,13% 4//9 1+1+2 s5or 1600 -2,17% 4//9 1+1+2 --- --- ----- ----- ---- seq where clauses --- ------------- s1cc keywords && '{ford, toyota}'::text[] s1or keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] s2cc keywords && '{ford, honda}'::text[] s2or keywords && '{ford}'::text[] OR keywords && '{honda}'::text[] s3cc keywords && '{honda, ferrari}'::text[] s3or keywords && '{honda}'::text[] OR keywords && '{ferrari}'::text[] s4cc keywords && '{ferrari, hummer}'::text[] s4or keywords && '{ferrari}'::text[] OR keywords && '{hummer}'::text[] s5cc keywords && '{ford, toyota, porsche}'::text[] s5or keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] OR keywords && '{porsche}'::text[] legend ------ seq sequence of 10 subsequent explain analyze per row rtm runtime mean (in milliseconds) of 10 subsequent measures delta difference percentage between cc and or sequences cc unique where clause with >1-size table (eg. {foo,bar}) or multiple where clauses with 1-size text table (eg. {foo}) ratio ratio between # of result rows and # of table rows deco result row partition between constant text table values in where clause. In the following, I refer to your condition forms as: - arr&&{f,b} - arr&&{f} or arr&&{b} I noticed first that, contrarily to your observation, for the "ford or toyata" case (sequence s1 developped into 2 subcases s1cc and s1or for both forms of condition), runtime mean is shorter for s1cc (arr&&{f,t}) than for s1or (arr&&{f} or arr&&{t}). But the difference percentage is only about 1,7% (ie. not enough relevant). This empirical advantage of form arr&&{f,t} over form (arr&&{f} or arr&&{t}) is also observed for 2 cases (s3 and s4) out of 4 (s2 up to s5). The difference percentage looks more relevant (about 3%). The cases s3 and s4 differ from the others (s1, s2, and s5) by the fact that the sets of matching rows for each compared text table value intersect: all the rows matched by ferrari also match honda (strict inclusion not equality); all the rows matched by ferrari also match hummer and conversely (double inclusion here, ie. equality). In the other 3 cases, each compared text table value matches set of rows without intersecting the matching row set of the other(s) value(s). We may then assume that form arr&&{f,t} would fit better when there are lots of rows matched by several terms--however this cannot be generalized at this stage. The reported data let us also guess some linear relationship between runtime and # of result rows. Here this relationship seems equally applicable with both forms arr&&{f,t} and (arr&&{f} or arr&&{t}). Out of these measures and report, I notice that, regarding the query plan explanation of your queries over real data, the difference between actual runtimes reported for each case of condition forms is not so relevant with respect to the overall runtime of the queries. At bitmap heap scan on the table over which conditions are performed, when the last row is retrieved, actual runtime is respectively of: - for arr&&{f,b}: 1276.990 ms; - for (arr&&{f) or arr&&{b}): 1211.535 ms. While quite close (difference percentage of about 5%), the difference is not really harmful with respect to the overall runtimes (resp. 13197 ms and 7768 ms), ie. in terms of part of overall runtimes resp. (1276/13197=) 10% and (1211/7768=) 16 %. Even if overlap operator runtimes are interchanged between the 2 cases, ratios would be resp. (1211/13197=) 9% and (1276/7768=) 16%, ie. not so different from the current plan. On the other hand, the hash join runtime part is very huge: - for arr&&{f,b} case: (13170-1924=) 11246 ms; - for (arr&&{f) or arr&&{b}) case: (7743-1850=) 5893 ms. These values represent resp. (11246/13197=) 85% and (5893/7768=) 76% of overall runtimes, ie. clearly dominating the overall query plan. In my opinion, analysis and optimization may be deepen over table indexes used for join planning. As your reported query plans show, the Where clauses are performed independantly from the table ml_lead; the reason is that all the attributes of the clauses belong to the table lead_reporting_data. Time may be reduced on join condition achievements. Hoping this observation will contribute a little to your opinion. Without any claim, I attached a document to this email for details on the measures I took with the overlap operator -- OpenDocument Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various" presents the detailed measures related to the data reported in this email. Regards. -- nha / Lyon / France.
Attachment
It would be nice if during create role we could have a parameter to set the number of days that a password is valid instead of just a timestamp. Best Regards -- Michael Gould, Managing Partner Intermodal Software Solutions, LLC 904.226.0978 904.592.5250 fax
On 06/07/2009 19:32, Michael Gould wrote: > It would be nice if during create role we could have a parameter to set the > number of days that a password is valid instead of just a timestamp. Would (current_timestamp + interval '365 days') work? Dunno myself - just thinking out loud... :-) Ray. ------------------------------------------------------------------ Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland rod@iol.ie Galway Cathedral Recitals: http://www.galwaycathedral.org/recitals ------------------------------------------------------------------
----- "nha" <lyondif02@free.fr> wrote: > From: "nha" <lyondif02@free.fr> > To: "John Cheng" <jlcheng@ymail.com> > Cc: pgsql-general@postgresql.org > Sent: Monday, July 6, 2009 9:12:22 AM GMT -08:00 US/Canada Pacific > Subject: Re: [GENERAL] Problem search on text arrays, using the overlaps (&&) operator > > Hello, > > With your exhaustive example statements based on table foo and cars, I > > performed some measures on my side (PostgreSQL 8.3.1 server). Here are > > some statistical results: > [ ... snipped ... ] > > In my opinion, analysis and optimization may be deepen over table > indexes used for join planning. As your reported query plans show, the > > Where clauses are performed independantly from the table ml_lead; the > > reason is that all the attributes of the clauses belong to the table > lead_reporting_data. Time may be reduced on join condition > achievements. > > Hoping this observation will contribute a little to your opinion. > > Without any claim, I attached a document to this email for details on > > the measures I took with the overlap operator -- OpenDocument > Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various" > presents the detailed measures related to the data reported in this > email. > > Regards. > > -- > nha / Lyon / France. Hi nha, I had not expected anyone to go to such lengths to evaluate my situation, thank you so much! After looking at your analysis, I realized that the test case I created isn't close enough to the queries running in our prod environment. For one, table 'foo' does not join to another table; The other thing is that the amount of data isn't the same; Finally, these tables have been ANALYZED. So I took some time to update the test case. On our server, running 8.3.6, I was able to reproduce the difference between the two styles: "arr&&{f,b}" and "arr&&{f} or arr&&{b}". First, the setup: -- Begin test case -- Sets up 'bar' SELECT id INTO TEMP TABLE bar FROM (SELECT generate_series(1,300000) as id) AS bar; CREATE INDEX bar_idx ON bar (id); ANALYZE bar; -- Sets up 'foo' CREATE TEMP SEQUENCE foo_bar_id_seq; CREATE TEMP TABLE foo ( bar_id numeric DEFAULT NEXTVAL('foo_bar_id_seq'), keywords text[] ); CREATE INDEX foo_idx ON foo USING gin (keywords); INSERT INTO foo (keywords) VALUES ('{ford}'::text[]); INSERT INTO foo (keywords) VALUES ('{toyota}'::text[]); INSERT INTO foo (keywords) VALUES ('{volkswagen}'::text[]); INSERT INTO foo (keywords) VALUES ('{saturn}'::text[]); INSERT INTO foo (keywords) VALUES ('{honda}'::text[]); INSERT INTO foo (keywords) VALUES ('{porsche}'::text[]); INSERT INTO foo (keywords) VALUES ('{porsche, audi, chrysler}'::text[]); INSERT INTO foo (keywords) VALUES ('{honda, hummer, ferrari}'::text[]); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); INSERT INTO foo (keywords) (SELECT keywords FROM foo); ANALYZE foo; -- End test case Query for the form "arr&&{f,b}" SELECT count(*) FROM foo INNER JOIN bar ON foo.bar_id = bar.id WHERE foo.keywords && '{ford, toyota, volkswagen, saturn, honda, porsche, hummer, ferrari}'::text[]; Query for the form "arr&&{f} or arr&&{b}": SELECT count(*) FROM foo, bar WHERE foo.bar_id = bar.id AND ( keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] OR keywords && '{volkswagen}'::text[] OR keywords && '{saturn}'::text[] OR keywords && '{honda}'::text[] OR keywords && '{porsche}'::text[] OR keywords && '{hummer}'::text[] OR keywords && '{ferrari}'::text[] ); For the first form, "arr&&{f,b}", the query takes about 15 seconds. For the second form "arr&&{f} or arr&&{b}", we get about 8 seconds. The difference is around 90-100%, which is what I am seeing on our real life queries. The query plans also become similar to the real life query plans. But I am having a hard time learning from them. The only interesting I see is that the estimated cost seems to be different than the actual run time. The second form has a higher estimated cost than the first form, but has a lower run time. In this test case, the query filters by any of 8 keywords. Note that with just 2 keywords, the difference is only about 5 seconds. The specific report that our users complained about involved 16 "keywords", where the difference is about 100%. When you said "analysis and optimization may be deepen over table indexes used for join planning", I'm not sure what you mean. Can you clarify?
I don't mean to be pesky. I was just wondering if there is anything else I should try? Should I simply rewrite all queries, change the form WHERE textarr && '{foo, bar}'::text[] To WHERE (textarr && '{foo}'::text[] OR textarr && '{bar}'::text[]) ? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Hello, Le 8/07/09 0:52, John Cheng a écrit : > I don't mean to be pesky. I was just wondering if there is anything > else I should try? > > Should I simply rewrite all queries, change the form > > WHERE textarr && '{foo, bar}'::text[] > > To > > WHERE (textarr && '{foo}'::text[] > OR textarr && '{bar}'::text[]) > > ? > While still puzzled by the big runtime difference you report between the 2 condition forms, I went on assessing these runtimes on my side from the new case statements that are assumed to reflect more the real world. Here are some measure results I got: (sorry for this long table) seq style runtime --- ----- ------- (db=slf) N01 OR-EA 6 237 N02 CC-EA 5 250 N03 OR+EA 12 628 N04 CC+EA 12 700 N05 OR+EA 15 679 N06 CC+EA 11 510 N07 CC-EA 7 712 N08 OR-EA 8 741 N09 CC-EA 4 963 N10 OR-EA 6 499 (db=stg) N11 CC+EA 15 978 N12 OR+EA 15 350 N13 CC-EA 8 102 N14 OR-EA 9 428 N15 OR-EA 5 267 N16 CC-EA 5 017 N17 OR-EA 6 119 N18 CC-EA 4 955 N19 OR+EA 11 722 N20 CC+EA 11 532 N21 OR-EA 7 303 N22 CC-EA 5 438 N23 CC-EA 5 519 N24 OR-EA 5 373 N25 OR-EA 5 422 N26 CC-EA 5 064 (db=stg) N27 CC-EA 8 314 (db=slf) N28 OR-EA 6 656 (db=stg) N29 OR-EA 6 760 (db=slf) N30 CC-EA 6 753 (db=stg) N31 CC-EA 5 500 (db=slf) N32 OR-EA 5 907 (db=stg) N33 OR-EA 5 391 (db=slf) N34 CC-EA 5 517 --- ----- ---------- Legend ------ seq: sequence order. style: condition style of query. CC: style "arr&&{f,b}" (one clause with multi-value text table). OR: style "arr&&{f} or arr&&{b}" (many clauses with 1-value text table). OR2: same style as style OR, with explicit JOIN in query expression. +EA: performed with EXPLAIN ANALYZE on query. Slower. -EA: performed without EXPLAIN ANALYZE on query. Faster. runtime: run time in milliseconds. (db=?): indicates that the following sequences have been performed after a drop-and-create process for all the tables and indexes. ------ Results from 2 selected EXPLAIN ANALYZE sequences: -- seq 03 (OR+EA) Aggregate (cost=37630.52..37630.53 rows=1 width=0) (actual time=12628.182..12628.184 rows=1 loops=1) -> Hash Join (cost=25989.12..37601.04 rows=11792 width=0) (actual time=8796.002..12231.422 rows=300000 loops=1) Hash Cond: ((bar.id)::numeric = foo.bar_id) -> Seq Scan on bar (cost=0.00..4328.00 rows=300000 width=4) (actual time=0.025..402.458 rows=300000 loops=1) -> Hash (cost=24636.81..24636.81 rows=82425 width=8) (actual time=8795.027..8795.027 rows=2097152 loops=1) -> Bitmap Heap Scan on foo (cost=1565.44..24636.81 rows=82425 width=8) (actual time=670.248..5098.109 rows=2097152 loops=1) Recheck Cond: ((keywords && '{ford}'::text[]) OR (keywords && '{toyota}'::text[]) OR (keywords && '{volkswagen}'::text[]) OR (keywords && '{saturn}'::text[]) OR (keywords && '{honda}'::text[]) OR (keywords && '{porsche}'::text[]) OR (keywords && '{hummer}'::text[]) OR (keywords && '{ferrari}'::text[])) -> BitmapOr (cost=1565.44..1565.44 rows=83879 width=0) (actual time=665.516..665.516 rows=0 loops=1) -> Bitmap Index Scan on foo_idx (cost=0.00..175.07 rows=10485 width=0) (actual time=114.013..114.013 rows=262144 loops=1) Index Cond: (keywords && '{ford}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..175.07 rows=10485 width=0) (actual time=72.398..72.398 rows=262144 loops=1) Index Cond: (keywords && '{toyota}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..175.07 rows=10485 width=0) (actual time=74.118..74.118 rows=262144 loops=1) Index Cond: (keywords && '{volkswagen}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..175.07 rows=10485 width=0) (actual time=58.486..58.486 rows=262144 loops=1) Index Cond: (keywords && '{saturn}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..175.07 rows=10485 width=0) (actual time=114.671..114.671 rows=524288 loops=1) Index Cond: (keywords && '{honda}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..175.07 rows=10485 width=0) (actual time=115.290..115.290 rows=524288 loops=1) Index Cond: (keywords && '{porsche}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..175.07 rows=10485 width=0) (actual time=58.184..58.184 rows=262144 loops=1) Index Cond: (keywords && '{hummer}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..175.07 rows=10485 width=0) (actual time=58.336..58.336 rows=262144 loops=1) Index Cond: (keywords && '{ferrari}'::text[]) Total runtime: 12628.401 ms -- seq 03 (OR+EA) -- seq 04 (CC+EA) Aggregate (cost=26726.37..26726.38 rows=1 width=0) (actual time=12700.620..12700.621 rows=1 loops=1) -> Hash Join (cost=17879.62..26722.62 rows=1500 width=0) (actual time=8482.572..12272.449 rows=300000 loops=1) Hash Cond: ((bar.id)::numeric = foo.bar_id) -> Seq Scan on bar (cost=0.00..4328.00 rows=300000 width=4) (actual time=0.029..412.984 rows=300000 loops=1) -> Hash (cost=17748.56..17748.56 rows=10485 width=8) (actual time=8481.524..8481.524 rows=2097152 loops=1) -> Bitmap Heap Scan on foo (cost=177.69..17748.56 rows=10485 width=8) (actual time=978.464..4679.954 rows=2097152 loops=1) Recheck Cond: (keywords && '{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[]) -> Bitmap Index Scan on foo_idx (cost=0.00..175.07 rows=10485 width=0) (actual time=973.569..973.569 rows=2097152 loops=1) Index Cond: (keywords && '{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[]) Total runtime: 12700.838 ms -- seq 04 (CC+EA) As far as I tried, the runtime difference between the 2 forms varies up to 10% and always under 1 second--ie. far away from the 100% and (15-8=) 7 seconds reported. Moreover the form "arr&&{f,b}" (let us call it: form 1) often shows better than the other (let us call it: form 2)--already noticed this fact by previous measures without joins. Exception occurs in a special case: for the first query submitted immediately after populating tables. It seems as if indexes were not optimized at this time. But all the following queries, whatever form is considered, run in lesser and similar time. When observing the runtime of intermediate steps (bitmap heap scan on table foo [step 1], sequential scan on table bar [step 2], hash join [step 3]), it can be noted that all 3 steps run in a comparable time whatever form used: - no relevant difference for steps 2 and 3, but an attended long time for retrieving all the resulting rows because of their high number; - for step 1: form 1 (arr&&{f,b}) needs more time to extract the 1st row of the resultset but is faster to collect the last rows. So no significant difference may be also revealed here. As I mentioned, I noticed that the runtime is longer for the 1st query performed after populating tables and that, since the 2nd query performed, runtimes clearly decrease towards a mean (about 6 seconds on my side). Maybe the runtimes reported (15 and 8 seconds respectively) occur in such a case (just after populating tables) with a performing order starting with form 1. As much as I saw, the sequence order acts upon the runtime reduction. Moreover successive queries tend to mean and equal runtimes with the 2 forms. About the user complaint, I would be not so astonished because of the duration of one query (more than 5 seconds in the 2 cases). There are so many rows to select that it seems not really possible to get a lesser runtime. What I may conclude: query plans show difference in performing the queries: each Where clause is assessed before a commun heap scan of table foo. The Where clause of the form 1 (multi-value text table) is longer to run because of multi-value to test for each key; the Where clause of the form 2 (1-value text table) is faster but the addition of the corresponding 1-value text table Where clauses runs in a comparable time. This observation resulted from measures on my sides and may not correctly reflect some proper configurations of the real production environment. Opting to form 2 (arr&&{f} or arr&&{b}) will surely not make runtime worst. So it would not be a bad choice to implement it if convinced. Another way could concern the hash join. It has been shown that this step costs a lot with respect to the overall runtime. Depending on available storage space and DBMS load, a kind of materialized view may be handled in order to cut off the overloading join. Here are some suggested statements to create this helper table: -- Creates a table jfoobar wrt. tables foo and bar -- (max 10 seconds elapsed on my platform) CREATE OR REPLACE TABLE jfoobar AS SELECT foo.bar_id, foo.keywords FROM foo INNER JOIN bar ON foo.bar_id = bar.id; -- Creates an index for jfoobar -- (max 3 seconds elapsed on my platform) -- this way... CREATE INDEX jfoobar_idx ON jfoobar(bar_id, keywords); -- or this one... DROP INDEX jfoobar_idx; CREATE INDEX jfoobar_idx ON jfoobar USING gin(keywords); -- This table may be updated or replaced on a regular time base or by triggered event on updating foo or bar tables. Finally, any query of the form 1 or 2 will run in less than 300 ms: -- form 1: unique text table SELECT count(*) FROM jfoobar WHERE keywords && '{ford, toyota, volkswagen, saturn, honda, porsche, hummer, ferrari}'::text[]; -- form 2: developped text table SELECT count(*) FROM jfoobar WHERE ( keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] OR keywords && '{volkswagen}'::text[] OR keywords && '{saturn}'::text[] OR keywords && '{honda}'::text[] OR keywords && '{porsche}'::text[] OR keywords && '{hummer}'::text[] OR keywords && '{ferrari}'::text[] ); -- The runtime would only depend on the update strategy for the join table. Regards. -- nha / Lyon / France.
Accidentally sent to nha only --- On Wed, 7/8/09, John Cheng <jlcheng@ymail.com> wrote: > From: John Cheng <jlcheng@ymail.com> > Subject: Re: [GENERAL] Problem search on text arrays, using the overlaps (&&) operator > To: "nha" <lyondif02@free.fr> > Date: Wednesday, July 8, 2009, 4:24 PM > Hi nha, > > I will try out your suggestion about a materialized view. I > had never even thought about trying it. > > As luck would have it, I had to try out these tests on a > different database today, which resulted in a different > query plan that executed both forms in the same time. > This different plan used a merge join instead of a hash > join. I will research this are more to see if I learn > anything new. > > You also pointed out that my queries (reports) are simply > reporting off of a lot of data. Perhaps I need to see if the > windowing functions in 8.4 can help improve things, or > perhaps try to partition the data. Unfortunately, these kind > of changes will be bigger than I had expected. > > > >
----- "nha" <lyondif02@free.fr> wrote: > > Another way could concern the hash join. It has been shown that this > step costs a lot with respect to the overall runtime. Depending on > available storage space and DBMS load, a kind of materialized view > may > be handled in order to cut off the overloading join. Here are some > suggested statements to create this helper table: > [snip] Hi nha, Sorry about the long lag after your last post. I didn't want to post back until I had something solid to report on. Using a materialized view turned out to be the best way to solve my problem. My coworker designed a new table that consists of the key columns for 3 large tables that were being joined. A trigger is used to make sure the "materialized view" is kept up-to-date. Since new data is added infrequently (once a month), the cost of keeping the materialized view up-to-date is cheap. The resulting query runs exceedingly fast! :) Thank you so much for your guidance. I have learned a lot from this incident!