Re: Problem search on text arrays, using the overlaps (&&) operator - Mailing list pgsql-general
From | nha |
---|---|
Subject | Re: Problem search on text arrays, using the overlaps (&&) operator |
Date | |
Msg-id | 4A522266.9070602@free.fr Whole thread Raw |
In response to | Problem search on text arrays, using the overlaps (&&) operator (John Cheng <jlcheng@ymail.com>) |
Responses |
Added parameter for CREATE ROLE
|
List | pgsql-general |
Hello, Le 2/07/09 2:07, John Cheng a écrit : > We use text[] on one of our tables. This text[] column allows us to > search for records that matches a keyword in a set of keywords. For > example, if we want to find records that has a keyword of "foo" or > "bar", we can use the condition: > > keywords&& '{foo, bar}'::text[] > > Another wau is to do this: > > (keywords&& '{foo}::text[]' OR keywords&& '{bar}::text[]') > > I am noticing a big difference between the two ways. I'm trying to > find out if we need to re-write our queries to speed them up, or > perhaps I am just missing something about how to use text[]. > > [...] > For some reason, I am seeing a big difference in our real database. I > don't want to just rewrite all of our queries yet. I'm guessing the > data makes a big difference. What would be a good way to examine the > data to figure out what's the best way to write our queries? Is there > any features in PostgreSQL that can help me improve the performance? > > Any advice would be greatly appreciated! > With your exhaustive example statements based on table foo and cars, I performed some measures on my side (PostgreSQL 8.3.1 server). Here are some statistical results: seq rtm delta ratio deco --- --- ----- ----- ---- s1cc 873 -1,74% 2//9 1+1 s1or 889 1,71% 2//9 1+1 s2cc 1322 8,53% 3//9 1+2 s2or 1209 -9,32% 3//9 1+2 s3cc 892 -2,61% 2//9 1+(.5*2) s3or 915 2,54% 2//9 1+(.5*2) s4cc 511 -3,00% 1//9 (.5*2) s4or 526 2,91% 1//9 (.5*2) s5cc 1635 2,13% 4//9 1+1+2 s5or 1600 -2,17% 4//9 1+1+2 --- --- ----- ----- ---- seq where clauses --- ------------- s1cc keywords && '{ford, toyota}'::text[] s1or keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] s2cc keywords && '{ford, honda}'::text[] s2or keywords && '{ford}'::text[] OR keywords && '{honda}'::text[] s3cc keywords && '{honda, ferrari}'::text[] s3or keywords && '{honda}'::text[] OR keywords && '{ferrari}'::text[] s4cc keywords && '{ferrari, hummer}'::text[] s4or keywords && '{ferrari}'::text[] OR keywords && '{hummer}'::text[] s5cc keywords && '{ford, toyota, porsche}'::text[] s5or keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] OR keywords && '{porsche}'::text[] legend ------ seq sequence of 10 subsequent explain analyze per row rtm runtime mean (in milliseconds) of 10 subsequent measures delta difference percentage between cc and or sequences cc unique where clause with >1-size table (eg. {foo,bar}) or multiple where clauses with 1-size text table (eg. {foo}) ratio ratio between # of result rows and # of table rows deco result row partition between constant text table values in where clause. In the following, I refer to your condition forms as: - arr&&{f,b} - arr&&{f} or arr&&{b} I noticed first that, contrarily to your observation, for the "ford or toyata" case (sequence s1 developped into 2 subcases s1cc and s1or for both forms of condition), runtime mean is shorter for s1cc (arr&&{f,t}) than for s1or (arr&&{f} or arr&&{t}). But the difference percentage is only about 1,7% (ie. not enough relevant). This empirical advantage of form arr&&{f,t} over form (arr&&{f} or arr&&{t}) is also observed for 2 cases (s3 and s4) out of 4 (s2 up to s5). The difference percentage looks more relevant (about 3%). The cases s3 and s4 differ from the others (s1, s2, and s5) by the fact that the sets of matching rows for each compared text table value intersect: all the rows matched by ferrari also match honda (strict inclusion not equality); all the rows matched by ferrari also match hummer and conversely (double inclusion here, ie. equality). In the other 3 cases, each compared text table value matches set of rows without intersecting the matching row set of the other(s) value(s). We may then assume that form arr&&{f,t} would fit better when there are lots of rows matched by several terms--however this cannot be generalized at this stage. The reported data let us also guess some linear relationship between runtime and # of result rows. Here this relationship seems equally applicable with both forms arr&&{f,t} and (arr&&{f} or arr&&{t}). Out of these measures and report, I notice that, regarding the query plan explanation of your queries over real data, the difference between actual runtimes reported for each case of condition forms is not so relevant with respect to the overall runtime of the queries. At bitmap heap scan on the table over which conditions are performed, when the last row is retrieved, actual runtime is respectively of: - for arr&&{f,b}: 1276.990 ms; - for (arr&&{f) or arr&&{b}): 1211.535 ms. While quite close (difference percentage of about 5%), the difference is not really harmful with respect to the overall runtimes (resp. 13197 ms and 7768 ms), ie. in terms of part of overall runtimes resp. (1276/13197=) 10% and (1211/7768=) 16 %. Even if overlap operator runtimes are interchanged between the 2 cases, ratios would be resp. (1211/13197=) 9% and (1276/7768=) 16%, ie. not so different from the current plan. On the other hand, the hash join runtime part is very huge: - for arr&&{f,b} case: (13170-1924=) 11246 ms; - for (arr&&{f) or arr&&{b}) case: (7743-1850=) 5893 ms. These values represent resp. (11246/13197=) 85% and (5893/7768=) 76% of overall runtimes, ie. clearly dominating the overall query plan. In my opinion, analysis and optimization may be deepen over table indexes used for join planning. As your reported query plans show, the Where clauses are performed independantly from the table ml_lead; the reason is that all the attributes of the clauses belong to the table lead_reporting_data. Time may be reduced on join condition achievements. Hoping this observation will contribute a little to your opinion. Without any claim, I attached a document to this email for details on the measures I took with the overlap operator -- OpenDocument Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various" presents the detailed measures related to the data reported in this email. Regards. -- nha / Lyon / France.
Attachment
pgsql-general by date: