Re: Problem search on text arrays, using the overlaps (&&) operator - Mailing list pgsql-general

From nha
Subject Re: Problem search on text arrays, using the overlaps (&&) operator
Date
Msg-id 4A522266.9070602@free.fr
Whole thread Raw
In response to Problem search on text arrays, using the overlaps (&&) operator  (John Cheng <jlcheng@ymail.com>)
Responses Added parameter for CREATE ROLE
List pgsql-general
Hello,

Le 2/07/09 2:07, John Cheng a écrit :
> We use text[] on one of our tables. This text[] column allows us to
> search for records that matches a keyword in a set of keywords. For
> example, if we want to find records that has a keyword of "foo" or
> "bar", we can use the condition:
>
>    keywords&&  '{foo, bar}'::text[]
>
> Another wau is to do this:
>
>    (keywords&&  '{foo}::text[]' OR keywords&&  '{bar}::text[]')
>
> I am noticing a big difference between the two ways. I'm trying to
> find out if we need to re-write our queries to speed them up, or
> perhaps I am just missing something about how to use text[].
>
> [...]
> For some reason, I am seeing a big difference in our real database. I
> don't want to just rewrite all of our queries yet. I'm guessing the
> data makes a big difference.  What would be a good way to examine the
> data to figure out what's the best way to write our queries? Is there
> any features in PostgreSQL that can help me improve the performance?
>
> Any advice would be greatly appreciated!
>

With your exhaustive example statements based on table foo and cars, I
performed some measures on my side (PostgreSQL 8.3.1 server). Here are
some statistical results:

seq    rtm    delta    ratio    deco
---    ---    -----    -----    ----
s1cc    873    -1,74%    2//9    1+1
s1or    889    1,71%    2//9    1+1
s2cc    1322    8,53%    3//9    1+2
s2or    1209    -9,32%    3//9    1+2
s3cc    892    -2,61%    2//9    1+(.5*2)
s3or    915    2,54%    2//9    1+(.5*2)
s4cc    511    -3,00%    1//9    (.5*2)
s4or    526    2,91%    1//9    (.5*2)
s5cc    1635    2,13%    4//9    1+1+2
s5or    1600    -2,17%    4//9    1+1+2
---    ---    -----    -----    ----

seq    where clauses
---    -------------
s1cc    keywords && '{ford, toyota}'::text[]
s1or    keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[]
s2cc    keywords && '{ford, honda}'::text[]
s2or    keywords && '{ford}'::text[] OR keywords && '{honda}'::text[]
s3cc    keywords && '{honda, ferrari}'::text[]
s3or    keywords && '{honda}'::text[] OR keywords && '{ferrari}'::text[]
s4cc    keywords && '{ferrari, hummer}'::text[]
s4or    keywords && '{ferrari}'::text[] OR keywords && '{hummer}'::text[]
s5cc    keywords && '{ford, toyota, porsche}'::text[]
s5or    keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] OR
keywords && '{porsche}'::text[]

legend
------
seq    sequence of 10 subsequent explain analyze per row
rtm    runtime mean (in milliseconds) of 10 subsequent measures
delta    difference percentage between cc and or sequences
cc    unique where clause with >1-size table (eg. {foo,bar})
or    multiple where clauses with 1-size text table (eg. {foo})
ratio    ratio between # of result rows and # of table rows
deco    result row partition between constant text table values in where
clause.

In the following, I refer to your condition forms as:
- arr&&{f,b}
- arr&&{f} or arr&&{b}

I noticed first that, contrarily to your observation, for the "ford or
toyata" case (sequence s1 developped into 2 subcases s1cc and s1or for
both forms of condition), runtime mean is shorter for s1cc (arr&&{f,t})
than for s1or (arr&&{f} or arr&&{t}). But the difference percentage is
only about 1,7% (ie. not enough relevant).

This empirical advantage of form arr&&{f,t} over form (arr&&{f} or
arr&&{t}) is also observed for 2 cases (s3 and s4) out of 4 (s2 up to
s5). The difference percentage looks more relevant (about 3%). The cases
s3 and s4 differ from the others (s1, s2, and s5) by the fact that the
sets of matching rows for each compared text table value intersect: all
the rows matched by ferrari also match honda (strict inclusion not
equality); all the rows matched by ferrari also match hummer and
conversely (double inclusion here, ie. equality). In the other 3 cases,
each compared text table value matches set of rows without intersecting
the matching row set of the other(s) value(s). We may then assume that
form arr&&{f,t} would fit better when there are lots of rows matched by
several terms--however this cannot be generalized at this stage.

The reported data let us also guess some linear relationship between
runtime and # of result rows. Here this relationship seems equally
applicable with both forms arr&&{f,t} and (arr&&{f} or arr&&{t}).

Out of these measures and report, I notice that, regarding the query
plan explanation of your queries over real data, the difference between
actual runtimes reported for each case of condition forms is not so
relevant with respect to the overall runtime of the queries. At bitmap
heap scan on the table over which conditions are performed, when the
last row is retrieved, actual runtime is respectively of:
- for arr&&{f,b}: 1276.990 ms;
- for (arr&&{f) or arr&&{b}): 1211.535 ms.
While quite close (difference percentage of about 5%), the difference is
not really harmful with respect to the overall runtimes (resp. 13197 ms
and 7768 ms), ie. in terms of part of overall runtimes resp.
(1276/13197=) 10% and (1211/7768=) 16 %.

Even if overlap operator runtimes are interchanged between the 2 cases,
ratios would be resp. (1211/13197=) 9% and (1276/7768=) 16%, ie. not so
different from the current plan.

On the other hand, the hash join runtime part is very huge:
- for arr&&{f,b} case: (13170-1924=) 11246 ms;
- for (arr&&{f) or arr&&{b}) case: (7743-1850=) 5893 ms.
These values represent resp. (11246/13197=) 85% and (5893/7768=) 76% of
overall runtimes, ie. clearly dominating the overall query plan.

In my opinion, analysis and optimization may be deepen over table
indexes used for join planning. As your reported query plans show, the
Where clauses are performed independantly from the table ml_lead; the
reason is that all the attributes of the clauses belong to the table
lead_reporting_data. Time may be reduced on join condition achievements.

Hoping this observation will contribute a little to your opinion.

Without any claim, I attached a document to this email for details on
the measures I took with the overlap operator -- OpenDocument
Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various"
presents the detailed measures related to the data reported in this email.

Regards.

--
nha / Lyon / France.

Attachment

pgsql-general by date:

Previous
From: Tguru
Date:
Subject: Re: Upgrading 8.3 to 8.4 on Windows.
Next
From: Ivan Sergio Borgonovo
Date:
Subject: Feistel cipher, shorter string and hex to int