Thread: Problem search on text arrays, using the overlaps (&&) operator

Problem search on text arrays, using the overlaps (&&) operator

From

John Cheng

Date:

01 July 2009, 21:07:20

We use text[] on one of our tables. This text[] column allows us to
search for records that matches a keyword in a set of keywords. For
example, if we want to find records that has a keyword of "foo" or
"bar", we can use the condition:

  keywords && '{foo, bar}'::text[]

Another wau is to do this:

  (keywords && '{foo}::text[]' OR keywords && '{bar}::text[]')

I am noticing a big difference between the two ways. I'm trying to
find out if we need to re-write our queries to speed them up, or
perhaps I am just missing something about how to use text[].

To set up a simple test case, use:

CREATE TEMP TABLE foo (
    keywords text[]
);
CREATE INDEX foo_idx ON foo USING gin (keywords);
INSERT INTO foo VALUES ('{ford}'::text[]);
INSERT INTO foo VALUES ('{toyota}'::text[]);
INSERT INTO foo VALUES ('{volkswagen}'::text[]);
INSERT INTO foo VALUES ('{dodge}'::text[]);
INSERT INTO foo VALUES ('{saturn}'::text[]);
INSERT INTO foo VALUES ('{honda}'::text[]);
INSERT INTO foo VALUES ('{porsche}'::text[]);
INSERT INTO foo VALUES ('{porsche, audi, chrysler}'::text[]);
INSERT INTO foo VALUES ('{honda, hummer, ferrari}'::text[]);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
INSERT INTO foo (SELECT keywords FROM foo);
-- Query of form 'arr && {foo, bar}'
EXPLAIN ANALYZE SELECT
    count(*)
FROM foo
WHERE keywords && '{ford, toyota}'::text[];

------ result ------
QUERY PLAN:
 Aggregate  (cost=9870.50..9870.51 rows=1 width=0) (actual time=449.937..449.938 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=104.88..9853.56 rows=6778 width=0) (actual time=61.197..308.724 rows=262144
loops=1)
         Recheck Cond: (keywords && '{ford,toyota}'::text[])
         ->  Bitmap Index Scan on foo_idx  (cost=0.00..103.19 rows=6778 width=0) (actual time=58.816..58.816
rows=262144loops=1) 
               Index Cond: (keywords && '{ford,toyota}'::text[])
 Total runtime: 450.121 ms
(6 rows)


-- Query of form 'arr && {foo} OR arr && bar'
EXPLAIN ANALYZE SELECT count(*) FROM foo WHERE
    (
        keywords && '{ford}'::text[]
        OR keywords && '{toyota}'::text[]
    )

------ result ------
QUERY PLAN:
 Aggregate  (cost=11351.85..11351.86 rows=1 width=0) (actual time=424.389..424.389 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=213.13..11318.04 rows=13522 width=0) (actual time=43.728..273.913 rows=262144
loops=1)
         Recheck Cond: ((keywords && '{ford}'::text[]) OR (keywords && '{toyota}'::text[]))
         ->  BitmapOr  (cost=213.13..213.13 rows=13556 width=0) (actual time=41.386..41.386 rows=0 loops=1)
               ->  Bitmap Index Scan on foo_idx  (cost=0.00..103.19 rows=6778 width=0) (actual time=21.216..21.216
rows=131072loops=1) 
                     Index Cond: (keywords && '{ford}'::text[])
               ->  Bitmap Index Scan on foo_idx  (cost=0.00..103.19 rows=6778 width=0) (actual time=20.167..20.167
rows=131072loops=1) 
                     Index Cond: (keywords && '{toyota}'::text[])
 Total runtime: 424.431 ms
(9 rows)


The difference is very little here. However, in our application I am
seeing a much bigger difference. The affected query is a lot more
complicated:

First, a query of the form "keywords && '{foo, bar}'::text[]"

EXPLAIN ANALYZE SELECT
    count(*)
FROM mb_lead ml
INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id
WHERE lrd.typeflags && '{autobytel.volume, automotive}'::text[];


------ result ------
QUERY PLAN:
 Aggregate  (cost=71602.10..71602.11 rows=1 width=0) (actual time=13196.895..13196.896 rows=1 loops=1)
   ->  Hash Join  (cost=60278.37..71598.14 rows=1582 width=0) (actual time=1924.076..13170.602 rows=29567 loops=1)
         Hash Cond: (ml.lead_id = lrd.lead_id)
         ->  Seq Scan on mb_lead ml  (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.014..293.214 rows=316398
loops=1)
         ->  Hash  (cost=60022.57..60022.57 rows=20464 width=8) (actual time=1922.050..1922.050 rows=473743 loops=1)
               ->  Bitmap Heap Scan on lead_reporting_data lrd  (cost=808.14..60022.57 rows=20464 width=8) (actual
time=424.841..1276.990rows=473743 loops=1) 
                     Recheck Cond: (typeflags && '{autobytel.volume,automotive}'::text[])
                     ->  Bitmap Index Scan on lead_reporting_data_typeflags_idx  (cost=0.00..803.02 rows=20464 width=0)
(actualtime=308.941..308.941 rows=483587 loops=1) 
                           Index Cond: (typeflags && '{autobytel.volume,automotive}'::text[])
 Total runtime: 13197.015 ms

Second, a query of the form:
 "keywords && '{foo}'::text[] OR keywords && '{bar}'::text[]"


EXPLAIN ANALYZE SELECT
    count(*)
FROM mb_lead ml
INNER JOIN lead_reporting_data lrd ON lrd.lead_id = ml.lead_id
WHERE (lrd.typeflags && '{autobytel.volume}'::text[]
    OR lrd.typeflags && '{automotive}'::text[])


------ result ------
QUERY PLAN:
 Aggregate  (cost=112761.86..112761.87 rows=1 width=0) (actual time=7768.672..7768.673 rows=1 loops=1)
   ->  Hash Join  (cost=101418.46..112753.97 rows=3156 width=0) (actual time=1850.560..7743.651 rows=29567 loops=1)
         Hash Cond: (ml.lead_id = lrd.lead_id)
         ->  Seq Scan on mb_lead ml  (cost=0.00..6557.98 rows=316398 width=8) (actual time=0.013..274.131 rows=316398
loops=1)
         ->  Hash  (cost=100908.13..100908.13 rows=40826 width=8) (actual time=1849.519..1849.519 rows=473743 loops=1)
               ->  Bitmap Heap Scan on lead_reporting_data lrd  (cost=1626.46..100908.13 rows=40826 width=8) (actual
time=357.613..1211.535rows=473743 loops=1) 
                     Recheck Cond: ((typeflags && '{autobytel.volume}'::text[]) OR (typeflags &&
'{automotive}'::text[]))
                     ->  BitmapOr  (cost=1626.46..1626.46 rows=40928 width=0) (actual time=240.921..240.921 rows=0
loops=1)
                           ->  Bitmap Index Scan on lead_reporting_data_typeflags_idx  (cost=0.00..803.02 rows=20464
width=0)(actual time=87.368..87.368 rows=161264 loops=1) 
                                 Index Cond: (typeflags && '{autobytel.volume}'::text[])
                           ->  Bitmap Index Scan on lead_reporting_data_typeflags_idx  (cost=0.00..803.02 rows=20464
width=0)(actual time=153.546..153.546 rows=322317 loops=1) 
                                 Index Cond: (typeflags && '{automotive}'::text[])
 Total runtime: 7768.788 ms


-----------
For some reason, I am seeing a big difference in our real database. I
don't want to just rewrite all of our queries yet. I'm guessing the
data makes a big difference.  What would be a good way to examine the
data to figure out what's the best way to write our queries? Is there
any features in PostgreSQL that can help me improve the performance?

Any advice would be greatly appreciated!

Re: Problem search on text arrays, using the overlaps (&&) operator

From

Andreas Wenk

Date:

03 July 2009, 06:12:57

John Cheng schrieb:
> -----------
> For some reason, I am seeing a big difference in our real database. I
> don't want to just rewrite all of our queries yet. I'm guessing the
> data makes a big difference.  What would be a good way to examine the
> data to figure out what's the best way to write our queries? Is there
> any features in PostgreSQL that can help me improve the performance?
>
> Any advice would be greatly appreciated!

Hi,

did you think about using the fulltext search integrated up from version 8.3. I never used
your approach and don't know if the fulltextsearch is suitable for your case ... just a hint.

http://www.postgresql.org/docs/8.4/interactive/textsearch.html

Cheers

Andy

Re: Problem search on text arrays, using the overlaps (&&) operator

From

John Cheng

Date:

03 July 2009, 11:37:25

Hi Andreas,

I'm afraid fulltext search won't fit our app here. Our application
tags each record with "source flags", which is a text[] of strings
that describes where the record came from. These flags are already
passed into the application when we store the records. So we can
simply store them as text[]. Contrast to this, doing a fulltext search
would be storing these flags as one single string, then using the
to_tsvector() to have PostgreSQL parse it out again. The fulltext
search approach doesn't seem to make sense for us.

I'm also suspcious that the same type of problem would affect queries
on tsvector columns, but I have not tested myself.

----- Original Message -----
From: "Andreas Wenk" <a.wenk@netzmeister-st-pauli.de>
To: "John Cheng" <jlcheng@ymail.com>, "PG-General Mailing List" <pgsql-general@postgresql.org>
Sent: Friday, July 3, 2009 2:12:46 AM GMT -08:00 US/Canada Pacific
Subject: Re: [GENERAL] Problem search on text arrays, using the overlaps (&&) operator

John Cheng schrieb:
> -----------
> For some reason, I am seeing a big difference in our real database. I
> don't want to just rewrite all of our queries yet. I'm guessing the
> data makes a big difference.  What would be a good way to examine the
> data to figure out what's the best way to write our queries? Is there
> any features in PostgreSQL that can help me improve the performance?
>
> Any advice would be greatly appreciated!

Hi,

did you think about using the fulltext search integrated up from version 8.3. I never used
your approach and don't know if the fulltextsearch is suitable for your case ... just a hint.

http://www.postgresql.org/docs/8.4/interactive/textsearch.html

Cheers

Andy

Re: Problem search on text arrays, using the overlaps (&&) operator

From

nha

Date:

06 July 2009, 13:40:40

Hello,

Le 2/07/09 2:07, John Cheng a écrit :
> We use text[] on one of our tables. This text[] column allows us to
> search for records that matches a keyword in a set of keywords. For
> example, if we want to find records that has a keyword of "foo" or
> "bar", we can use the condition:
>
>    keywords&&  '{foo, bar}'::text[]
>
> Another wau is to do this:
>
>    (keywords&&  '{foo}::text[]' OR keywords&&  '{bar}::text[]')
>
> I am noticing a big difference between the two ways. I'm trying to
> find out if we need to re-write our queries to speed them up, or
> perhaps I am just missing something about how to use text[].
>
> [...]
> For some reason, I am seeing a big difference in our real database. I
> don't want to just rewrite all of our queries yet. I'm guessing the
> data makes a big difference.  What would be a good way to examine the
> data to figure out what's the best way to write our queries? Is there
> any features in PostgreSQL that can help me improve the performance?
>
> Any advice would be greatly appreciated!
>

With your exhaustive example statements based on table foo and cars, I
performed some measures on my side (PostgreSQL 8.3.1 server). Here are
some statistical results:

seq    rtm    delta    ratio    deco
---    ---    -----    -----    ----
s1cc    873    -1,74%    2//9    1+1
s1or    889    1,71%    2//9    1+1
s2cc    1322    8,53%    3//9    1+2
s2or    1209    -9,32%    3//9    1+2
s3cc    892    -2,61%    2//9    1+(.5*2)
s3or    915    2,54%    2//9    1+(.5*2)
s4cc    511    -3,00%    1//9    (.5*2)
s4or    526    2,91%    1//9    (.5*2)
s5cc    1635    2,13%    4//9    1+1+2
s5or    1600    -2,17%    4//9    1+1+2
---    ---    -----    -----    ----

seq    where clauses
---    -------------
s1cc    keywords && '{ford, toyota}'::text[]
s1or    keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[]
s2cc    keywords && '{ford, honda}'::text[]
s2or    keywords && '{ford}'::text[] OR keywords && '{honda}'::text[]
s3cc    keywords && '{honda, ferrari}'::text[]
s3or    keywords && '{honda}'::text[] OR keywords && '{ferrari}'::text[]
s4cc    keywords && '{ferrari, hummer}'::text[]
s4or    keywords && '{ferrari}'::text[] OR keywords && '{hummer}'::text[]
s5cc    keywords && '{ford, toyota, porsche}'::text[]
s5or    keywords && '{ford}'::text[] OR keywords && '{toyota}'::text[] OR
keywords && '{porsche}'::text[]

legend
------
seq    sequence of 10 subsequent explain analyze per row
rtm    runtime mean (in milliseconds) of 10 subsequent measures
delta    difference percentage between cc and or sequences
cc    unique where clause with >1-size table (eg. {foo,bar})
or    multiple where clauses with 1-size text table (eg. {foo})
ratio    ratio between # of result rows and # of table rows
deco    result row partition between constant text table values in where
clause.

In the following, I refer to your condition forms as:
- arr&&{f,b}
- arr&&{f} or arr&&{b}

I noticed first that, contrarily to your observation, for the "ford or
toyata" case (sequence s1 developped into 2 subcases s1cc and s1or for
both forms of condition), runtime mean is shorter for s1cc (arr&&{f,t})
than for s1or (arr&&{f} or arr&&{t}). But the difference percentage is
only about 1,7% (ie. not enough relevant).

This empirical advantage of form arr&&{f,t} over form (arr&&{f} or
arr&&{t}) is also observed for 2 cases (s3 and s4) out of 4 (s2 up to
s5). The difference percentage looks more relevant (about 3%). The cases
s3 and s4 differ from the others (s1, s2, and s5) by the fact that the
sets of matching rows for each compared text table value intersect: all
the rows matched by ferrari also match honda (strict inclusion not
equality); all the rows matched by ferrari also match hummer and
conversely (double inclusion here, ie. equality). In the other 3 cases,
each compared text table value matches set of rows without intersecting
the matching row set of the other(s) value(s). We may then assume that
form arr&&{f,t} would fit better when there are lots of rows matched by
several terms--however this cannot be generalized at this stage.

The reported data let us also guess some linear relationship between
runtime and # of result rows. Here this relationship seems equally
applicable with both forms arr&&{f,t} and (arr&&{f} or arr&&{t}).

Out of these measures and report, I notice that, regarding the query
plan explanation of your queries over real data, the difference between
actual runtimes reported for each case of condition forms is not so
relevant with respect to the overall runtime of the queries. At bitmap
heap scan on the table over which conditions are performed, when the
last row is retrieved, actual runtime is respectively of:
- for arr&&{f,b}: 1276.990 ms;
- for (arr&&{f) or arr&&{b}): 1211.535 ms.
While quite close (difference percentage of about 5%), the difference is
not really harmful with respect to the overall runtimes (resp. 13197 ms
and 7768 ms), ie. in terms of part of overall runtimes resp.
(1276/13197=) 10% and (1211/7768=) 16 %.

Even if overlap operator runtimes are interchanged between the 2 cases,
ratios would be resp. (1211/13197=) 9% and (1276/7768=) 16%, ie. not so
different from the current plan.

On the other hand, the hash join runtime part is very huge:
- for arr&&{f,b} case: (13170-1924=) 11246 ms;
- for (arr&&{f) or arr&&{b}) case: (7743-1850=) 5893 ms.
These values represent resp. (11246/13197=) 85% and (5893/7768=) 76% of
overall runtimes, ie. clearly dominating the overall query plan.

In my opinion, analysis and optimization may be deepen over table
indexes used for join planning. As your reported query plans show, the
Where clauses are performed independantly from the table ml_lead; the
reason is that all the attributes of the clauses belong to the table
lead_reporting_data. Time may be reduced on join condition achievements.

Hoping this observation will contribute a little to your opinion.

Without any claim, I attached a document to this email for details on
the measures I took with the overlap operator -- OpenDocument
Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various"
presents the detailed measures related to the data reported in this email.

Regards.

--
nha / Lyon / France.

Attachment

overlap-opr-measures.ods

Added parameter for CREATE ROLE

From

Michael Gould

Date:

06 July 2009, 15:32:29

It would be nice if during create role we could have a parameter to set the
number of days that a password is valid instead of just a timestamp.

Best Regards

--
Michael Gould, Managing Partner
Intermodal Software Solutions, LLC
904.226.0978
904.592.5250 fax

Re: Added parameter for CREATE ROLE

From

Raymond O'Donnell

Date:

06 July 2009, 15:40:18

On 06/07/2009 19:32, Michael Gould wrote:
> It would be nice if during create role we could have a parameter to set the
> number of days that a password is valid instead of just a timestamp.

Would (current_timestamp + interval '365 days') work? Dunno myself -
just thinking out loud... :-)

Ray.

------------------------------------------------------------------
Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland
rod@iol.ie
Galway Cathedral Recitals: http://www.galwaycathedral.org/recitals
------------------------------------------------------------------

Re: Problem search on text arrays, using the overlaps (&&) operator

From

John Cheng

Date:

06 July 2009, 20:17:22


----- "nha" <lyondif02@free.fr> wrote:

> From: "nha" <lyondif02@free.fr>
> To: "John Cheng" <jlcheng@ymail.com>
> Cc: pgsql-general@postgresql.org
> Sent: Monday, July 6, 2009 9:12:22 AM GMT -08:00 US/Canada Pacific
> Subject: Re: [GENERAL] Problem search on text arrays, using the overlaps (&&) operator
>
> Hello,
>
> With your exhaustive example statements based on table foo and cars, I
>
> performed some measures on my side (PostgreSQL 8.3.1 server). Here are
>
> some statistical results:
>
[ ... snipped ... ]
>
> In my opinion, analysis and optimization may be deepen over table
> indexes used for join planning. As your reported query plans show, the
>
> Where clauses are performed independantly from the table ml_lead; the
>
> reason is that all the attributes of the clauses belong to the table
> lead_reporting_data. Time may be reduced on join condition
> achievements.
>
> Hoping this observation will contribute a little to your opinion.
>
> Without any claim, I attached a document to this email for details on
>
> the measures I took with the overlap operator -- OpenDocument
> Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various"
> presents the detailed measures related to the data reported in this
> email.
>
> Regards.
>
> --
> nha / Lyon / France.

Hi nha,

I had not expected anyone to go to such lengths to evaluate my
situation, thank you so much!

After looking at your analysis, I realized that the test case I
created isn't close enough to the queries running in our prod
environment. For one, table 'foo' does not join to another table; The
other thing is that the amount of data isn't the same; Finally, these
tables have been ANALYZED.

So I took some time to update the test case. On our server, running
8.3.6, I was able to reproduce the difference between the two styles:
"arr&&{f,b}" and "arr&&{f} or arr&&{b}".

First, the setup:

-- Begin test case
-- Sets up 'bar'
SELECT id INTO TEMP TABLE bar FROM (SELECT generate_series(1,300000) as id) AS bar;
CREATE INDEX bar_idx ON bar (id);
ANALYZE bar;
-- Sets up 'foo'
CREATE TEMP SEQUENCE foo_bar_id_seq;
CREATE TEMP TABLE foo (
    bar_id numeric DEFAULT NEXTVAL('foo_bar_id_seq'),
    keywords text[]
);
CREATE INDEX foo_idx ON foo USING gin (keywords);
INSERT INTO foo (keywords) VALUES ('{ford}'::text[]);
INSERT INTO foo (keywords) VALUES ('{toyota}'::text[]);
INSERT INTO foo (keywords) VALUES ('{volkswagen}'::text[]);
INSERT INTO foo (keywords) VALUES ('{saturn}'::text[]);
INSERT INTO foo (keywords) VALUES ('{honda}'::text[]);
INSERT INTO foo (keywords) VALUES ('{porsche}'::text[]);
INSERT INTO foo (keywords) VALUES ('{porsche, audi, chrysler}'::text[]);
INSERT INTO foo (keywords) VALUES ('{honda, hummer, ferrari}'::text[]);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
ANALYZE foo;
-- End test case

Query for the form "arr&&{f,b}"
SELECT
    count(*)
FROM foo
INNER JOIN bar ON foo.bar_id = bar.id
WHERE
    foo.keywords && '{ford, toyota, volkswagen, saturn, honda, porsche, hummer, ferrari}'::text[];

Query for the form "arr&&{f} or arr&&{b}":
SELECT
    count(*)
FROM foo, bar
WHERE
    foo.bar_id = bar.id
    AND
    (
        keywords && '{ford}'::text[]
        OR keywords && '{toyota}'::text[]
        OR keywords && '{volkswagen}'::text[]
        OR keywords && '{saturn}'::text[]
        OR keywords && '{honda}'::text[]
        OR keywords && '{porsche}'::text[]
        OR keywords && '{hummer}'::text[]
        OR keywords && '{ferrari}'::text[]
    );

For the first form, "arr&&{f,b}", the query takes about 15
seconds. For the second form "arr&&{f} or arr&&{b}", we get about 8
seconds. The difference is around 90-100%, which is what I am seeing on
our real life queries.

The query plans also become similar to the real life query plans. But
I am having a hard time learning from them. The only interesting I see
is that the estimated cost seems to be different than the actual run
time. The second form has a higher estimated cost than the first form,
but has a lower run time.

In this test case, the query filters by any of 8 keywords. Note that
with just 2 keywords, the difference is only about 5 seconds. The
specific report that our users complained about involved 16
"keywords", where the difference is about 100%.

When you said "analysis and optimization may be deepen over table
indexes used for join planning", I'm not sure what you mean. Can you
clarify?

Re: Problem search on text arrays, using the overlaps (&&) operator

From

John Cheng

Date:

07 July 2009, 19:53:29

I don't mean to be pesky. I was just wondering if there is anything
else I should try?

Should I simply rewrite all queries, change the form

WHERE textarr && '{foo, bar}'::text[]

To

WHERE (textarr && '{foo}'::text[]
       OR textarr && '{bar}'::text[])

?

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Problem search on text arrays, using the overlaps (&&) operator

From

nha

Date:

08 July 2009, 08:22:04

Hello,

Le 8/07/09 0:52, John Cheng a écrit :
> I don't mean to be pesky. I was just wondering if there is anything
> else I should try?
>
> Should I simply rewrite all queries, change the form
>
> WHERE textarr && '{foo, bar}'::text[]
>
> To
>
> WHERE (textarr && '{foo}'::text[]
>        OR textarr && '{bar}'::text[])
>
> ?
>

While still puzzled by the big runtime difference you report between the
2 condition forms, I went on assessing these runtimes on my side from
the new case statements that are assumed to reflect more the real world.

Here are some measure results I got: (sorry for this long table)

seq    style    runtime
---    -----    -------
(db=slf)
N01    OR-EA    6 237
N02    CC-EA    5 250
N03    OR+EA    12 628
N04    CC+EA    12 700
N05    OR+EA    15 679
N06    CC+EA    11 510
N07    CC-EA    7 712
N08    OR-EA    8 741
N09    CC-EA    4 963
N10    OR-EA    6 499
(db=stg)
N11    CC+EA    15 978
N12    OR+EA    15 350
N13    CC-EA    8 102
N14    OR-EA    9 428
N15    OR-EA    5 267
N16    CC-EA    5 017
N17    OR-EA    6 119
N18    CC-EA    4 955
N19    OR+EA    11 722
N20    CC+EA    11 532
N21    OR-EA    7 303
N22    CC-EA    5 438
N23    CC-EA    5 519
N24    OR-EA    5 373
N25    OR-EA    5 422
N26    CC-EA    5 064
(db=stg)
N27    CC-EA    8 314
(db=slf)
N28    OR-EA    6 656
(db=stg)
N29    OR-EA    6 760
(db=slf)
N30    CC-EA    6 753
(db=stg)
N31    CC-EA    5 500
(db=slf)
N32    OR-EA    5 907
(db=stg)
N33    OR-EA    5 391
(db=slf)
N34    CC-EA    5 517
---    -----    ----------

Legend
------
seq: sequence order.
style: condition style of query.
    CC: style "arr&&{f,b}" (one clause with multi-value text table).
    OR: style "arr&&{f} or arr&&{b}" (many clauses with 1-value text table).
    OR2: same style as style OR, with explicit JOIN in query expression.
    +EA: performed with EXPLAIN ANALYZE on query. Slower.
    -EA: performed without EXPLAIN ANALYZE on query. Faster.
runtime: run time in milliseconds.
(db=?): indicates that the following sequences have been performed after
a drop-and-create process for all the tables and indexes.
------

Results from 2 selected EXPLAIN ANALYZE sequences:

-- seq 03 (OR+EA)
Aggregate  (cost=37630.52..37630.53 rows=1 width=0) (actual
time=12628.182..12628.184 rows=1 loops=1)
  ->  Hash Join  (cost=25989.12..37601.04 rows=11792 width=0) (actual
time=8796.002..12231.422 rows=300000 loops=1)
        Hash Cond: ((bar.id)::numeric = foo.bar_id)
        ->  Seq Scan on bar  (cost=0.00..4328.00 rows=300000 width=4)
(actual time=0.025..402.458 rows=300000 loops=1)
        ->  Hash  (cost=24636.81..24636.81 rows=82425 width=8) (actual
time=8795.027..8795.027 rows=2097152 loops=1)
              ->  Bitmap Heap Scan on foo  (cost=1565.44..24636.81
rows=82425 width=8) (actual time=670.248..5098.109 rows=2097152 loops=1)
                    Recheck Cond: ((keywords && '{ford}'::text[]) OR
(keywords && '{toyota}'::text[]) OR (keywords && '{volkswagen}'::text[])
OR (keywords && '{saturn}'::text[]) OR (keywords && '{honda}'::text[])
OR (keywords && '{porsche}'::text[]) OR (keywords && '{hummer}'::text[])
OR (keywords && '{ferrari}'::text[]))
                    ->  BitmapOr  (cost=1565.44..1565.44 rows=83879
width=0) (actual time=665.516..665.516 rows=0 loops=1)
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=114.013..114.013
rows=262144 loops=1)
                                Index Cond: (keywords && '{ford}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=72.398..72.398
rows=262144 loops=1)
                                Index Cond: (keywords && '{toyota}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=74.118..74.118
rows=262144 loops=1)
                                Index Cond: (keywords &&
'{volkswagen}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.486..58.486
rows=262144 loops=1)
                                Index Cond: (keywords && '{saturn}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=114.671..114.671
rows=524288 loops=1)
                                Index Cond: (keywords && '{honda}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=115.290..115.290
rows=524288 loops=1)
                                Index Cond: (keywords &&
'{porsche}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.184..58.184
rows=262144 loops=1)
                                Index Cond: (keywords && '{hummer}'::text[])
                          ->  Bitmap Index Scan on foo_idx
(cost=0.00..175.07 rows=10485 width=0) (actual time=58.336..58.336
rows=262144 loops=1)
                                Index Cond: (keywords &&
'{ferrari}'::text[])
Total runtime: 12628.401 ms
-- seq 03 (OR+EA)

-- seq 04 (CC+EA)
Aggregate  (cost=26726.37..26726.38 rows=1 width=0) (actual
time=12700.620..12700.621 rows=1 loops=1)
  ->  Hash Join  (cost=17879.62..26722.62 rows=1500 width=0) (actual
time=8482.572..12272.449 rows=300000 loops=1)
        Hash Cond: ((bar.id)::numeric = foo.bar_id)
        ->  Seq Scan on bar  (cost=0.00..4328.00 rows=300000 width=4)
(actual time=0.029..412.984 rows=300000 loops=1)
        ->  Hash  (cost=17748.56..17748.56 rows=10485 width=8) (actual
time=8481.524..8481.524 rows=2097152 loops=1)
              ->  Bitmap Heap Scan on foo  (cost=177.69..17748.56
rows=10485 width=8) (actual time=978.464..4679.954 rows=2097152 loops=1)
                    Recheck Cond: (keywords &&
'{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[])
                    ->  Bitmap Index Scan on foo_idx  (cost=0.00..175.07
rows=10485 width=0) (actual time=973.569..973.569 rows=2097152 loops=1)
                          Index Cond: (keywords &&
'{ford,toyota,volkswagen,saturn,honda,porsche,hummer,ferrari}'::text[])
Total runtime: 12700.838 ms
-- seq 04 (CC+EA)

As far as I tried, the runtime difference between the 2 forms varies up
to 10% and always under 1 second--ie. far away from the 100% and (15-8=)
7 seconds reported. Moreover the form "arr&&{f,b}" (let us call it: form
1) often shows better than the other (let us call it: form 2)--already
noticed this fact by previous measures without joins.

Exception occurs in a special case: for the first query submitted
immediately after populating tables. It seems as if indexes were not
optimized at this time. But all the following queries, whatever form is
considered, run in lesser and similar time.

When observing the runtime of intermediate steps (bitmap heap scan on
table foo [step 1], sequential scan on table bar [step 2], hash join
[step 3]), it can be noted that all 3 steps run in a comparable time
whatever form used:
- no relevant difference for steps 2 and 3, but an attended long time
for retrieving all the resulting rows because of their high number;
- for step 1: form 1 (arr&&{f,b}) needs more time to extract the 1st row
of the resultset but is faster to collect the last rows. So no
significant difference may be also revealed here.

As I mentioned, I noticed that the runtime is longer for the 1st query
performed after populating tables and that, since the 2nd query
performed, runtimes clearly decrease towards a mean (about 6 seconds on
my side). Maybe the runtimes reported (15 and 8 seconds respectively)
occur in such a case (just after populating tables) with a performing
order starting with form 1. As much as I saw, the sequence order acts
upon the runtime reduction. Moreover successive queries tend to mean and
equal runtimes with the 2 forms.

About the user complaint, I would be not so astonished because of the
duration of one query (more than 5 seconds in the 2 cases). There are so
many rows to select that it seems not really possible to get a lesser
runtime.

What I may conclude: query plans show difference in performing the
queries: each Where clause is assessed before a commun heap scan of
table foo. The Where clause of the form 1 (multi-value text table) is
longer to run because of multi-value to test for each key; the Where
clause of the form 2 (1-value text table) is faster but the addition of
the corresponding 1-value text table Where clauses runs in a comparable
time. This observation resulted from measures on my sides and may not
correctly reflect some proper configurations of the real production
environment.

Opting to form 2 (arr&&{f} or arr&&{b}) will surely not make runtime
worst. So it would not be a bad choice to implement it if convinced.

Another way could concern the hash join. It has been shown that this
step costs a lot with respect to the overall runtime. Depending on
available storage space and DBMS load, a kind of materialized view may
be handled in order to cut off the overloading join. Here are some
suggested statements to create this helper table:

-- Creates a table jfoobar wrt. tables foo and bar
-- (max 10 seconds elapsed on my platform)
CREATE OR REPLACE TABLE jfoobar AS
SELECT
    foo.bar_id, foo.keywords
FROM foo
INNER JOIN bar ON foo.bar_id = bar.id;
-- Creates an index for jfoobar
-- (max 3 seconds elapsed on my platform)
--  this way...
CREATE INDEX jfoobar_idx ON jfoobar(bar_id, keywords);
--  or this one...
DROP INDEX jfoobar_idx;
CREATE INDEX jfoobar_idx ON jfoobar USING gin(keywords);
--

This table may be updated or replaced on a regular time base or by
triggered event on updating foo or bar tables.

Finally, any query of the form 1 or 2 will run in less than 300 ms:
-- form 1: unique text table
SELECT    count(*) FROM jfoobar
WHERE
    keywords && '{ford, toyota, volkswagen, saturn, honda, porsche,
hummer, ferrari}'::text[];

-- form 2: developped text table
SELECT    count(*) FROM jfoobar
WHERE
    (
        keywords && '{ford}'::text[]
        OR keywords && '{toyota}'::text[]
        OR keywords && '{volkswagen}'::text[]
        OR keywords && '{saturn}'::text[]
        OR keywords && '{honda}'::text[]
        OR keywords && '{porsche}'::text[]
        OR keywords && '{hummer}'::text[]
        OR keywords && '{ferrari}'::text[]
    );
--

The runtime would only depend on the update strategy for the join table.

Regards.

--
nha / Lyon / France.

Re: Problem search on text arrays, using the overlaps (&&) operator

From

John Cheng

Date:

08 July 2009, 19:15:42

Accidentally sent to nha only

--- On Wed, 7/8/09, John Cheng <jlcheng@ymail.com> wrote:

> From: John Cheng <jlcheng@ymail.com>
> Subject: Re: [GENERAL] Problem search on text arrays, using the overlaps (&&) operator
> To: "nha" <lyondif02@free.fr>
> Date: Wednesday, July 8, 2009, 4:24 PM
> Hi nha,
>
> I will try out your suggestion about a materialized view. I
> had never even thought about trying it.
>
> As luck would have it, I had to try out these tests on a
> different database today, which resulted in a different
> query plan that executed both forms in the same time. 
> This different plan used a merge join instead of a hash
> join. I will research this are more to see if I learn
> anything new.
>
> You also pointed out that my queries (reports) are simply
> reporting off of a lot of data. Perhaps I need to see if the
> windowing functions in 8.4 can help improve things, or
> perhaps try to partition the data. Unfortunately, these kind
> of changes will be bigger than I had expected.
>
>
>      
>

Re: Problem search on text arrays, using the overlaps (&&) operator

From

John Cheng

Date:

23 July 2009, 01:49:56

----- "nha" <lyondif02@free.fr> wrote:

>
> Another way could concern the hash join. It has been shown that this
> step costs a lot with respect to the overall runtime. Depending on
> available storage space and DBMS load, a kind of materialized view
> may
> be handled in order to cut off the overloading join. Here are some
> suggested statements to create this helper table:
>
[snip]

Hi nha,

Sorry about the long lag after your last post. I didn't want to post
back until I had something solid to report on. Using a materialized
view turned out to be the best way to solve my problem. My coworker
designed a new table that consists of the key columns for 3 large
tables that were being joined. A trigger is used to make sure
the "materialized view" is kept up-to-date. Since new data is added
infrequently (once a month), the cost of keeping the materialized view
up-to-date is cheap. The resulting query runs exceedingly fast! :)

Thank you so much for your guidance. I have learned a lot from this
incident!