Re: Problem search on text arrays, using the overlaps (&&) operator - Mailing list pgsql-general

From John Cheng
Subject Re: Problem search on text arrays, using the overlaps (&&) operator
Date
Msg-id 220686.33449.qm@web43404.mail.sp1.yahoo.com
Whole thread Raw
In response to Problem search on text arrays, using the overlaps (&&) operator  (John Cheng <jlcheng@ymail.com>)
List pgsql-general

----- "nha" <lyondif02@free.fr> wrote:

> From: "nha" <lyondif02@free.fr>
> To: "John Cheng" <jlcheng@ymail.com>
> Cc: pgsql-general@postgresql.org
> Sent: Monday, July 6, 2009 9:12:22 AM GMT -08:00 US/Canada Pacific
> Subject: Re: [GENERAL] Problem search on text arrays, using the overlaps (&&) operator
>
> Hello,
>
> With your exhaustive example statements based on table foo and cars, I
>
> performed some measures on my side (PostgreSQL 8.3.1 server). Here are
>
> some statistical results:
>
[ ... snipped ... ]
>
> In my opinion, analysis and optimization may be deepen over table
> indexes used for join planning. As your reported query plans show, the
>
> Where clauses are performed independantly from the table ml_lead; the
>
> reason is that all the attributes of the clauses belong to the table
> lead_reporting_data. Time may be reduced on join condition
> achievements.
>
> Hoping this observation will contribute a little to your opinion.
>
> Without any claim, I attached a document to this email for details on
>
> the measures I took with the overlap operator -- OpenDocument
> Spreadsheet (ODS) v2 formatted file, 24 kiB. The 3rd sheet "various"
> presents the detailed measures related to the data reported in this
> email.
>
> Regards.
>
> --
> nha / Lyon / France.

Hi nha,

I had not expected anyone to go to such lengths to evaluate my
situation, thank you so much!

After looking at your analysis, I realized that the test case I
created isn't close enough to the queries running in our prod
environment. For one, table 'foo' does not join to another table; The
other thing is that the amount of data isn't the same; Finally, these
tables have been ANALYZED.

So I took some time to update the test case. On our server, running
8.3.6, I was able to reproduce the difference between the two styles:
"arr&&{f,b}" and "arr&&{f} or arr&&{b}".

First, the setup:

-- Begin test case
-- Sets up 'bar'
SELECT id INTO TEMP TABLE bar FROM (SELECT generate_series(1,300000) as id) AS bar;
CREATE INDEX bar_idx ON bar (id);
ANALYZE bar;
-- Sets up 'foo'
CREATE TEMP SEQUENCE foo_bar_id_seq;
CREATE TEMP TABLE foo (
    bar_id numeric DEFAULT NEXTVAL('foo_bar_id_seq'),
    keywords text[]
);
CREATE INDEX foo_idx ON foo USING gin (keywords);
INSERT INTO foo (keywords) VALUES ('{ford}'::text[]);
INSERT INTO foo (keywords) VALUES ('{toyota}'::text[]);
INSERT INTO foo (keywords) VALUES ('{volkswagen}'::text[]);
INSERT INTO foo (keywords) VALUES ('{saturn}'::text[]);
INSERT INTO foo (keywords) VALUES ('{honda}'::text[]);
INSERT INTO foo (keywords) VALUES ('{porsche}'::text[]);
INSERT INTO foo (keywords) VALUES ('{porsche, audi, chrysler}'::text[]);
INSERT INTO foo (keywords) VALUES ('{honda, hummer, ferrari}'::text[]);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
INSERT INTO foo (keywords) (SELECT keywords FROM foo);
ANALYZE foo;
-- End test case

Query for the form "arr&&{f,b}"
SELECT
    count(*)
FROM foo
INNER JOIN bar ON foo.bar_id = bar.id
WHERE
    foo.keywords && '{ford, toyota, volkswagen, saturn, honda, porsche, hummer, ferrari}'::text[];

Query for the form "arr&&{f} or arr&&{b}":
SELECT
    count(*)
FROM foo, bar
WHERE
    foo.bar_id = bar.id
    AND
    (
        keywords && '{ford}'::text[]
        OR keywords && '{toyota}'::text[]
        OR keywords && '{volkswagen}'::text[]
        OR keywords && '{saturn}'::text[]
        OR keywords && '{honda}'::text[]
        OR keywords && '{porsche}'::text[]
        OR keywords && '{hummer}'::text[]
        OR keywords && '{ferrari}'::text[]
    );

For the first form, "arr&&{f,b}", the query takes about 15
seconds. For the second form "arr&&{f} or arr&&{b}", we get about 8
seconds. The difference is around 90-100%, which is what I am seeing on
our real life queries.

The query plans also become similar to the real life query plans. But
I am having a hard time learning from them. The only interesting I see
is that the estimated cost seems to be different than the actual run
time. The second form has a higher estimated cost than the first form,
but has a lower run time.

In this test case, the query filters by any of 8 keywords. Note that
with just 2 keywords, the difference is only about 5 seconds. The
specific report that our users complained about involved 16
"keywords", where the difference is about 100%.

When you said "analysis and optimization may be deepen over table
indexes used for join planning", I'm not sure what you mean. Can you
clarify?






pgsql-general by date:

Previous
From: Lee Harr
Date:
Subject: combine multiple row values in to one row
Next
From: Ivan Sergio Borgonovo
Date:
Subject: Re: combine multiple row values in to one row