Thread: DIfferent plans for explicit versus implicit join using link table

DIfferent plans for explicit versus implicit join using link table

From

"John D. Burger"

Date:

26 April 2007, 14:20:36

Hi -

I have a table of words and a table linking words in various ways:

create table allWords (
   wordID    serial    PRIMARY KEY,
   word        text
);
create unique index ix_allwords_word ON allwords (word);

create table allWordRelations (
   word1ID    integer references allWords,
   word2ID    integer references allWords,
   pos1        integer references posTypes,
   pos2        integer references posTypes,
   relID        integer references allRelationTypes,
   confidence    float,
   primary key (word1ID, word2ID, pos1, pos2, relID)
);
create index ix_allWordRelations_word1_pos1 on allWordRelations
(word1ID, pos1);
create index ix_allWordRelations_word2_pos2 on allWordRelations
(word2ID, pos2);

I have two queries for looking up related words which I think should
be equivalent, but 7.4.8 comes up with very different plans.  The
first query joins the word table to itself explicitly via the
relations table - this is very fast.  The second query uses an IN
against the link table in the where clause, and is very slow.  I'm
sure I can affect this by adding indexes, but I'm mainly trying to
understand what difference the planner is seeing.  EXPLAIN ANALYZE
output is below - can anyone explain?  Are my two queries subtly
different in terms of NULLs, or something like that?  Thanks.

- John Burger
   MITRE


explain analyze select w2.word from allwords w1 join allwordrelations
as r on (w1.wordid = r.word1id) join allwords w2 on (w2.wordid =
r.word2id) where w1.word = 'dogging';

       QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------------
Nested Loop  (cost=0.00..579.05 rows=81 width=15) (actual
time=0.607..30.509 rows=59 loops=1)
    ->  Nested Loop  (cost=0.00..333.94 rows=81 width=4) (actual
time=0.564..29.032 rows=59 loops=1)
          ->  Index Scan using ix_allwords_word on allwords w1
(cost=0.00..3.49 rows=1 width=4) (actual time=0.326..0.329 rows=1
loops=1)
                Index Cond: (word = 'dogging'::text)
          ->  Index Scan using ix_allwordrelations_word1_pos1 on
allwordrelations r  (cost=0.00..329.36 rows=87 width=8) (actual
time=0.220..28.564 rows=59 loops=1)
                Index Cond: ("outer".wordid = r.word1id)
    ->  Index Scan using allwords_pkey on allwords w2
(cost=0.00..3.01 rows=1 width=19) (actual time=0.018..0.020 rows=1
loops=59)
          Index Cond: (w2.wordid = "outer".word2id)
Total runtime: 30.713 ms



explain analyze select w2.word from allwords w1, allwords w2 where
(w1.wordid, w2.wordid) in (select word1id, word2id from
allwordrelations ) and w1.word = 'dogging';

QUERY PLAN
------------------------------------------------------------------------
------------------------------------------------------------------------
-----
Nested Loop  (cost=760422.86..817628.29 rows=1 width=15) (actual
time=99277.403..111291.862 rows=59 loops=1)
    ->  Hash Join  (cost=760422.86..817625.27 rows=1 width=4) (actual
time=99277.110..111270.093 rows=59 loops=1)
          Hash Cond: ("outer".word1id = "inner".wordid)
          ->  Unique  (cost=760419.36..794740.32 rows=4576128
width=8) (actual time=96713.791..107843.446 rows=4302242 loops=1)
                ->  Sort  (cost=760419.36..771859.68 rows=4576128
width=8) (actual time=96713.785..102973.088 rows=4576035 loops=1)
                      Sort Key: allwordrelations.word1id,
allwordrelations.word2id
                      ->  Seq Scan on allwordrelations
(cost=0.00..79409.28 rows=4576128 width=8) (actual
time=0.008..8668.255 rows=4576035 loops=1)
          ->  Hash  (cost=3.49..3.49 rows=1 width=4) (actual
time=0.078..0.078 rows=0 loops=1)
                ->  Index Scan using ix_allwords_word on allwords w1
(cost=0.00..3.49 rows=1 width=4) (actual time=0.067..0.070 rows=1
loops=1)
                      Index Cond: (word = 'dogging'::text)
    ->  Index Scan using allwords_pkey on allwords w2
(cost=0.00..3.01 rows=1 width=19) (actual time=0.360..0.363 rows=1
loops=59)
          Index Cond: (w2.wordid = "outer".word2id)
Total runtime: 111292.449 ms

Re: DIfferent plans for explicit versus implicit join using link table

From

Tom Lane

Date:

26 April 2007, 17:44:30

"John D. Burger" <john@mitre.org> writes:
> I have two queries for looking up related words which I think should
> be equivalent, but 7.4.8 comes up with very different plans.

They're not at all equivalent:

> explain analyze select w2.word from allwords w1 join allwordrelations
> as r on (w1.wordid = r.word1id) join allwords w2 on (w2.wordid =
> r.word2id) where w1.word = 'dogging';

> explain analyze select w2.word from allwords w1, allwords w2 where
> (w1.wordid, w2.wordid) in (select word1id, word2id from
> allwordrelations ) and w1.word = 'dogging';

If there are duplicate word1id,word2id entries in allwordrelations, the
first query will produce duplicate outputs; the second will not.

If there were a unique constraint on (word1id, word2id), in theory
the planner could prove that the IN form could be simplified to a plain
join, but there is no such logic in HEAD let alone 7.4, and in any case
you've not got such a constraint.

The plan that gets chosen is to forcibly unique-ify the (word1id,
word2id) data (via a "sort | uniq"-like pipeline) and then do a normal
join with that.  Which is expensive because allwordrelations is big.
But the alternative is probably even worse: without that
allwordrelations has to be joined to w1 and w2 simultaneously, meaning
that the unconstrained cartesian product of w1 and w2 has to be formed
first.

            regards, tom lane

Re: DIfferent plans for explicit versus implicit join using link table

From

"John D. Burger"

Date:

26 April 2007, 23:25:44

Tom Lane replied:

>> I have two queries for looking up related words which I think should
>> be equivalent, but 7.4.8 comes up with very different plans.
>
> They're not at all equivalent:

> If there are duplicate word1id,word2id entries in allwordrelations,
> the
> first query will produce duplicate outputs; the second will not.

Ah, that should have been my second guess - whenever I fail to get
stuff like this, it's usually to do with either duplicates or NULLs.

> If there were a unique constraint on (word1id, word2id), in theory
> the planner could prove that the IN form could be simplified to a
> plain
> join, but there is no such logic in HEAD let alone 7.4, and in any
> case
> you've not got such a constraint.

But such would reflect the reality of my data, so it should be there.

> The plan that gets chosen is to forcibly unique-ify the (word1id,
> word2id) data (via a "sort | uniq"-like pipeline) and then do a normal
> join with that.  Which is expensive because allwordrelations is big.
> But the alternative is probably even worse: without that
> allwordrelations has to be joined to w1 and w2 simultaneously, meaning
> that the unconstrained cartesian product of w1 and w2 has to be formed
> first.

Hmm, but wouldn't it at least filter one side per my where clause:
w1.word = 'dogging'?  Anyway, thanks, the incremental enlightenment
continues.

- John Burger
   MITRE

Re: DIfferent plans for explicit versus implicit join using link table

From

Tom Lane

Date:

26 April 2007, 23:54:26

"John D. Burger" <john@mitre.org> writes:
> Tom Lane replied:
>> But the alternative is probably even worse: without that
>> allwordrelations has to be joined to w1 and w2 simultaneously, meaning
>> that the unconstrained cartesian product of w1 and w2 has to be formed
>> first.

> Hmm, but wouldn't it at least filter one side per my where clause:
> w1.word = 'dogging'?

Ah, right, it would do that --- but you still then have to join each of
those rows to every row of w2 before you can do the IN check, and each
of those IN checks would be an index probe into allwordrelations, which
is not that cheap.  (Or at least 7.4 doesn't think so --- it does not
have any understanding about multiple index probes on the inside of a
nestloop being cheaper than single probes due to caching of the upper
index levels.  You really ought to think about getting onto a newer
version; 8.2 is quite a lot smarter than 7.4.)

            regards, tom lane