Thread: Need help - optimizer trouble

Need help - optimizer trouble

From
Helge Bahmann
Date:
Hi!

I am desperate with 7.1beta4; I have a serious performance problem that
does not manifest in 7.0.2. This is currently just a feasability test, so
I am willing to try every hack.

The problematic part of my schema looks as follows:

document       written_by       author           author_keyword
-------- 1   n ----------       ----------       --------------
*doc_id  <---- *doc_id    n   1 *name      1   n *keyword
doc            *author_id ----> *author_id <---- *author_id

columns marked with an asterik have an index

document contains ~100 000 tuples; each is "written_by" 1-3 out of
~10 000 authors, and every author is associated with ~4 keywords

I am trying to retrieve documents written by authors associated with
a given keyword. I can get the doc_ids using:

SELECT doc_id FROM written_by AS wb
    JOIN author_keyword AS kw ON kw.author_id=wb.author_id
    WHERE kw.keyword='foo'

7.1beta4 and 7.0.2 both use the following query plan:

Nested Loop  (cost=0.00..18.37 rows=8 width=20)
  ->  Index Scan using author_keyword_pkey on author_keyword kw  (cost=0.00..2.02 rows=1 width=4)
  ->  Index Scan using written_by_idx on written_by  (cost=0.00..16.29 rows=4 width=16)

I can then take the values returned and retrieve the documents in a
separate query; the two queries take at most a second. However if
I add a join to get the documents directly:

SELECT doc FROM document
    JOIN written_by AS wb ON document.doc_id=wb.doc_id
    JOIN author_keyword AS kw ON kw.author_id=wb.author_id
    WHERE kw.keyword='foo'

7.0.2 will use the following query plan, query takes ~1 second:

Nested Loop  (cost=0.00..27.53 rows=2 width=32)
  ->  Nested Loop  (cost=0.00..10.00 rows=4 width=20)
        ->  Index Scan using author_keyword_pkey on author_keyword kw (cost=0.00..2.02 rows=1 width=4)
        ->  Index Scan using written_by_idx on ist_autor_von (cost=0.00..7.95 rows=2 width=16)
  ->  Index Scan using document_pkey on document  (cost=0.00..4.16 rows=1 width=12)

7.1beta4 uses the following query plan, query takes ~150 seconds:

Nested Loop  (cost=8562.09..39846.62 rows=4 width=32)
  ->  Index Scan using author_keyword_pkey on author_keyword kw (cost=0.00..2.02 rows=1 width=4)
  ->  Materialize  (cost=37969.52..37969.52 rows=100000 width=28)
        ->  Hash Join  (cost=8562.09..37969.52 rows=100000 width=28)
              ->  Seq Scan on document  (cost=0.00..11932.00 rows=100000 width=12)
              ->  Hash  (cost=5129.55..5129.55 rows=203555 width=16)
                    ->  Seq Scan on written_by  (cost=0.00..5129.55 rows=303555 width=16)

The two databases are not identical, but they were created using the
same script which fills the database with random data of the above
structure. I recreated the databases several times, so it is completely
repeatable.

I tried playing with the optimizer parameters, but it only got worse
because I do not really understand what they are all doing.

I fell uncomfortable with the idea of having to split up my query into
two separate ones. Can anyone give me any hints how to influence the 7.1
optimizer to behave like 7.0.2 in this case? Is there something wrong with
my query? Is there an alternative form which the optimizer can handle
better?

Thanks for your patience

Helge


Re: Need help - optimizer trouble

From
Neil Conway
Date:
On Thu, Apr 05, 2001 at 12:01:39AM +0200, Helge Bahmann wrote:
> 7.0.2 will use the following query plan, query takes ~1 second:
>
> Nested Loop  (cost=0.00..27.53 rows=2 width=32)
>   ->  Nested Loop  (cost=0.00..10.00 rows=4 width=20)
>         ->  Index Scan using author_keyword_pkey on author_keyword kw (cost=0.00..2.02 rows=1 width=4)
>         ->  Index Scan using written_by_idx on ist_autor_von (cost=0.00..7.95 rows=2 width=16)
>   ->  Index Scan using document_pkey on document  (cost=0.00..4.16 rows=1 width=12)
>
> 7.1beta4 uses the following query plan, query takes ~150 seconds:
>
> Nested Loop  (cost=8562.09..39846.62 rows=4 width=32)
>   ->  Index Scan using author_keyword_pkey on author_keyword kw (cost=0.00..2.02 rows=1 width=4)
>   ->  Materialize  (cost=37969.52..37969.52 rows=100000 width=28)
>         ->  Hash Join  (cost=8562.09..37969.52 rows=100000 width=28)
>               ->  Seq Scan on document  (cost=0.00..11932.00 rows=100000 width=12)
>               ->  Hash  (cost=5129.55..5129.55 rows=203555 width=16)
>                     ->  Seq Scan on written_by  (cost=0.00..5129.55 rows=303555 width=16)
>
> The two databases are not identical, but they were created using the
> same script which fills the database with random data of the above
> structure. I recreated the databases several times, so it is completely
> repeatable.

If the amount of data in the databases is similar, it looks like the
planner's guesses are totally off. Have you VACUUM ANALYZE'd both databases
recently?

(That's the obvious question. I'll leave it to the PgSQL gurus to help
you if that's not the problem.)

Cheers,

Neil

--
Neil Conway <neilconway@home.com>
Get my GnuPG key from: http://klamath.dyndns.org/mykey.asc
Encrypted mail welcomed

Vegetarians do not love animals... they hate plants.

Re: Need help - optimizer trouble

From
Tom Lane
Date:
Helge Bahmann <bahmann@math.tu-freiberg.de> writes:
> SELECT doc FROM document
>     JOIN written_by AS wb ON document.doc_id=wb.doc_id
>     JOIN author_keyword AS kw ON kw.author_id=wb.author_id
>     WHERE kw.keyword='foo'

> 7.0.2 will use the following query plan, query takes ~1 second:
> ...
> 7.1beta4 uses the following query plan, query takes ~150 seconds:
> ...

7.1 treats the JOIN clauses as determining join order, where 7.0 did
not.  Looks like you should re-order the query to be, say,

SELECT doc FROM written_by AS wb
    JOIN author_keyword AS kw ON kw.author_id=wb.author_id
    JOIN document ON document.doc_id=wb.doc_id
    WHERE kw.keyword='foo'

This behavior is a bit of a hack, which we'll likely reconsider in
future releases, but it does have some redeeming social value too.
See
http://www.postgresql.org/devel-corner/docs/postgres/explicit-joins.html

            regards, tom lane

Re: Need help - optimizer trouble

From
Helge Bahmann
Date:
On Wed, 4 Apr 2001, Tom Lane wrote:
> 7.1 treats the JOIN clauses as determining join order, where 7.0 did
> not.  Looks like you should re-order the query to be, say,
>
> SELECT doc FROM written_by AS wb
>     JOIN author_keyword AS kw ON kw.author_id=wb.author_id
>     JOIN document ON document.doc_id=wb.doc_id
>     WHERE kw.keyword='foo'

It works! I restructured my query

SELECT doc FROM document
    JOIN (SELECT keyword, doc_id FROM written_by AS _w
        JOIN author_keyword AS _kw ON _w.author_id=_kw.author_id
    ) AS wb ON document.doc_id=wb.doc_id
    WHERE wb.keyword='foo'

which is easier to handle for my application logic and it is pretty fast
now.

Many thanks for your help!

regards,
Helge


Re: Need help - optimizer trouble

From
"ADBAAMD"
Date:
Tom Lane wrote:

>
> 7.1 treats the JOIN clauses as determining join order, where 7.0 did
> not.  Looks like you should re-order the query to be, say,
>
> SELECT doc FROM written_by AS wb
>     JOIN author_keyword AS kw ON kw.author_id=wb.author_id
>     JOIN document ON document.doc_id=wb.doc_id
>     WHERE kw.keyword='foo'
>
> This behavior is a bit of a hack, which we'll likely reconsider in
> future releases, but it does have some redeeming social value too.
> See
> http://www.postgresql.org/devel-corner/docs/postgres/explicit-joins.html

    Excuse my audacity in posing such a preposterous question... but
shouldn't the SQL syntax be as free of effects on the physical level as
possible?  I've already posted a similar question at pgsql-general that
got no answer, but I will try it again here.

    As I understand from the relational model there should be three levels:
user, logical and physical.  The SELECT command should pertain to the
user level, and the access paths to the physical.  Any alteration in the
behaviour of the optimizer should be or (1) set by the administrator at
the optimizer itself, or (2) at the logical-physical levels' mapping, or
(3) thru hints delivered parallel with, but separate to, the SQL syntax
of the query itself.

    So this change, besides really being a hack, would kind of pollute the
environment by adding complexity to the use of SQL.  Perhaps it would be
cleaner to have somehing like Oracle's hints?

    Thanks for your attention.



--
  _
/ \   Leandro Guimarães Faria Corsetti Dutra         +55 (11) 3040 8913
\ /   Amdocs at Bell Canada                          +1 (514) 786 87 47
  X    Support Center, São Paulo, Brazil          mailto:adbaamd@bell.ca
/ \   http://terravista.pt./Enseada/1989/    mailto:leandrod@amdocs.com



Re: Need help - optimizer trouble

From
Tom Lane
Date:
"ADBAAMD" <adba.amdocs@bell.ca> writes:
> Tom Lane wrote:
>> This behavior is a bit of a hack, which we'll likely reconsider in
>> future releases, but it does have some redeeming social value too.

> Perhaps it would be cleaner to have somehing like Oracle's hints?

Possibly.  Want to enlighten those of us who've never used Oracle about
how that works?

            regards, tom lane

Re: Need help - optimizer trouble

From
"ADBAAMD"
Date:
Tom Lane wrote:

> "ADBAAMD" <adba.amdocs@bell.ca> writes:
>
>> Tom Lane wrote:
>>
>>> This behavior is a bit of a hack, which we'll likely reconsider in
>>> future releases, but it does have some redeeming social value too.
>
>> Perhaps it would be cleaner to have somehing like Oracle's hints?
>
> Possibly.  Want to enlighten those of us who've never used Oracle about
> how that works?

    You can read (almost) all about it at
http://technet.oracle.com/doc/server.804/a58246/toc.htm, more to the
point http://technet.oracle.com/doc/server.804/a58246/optimiz.htm#2250.

    For example:

  SELECT    /*+ STAR */
    doc_id
    FROM
    written_by    AS wb
    JOIN
    author_keyword    AS kw
      ON
    kw.author_id    =  wb.author_id
   WHERE
    kw.keyword    =  'foo'
;


    There is also a /*+ ORDERED */ hint, but I dislike it as it also requires
query rewriting.  Keep in mind I have an opinion against having to
rewrite queries for performance reasons, thinking this should be done at
the optimizer or preferably at the physical level or the
logical-to-physical mapping.


    Also it does not seems to me very nice to have queries changing behaviour
from version to version and then back again.



--
  _
/ \   Leandro Guimarães Faria Corsetti Dutra         +55 (11) 3040 8913
\ /   Amdocs at Bell Canada                          +1 (514) 786 87 47
  X    Support Center, São Paulo, Brazil          mailto:adbaamd@bell.ca
/ \   http://terravista.pt./Enseada/1989/    mailto:leandrod@amdocs.com