Thread: Preformance

Preformance

From
Cees van de Griend
Date:
LS,

I'm at a loss and need help.

I have a server (SuSE 7.1) with PostgreSQL 7.0.3 with 2 databases.
The databases are roughly the same size. Basicly the setup is as
follows:

tblNumber:    project_id    INTEGER
        project_name     VARCHAR(30)
        ddi_number    INTEGER
        phone_number    VARCHAR(30)

tblData:    start_call    DATE
        project_id    INTEGER
        ddi_number    INTEGER
        ...more fields...

The main index is on tblData on fields (start_call, project_id,
ddi_number). This index is indeed used as EXPLAIN shows.

The query is something like:
SELECT    n.project_name,
    n.phone_number,
    SUM(d.data),
    SUM(d.data...)
FROM    tblNumber n,
    tblData d
WHERE    d.start_call BETWEEN i AND j
AND    d.project_id BETWEEN k AND l
AND    d.project_id = n.project_id
AND    d.ddi_number = n.ddi_number
GROUP BY n.project_name, n.phone_number
ORDER BY n.project_name, n.phone_number;

VACUUM ANALYZE is run on both databases daily.

The strange part is that a query on the first database takes 2.5 seconds
and on the second one 3 minutes and 7.1 second!
EXPLAIN looks the same on both databases.

I've attached the output.
The first query is the quick one, the second the slow one.

What can be the cause of this huge difference in time?

Regards,
Cees.

Attachment

Re: Preformance

From
Frank Bax
Date:
Explain's don't look the same to me.  The first explain indicates use of
HASH JOIN, whereas the second is using NESTED LOOP.

I'm guessing some of the other indexes on the two tables are not the same
between the two databases.  One of these other indexes is being used in the
first explain, but not the second.

Frank

At 05:56 PM 2/2/02 +0100, Cees van de Griend wrote:
>I have a server (SuSE 7.1) with PostgreSQL 7.0.3 with 2 databases.
>The databases are roughly the same size. Basicly the setup is as
>follows:
>
>tblNumber:    project_id    INTEGER
>        project_name     VARCHAR(30)
>        ddi_number    INTEGER
>        phone_number    VARCHAR(30)
>
>tblData:    start_call    DATE
>        project_id    INTEGER
>        ddi_number    INTEGER
>        ...more fields...
>
>The main index is on tblData on fields (start_call, project_id,
>ddi_number). This index is indeed used as EXPLAIN shows.
>
>The query is something like:
>SELECT    n.project_name,
>    n.phone_number,
>    SUM(d.data),
>    SUM(d.data...)
>FROM    tblNumber n,
>    tblData d
>WHERE    d.start_call BETWEEN i AND j
>AND    d.project_id BETWEEN k AND l
>AND    d.project_id = n.project_id
>AND    d.ddi_number = n.ddi_number
>GROUP BY n.project_name, n.phone_number
>ORDER BY n.project_name, n.phone_number;
>
>VACUUM ANALYZE is run on both databases daily.
>
>The strange part is that a query on the first database takes 2.5 seconds
>and on the second one 3 minutes and 7.1 second!
>EXPLAIN looks the same on both databases.
>
>I've attached the output.
>The first query is the quick one, the second the slow one.
>
>What can be the cause of this huge difference in time?
>
>Regards,
>Cees.
>
>Attachment Converted: "m:\BELL_HSE\ATTACH\t"
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo@postgresql.org so that your
>message can get through to the mailing list cleanly
>

Re: Preformance

From
Tom Lane
Date:
Cees van de Griend <cees-list@griend.xs4all.nl> writes:
> The strange part is that a query on the first database takes 2.5 seconds
> and on the second one 3 minutes and 7.1 second!
> EXPLAIN looks the same on both databases.

No, it doesn't look the same at all: you're getting hash joins in one
case and nestloop joins in the other.  I'd also observe that you haven't
given us anything close to an accurate version of the query, as the
EXPLAINs show four input tables not two.  Moreover, the two EXPLAINs
are clearly not for the same query (the table names aren't the same).

>                       ->  Hash Join  (cost=20.38..1145.68 rows=1161 width=40)
>                             ->  Seq Scan on tblcitsdialinday dd  (cost=0.00..1006.03 rows=2485 width=32)
>                             ->  Hash  (cost=18.10..18.10 rows=910 width=8)
>                                   ->  Seq Scan on tblcitsddinumber dn  (cost=0.00..18.10 rows=910 width=8)

vs

>                       ->  Nested Loop  (cost=0.00..1119.65 rows=4 width=40)
>                             ->  Seq Scan on tblatnnumber dn  (cost=0.00..10.03 rows=503 width=8)
>                             ->  Index Scan using idxatndialinday04 on tblatndialinday dd  (cost=0.00..2.19 rows=1
width=32)

The problem is presumably that the planner is drastically
underestimating the number of joinable rows in "dn" in the
second case, and so choosing a plan that works well if that
number is small but not well when it's large.  There's not
a lot you can do about this in 7.0.3, but you are overdue
for an update anyway.  I'd suggest trying it in 7.2, due
out Monday.

            regards, tom lane

Re: Preformance

From
Tom Lane
Date:
Cees van de Griend <cees@griend.xs4all.nl> writes:
>> The problem is presumably that the planner is drastically
>> underestimating the number of joinable rows in "dn" in the
>> second case, and so choosing a plan that works well if that
>> number is small but not well when it's large.

> This is the strange part. The sizes of the 2 databases are not that
> great and the 'fast' one has a bigger Number table:

Nothing strange about that.  The larger table is probably enough
larger to persuade the planner to take the hash-join plan; remember
the problem in the other case is that the planner thinks there are few
enough rows to make nestloop appropriate, when there really are too
many for that plan to be a good choice.

> Someone has altered the table and added a column which is never used,
> there is no data in it and it is never used in a query.
> Can this be the reason for the huge preformance loss?

No.

            regards, tom lane

Re: Preformance

From
Cees van de Griend
Date:
Tom, thanks for the effort of answering my questions.
I'm a long time reader of this list and you are always very helpfull.

On Sat, Feb 02, 2002 at 03:57:19PM -0500, Tom Lane wrote:
> Cees van de Griend <cees@griend.xs4all.nl> writes:
> >> The problem is presumably that the planner is drastically
> >> underestimating the number of joinable rows in "dn" in the
> >> second case, and so choosing a plan that works well if that
> >> number is small but not well when it's large.
>
> > This is the strange part. The sizes of the 2 databases are not that
> > great and the 'fast' one has a bigger Number table:
>
> Nothing strange about that.  The larger table is probably enough
> larger to persuade the planner to take the hash-join plan; remember
> the problem in the other case is that the planner thinks there are few
> enough rows to make nestloop appropriate, when there really are too
> many for that plan to be a good choice.

OK, I can understand this, but...

> > Someone has altered the table and added a column which is never used,
> > there is no data in it and it is never used in a query.
> > Can this be the reason for the huge preformance loss?
>
> No.

Sigh. Then what can be?

Before, a few days back, the 'same' query on the 'slow' database took
approximately the same time on the 'fast' one; it took seconds.
I've never taken the time to study the results of EXPLAIN, because I was
happy with the default actions.

Now, on the 'fast' database is still takes seconds but on the 'slow'
one it takes minutes and the only change I can think of is the added
column.

The database are on the same server, they share the same processor, the
same memory and the same disk.

What can possible be the cause of the difference in preformance?

Is is as simple as a wrong choice of the planner and can a wrong choice
have such huge effect?

>             regards, tom lane

Regards,
Cees.


Re: Preformance

From
Tom Lane
Date:
Cees van de Griend <cees-list@griend.xs4all.nl> writes:
> What can possible be the cause of the difference in preformance?

Probably the VACUUM ANALYZE statistics changed just enough to push the
planner into making the wrong choice.  You could experiment with doing
"set enable_nestloop to off" and then EXPLAIN to see what the plan and
cost are; I'll bet that the estimated cost of the hash plan is now
just fractionally more than that of the nestloop.

Of course, the *true* costs are very different, which is why I consider
this a planner estimation failure.

> Is is as simple as a wrong choice of the planner and can a wrong choice
> have such huge effect?

Yes, and yes.

            regards, tom lane

Re: Preformance

From
Cees van de Griend
Date:
On Sat, Feb 02, 2002 at 05:09:58PM -0500, Tom Lane wrote:
> Cees van de Griend <cees-list@griend.xs4all.nl> writes:
> > What can possible be the cause of the difference in preformance?
>
> Probably the VACUUM ANALYZE statistics changed just enough to push the
> planner into making the wrong choice.  You could experiment with doing
> "set enable_nestloop to off" and then EXPLAIN to see what the plan and
> cost are; I'll bet that the estimated cost of the hash plan is now
> just fractionally more than that of the nestloop.
>
> Of course, the *true* costs are very different, which is why I consider
> this a planner estimation failure.

I have dumped the table, droped the table and put the dump back into the
database. Now the speed is as it should be (for the time being).

If I understand you explanation, a possible hack to circumvent this problem
could be to trick the planner into thinking that it should use the hash
scan method, by inserting dummy data into the tbl[XXX]Number.

Another possible fix is to execute 'SET enable_nestloop = OFF;' before every
query.

Is this correct?

>             regards, tom lane

Regards,
Cees.


Re: Preformance

From
Tom Lane
Date:
Cees van de Griend <cees-list@griend.xs4all.nl> writes:
> ... a possible hack to circumvent this problem ...

Before contorting your data or application to hack around this problem,
I'd suggest updating to 7.2 and seeing if the problem is fixed for real
by the new statistics code.

7.2 should be out tomorrow, if Marc wraps it on schedule.

            regards, tom lane