Thread: Performance Killer 'IN' ?

Performance Killer 'IN' ?

From
Kai Hessing
Date:
Hossa,

I just made a little test on our test-database. I have an excel sheet
with about 2000 entries that should be updated with the same value. In a
first try I generated an SQL-Syntax for every entry like:

UPDATE xyz SET status=-6 WHERE id=xyz1 AND status>-1;
UPDATE xyz SET status=-6 WHERE id=xyz2 AND status>-1;
UPDATE xyz SET status=-6 WHERE id=... AND status>-1;

The execution of the ~2000 SQL-commands took about 5-10 seconds.

Then I tried the same with generating only one request using IN with the
twothousand entries like:

UPDATE xyz WHERE id IN (xyz1, xyz2, ....) AND status>-1;

and it took about 10 Minutes to execute. So it is nearly a hundred times
slower. Can this be verified? Is there anything that can be done about
that? Else I would need to have a few words with our programmers...

By the way the testsystem is a basic Suse 9.3 with a default postgres
installation 8.0.x

Thanks and
*greets*
Kai

Re: Performance Killer 'IN' ?

From
Kai Hessing
Date:
> and it took about 10 Minutes to execute. So it is nearly a hundred times
> slower. Can this be verified? Is there anything that can be done about
> that? Else I would need to have a few words with our programmers...

OK... I overstated a little bit. The real numbers are:

1.) 21.5 seconds
2.) 363.7 seconds

But it is still a significant difference.

Re: Performance Killer 'IN' ?

From
"Joshua D. Drake"
Date:
Kai Hessing wrote:
>> and it took about 10 Minutes to execute. So it is nearly a hundred times
>> slower. Can this be verified? Is there anything that can be done about
>> that? Else I would need to have a few words with our programmers...
>
> OK... I overstated a little bit. The real numbers are:
>
> 1.) 21.5 seconds
> 2.) 363.7 seconds
>
> But it is still a significant difference.

Can you provide an explain analyze of each query?

>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that your
>        message can get through to the mailing list cleanly
>


--

             === The PostgreSQL Company: Command Prompt, Inc. ===
       Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
       Providing the most comprehensive  PostgreSQL solutions since 1997
                      http://www.commandprompt.com/



Re: Performance Killer 'IN' ?

From
Kai Hessing
Date:
Joshua D. Drake wrote:
> Kai Hessing wrote:
>> 1.) 21.5 seconds
>> 2.) 363.7 seconds
>>
>> But it is still a significant difference.
>
> Can you provide an explain analyze of each query?

There isn't an explain analyze of the first variant, because there are
just 2000 SQL-Updates. The explain analyze for just one of the commands
(UPDATE xyz SET status=-6 WHERE phon='xyz1' AND status>-1;) is:
------------------
Index Scan using phon_phon_idx on phon  (cost=0.00..5193.83 rows=530
width=148) (actual time=0.146..0.146 rows=0 loops=1)

  Index Cond: ((phon)::text = 'xyz'::text)

  Filter: (status > -1)

Total runtime: 0.387 ms

1 Datensätze (means data sets)

Laufzeit gesamt: 16.682 ms (means running time)
------------------

The second one (UPDATE xyz WHERE id IN (xyz1, xyz2, ....) AND
status>-1;) returns:
------------------
Seq Scan on phon  (cost=0.00..1573304.58 rows=105931 width=148) (actual
time=369563.565..369563.565 rows=0 loops=1)

  Filter: ((((phon)::text = 'xyz1) OR ((phon)::text = 'xyz2'::text) OR
((phon)::text = 'xyz3'::text) OR ((phon)::text = 'xyz4'::text) OR ...
[all the 2000 entries from the IN clause]

Total runtime: 369566.954 ms

667 Datensätze (means data sets)

Laufzeit gesamt: 370,179.246 ms (means running time)
------------------


Hope, that helps. Have a nice weekend. Being back to work on monday ;)

*greets*
Kai

Re: Performance Killer 'IN' ?

From
"Marko Kreen"
Date:
On 3/31/06, Kai Hessing <kai.hessing@hobsons.de> wrote:
> The second one (UPDATE xyz WHERE id IN (xyz1, xyz2, ....) AND
> status>-1;) returns:
> ------------------
> Seq Scan on phon  (cost=0.00..1573304.58 rows=105931 width=148) (actual
> time=369563.565..369563.565 rows=0 loops=1)

Just a shot in the dark: does the plan stay the same,
when you remove the ' AND status > -1' ?

--
marko

Re: Performance Killer 'IN' ?

From
Tom Lane
Date:
Kai Hessing <kai.hessing@hobsons.de> writes:
> Index Scan using phon_phon_idx on phon  (cost=0.00..5193.83 rows=530
> width=148) (actual time=0.146..0.146 rows=0 loops=1)
> ...
> Seq Scan on phon  (cost=0.00..1573304.58 rows=105931 width=148) (actual
> time=369563.565..369563.565 rows=0 loops=1)

You need to look into the discrepancy between estimated and actual row
counts.  (I suppose the reason you're showing 0 rows here is that you
already did these UPDATEs and so none of the rows in question pass the
status filter --- but how many rows are there matching the phon index
conditions?)  Perhaps a larger statistics target for the phon column
would be a good idea.

            regards, tom lane

Re: Performance Killer 'IN' ?

From
Kai Hessing
Date:
Marko Kreen wrote:
> On 3/31/06, Kai Hessing <kai.hessing@hobsons.de> wrote:
>> The second one (UPDATE xyz WHERE id IN (xyz1, xyz2, ....) AND
>> status>-1;) returns:
>> ------------------
>> Seq Scan on phon  (cost=0.00..1573304.58 rows=105931 width=148) (actual
>> time=369563.565..369563.565 rows=0 loops=1)
>
> Just a shot in the dark: does the plan stay the same,
> when you remove the ' AND status > -1' ?

No difference: I skipped the 'AND status > -1' and have the following
results...


Using 2000x 'UPDATE phon SET status=-6 WHERE ' returns each time:
----------------------------
Index Scan using phon_phon_idx on phon  (cost=0.00..5179.80 rows=1587
width=148) (actual time=31.452..31.470 rows=1 loops=1)

  Index Cond: ((phon)::text = 'wink4103@uni-trier.de'::text)

Total runtime:3.414 ms
----------------------------
(Total runtime for all 2000 Updates: 23335.393 ms


Using the 'UPDATE xyz WHERE id IN ('xyz1', 'xyz2', other 2000
values.....)' returns:
----------------------------
Seq Scan on phon  (cost=0.00..1564960.67 rows=317227 width=148) (actual
time=68.315..365621.761 rows=2522 loops=1)

  Filter: (((phon)::text = '.....

Total runtime: 393182.745 ms
----------------------------

Re: Performance Killer 'IN' ?

From
Kai Hessing
Date:
Tom Lane wrote:
> Kai Hessing <kai.hessing@hobsons.de> writes:
>> Index Scan using phon_phon_idx on phon  (cost=0.00..5193.83 rows=530
>> width=148) (actual time=0.146..0.146 rows=0 loops=1)
>> ...
>> Seq Scan on phon  (cost=0.00..1573304.58 rows=105931 width=148) (actual
>> time=369563.565..369563.565 rows=0 loops=1)
>
> You need to look into the discrepancy between estimated and actual row
> counts.  (I suppose the reason you're showing 0 rows here is that you
> already did these UPDATEs and so none of the rows in question pass the
> status filter --- but how many rows are there matching the phon index
> conditions?)  Perhaps a larger statistics target for the phon column
> would be a good idea.

Yes... The 0 rows are there because I did the command before. Now I
resetted the test database to a previous state and dopped the 'AND
status>-1' in the SQL-syntax:

Using the 'UPDATE xyz WHERE id IN ('xyz1', 'xyz2', other 2000
values.....)' returns:
----------------------------
Seq Scan on phon  (cost=0.00..1564960.67 rows=317227 width=148) (actual
time=68.315..365621.761 rows=2522 loops=1)

  Filter: (((phon)::text = '.....

Total runtime: 393182.745 ms
----------------------------

(please see also the other post <49ck9sFo32mbU1@individual.net> )

What do you mean with larger statistics target?

Re: Performance Killer 'IN' ?

From
"Marko Kreen"
Date:
On 4/3/06, Kai Hessing <kai.hessing@hobsons.de> wrote:
> Marko Kreen wrote:
> > Just a shot in the dark: does the plan stay the same,
> > when you remove the ' AND status > -1' ?
>
> No difference: I skipped the 'AND status > -1' and have the following
> results...

Ok.  Thanks.  I once had similar query and it seemed that postgres
got confused if there was big IN and alse something else in WHERE
clause.  Dropping the rest got postgres to use index for IN.

But I did not have time to research it then and with your result
seems it's not a problem in recent versions.

--
marko

Re: Performance Killer 'IN' ?

From
Tom Lane
Date:
Kai Hessing <kai.hessing@hobsons.de> writes:
> Yes... The 0 rows are there because I did the command before. Now I
> resetted the test database to a previous state and dopped the 'AND
> status>-1' in the SQL-syntax:

> Using the 'UPDATE xyz WHERE id IN ('xyz1', 'xyz2', other 2000
> values.....)' returns:
> ----------------------------
> Seq Scan on phon  (cost=0.00..1564960.67 rows=317227 width=148) (actual
> time=68.315..365621.761 rows=2522 loops=1)

>   Filter: (((phon)::text = '.....

Well, here's our problem it would seem: the planner is estimating the IN
clause to match 317227 rows, rather than the actual 2522.  That's
naturally going to bias it against choosing an indexscan.  You need to
get that estimate closer before there's going to be much chance of
choosing the right plan.

> What do you mean with larger statistics target?

See ALTER TABLE SET STATISTICS, or just change default_statistics_target
and re-ANALYZE.

            regards, tom lane

Re: Performance Killer 'IN' ?

From
Kai Hessing
Date:
Tom Lane wrote:
> Well, here's our problem it would seem: the planner is estimating the IN
> clause to match 317227 rows, rather than the actual 2522.  That's
> naturally going to bias it against choosing an indexscan.  You need to
> get that estimate closer before there's going to be much chance of
> choosing the right plan.
>
>> What do you mean with larger statistics target?
>
> See ALTER TABLE SET STATISTICS, or just change default_statistics_target
> and re-ANALYZE.

Thanks, that definitly looks like a starting point. I will test it and
post my results. Btw. what happens if the estimation would be to low?

Re: Performance Killer 'IN' ?

From
Kai Hessing
Date:
Kai Hessing wrote:
>>> What do you mean with larger statistics target?
>>
>> See ALTER TABLE SET STATISTICS, or just change default_statistics_target
>> and re-ANALYZE.
>
> Thanks, that definitly looks like a starting point. I will test it and
> post my results. Btw. what happens if the estimation would be to low?

unbelievable... I found the problem: Doing a simple ANALYZE after
resetting the test-db to a previous state is my friend. The results now are:

---------
Index Scan using phon_phon_idx, phon_phon_idx, phon_phon_idx, .....
(cost=0.00..56793.09 rows=4068 width=53) (actual time=0.155..363.810
rows=2534 loops=1)

  Index Cond: (((phon)::text = 'xyz'::text) OR .....

  Filter: (status > -1)

Total runtime: 11140.480 ms
---------

Thank you very much, this was the right hint!