Thread: UPDATE with JOIN not using index

UPDATE with JOIN not using index

From
Arnaud Lesauvage
Date:
Hi all !

PostgreSQL 8.4 here.
I have a simple update query that looks like this :

UPDATE t1
SET col = t2.col
FROM t2
WHERE t1.key1 = t2.key1 AND t1.key2 = t2.key2;

There is an index on (key1,key2) on the joined table (t2).
This query does not use the index.


If I rewrite it with a subselect, the index is used :
UPDATE t1
SET col = (SELECT t2.col FROM t2
    WHERE t1.key1 = t2.key1
    AND t1.key2 = t2.key2);

I know both queries are not exactly equivalent (I guess the second one
would throw an error if two rows were returned for the same key1,key2
pair, right?), but even though they are not equivalent, why does the
first one not use the index ?

I could provide the output from explain analyze, but the query takes
some time to run so I'll only do it if requested. Now, what is the
theorical differences between the two queries that prevent the use of
the index in the first case ?

Thanks a lot for your clarifications on this matter !

Regards
Arnaud

Re: UPDATE with JOIN not using index

From
Richard Huxton
Date:
On 16/03/10 13:05, Arnaud Lesauvage wrote:
> Hi all !
>
> PostgreSQL 8.4 here.
> I have a simple update query that looks like this :
>
> UPDATE t1
> SET col = t2.col
> FROM t2
> WHERE t1.key1 = t2.key1 AND t1.key2 = t2.key2;
>
> There is an index on (key1,key2) on the joined table (t2).
> This query does not use the index.

What does it do, then? The output of EXPLAIN would be a start if EXPLAIN
ANALYSE is too expensive.

Oh - and how many rows will this actually update?

--
   Richard Huxton
   Archonet Ltd

Re: UPDATE with JOIN not using index

From
Arnaud Lesauvage
Date:
Le 16/03/2010 14:50, Richard Huxton a écrit :
> On 16/03/10 13:05, Arnaud Lesauvage wrote:
>>  PostgreSQL 8.4 here.
>>  I have a simple update query that looks like this :
>>
>>  UPDATE t1
>>  SET col = t2.col
>>  FROM t2
>>  WHERE t1.key1 = t2.key1 AND t1.key2 = t2.key2;
>>
>>  There is an index on (key1,key2) on the joined table (t2).
>>  This query does not use the index.
>
> What does it do, then? The output of EXPLAIN would be a start if EXPLAIN
> ANALYSE is too expensive.
 > Oh - and how many rows will this actually update?

I launched the EXPLAIN ANALYZE as soon as I posted the message, but it
is not over yet. You are right that I should have posted the explain.
Approximatively 500.000 rows will be updated.


First query :

"Merge Join  (cost=699826.38..704333.80 rows=13548 width=836)"
"  Merge Cond: (((c.rue)::text = (r.rue)::text) AND ((c.codesite)::text
= (r.codesite)::text))"
"  ->  Sort  (cost=696320.21..697701.07 rows=552343 width=823)"
"        Sort Key: c.rue, c.codesite"
"        ->  Seq Scan on cellules c  (cost=0.00..443520.43 rows=552343
width=823)"
"  ->  Sort  (cost=3504.88..3596.96 rows=36833 width=43)"
"        Sort Key: r.rue, r.codesite"
"        ->  Seq Scan on rues r  (cost=0.00..711.33 rows=36833 width=43)"


Second query :

"Seq Scan on cellules c  (cost=0.00..5018080.39 rows=552343 width=823)"
"  SubPlan 1"
"    ->  Index Scan using idx_rues_ruecodesite on rues r
(cost=0.00..8.28 rows=1 width=13)"
"          Index Cond: (((rue)::text = ($1)::text) AND ((codesite)::text
= ($0)::text))"








Re: UPDATE with JOIN not using index

From
Richard Huxton
Date:
On 16/03/10 13:57, Arnaud Lesauvage wrote:
> First query :
>
> "Merge Join (cost=699826.38..704333.80 rows=13548 width=836)"
> " Merge Cond: (((c.rue)::text = (r.rue)::text) AND ((c.codesite)::text =
> (r.codesite)::text))"
> " -> Sort (cost=696320.21..697701.07 rows=552343 width=823)"
> " Sort Key: c.rue, c.codesite"
> " -> Seq Scan on cellules c (cost=0.00..443520.43 rows=552343 width=823)"
> " -> Sort (cost=3504.88..3596.96 rows=36833 width=43)"
> " Sort Key: r.rue, r.codesite"
> " -> Seq Scan on rues r (cost=0.00..711.33 rows=36833 width=43)"
>
>
> Second query :
>
> "Seq Scan on cellules c (cost=0.00..5018080.39 rows=552343 width=823)"
> " SubPlan 1"
> " -> Index Scan using idx_rues_ruecodesite on rues r (cost=0.00..8.28
> rows=1 width=13)"
> " Index Cond: (((rue)::text = ($1)::text) AND ((codesite)::text =
> ($0)::text))"

OK - we have a merge join in the first case where it joins the
pre-sorted output of both tables.

In the second case it queries the index once for each row in "cellules".

Now look at the costs. The first one is around 704,000 and the second
one is 5,000,000 - about 6 times as much. That's why it's not using the
index, because it thinks it will be more expensive.

If it's not really more expensive that suggests your configuration
values aren't very close to reality.

The first query should run faster if it has more work_mem available too.
At the moment, it's probably going back and fore doing an on-disk sort.

--
   Richard Huxton
   Archonet Ltd

Re: UPDATE with JOIN not using index

From
Tom Lane
Date:
Arnaud Lesauvage <arnaud.listes@codata.eu> writes:
> First query :

> "Merge Join  (cost=699826.38..704333.80 rows=13548 width=836)"
> "  Merge Cond: (((c.rue)::text = (r.rue)::text) AND ((c.codesite)::text
> = (r.codesite)::text))"
> "  ->  Sort  (cost=696320.21..697701.07 rows=552343 width=823)"
> "        Sort Key: c.rue, c.codesite"
> "        ->  Seq Scan on cellules c  (cost=0.00..443520.43 rows=552343
> width=823)"
> "  ->  Sort  (cost=3504.88..3596.96 rows=36833 width=43)"
> "        Sort Key: r.rue, r.codesite"
> "        ->  Seq Scan on rues r  (cost=0.00..711.33 rows=36833 width=43)"

> Second query :

> "Seq Scan on cellules c  (cost=0.00..5018080.39 rows=552343 width=823)"
> "  SubPlan 1"
> "    ->  Index Scan using idx_rues_ruecodesite on rues r
> (cost=0.00..8.28 rows=1 width=13)"
> "          Index Cond: (((rue)::text = ($1)::text) AND ((codesite)::text
> = ($0)::text))"

Please notice that the planner thinks the second plan is much more
expensive than the first.  I think you will find that it's right.
Using an index is not always the best way to do a query.

            regards, tom lane

Re: UPDATE with JOIN not using index

From
Arnaud Lesauvage
Date:
Le 16/03/2010 15:25, Richard Huxton a écrit :
> OK - we have a merge join in the first case where it joins the
> pre-sorted output of both tables.
>
> In the second case it queries the index once for each row in "cellules".
>
> Now look at the costs. The first one is around 704,000 and the second
> one is 5,000,000 - about 6 times as much. That's why it's not using the
> index, because it thinks it will be more expensive.
>
> If it's not really more expensive that suggests your configuration
> values aren't very close to reality.
>
> The first query should run faster if it has more work_mem available too.
> At the moment, it's probably going back and fore doing an on-disk sort.

Indeed !
I admit that I had not tested the second query, I just thought that the
first one took way too long to execute.

I will try increasing work_mem, but it is already set at 16MB which I
found is quite high.

Thanks a lot for clarifying that !

Regards
Arnaud

Re: UPDATE with JOIN not using index

From
tv@fuzzy.cz
Date:

> Le 16/03/2010 15:25, Richard Huxton a écrit :
>> OK - we have a merge join in the first case where it joins the
>> pre-sorted output of both tables.
>>
>> In the second case it queries the index once for each row in "cellules".
>>
>> Now look at the costs. The first one is around 704,000 and the second
>> one is 5,000,000 - about 6 times as much. That's why it's not using the
>> index, because it thinks it will be more expensive.
>>
>> If it's not really more expensive that suggests your configuration
>> values aren't very close to reality.
>>
>> The first query should run faster if it has more work_mem available too.
>> At the moment, it's probably going back and fore doing an on-disk sort.
>
> Indeed !
> I admit that I had not tested the second query, I just thought that the
> first one took way too long to execute.
>
> I will try increasing work_mem, but it is already set at 16MB which I
> found is quite high.

What do you mean by "high"? I believe the proper value of work_mem is such
that results in highest performance of the query while not causing
problems to the other sessions.

Don't forget you can set this for each session separately (SET work_mem =
something) so you can use some conservative value in the postgresql.conf
and use a different value (e.g. 64MB) for batch processing. I guess you
don't perform such UPDATE in usual sessions, right?

regards
Tomas



Re: UPDATE with JOIN not using index

From
Arnaud Lesauvage
Date:
Le 16/03/2010 15:37, Tom Lane a écrit :
> Arnaud Lesauvage<arnaud.listes@codata.eu>  writes:
>>  First query :
>>  "Merge Join  (cost=699826.38..704333.80 rows=13548 width=836)"
>>  "  Merge Cond: (((c.rue)::text = (r.rue)::text) AND ((c.codesite)::text
>>  = (r.codesite)::text))"
>
>>  Second query :
>>  "Seq Scan on cellules c  (cost=0.00..5018080.39 rows=552343 width=823)"
>
> Please notice that the planner thinks the second plan is much more
> expensive than the first.  I think you will find that it's right.
> Using an index is not always the best way to do a query.

Yes indeed, I should have read the output of explain more thoroughly I
guess !
I am still a bit intrigued by the different query plans. Both query look
very similar to me, so why does the planner make so different choices ?

Regards,
Arnaud

Re: UPDATE with JOIN not using index

From
Arnaud Lesauvage
Date:
Le 16/03/2010 16:02, tv@fuzzy.cz a écrit :
>>  I will try increasing work_mem, but it is already set at 16MB which I
>>  found is quite high.
>
> What do you mean by "high"? I believe the proper value of work_mem is such
> that results in highest performance of the query while not causing
> problems to the other sessions.

Yes, that's true, see my answer below.

> Don't forget you can set this for each session separately (SET work_mem =
> something) so you can use some conservative value in the postgresql.conf
> and use a different value (e.g. 64MB) for batch processing. I guess you
> don't perform such UPDATE in usual sessions, right?

Indeed, I did NOT know that I could set work_mem for each session !
That makes a big difference actually !

I'll try this ASAP !
Thanks a lot for the hint !

Regards,
Arnaud

Re: UPDATE with JOIN not using index

From
Tom Lane
Date:
Arnaud Lesauvage <arnaud.listes@codata.eu> writes:
> I am still a bit intrigued by the different query plans. Both query look
> very similar to me, so why does the planner make so different choices ?

Sometime in the future they might generate the same plan.  Right now the
planner's ability to optimize sub-SELECTs is pretty limited, and so you
typically get a nestloop-like plan even if some other join style would
be faster.  (It's not all the planner's fault either --- we'd need to
tweak the executor so that it could throw the appropriate error for more
than one join partner row, etc.)

            regards, tom lane

Re: UPDATE with JOIN not using index

From
Arnaud Lesauvage
Date:
Le 16/03/2010 16:52, Tom Lane a écrit :
> Sometime in the future they might generate the same plan.  Right now the
> planner's ability to optimize sub-SELECTs is pretty limited, and so you
> typically get a nestloop-like plan even if some other join style would
> be faster.

OK, thanks for the clarification.
Should joins always be considered better than subselects then ?

I like the readability of subselects and I think I use them more often
than joins when I have the choice, so maybe I should check and change
all my queries ?