Re: Performance of UPDATE SET = FROM vs UPDATE SET = (SELECT ...) - Mailing list pgsql-general

From pbj@cmicdo.com
Subject Re: Performance of UPDATE SET = FROM vs UPDATE SET = (SELECT ...)
Date
Msg-id 1415361808.73120.YahooMailBasic@web161704.mail.bf1.yahoo.com
Whole thread Raw
List pgsql-general
On Thu, Nov 06, 2014 at 02:55:20PM +0000, Shaun Thomas wrote:
>
> These updates aren't equivalent. It's very important you know this, because you're also inflating your table with a
lotof extra updated rows. 
>
> Take the first UPDATE:
>
> > UPDATE second SET time1 = orig.time1
> > FROM orig
> > WHERE second.key1 = orig.key1;
>
> If you wrote this as a SELECT, it would look like this:
>
> SELECT second.time1, orig.time1
>   FROM second
>   JOIN ORIG ON (second.key1 = orig.key1)
>
> Since second is a many to one subset of orig, you now have several simultaneous updates. Your second UPDATE:
>
> > UPDATE second SET time1 = (SELECT orig.time1 FROM orig,second
> > WHERE orig.key1 = second.key1 LIMIT 1);
>
> Is equivalent to this SELECT:
>
> SELECT second.time1,
>        (SELECT orig.time1 FROM orig,second
>          WHERE orig.key1 = second.key1 LIMIT 1)
>   FROM second;
>
> Meaning you'd only get as many updates as there are rows in second. The difference is your LIMIT 1. However, since
you'renot using an ORDER BY clause, the actual value you get for time1 will be indeterminate. Something like this would
removethe row inflation and fix the random time1 behavior, but I'm not sure it was your intent: 
>
> UPDATE second
>    SET time1 = orig.time1
>   FROM (SELECT DISTINCT ON (key1) key1, time1
>           FROM orig
>          ORDER BY key1, time1 DESC) sub
>  WHERE second.key1 = sub.key1;

I see now that I made more than one mistake.

1) I forgot to INCLUDE INDEXES when creating second.  I would have
   seen dup keys when filling it.

    CREATE TABLE second (LIKE orig INCLUDING INDEXES);

2) I should have used something like this to fill second:

    INSERT INTO second (key1)
        SELECT key1 FROM orig
        ORDER BY random()
        LIMIT 400000;

3) I then incorrectly remembered the query I had written at work.  It
   should have been:

    EXPLAIN ANALYZE
    UPDATE second se SET time1 = (SELECT time1 FROM orig
                WHERE orig.key1 = se.key1);


Once the second table is filled with unique keys, then both UPDATES
should have produced the same results, but the UPDATE FROM is faster
than the UPDATE = SELECT, which is documented.

My original intent was to find out what the performance differences
between the two are.

Thanks for pointing these things out!
PJ


pgsql-general by date:

Previous
From: David G Johnston
Date:
Subject: Re: does pg_restore respect CLUSTER ON in the schema?
Next
From: Dan H
Date:
Subject: reindex table deadlock