Thread: Is there something wrong with my test case?

Is there something wrong with my test case?

From
Thiemo Kellner
Date:
Hi all and merry Christmas

I was under the impression that updating a table with values from (an)  
other table(s) would be implemented most efficiently with a correlated  
subquery a long the schema as follows, let's name it A.

update TO_BE_UPDATED U
    set ({column list}) = ({correlated subquery 1})
  where exists ({correlated subquery 1})
;

I set up a test case to figure out if this is the case. I compared the  
execution plan of a query with above pattern with the execution plans  
with the following patterns.

B
update TO_BE_UPDATED U
    set COL_1 = ({correlated subquery 1}),
        COL_2 = ({correlated subquery 2})
  where exists ({correlated subquery 3})
;

C
update TO_BE_UPDATED U
    set COL_1 = ({correlated subquery 1}),
        COL_2 = ({correlated subquery 2})
  where U.KEY_U in ({correlated subquery 3})
;

Explain analyze verbose showed for:
A (cost=264.72..626.97 rows=31 width=90) (actual time=1.117..1.117  
rows=0 loops=1)
B (cost=264.72..974.25 rows=31 width=58) (actual time=1.508..1.508  
rows=0 loops=1)
C (cost=0.42..611.19 rows=31 width=52) (actual time=2.217..2.217  
rows=0 loops=1)

I am very surprised that the cost of A is (much) higher than that of C  
which I suspected to be the most inefficient. I was that much fixed on  
the costs that I initially ignored the actual time where my  
assumptions on efficiency are reflected. Funny though is that the  
subjective impression when waiting for the update queries to complete  
was that C was fastest by far, followed by B and only at the end was  
update A.

Now I wonder whether I do not know, how to read the explain plan  
summary, I should not trust my subjective impression with respect to  
time measurement or if my test case is faulty.

I carried out the test on an idle Windows 10 laptop with portable  
PostgreSQL 10.4 provided by PortableApps.
You can find test case script and log at https://pastebin.com/W2HsTBwi

I would appreciate your two dimes.

Kind regards

Thiemo
-- 
Signal: +49 1578 7723737
Handys: +41 78 947 36 21 | +49 1578 772 37 37
Tox-Id:  
B7282A9E31D944DDAD1ECC5B33C36AAE80B32D119AB7391D602C937A65E8CA0625C495C5322B



Re: Is there something wrong with my test case?

From
"Peter J. Holzer"
Date:
On 2018-12-25 11:54:11 +0000, Thiemo Kellner wrote:
[three different but functionally equivalent queries]

> Explain analyze verbose showed for:
> A (cost=264.72..626.97 rows=31 width=90) (actual time=1.117..1.117 rows=0
> loops=1)
> B (cost=264.72..974.25 rows=31 width=58) (actual time=1.508..1.508 rows=0
> loops=1)
> C (cost=0.42..611.19 rows=31 width=52) (actual time=2.217..2.217 rows=0
> loops=1)
>
> I am very surprised that the cost of A is (much) higher than that of C which

626.97 doesn't seem "much higher" to me than 611.19. I would call that
"about the same".

> I suspected to be the most inefficient. I was that much fixed on the costs
> that I initially ignored the actual time where my assumptions on efficiency
> are reflected. Funny though is that the subjective impression when waiting
> for the update queries to complete was that C was fastest by far, followed
> by B and only at the end was update A.

This is weird. C takes almost exactly twice as long as A, and while
humans aren't very good at estimating times, One second should feel
faster than two, not slower, and certainly not slower by far. Is it
possible that your subjective impression wasn't based on the executions
you posted but on others? Caching and load spikes can cause quite large
variations in run time, so running the same query again may not take the
same time (usually the second time is faster - sometimes much faster).

        hp

--
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Attachment

Re: Is there something wrong with my test case?

From
David Rowley
Date:
On Wed, 26 Dec 2018 at 00:54, Thiemo Kellner
<thiemo@gelassene-pferde.biz> wrote:
> Explain analyze verbose showed for:
> A (cost=264.72..626.97 rows=31 width=90) (actual time=1.117..1.117
> rows=0 loops=1)
> B (cost=264.72..974.25 rows=31 width=58) (actual time=1.508..1.508
> rows=0 loops=1)
> C (cost=0.42..611.19 rows=31 width=52) (actual time=2.217..2.217
> rows=0 loops=1)
>
> I am very surprised that the cost of A is (much) higher than that of C
> which I suspected to be the most inefficient. I was that much fixed on
> the costs that I initially ignored the actual time where my
> assumptions on efficiency are reflected. Funny though is that the
> subjective impression when waiting for the update queries to complete
> was that C was fastest by far, followed by B and only at the end was
> update A.

While the times mentioned in "actual time" are for execution only and
don't account for the time taken to plan the query, the results you
put in [1] disagree entirely with your claim that 'C' was faster. 'A'
comes out fastest with the explain analyzes you've listed.

A:
Planning TIME: 0.423 ms
Execution TIME: 1.170 ms

C:
Planning TIME: 0.631 ms
Execution TIME: 2.281 ms

Have you confused each of the results, perhaps because they're in a
different order as to your cases above?

I'd certainly expect 'A' to be the fastest of the bunch since it's
both less effort for the planner and also the executor.  I didn't look
at why the cost is estimated to be slightly higher, but the planner
wouldn't consider rewriting the queries to one of the other cases
anyway, so it's likely not that critical that the costings are
slightly out from reality.

> where U.KEY_U in ({correlated subquery 3})

This is not correlated in [1].

[1]  https://pastebin.com/W2HsTBwi

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Is there something wrong with my test case?

From
Thiemo Kellner
Date:
Hi HP

Thanks for your reply.

Quoting "Peter J. Holzer" <hjp-pgsql@hjp.at>:

> On 2018-12-25 11:54:11 +0000, Thiemo Kellner wrote:
> [three different but functionally equivalent queries]
>
>> Explain analyze verbose showed for:
>> A (cost=264.72..626.97 rows=31 width=90) (actual time=1.117..1.117 rows=0
>> loops=1)
>> C (cost=0.42..611.19 rows=31 width=52) (actual time=2.217..2.217 rows=0
>> loops=1)
>
> 626.97 doesn't seem "much higher" to me than 611.19. I would call that
> "about the same".
>

So would I but the cost is given as a range. Taking the the average  
somewhat 400 compare to somewhat 300. I do not know whether averaging  
is appropriate here.

> This is weird. C takes almost exactly twice as long as A, and while
> humans aren't very good at estimating times, One second should feel
> faster than two, not slower, and certainly not slower by far. Is it
> possible that your subjective impression wasn't based on the executions
> you posted but on others? Caching and load spikes can cause quite large
> variations in run time, so running the same query again may not take the
> same time (usually the second time is faster - sometimes much faster).

I am pretty sure not to have confused the association of my felt time  
measure to the query. I did run the script several times but as the  
script create everything anew this might not have any impact caching  
wise. However, if caching had an impact it just would add to the  
discrepancy between feeling that the first statement was much faster  
than the supposedly optimal statement. Being as may, there is still  
fact that cost for A was estimated  about the same as C or much higher  
but A was executed in half of the time of C.

-- 
Signal: +49 1578 7723737
Handys: +41 78 947 36 21 | +49 1578 772 37 37
Tox-Id:  
B7282A9E31D944DDAD1ECC5B33C36AAE80B32D119AB7391D602C937A65E8CA0625C495C5322B



Re: Is there something wrong with my test case?

From
Thiemo Kellner
Date:
Hi David

Thanks for your revision.

Quoting David Rowley <david.rowley@2ndquadrant.com>:

> On Wed, 26 Dec 2018 at 00:54, Thiemo Kellner
> <thiemo@gelassene-pferde.biz> wrote:
>> Explain analyze verbose showed for:
>> A (cost=264.72..626.97 rows=31 width=90) (actual time=1.117..1.117
>> rows=0 loops=1)
>> B (cost=264.72..974.25 rows=31 width=58) (actual time=1.508..1.508
>> rows=0 loops=1)
>> C (cost=0.42..611.19 rows=31 width=52) (actual time=2.217..2.217
>> rows=0 loops=1)
>>
>> I am very surprised that the cost of A is (much) higher than that of C
>> which I suspected to be the most inefficient. I was that much fixed on
>> the costs that I initially ignored the actual time where my
>> assumptions on efficiency are reflected. Funny though is that the
>> subjective impression when waiting for the update queries to complete
>> was that C was fastest by far, followed by B and only at the end was
>> update A.
>
> While the times mentioned in "actual time" are for execution only and
> don't account for the time taken to plan the query, the results you
> put in [1] disagree entirely with your claim that 'C' was faster. 'A'
> comes out fastest with the explain analyzes you've listed.
>
> A:
> Planning TIME: 0.423 ms
> Execution TIME: 1.170 ms
>
> C:
> Planning TIME: 0.631 ms
> Execution TIME: 2.281 ms
>
> Have you confused each of the results, perhaps because they're in a
> different order as to your cases above?

I am pretty sure I did not confuse. I am not worried about planning  
times as I assume that PostgreSQL has a time limit restricting the  
time used to find the best execution path in the order of seconds such  
that for a heavy load query it would get neglectable.

> I'd certainly expect 'A' to be the fastest of the bunch since it's
> both less effort for the planner and also the executor.  I didn't look
> at why the cost is estimated to be slightly higher, but the planner
> wouldn't consider rewriting the queries to one of the other cases
> anyway, so it's likely not that critical that the costings are
> slightly out from reality.

I am glad, that my feeling what should be the best query meets  
reality. However, I am left a bit concerned about the usefulness of  
the costs of the execution plan. I feel the costs rather contradict  
the actual execution times in my test case. To me this would render  
the cost useless for comparison of queries.

>> where U.KEY_U in ({correlated subquery 3})
>
> This is not correlated in [1].
>
> [1]  https://pastebin.com/W2HsTBwi

Right you are, my fault. Thanks for your attention there as well. :-)

-- 
Signal: +49 1578 7723737
Handys: +41 78 947 36 21 | +49 1578 772 37 37
Tox-Id:  
B7282A9E31D944DDAD1ECC5B33C36AAE80B32D119AB7391D602C937A65E8CA0625C495C5322B



Re: Is there something wrong with my test case?

From
Fabio Pardi
Date:
Hi Thiemo,


On 07/01/2019 11:30, Thiemo Kellner wrote:

Hi HP

Thanks for your reply.

Quoting "Peter J. Holzer" <hjp-pgsql@hjp.at>:

On 2018-12-25 11:54:11 +0000, Thiemo Kellner wrote:
[three different but functionally equivalent queries]

Explain analyze verbose showed for:
A (cost=264.72..626.97 rows=31 width=90) (actual time=1.117..1.117 rows=0
loops=1)
C (cost=0.42..611.19 rows=31 width=52) (actual time=2.217..2.217 rows=0
loops=1)

626.97 doesn't seem "much higher" to me than 611.19. I would call that
"about the same".


So would I but the cost is given as a range. Taking the the average somewhat 400 compare to somewhat 300. I do not know whether averaging is appropriate here.


The cost is not a range. The 2 numbers you see are:

  • Estimated start-up cost. This is the time expended before the output phase can begin, e.g., time to do the sorting in a sort node.

  • Estimated total cost. This is stated on the assumption that the plan node is run to completion, i.e., all available rows are retrieved. In practice a node's parent node might stop short of reading all available rows (see the LIMIT example below).

As you can read here:

https://www.postgresql.org/docs/current/using-explain.html


regards,

fabio pardi


Re: Is there something wrong with my test case?

From
Thiemo Kellner
Date:
Hi Fabio

Quoting Fabio Pardi <f.pardi@portavita.eu>:

> The cost is not a range. The 2 numbers you see are:
>
>
>   *
>     Estimated start-up cost. This is the time expended before the  
> output phase can begin, e.g., time to do the sorting in a sort node.
>
>   *    Estimated total cost. This is stated on the assumption that  
> the plan node is run to completion, i.e., all available rows are  
> retrieved. In practice a node's parent node might stop short of  
> reading all available rows (see the LIMIT example below).
>
>   As you can read here:
>
>   https://www.postgresql.org/docs/current/using-explain.html

Thanks for pointing this out. I was mislead than be the form those two  
different things are displayed.
-- 
Signal: +49 1578 7723737
Handys: +41 78 947 36 21 | +49 1578 772 37 37
Tox-Id:  
B7282A9E31D944DDAD1ECC5B33C36AAE80B32D119AB7391D602C937A65E8CA0625C495C5322B