Thread: Plan for update ... where a is not distinct from b

Plan for update ... where a is not distinct from b

From

"Peter J. Holzer"

Date:

28 November 2017, 11:50:16

[PostgreSQL 9.5.10 on x86_64-pc-linux-gnu, compiled by gcc (Debian
4.9.2-10) 4.9.2, 64-bit]

I noticed that an update was taking a long time and found this:

UPDATE public.facttable_imf_ifs p
SET [...lots of columns...]
FROM cleansing.cls_imf_ifs_facttable_imf_ifs c, cleansing.cleansing_change_type ct
WHERE   (p.macrobondtimeseries is not distinct from c.macrobondtimeseries) AND (p.date is not distinct from c.date)
ANDc.cleansing_change_type_id = ct.cleansing_change_type_id   AND ct.cleansing_change_type_desc_short IN
('UPDATED_NEW')
;


╔════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                                       QUERY PLAN
                                ║ 

╟────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╢
║ Update on facttable_imf_ifs p  (cost=1.09..1978353972070.75 rows=7969398 width=1498)
                                ║ 
║   ->  Nested Loop  (cost=1.09..1978353972070.75 rows=7969398 width=1498)
                                ║ 
║         Join Filter: ((NOT ((p.macrobondtimeseries)::text IS DISTINCT FROM (c.macrobondtimeseries)::text)) AND (NOT
(p.dateIS DISTINCT FROM c.date))) ║ 
║         ->  Seq Scan on facttable_imf_ifs p  (cost=0.00..1071317.45 rows=20727045 width=51)
                                ║ 
║         ->  Materialize  (cost=1.09..2028790.72 rows=5454160 width=1472)
                                ║ 
║               ->  Hash Join  (cost=1.09..2001519.92 rows=5454160 width=1472)
                                ║ 
║                     Hash Cond: (c.cleansing_change_type_id = ct.cleansing_change_type_id)
                                ║ 
║                     ->  Seq Scan on cls_imf_ifs_facttable_imf_ifs c  (cost=0.00..1824258.62 rows=32724962 width=1470)
                                ║ 
║                     ->  Hash  (cost=1.07..1.07 rows=1 width=8)
                                ║ 
║                           ->  Seq Scan on cleansing_change_type ct  (cost=0.00..1.07 rows=1 width=8)
                                ║ 
║                                 Filter: ((cleansing_change_type_desc_short)::text = 'UPDATED_NEW'::text)
                                ║ 

╚════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

Am I correct to assume that the result of Materialize is a flat table
(in memory) without any indexes and that the nested loop has to scan
that for each of the 20 million rows of the target table?

That's going to take a long time ...

Replacing the "is not distinct from" with "=" (which is possible in this
case because both columns are not null (and indeed the primary key), gives me this plan:


╔════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                         QUERY PLAN
    ║ 

╟────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╢
║ Update on facttable_imf_ifs p  (cost=2611816.89..4435860.27 rows=7969425 width=1498)
    ║ 
║   ->  Merge Join  (cost=2611816.89..4435860.27 rows=7969425 width=1498)
    ║ 
║         Merge Cond: (((p.macrobondtimeseries)::text = (c.macrobondtimeseries)::text) AND (p.date = c.date))
    ║ 
║         ->  Index Scan using facttable_imf_ifs_pkey on facttable_imf_ifs p  (cost=0.56..1541107.94 rows=20727117
width=51)║ 
║         ->  Sort  (cost=2611811.12..2625446.52 rows=5454160 width=1472)
    ║ 
║               Sort Key: c.macrobondtimeseries, c.date
    ║ 
║               ->  Hash Join  (cost=1.09..2001519.92 rows=5454160 width=1472)
    ║ 
║                     Hash Cond: (c.cleansing_change_type_id = ct.cleansing_change_type_id)
    ║ 
║                     ->  Seq Scan on cls_imf_ifs_facttable_imf_ifs c  (cost=0.00..1824258.62 rows=32724962 width=1470)
    ║ 
║                     ->  Hash  (cost=1.07..1.07 rows=1 width=8)
    ║ 
║                           ->  Seq Scan on cleansing_change_type ct  (cost=0.00..1.07 rows=1 width=8)
    ║ 
║                                 Filter: ((cleansing_change_type_desc_short)::text = 'UPDATED_NEW'::text)
    ║ 

╚════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝

which obviously looks a lot nicer, but even replacing
(A is not distinct from B)
with the equivalent
(A = B or A is null and B is null)
gives a different plan:


╔═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                                                                                          QUERY PLAN

╟───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╢
║ Update on facttable_imf_ifs p  (cost=5192.25..28394567337.17 rows=7969813 width=1498)
║   ->  Nested Loop  (cost=5192.25..28394567337.17 rows=7969813 width=1498)
║         ->  Hash Join  (cost=1.09..2001519.92 rows=5454160 width=1472)
║               Hash Cond: (c.cleansing_change_type_id = ct.cleansing_change_type_id)
║               ->  Seq Scan on cls_imf_ifs_facttable_imf_ifs c  (cost=0.00..1824258.62 rows=32724962 width=1470)
║               ->  Hash  (cost=1.07..1.07 rows=1 width=8)
║                     ->  Seq Scan on cleansing_change_type ct  (cost=0.00..1.07 rows=1 width=8)
║                           Filter: ((cleansing_change_type_desc_short)::text = 'UPDATED_NEW'::text)
║         ->  Bitmap Heap Scan on facttable_imf_ifs p  (cost=5191.16..5205.65 rows=1 width=51)
║               Recheck Cond: (((macrobondtimeseries)::text = (c.macrobondtimeseries)::text) OR (macrobondtimeseries IS
NULL))
║               Filter: ((((macrobondtimeseries)::text = (c.macrobondtimeseries)::text) OR ((macrobondtimeseries IS
NULL)AND (c.macrobondtimeseries IS NULL))) AND ((date = c.date) OR ((date IS NULL) AND (c.date IS NULL)))) 
║               ->  BitmapOr  (cost=5191.16..5191.16 rows=866 width=0)
║                     ->  Bitmap Index Scan on facttable_imf_ifs_pkey  (cost=0.00..7.06 rows=866 width=0)
║                           Index Cond: ((macrobondtimeseries)::text = (c.macrobondtimeseries)::text)
║                     ->  Bitmap Index Scan on facttable_imf_ifs_pkey  (cost=0.00..2.07 rows=1 width=0)
║                           Index Cond: (macrobondtimeseries IS NULL)

╚═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╝
       hp

--   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Re: Plan for update ... where a is not distinct from b

From

Tom Lane

Date:

28 November 2017, 17:35:33

"Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> I noticed that an update was taking a long time and found this:
> [ crappy plan for join on IS NOT DISTINCT ]

Yeah, there's no optimization smarts at all for IS [NOT] DISTINCT.
It can't be converted into a merge qual, nor a hash qual, nor an
indexscan qual.

In principle this could be improved, but given how much work it'd be
and how seldom anyone complains, it's not likely to happen anytime soon.
        regards, tom lane

Re: Plan for update ... where a is not distinct from b

From

Laurenz Albe

Date:

28 November 2017, 22:48:24

Peter J. Holzer wrote:
> I noticed that an update was taking a long time and found this:
> 
> UPDATE public.facttable_imf_ifs p
> SET [...lots of columns...]
> FROM cleansing.cls_imf_ifs_facttable_imf_ifs c, cleansing.cleansing_change_type ct
> WHERE
>     (p.macrobondtimeseries is not distinct from c.macrobondtimeseries) AND (p.date is not distinct from c.date)
>     AND c.cleansing_change_type_id = ct.cleansing_change_type_id
>     AND ct.cleansing_change_type_desc_short IN ('UPDATED_NEW')

It is kind of ugly, and I didn't test it, but here is an idea:

Suppose we know a value that cannot occur in both p.date and c.date.

Then you could write
  WHERE coalesce(p.date, '0044-03-15 BC') = coalesce(c.date, '0044-03-15 BC')

and create an index on the coalesce expressions to facilitate
a merge join.

Yours,
Laurenz Albe

Re: Plan for update ... where a is not distinct from b

From

"Peter J. Holzer"

Date:

28 November 2017, 23:56:47

On 2017-11-28 09:35:33 -0500, Tom Lane wrote:
> "Peter J. Holzer" <hjp-pgsql@hjp.at> writes:
> > I noticed that an update was taking a long time and found this:
> > [ crappy plan for join on IS NOT DISTINCT ]
>
> Yeah, there's no optimization smarts at all for IS [NOT] DISTINCT.
> It can't be converted into a merge qual, nor a hash qual, nor an
> indexscan qual.

Pity. I expected IS NOT DISTINCT to be treated pretty much like =, given
that it is just a more naive equality test. In particular, since
PostgreSQL stores NULL values in indexes (unlike Oracle) I expected it
to be able to use an index scan.

> In principle this could be improved, but given how much work it'd be
> and how seldom anyone complains, it's not likely to happen anytime soon.

Yeah, IS [NOT] DISTINCT is pretty obscure. I guess not many people use
it.
       hp

--   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>

Re: Plan for update ... where a is not distinct from b

From

"Peter J. Holzer"

Date:

29 November 2017, 00:14:53

On 2017-11-28 20:48:24 +0100, Laurenz Albe wrote:
> Peter J. Holzer wrote:
> > I noticed that an update was taking a long time and found this:
> >
> > UPDATE public.facttable_imf_ifs p
> > SET [...lots of columns...]
> > FROM cleansing.cls_imf_ifs_facttable_imf_ifs c, cleansing.cleansing_change_type ct
> > WHERE
> >     (p.macrobondtimeseries is not distinct from c.macrobondtimeseries) AND (p.date is not distinct from c.date)
> >     AND c.cleansing_change_type_id = ct.cleansing_change_type_id
> >     AND ct.cleansing_change_type_desc_short IN ('UPDATED_NEW')
>
> It is kind of ugly, and I didn't test it, but here is an idea:
>
> Suppose we know a value that cannot occur in both p.date and c.date.
>
> Then you could write
>
>    WHERE coalesce(p.date, '0044-03-15 BC') = coalesce(c.date, '0044-03-15 BC')
>
> and create an index on the coalesce expressions to facilitate
> a merge join.

Yes, that would work. However in most cases these queries are
automatically generated from the tables so determining "impossible"
values is not generally possible. I don't think it is necessary, though.
Replacing (A is not distinct from B) with (A = B or A is null and B is
null) usually produces an acceptable plan (indeed, I had that before, I
replaced it with is not distinct to clean it up) and when a column has a
not null constraint I can simply use (A = B). (I should probably do this
automatically - currently I need to pass a flag to the query generator,
and that's a bit error prone)
       hp

--   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp@hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>