Re: Long Running Update - My Solution - Mailing list pgsql-performance

From Harry Mantheakis
Subject Re: Long Running Update - My Solution
Date
Msg-id 4E09A389.80604@riskcontrollimited.com
Whole thread Raw
In response to Re: Long Running Update - My Solution  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
List pgsql-performance
Hello Kevin

 > If you use EXPLAIN with both statements...

Yes, the plans are indeed very different.

Here is the statement, set to update up to 100,000 records, which took
about 5 seconds to complete:


UPDATE
   table_A
SET
   field_1 = table_B.field_1
, field_2 = table_B.field_2
FROM
   table_B
WHERE
   table_B.tb_id >= 0
AND
   table_B.tb_id <= 100000
AND
   table_B.tb_id = table_A.ta_id
;


The query plan for the above is:


Nested Loop  (cost=0.00..2127044.47 rows=73620 width=63)
   ->  Index Scan using table_B_pkey on table_B  (cost=0.00..151830.75
rows=73620 width=20)
         Index Cond: ((tb_id >= 0) AND (tb_id <= 100000))
   ->  Index Scan using table_A_pkey on table_A  (cost=0.00..26.82
rows=1 width=47)
         Index Cond: (table_A.ta_id = table_B.tb_id)


Now, if I change the first AND clause to update 1M records, as follows:


table_B.id <= 1000000


I get the following - quite different - query plan:


Hash Join  (cost=537057.49..8041177.88 rows=852150 width=63)
   Hash Cond: (table_A.ta_id = table_B.tb_id)
   ->  Seq Scan on table_A  (cost=0.00..3294347.71 rows=145561171 width=47)
   ->  Hash  (cost=521411.62..521411.62 rows=852150 width=20)
         ->  Bitmap Heap Scan on table_B  (cost=22454.78..521411.62
rows=852150 width=20)
               Recheck Cond: ((tb_id >= 0) AND (tb_id <= 1000000))
               ->  Bitmap Index Scan on table_B_pkey
(cost=0.00..22241.74 rows=852150 width=0)
                     Index Cond: ((tb_id >= 0) AND (tb_id <= 1000000))


Note: When I tried updating 1M records, the command was still running
after 25 minutes before I killed it.

The sequential scan in the later plan looks expensive, and (I think)
supports what others have since mentioned, namely that when the
optimizer moves to using sequential scans (working off the disk) things
get a lot slower.

For me, the penny has finally dropped on why I should use EXPLAIN for
bulk operations.

Thanks too, to Greg Smith, Robert Klemme and Thomas for all the feedback.

Kind regards

Harry Mantheakis
London, UK




pgsql-performance by date:

Previous
From: Greg Smith
Date:
Subject: Re: Long Running Update - My Solution
Next
From: Craig McIlwee
Date:
Subject: Slow performance when querying millions of rows