Home > mailing lists

Re: Moving delta data faster - Mailing list pgsql-general

From	yudhi s
Subject	Re: Moving delta data faster
Date	April 9, 2024 16:36:30
Msg-id	CAEzWdqfN_wfpr4enRtJhdQ6RJWBAZJK2Seg4_KpwqbMhbbhi7g@mail.gmail.com Whole thread
In response to	Re: Moving delta data faster (Adrian Klaver <adrian.klaver@aklaver.com>)
List	pgsql-general

Tree view

On Sun, Apr 7, 2024 at 2:25 AM Adrian Klaver <adrian.klaver@aklaver.com> wrote:

I have no idea how this works in the code, but my suspicion is it is due
to the following:

https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT

"The optional ON CONFLICT clause specifies an alternative action to
raising a unique violation or exclusion constraint violation error. For
each individual row proposed for insertion, either the insertion
proceeds, or, if an arbiter constraint or index specified by
conflict_target is violated, the alternative conflict_action is taken.
ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative
action. ON CONFLICT DO UPDATE updates the existing row that conflicts
with the row proposed for insertion as its alternative action."

vs this:

"First, the MERGE command performs a join from data_source to
target_table_name producing zero or more candidate change rows. For each
candidate change row, the status of MATCHED or NOT MATCHED is set just
once, after which WHEN clauses are evaluated in the order specified. For
each candidate change row, the first clause to evaluate as true is
executed. No more than one WHEN clause is executed for any candidate
change row."

Where ON CONFLICT attempts the INSERT then on failure does the UPDATE
for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on
the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT
MATCHED takes the appropriate action for the first WHEN match. In other
words it goes directly to the appropriate action.

Thank you Adrian. I think you are spoton on the cause of upsert becoming slower than Merge. Below is the explain plan I captured for both the operations and it looks like even the planning time is small for the UPSERT, as because it operates on the constraint i.e the Update will wait for all the failure records from the INSERT and thus it takes longer. The Merge seems to be evaluated on the Joins i.e it is directly able to get the set of rows which has to be Updated rather waiting for the INSERT to make it fail based on the PK constraint.

************** Explain plan for UPSERT *****************

Insert on public.target_tab (cost=0.00..17353.00 rows=0 width=0) (actual time=19957.569..19957.570 rows=0 loops=1)
Conflict Resolution: UPDATE
Conflict Arbiter Indexes: target_tab_pkey
Tuples Inserted: 500000
Conflicting Tuples: 500000
Buffers: shared hit=8545735 written=10094
-> Seq Scan on public.source_tab (cost=0.00..17353.00 rows=1000000 width=29) (actual time=0.006..208.306 rows=1000000 loops=1)
Output: source_tab.id, source_tab.column1, source_tab.column2
Buffers: shared hit=7353
Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency = '1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4', search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem = '8MB'
Query Identifier: -1356019529835809419
Planning:
Buffers: shared hit=41
Planning Time: 0.199 ms
Execution Time: 19959.261 ms

************** Explain plan for Merge *****************

Merge on public.target_tab t (cost=17368.00..53460.01 rows=0 width=0) (actual time=14209.966..14209.968 rows=0 loops=1)
Tuples: inserted=500000 updated=500000
Buffers: shared hit=5040097 written=10460, temp read=4143 written=4143
I/O Timings: temp read=26.746 write=68.596
-> Hash Left Join (cost=17368.00..53460.01 rows=1000000 width=35) (actual time=179.233..1332.264 rows=1000000 loops=1)
Output: t.ctid, s.column1, s.column2, s.id
Inner Unique: true
Hash Cond: (s.id = t.id)
Buffers: shared hit=11029, temp read=4143 written=4143
I/O Timings: temp read=26.746 write=68.596
-> Seq Scan on public.source_tab s (cost=0.00..17353.00 rows=1000000 width=29) (actual time=0.008..268.506 rows=1000000 loops=1)
Output: s.column1, s.column2, s.id
Buffers: shared hit=7353
-> Hash (cost=8676.00..8676.00 rows=500000 width=10) (actual time=178.101..178.102 rows=500000 loops=1)
Output: t.ctid, t.id
Buckets: 524288 Batches: 2 Memory Usage: 14824kB
Buffers: shared hit=3676, temp written=977
I/O Timings: temp write=5.904
-> Seq Scan on public.target_tab t (cost=0.00..8676.00 rows=500000 width=10) (actual time=0.007..66.441 rows=500000 loops=1)
Output: t.ctid, t.id
Buffers: shared hit=3676
Settings: effective_cache_size = '10475192kB', maintenance_io_concurrency = '1', max_parallel_workers = '32', max_parallel_workers_per_gather = '4', search_path = 'public, public, "$user"', temp_buffers = '16MB', work_mem = '8MB'
Query Identifier: -2297080081674771467
Planning:
Buffers: shared hit=85
Planning Time: 0.466 ms

Execution Time: 14212.061 ms

pgsql-general by date:

From: Ron Johnson
Date: 09 April 2024, 16:33:27
Subject: [MASSMAIL]PL/pgSQL techniques better than bash for dynamic DO?

From: Tom Lane
Date: 09 April 2024, 16:36:50
Subject: Re: prepared statement "cu1" already exists (but it does not)

Re: Moving delta data faster - Mailing list pgsql-general

Previous

Next