Re: Moving delta data faster - Mailing list pgsql-general

From Adrian Klaver
Subject Re: Moving delta data faster
Date
Msg-id ceed353b-b586-4b81-85e2-9f7fffcfbe71@aklaver.com
Whole thread Raw
In response to Re: Moving delta data faster  (yudhi s <learnerdatabase99@gmail.com>)
Responses Re: Moving delta data faster
List pgsql-general
On 4/6/24 13:04, yudhi s wrote:
> 
> On Sat, Apr 6, 2024 at 10:25 PM Adrian Klaver <adrian.klaver@aklaver.com 
> <mailto:adrian.klaver@aklaver.com>> wrote:
> 
> 
>     Your original problem description was:
> 
>     "Then subsequently these rows will be inserted/updated based on the
>     delta number of rows that got inserted/updated in the source database.
>     In some cases these changed data can flow multiple times per day to the
>     downstream i.e. postgres database and in other cases once daily."
> 
>     If the above is not a hard rule, then yes up to some point just
>     replacing the data in mass would be the simplest/fastest method. You
>     could cut a step out by doing something like TRUNCATE target_tab and
>     then COPY target_tab FROM 'source.csv' bypassing the INSERT INTO
>     source_tab.
> 
> Yes, actually i didn't realize that truncate table transactional/online 
> here in postgres. In other databases like Oracle its downtime for the 
> read queries on the target table, as data will be vanished from the 
> target table post truncate(until the data load happens) and those are 
> auto commit. Thanks Veem for sharing that  option.
> 
>   I also think that truncate will be faster if the changes/delta is 
> large , but if its handful of rows like <5%of the rows in the table then 
> Upsert/Merge will be better performant. And also the down side of the 
> truncate option is,  it does ask to bring/export all the data from 
> source to the S3 file which may take longer as compared to bringing just 
> the delta records. Correct me if I'm wrong.

Since you still have not specified how the data is stored in S3 and how 
you propose to move them into Postgres I can't really answer.

> 
> However I am still not able to understand why the upsert is less 
> performant than merge, could you throw some light on this please?
> 

I have no idea how this works in the code, but my suspicion is it is due
to the following:

https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT

"The optional ON CONFLICT clause specifies an alternative action to 
raising a unique violation or exclusion constraint violation error. For 
each individual row proposed for insertion, either the insertion 
proceeds, or, if an arbiter constraint or index specified by 
conflict_target is violated, the alternative conflict_action is taken. 
ON CONFLICT DO NOTHING simply avoids inserting a row as its alternative 
action. ON CONFLICT DO UPDATE updates the existing row that conflicts 
with the row proposed for insertion as its alternative action."

vs this:

"First, the MERGE command performs a join from data_source to 
target_table_name producing zero or more candidate change rows. For each 
candidate change row, the status of MATCHED or NOT MATCHED is set just 
once, after which WHEN clauses are evaluated in the order specified. For 
each candidate change row, the first clause to evaluate as true is 
executed. No more than one WHEN clause is executed for any candidate 
change row."

Where ON CONFLICT attempts the INSERT then on failure does the UPDATE 
for the ON CONFLICT DO UPDATE case. MERGE on the hand evaluates based on 
the join condition(ON tbl1.fld =tbl2.fld) and then based on MATCH/NOT 
MATCHED takes the appropriate action for the first WHEN match. In other 
words it goes directly to the appropriate action.

-- 
Adrian Klaver
adrian.klaver@aklaver.com




pgsql-general by date:

Previous
From: yudhi s
Date:
Subject: Re: Moving delta data faster
Next
From: Tony Bazeley
Date:
Subject: [MASSMAIL]pg_dumpall - restoration problem