Home > mailing lists

Thread: [GENERAL] insert on conflict

[GENERAL] insert on conflict

From

armand pirvu

Date:

28 June 2017, 01:40:06

Got question

birstdb=# \d csischema.dim_company

Table "csischema.dim_company"

Column | Type | Modifiers

-----------------+-----------------------------+-----------

company_id | integer | not null

company_name | character varying(100) |

city | character varying(100) |

state | character varying(100) |

postal_code | character varying(100) |

country | character varying(100) |

latitude | double precision |

longitude | double precision |

update_datetime | timestamp without time zone |

company_source | character varying(1) |

Indexes:

"dim_company_pkey" PRIMARY KEY, btree (company_id)

birstdb=# \d IVEE.dim_company

Table "ivee.dim_company"

Column | Type | Modifiers

-----------------+-----------------------------+-----------

company_id | integer |

company_name | character varying(100) |

city | character varying(100) |

state | character varying(100) |

postal_code | character varying(100) |

country | character varying(100) |

latitude | double precision |

longitude | double precision |

update_datetime | timestamp without time zone |

company_source | character varying(1) |

insert into csischema.dim_company select * from IVEE.dim_company on conflict (company_id) do update

SET

company_name = EXCLUDED.company_name ,

city = EXCLUDED.city ,

state = EXCLUDED.state ,

postal_code = EXCLUDED.postal_code ,

country = EXCLUDED.country ,

latitude = EXCLUDED.latitude ,

longitude = EXCLUDED.longitude ,

update_datetime = EXCLUDED.update_datetime ,

company_source = EXCLUDED.company_source;

QUERY PLAN

---------------------------------------------------------------------------------------

Insert on dim_company (cost=0.00..188.32 rows=1232 width=1126)

Conflict Resolution: UPDATE

Conflict Arbiter Indexes: dim_company_pkey

-> Seq Scan on dim_company dim_company_1 (cost=0.00..188.32 rows=1232 width=1126)

(4 rows)

so how is it working in fact ? Isn't it working like looping in the IVEE.dim_company and for each company_id if the record does have a correspondent in csischema.dim_company then update csischema.dim_company set company_name = EXCLUDED.company_name where company_id=... ? If so isn't it supposed to use the PK for each company_id ? Or is it more like building a whole list from IVEE.dim_company and treat like a join ? Just trying to understand

Thanks

Armand

Re: [GENERAL] insert on conflict

From

Peter Geoghegan

Date:

28 June 2017, 02:30:16

On Tue, Jun 27, 2017 at 12:40 PM, armand pirvu <armand.pirvu@gmail.com> wrote:
> so how is it working in fact ? Isn't it working like looping in the
> IVEE.dim_company  and for each company_id if the record does have a
> correspondent in csischema.dim_company then update csischema.dim_company set
> company_name = EXCLUDED.company_name  where company_id=...  ? If so isn't it
> supposed to use the PK for each company_id ? Or is it more like building a
> whole list from IVEE.dim_company and treat like a join ? Just trying to
> understand

The processing here has to happen a tuple at a time. That's because
the index structure itself is the only authoritative source of truth
about whether or not there is a possible conflict. An MVCC snapshot
isn't good enough, because it sees a consistent view of things, not
the true physical reality of what exists or does not exist in the
index.

What you end up with here is a weird nested loop join, you might say.
The implementation couldn't do it any other way (this could never
behave more like a merge join), because we must eagerly check for
conflicts right as we insert (our insert would be registered by
*other* inserters/upserters as a conflict). If the implementation did
ever do it that way, it would break the important UPSERT guarantees
around concurrency.

MERGE does this in other systems, which is okay for those other
systems because MERGE makes no special promises about concurrency
(e.g., you can get a unique violation in the joined-on column with
MERGE). But, MERGE would be faster for bulk loading, which is what
MERGE is good for.

--
Peter Geoghegan

Re: [GENERAL] insert on conflict

From

armand pirvu

Date:

28 June 2017, 02:43:35

On Jun 27, 2017, at 3:30 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Tue, Jun 27, 2017 at 12:40 PM, armand pirvu <armand.pirvu@gmail.com> wrote:
so how is it working in fact ? Isn't it working like looping in the
IVEE.dim_company and for each company_id if the record does have a
correspondent in csischema.dim_company then update csischema.dim_company set
company_name = EXCLUDED.company_name where company_id=... ? If so isn't it
supposed to use the PK for each company_id ? Or is it more like building a
whole list from IVEE.dim_company and treat like a join ? Just trying to
understand

The processing here has to happen a tuple at a time. That's because
the index structure itself is the only authoritative source of truth
about whether or not there is a possible conflict. An MVCC snapshot
isn't good enough, because it sees a consistent view of things, not
the true physical reality of what exists or does not exist in the
index.

What you end up with here is a weird nested loop join, you might say.
The implementation couldn't do it any other way (this could never
behave more like a merge join), because we must eagerly check for
conflicts right as we insert (our insert would be registered by
*other* inserters/upserters as a conflict). If the implementation did
ever do it that way, it would break the important UPSERT guarantees
around concurrency.

MERGE does this in other systems, which is okay for those other
systems because MERGE makes no special promises about concurrency
(e.g., you can get a unique violation in the joined-on column with
MERGE). But, MERGE would be faster for bulk loading, which is what
MERGE is good for.

--
Peter Geoghegan

Hi Peter

So for example if IVEE.dim_company has 10k rows and csischema.dim_company has 40 rows, what will happen for each row in IVEE.dim_company we check csischema.dim_company and if the check tells row is in it switches to update and this would mean yes I scan IVEE.dim_company, however should an update be needed in csischema.dim_company it will use the csischema.dim_company PK since we pass one value gotten from IVEE.dim_company

The question I guess is what happens IF I IVEE.dim_company accounts for far more than 5% of csischema.dim_company ? Will that translate into a scan on csischema.dim_company ?

What I am looking at now looks like a potential racing contention which so I am wondering if there are better ways to do it

Thanks

Armand