Thread: Check for duplicates before inserting new rows

Check for duplicates before inserting new rows

From

Rich Shepard

Date:

04 September 2020, 13:21:10

This is a new issue for me: I've received data from a different source and
need to add non-duplicates to two tables in the database. Each row in the
new data has a station ID and associated measurements.

The existing database includes a table for station information and another
for measurements made there.

I want to learn which stations and which measurements in the new data set
are not included in the existing tables. And, I don't want to try inserting
the new data and have postgres tell me when it's found duplicates,
especially since there are two tables involved.

My research into how to do this has not found a solution so I ask for
pointers to resources that will teach me how to add these new data to both
tables.

Regards,

Rich

Re: Check for duplicates before inserting new rows

From

George Woodring

Date:

04 September 2020, 13:27:14

I would suggest creating a temp table based on the original table and loading the data into it first. You can then purge the duplicates.

George Woodring

iGLASS Networks
www.iglass.net

On Fri, Sep 4, 2020 at 9:21 AM Rich Shepard <rshepard@appl-ecosys.com> wrote:

This is a new issue for me: I've received data from a different source and
need to add non-duplicates to two tables in the database. Each row in the
new data has a station ID and associated measurements.

The existing database includes a table for station information and another
for measurements made there.

I want to learn which stations and which measurements in the new data set
are not included in the existing tables. And, I don't want to try inserting
the new data and have postgres tell me when it's found duplicates,
especially since there are two tables involved.

My research into how to do this has not found a solution so I ask for
pointers to resources that will teach me how to add these new data to both
tables.

Regards,

Rich

Re: Check for duplicates before inserting new rows

From

Rich Shepard

Date:

04 September 2020, 14:40:00

On Fri, 4 Sep 2020, George Woodring wrote:

> I would suggest creating a temp table based on the original table and
> loading the data into it first. You can then purge the duplicates.

George,

I hadn't thought of this. Using a duplicate table without a PK would work
well if there's only one attribute that needs checking.

In my case there are two tables involved: locations and measurements. So, I
can create a temporary table from the new data which holds only the
locations. I can then isolate the new location rows that do not already
exist in that table, then insert the new data to eliminate the duplicated
measurements.

Thanks,

Rich

Re: Check for duplicates before inserting new rows

From

Rich Shepard

Date:

04 September 2020, 14:46:03

On Fri, 4 Sep 2020, Olivier Gautherot wrote:

> First of all, what version of PostgreSQL are you using?

Olivier,

12.2.

> One way would be to add a UNIQUE constraint and perform for each row of the
> source table a INSERT ... ON CONFLICT DO NOTHING (see
> https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT)
> 
> If it is a 2-way merge, I would encapsulate the process in a function
> (which will create a transaction to protect your process) and add a column
> to trace the rows that have been merged. For this purpose, you can use the
> ON CONFLICT DO UPDATE ...

Lots for me to ponder.

The new data has attributes stored in two database tables: location and
measurements. I'll need to think about how both could be handled in a single
function. For example, the new data contains a row with a location not
already in the existing location table. That location needs to be added to
the location table and its associated measurement attributes then added to
the existing measurements table.

Thanks for the suggestion,

Rich

Re: Check for duplicates before inserting new rows

From

Chris Sterritt

Date:

04 September 2020, 15:51:04

On 04/09/2020 15:46, Rich Shepard wrote:

On Fri, 4 Sep 2020, Olivier Gautherot wrote:

First of all, what version of PostgreSQL are you using?

Olivier,

12.2.

One way would be to add a UNIQUE constraint and perform for each row of the
source table a INSERT ... ON CONFLICT DO NOTHING (see
https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT)

If it is a 2-way merge, I would encapsulate the process in a function
(which will create a transaction to protect your process) and add a column
to trace the rows that have been merged. For this purpose, you can use the
ON CONFLICT DO UPDATE ...

Lots for me to ponder.

The new data has attributes stored in two database tables: location and
measurements. I'll need to think about how both could be handled in a single
function. For example, the new data contains a row with a location not
already in the existing location table. That location needs to be added to
the location table and its associated measurement attributes then added to
the existing measurements table.

Thanks for the suggestion,

Rich

Assuming some simple table structures (I've not included PK or FK definitions for simplicity):

create table station (station_id integer, station_data text);

create table measurement (measurement_id bigserial, station_id integer, ameasurement text);

-- Add some test stations
insert into station(station_id, station_data)
values (1, 'station1'), (2, 'station2'), (3, 'station3') ;


*Query to identify new measurements:
select station_id, ameasurement
from (VALUES (1,'1meas1'), -- this represents your new test data set             (1,'1meas2'),             (2,'2meas1'),             (3,'3meas1')) as m(station_id, ameasurement)
except
select station_id, ameasurement
from measurement;

The query above will give you a list of the new values which are not currently stored in table measurement.
Run it and we get all 4 rows returned.

Add a measurement row:
insert into measurement(station_id, ameasurement)
values (2,'2meas1');

Now if you repeat the check for new measurements with the same query as at *, you only get 3 rows.

Cheers, Chris Sterritt

Re: Check for duplicates before inserting new rows

From

Chris Sterritt

Date:

04 September 2020, 16:13:34

On 04/09/2020 14:21, Rich Shepard wrote:

This is a new issue for me: I've received data from a different source and
need to add non-duplicates to two tables in the database. Each row in the
new data has a station ID and associated measurements.

The existing database includes a table for station information and another
for measurements made there.

I want to learn which stations and which measurements in the new data set
are not included in the existing tables. And, I don't want to try inserting
the new data and have postgres tell me when it's found duplicates,
especially since there are two tables involved.

My research into how to do this has not found a solution so I ask for
pointers to resources that will teach me how to add these new data to both
tables.

Regards,

Rich

To insert data into both tables:

drop table if exists station;
drop table if exists measurement;

create table station (station_id serial, station_data text);

create table measurement (measurement_id bigserial, station_id integer, ameasurement text);

insert into station(station_data)
values ('station1'), ('station2'), ('station3') ;

with src_data as (select station_data, ameasurement
                  from (VALUES ('station1','meas1'),                               ('station2','meas2'),                               ('station3','meas3'),                               ('station4','meas4')) as m(station_data, ameasurement)),     ins_station as (insert into station  (station_data)                     select station_data from src_data
                     except
                     select station_data from station
                     returning station_id, station_data
                    )
insert into measurement (station_id, ameasurement)    select s.station_id, sd.ameasurement
    from src_data sd
    join (SELECT station_id, station_data FROM ins_station UNION SELECT station_id, station_data FROM station) s using (station_data)    except
    select station_id, ameasurement
    from measurement;

select * from station;
select * from measurement;


Regards,
Chris Sterritt

Re: Check for duplicates before inserting new rows

From

Rich Shepard

Date:

04 September 2020, 16:49:38

On Fri, 4 Sep 2020, Chris Sterritt wrote:

> Assuming some simple table structures (I've not included PK or FK definitions 
> for simplicity):

Chris,

Thanks very much.

Stay well,

Rich