Deduplication and transaction isolation level - Mailing list pgsql-general

From François Beausoleil
Subject Deduplication and transaction isolation level
Date
Msg-id 3D9F49CC-AC80-4AAC-8B89-6BB53A3DE9C3@teksol.info
Whole thread Raw
Responses Re: Deduplication and transaction isolation level  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-general
Hi all!

I import many, many rows of data into a table, from three or more computers, 4 times per hour. I have a primary key,
andthe query I use to import the data is supposed to dedup before inserting, but I still get primary key violations. 

The import process is:

* Load CSV data into temp table
* INSERT INTO dest SELECT DISTINCT (pkey) FROM temp WHERE NOT EXISTS(temp.pkey = dest.pkey)

I assumed (erroneously) that this would guarantee no duplicate data could make it into the database. The primary key
violationsare proving me wrong. 

My assumption is that two or more computers are inserting the same pkey, and since the transaction isolation level is
unchangedfrom the default (read committed), the insert on machine A does not see that machine B's already inserted the
rowA's trying to insert. Is that likely? 

Do I have other solutions? Should I run all files serially through PostgreSQL?

Thanks!
François Beausoleil

                   Table "public.show_interaction_bindings_2013_09_23"
               Column                |            Type             |      Modifiers
-------------------------------------+-----------------------------+----------------------
 interaction_id                      | uuid                        | not null
 show_id                             | uuid                        | not null
 service_name                        | character varying(8)        | not null
 interaction_service_id              | text                        | not null
 interaction_created_at              | timestamp without time zone | not null
 -- ...
Indexes:
    "show_interaction_bindings_2013_09_23_pkey" PRIMARY KEY, btree (market_id, show_id, interaction_service_id,
service_name)CLUSTER 
Check constraints:
    "int_created_at_within_partition" CHECK ('2013-09-23 07:00:00'::timestamp without time zone <=
interaction_created_atAND interaction_created_at < '2013-09-30 07:00:00'::timestamp without time zone) 
    "persona_followers_count_greater_than_zero" CHECK (persona_followers_count > 0)
Inherits: show_interaction_bindings

-- Import script
CREATE TEMPORARY TABLE IF NOT EXISTS show_interaction_bindings_2013_09_23_import( LIKE show_interaction_bindings );
COPY show_interaction_bindings_2013_09_23_import(market_id, show_id, interaction_service_id, service_name,
interaction_created_at/* , ... */) FROM stdin WITH (FORMAT csv, HEADER false, DELIMITER ',', QUOTE '\"'); 
-- data, thousands of rows
\.

ANALYZE show_interaction_bindings_2013_09_23_import;

CREATE TABLE IF NOT EXISTS show_interaction_bindings_2013_09_23(
    CONSTRAINT int_created_at_within_partition CHECK(timestamp without time zone '2013-09-23 07:00' <=
interaction_created_atAND interaction_created_at < timestamp without time zone '2013-09-30 07:00' 
  , PRIMARY KEY(market_id, show_id, interaction_service_id, service_name)
) INHERITS ( show_interaction_bindings );


-- The failing query

EXPLAIN ANALYZE
INSERT INTO show_interaction_bindings_2013_09_23(market_id, show_id, interaction_service_id, service_name, ...)
  SELECT DISTINCT ON (market_id, show_id, interaction_service_id, service_name) market_id, show_id,
interaction_service_id,service_name, interaction_created_at -- , ... 
  FROM show_interaction_bindings_2013_09_23_import
  WHERE NOT EXISTS( SELECT *
                    FROM show_interaction_bindings_2013_09_23 dest
                    WHERE dest.show_id = show_interaction_bindings_2013_09_23_import.show_id
                      AND dest.market_id = show_interaction_bindings_2013_09_23_import.market_id
                      AND dest.service_name = show_interaction_bindings_2013_09_23_import.service_name
                      AND dest.interaction_service_id =
show_interaction_bindings_2013_09_23_import.interaction_service_id); 


-- The failure

psql:/var/tmp/csv-assembly-21534/2013-09-24-12.sql:32391: ERROR:  duplicate key value violates unique constraint
"show_interaction_bindings_2013_09_23_pkey"
DETAIL:  Key (market_id, show_id, interaction_service_id, service_name)=(23813bc0-f08d-012f-70c3-4040b2a1b35b,
17104480-d6b6-0130-b1d1-7a163e02a1d3,382389681297833984, Twitter) already exists. 


Attachment

pgsql-general by date:

Previous
From: Jayadevan M
Date:
Subject: logging statements in PostgreSQL
Next
From: Amit Langote
Date:
Subject: Re: logging statements in PostgreSQL