Re: Import large data set into a table and resolve duplicates? - Mailing list pgsql-general

From John McKown
Subject Re: Import large data set into a table and resolve duplicates?
Date
Msg-id CAAJSdjgksStGtVY5YjaHh=TLhC3=MaSpox+Ptxaewi3hb5eE+g@mail.gmail.com
Whole thread Raw
In response to Re: Import large data set into a table and resolve duplicates?  (John McKown <john.archie.mckown@gmail.com>)
List pgsql-general
OOPS, I forgot to mention in the SELECT generating the output file that the e'\t' generates a "tab" character. You likely already know this, but I like to make my posts as self contained as possible. 

On Sun, Feb 15, 2015 at 10:00 AM, John McKown <john.archie.mckown@gmail.com> wrote:
On Sat, Feb 14, 2015 at 3:54 PM, Eugene Dzhurinsky <jdevelop@gmail.com> wrote:
Hello!

I have a huge dictionary table with series data generated by a third-party
service. The table consists of 2 columns

- id : serial, primary key
- series : varchar, not null, indexed

From time to time I need to apply a "patch" to the dictionary, the patch file
consists of "series" data, one per line.

Now I need to import the patch into the database, and produce another file as
- if the passed "series" field exists in the database, then return ID:series
- otherwise insert a new row to the table and generate new ID and return ID:series
for each row in the source file.

So the new file will contain both ID and series data, separated by tab or
something.

While reading and writing the data is not a question (I simply described the
whole task just in case), I wonder what is the most efficient way of importing
such a data into a table, keeping in mind that

- the dictionary table already consists of ~200K records
- the patch could be ~1-50K of records long
- records could not be removed from the dictionary, only added if not exist

Thanks!

--
Eugene Dzhurinsky

​I was hoping that you'd get a good reply from someone else. But it is the weekend. So I will _try_ to explain what I would try. You can see if it is of any use to you. Sometimes my ideas are really good. And other times they are, well let's be nice and say "not as good as other times" [grin/]. I can't test the below because I don't have any data. But hopefully it will be close and even of some help to you.

The first thing that I would do is put the input patch data into its own table. Perhaps a temporary table, or even a permanent one.

DROP TABLE IF EXISTS patch_data;
CREATE TABLE patch_data (
   input_record_number SERIAL,
   already_exists BOOLEAN DEFAULT FALSE;
   id INTEGER,
   series TEXT ​NOT NULL );

​I don't know if the order of the records​ in output file need to match the order of the records in the input file. If not, they you don't need the "input_record_number" field. Otherwise, that is used to maintain the order of the input. At this point, you can load the input file with an SQL command similar to:

COPY patch_data (series) FROM input-file.txt;

Now update the path_data from the existing dictionary table to see which "series" data already exists.

UPDATE patch_data SET already_exists=((SELECT TRUE FROM dictionary WHERE dictionary.series = patch_data.series));

At this point, the table patch_data has been updated such that if the series data in it already exists, the "already_exists" column is now TRUE instead of the initial FALSE. This means that we need to insert all the series data in "patch_data" which does not exist in "dictionary" ( i.e. "already_exists" is FALSE in "patch_data") into "dictionary".

INSERT INTO dictionary(series) SELECT series FROM patch_data WHERE already_exists = FALSE;

The above should insert the "series" records which don't exist in "dictionary" into it and generate an "id" column for it from your SERIAL definition. Now we need to find out the "id" values for each of the "series" values and return them. 

UPDATE patch_data SET id=((SELECT id FROM dictionary WHERE dictionary.series = patch_data.series));

At this point, if I have not messed up, every "series" value in "patch_data" should have the proper "id" value from "dictionary". So the table "patch_data" should now have all that we need in it. So we just use it to make our output file. I don't know you're output requirements, but I'd just do something like:

SELECT id, e'\t',series FROM patch_data ORDER BY input_record_number;

Again, if the output order does not need to be the same as the input order, then all the stuff with the "input_record_number" column can be eliminated. Just to be a good citizen:

DROP TABLE patch_data;

at this point. Unless, you think you might need it for some reason. Who knows, you might even want to archive it as some sort of audit information. 


--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown



--
He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

pgsql-general by date:

Previous
From: John McKown
Date:
Subject: Re: Import large data set into a table and resolve duplicates?
Next
From: Rafal Pietrak
Date:
Subject: Re: partial "on-delete set null" constraint