data import - duplicates - Mailing list pgsql-general

From brian
Subject data import - duplicates
Date
Msg-id 47D378CA.3090007@zijn-digital.com
Whole thread Raw
Responses Re: data import - duplicates  ("Webb Sprague" <webb.sprague@gmail.com>)
List pgsql-general
I have ~350K rows of sample data that has come to me in 64 text files
(CSV) that I'd like to import into a new database. Each file can be
considered its own category and is so named. That is, each file name
will be inserted into a separate categories table. I'd like to relate
each row to its category. However, while the rows are unique in each
file, some rows are found in more than one file.

I also must parse some of the fields into separate columns, but that's
no big deal. But it means that I must do some pre-processing on these
files, in any case.

After some thought, I thought I might brute-force the problem with Perl
by reading each line of each file into an assoc. array unless it's
already in it, in which case I'd append the "key" based on the list of
categories that line is found in (I'd still need to parse outthe keys
later but I'd get rid of the dupes). Each array row would be like so:

'key' => '1,4,etc.', 'text' => 'a line'

Aside from the fact that the array search would become ridiculously
large after just a couple of files, I realised that this is a
non-starter for the simple fact that this data comprises less than 25%
of the eventual total. So refactoring it in this way would be a waste of
time (there will probably be dupes in the files to come).

So, I'd like to instead parse out my columns properly and write each
line (along with its category key) to a new, single file to be copied
into a working table. ONce I've done so, is there an efficient method I
can use to select all duplicates (save for the category key) into a set
from which I could then select into the final table (and insert the keys
into the category join table)?

For example (pk is the PK from the working table and ck is the category
key), my dupes query on the working table would give the following set:

pk  ck
1   1  a  a  a  a
2   3  a  a  a  a
3   3  b  b  b  b
4   7  b  b  b  b
5   4  a  a  a  a

I would then want to insert just the unique rows into the final table
yet add all of the the PKs and CKs to the category join table. After
that was done, I'd delete all of these from the working table and then
move the unique rows that are left to the final table (and insert the
keys into the join table).

I hope that makes sense. I'm not looking for anyone to do my homework
for me; I'm sure I could fix up a tasty function for this (the data is
destined for MySQL, alas, but I'll be damned if I won't use PG for the
heavy lifting). What I'm really looking for is a handy way to grab all
of those dupes.

pgsql-general by date:

Previous
From: Greg Smith
Date:
Subject: Re: Watch your PlanetPostgreSQL.org blogs
Next
From: "Webb Sprague"
Date:
Subject: Re: data import - duplicates