Thread: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"

Hi,

I am copying the data from a CSV file to a Table using "COPY" command. But one thing that I got stuck, is how to skip
duplicaterecords while copying from CSV to tables. By looking at the documentation, it seems, Postgresql don't have any
inbuilttoo to handle this with "copy" command. By doing Google I got below 1 idea to use temp table. 

http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql

I am also thinking what if I let the records get inserted, and then delete the duplicate records from table as this
postsuggested - http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. 

Both of the solution looks like doing double work. But I am not sure which is the best solution here. Can anybody
suggestwhich approach should I adopt ? Or if any better ideas you guys have on this task, please share. 

Thanks in advance!

--
================
Regards,
Arup Rakshit
================
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as
possible,you are, by definition, not smart enough to debug it. 

--Brian Kernighan


On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote:
> Hi,
>
> I am copying the data from a CSV file to a Table using "COPY" command.
> But one thing that I got stuck, is how to skip duplicate records while
> copying from CSV to tables. By looking at the documentation, it seems,
> Postgresql don't have any inbuilt too to handle this with "copy"
> command. By doing Google I got below 1 idea to use temp table.
>
> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql
>
> I am also thinking what if I let the records get inserted, and then
> delete the duplicate records from table as this post suggested -
> http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com.
>
> Both of the solution looks like doing double work. But I am not sure
> which is the best solution here. Can anybody suggest which approach
> should I adopt ? Or if any better ideas you guys have on this task,
> please share.

Assuming you are using Unix, or can install Unix tools, run the input
files through

  sort -u

before passing them to COPY.

Oliver Elphick



On Sunday, May 24, 2015 02:52:47 PM you wrote:
> On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote:
> > Hi,
> >
> > I am copying the data from a CSV file to a Table using "COPY" command.
> > But one thing that I got stuck, is how to skip duplicate records while
> > copying from CSV to tables. By looking at the documentation, it seems,
> > Postgresql don't have any inbuilt too to handle this with "copy"
> > command. By doing Google I got below 1 idea to use temp table.
> >
> > http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql
> >
> > I am also thinking what if I let the records get inserted, and then
> > delete the duplicate records from table as this post suggested -
> > http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com.
> >
> > Both of the solution looks like doing double work. But I am not sure
> > which is the best solution here. Can anybody suggest which approach
> > should I adopt ? Or if any better ideas you guys have on this task,
> > please share.
>
> Assuming you are using Unix, or can install Unix tools, run the input
> files through
>
>   sort -u
>
> before passing them to COPY.
>
> Oliver Elphick
>

I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files. Now
supposeI have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am
importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file
totable `table1`, if the current CSV data already table has. 

How to do this?

 My SO link is not a solution to my problem I see now.

--
================
Regards,
Arup Rakshit
================
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as
possible,you are, by definition, not smart enough to debug it. 

--Brian Kernighan


On Sun, 2015-05-24 at 18:25 +0630, Arup Rakshit wrote:
> >
> > Assuming you are using Unix, or can install Unix tools, run the input
> > files through
> >
> >   sort -u
> >
> > before passing them to COPY.
> >
> > Oliver Elphick
> >
>
> I think I need to ask more specific way. I have a table say `table1`,
> where I feed data from different CSV files. Now suppose I have
> inserted N records to my table `table1` from csv file `c1`. This is
> ok, next time when again I am importing from a different CSV file say
> `c2` to `table1`, I just don't want reinsert any record from this new
> CSV file to table `table1`, if the current CSV data already table has.
>
> How to do this?

Unix tools are still the easiest way to deal with it, I think.

Ensure the total input is unique as above and stored in file1.

Use COPY to output the existing table to another text file (file2) with
similar format to file1.  Then

  cat file1 file2 | sort | uniq -d >file3

This will only output lines that exist in both file1 and file2.  Then

  cat file1 file3 | sort | uniq -u >newinputfile

This will eliminate from file1 lines that are already in file2.

It will only eliminate lines that are entirely identical; it won't stop
duplicate primary keys.


Oliver Elphick



> I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files. Now
supposeI have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am
importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file
totable `table1`, if the current CSV data already table has. 
>
> How to do this?
>
>  My SO link is not a solution to my problem I see now.
>
> --
> ================
> Regards,
> Arup Rakshit
> ================
> Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as
possible,you are, by definition, not smart enough to debug it. 
>
> --Brian Kernighan
>
>

Assuming that these CSV files are coming from an external source (e.g.
Bank statement transactions for feeding into a Bank Rec module) then you
need a program to read the file and handle it accordingly. If end users
are running this, then they would probably appreciate a little report
about what was loaded and what was discarded.

On the other hand, if DBA's are running this you could use ExecuteQuery
(written in Java) that has a facility to load CSV files and it will
report the duplicates. However, you can ignore the duplicates and still
commit the non duplicated transactions to the table, if you so desire.
The default for EQ is NOT to run in auto-commit mode, so you have to
actually issue a "commit" to save your work. However, this option can be
changed in your preferences.

HTH,
Robert



On 05/24/2015 04:55 AM, Arup Rakshit wrote:
> On Sunday, May 24, 2015 02:52:47 PM you wrote:
>> On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote:
>>> Hi,
>>>
>>> I am copying the data from a CSV file to a Table using "COPY" command.
>>> But one thing that I got stuck, is how to skip duplicate records while
>>> copying from CSV to tables. By looking at the documentation, it seems,
>>> Postgresql don't have any inbuilt too to handle this with "copy"
>>> command. By doing Google I got below 1 idea to use temp table.
>>>
>>> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql
>>>
>>> I am also thinking what if I let the records get inserted, and then
>>> delete the duplicate records from table as this post suggested -
>>> http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com.
>>>
>>> Both of the solution looks like doing double work. But I am not sure
>>> which is the best solution here. Can anybody suggest which approach
>>> should I adopt ? Or if any better ideas you guys have on this task,
>>> please share.
>>
>> Assuming you are using Unix, or can install Unix tools, run the input
>> files through
>>
>>    sort -u
>>
>> before passing them to COPY.
>>
>> Oliver Elphick
>>
>
> I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files. Now
supposeI have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am
importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file
totable `table1`, if the current CSV data already table has. 
>
> How to do this?

As others have pointed out this depends on what you are considering a
duplicate.

Is it if the entire row is duplicated?

Or if some portion of the row(a 'primary key') is duplicated?

>
>   My SO link is not a solution to my problem I see now.
>


--
Adrian Klaver
adrian.klaver@aklaver.com


On Sunday, May 24, 2015 07:24:41 AM you wrote:
> On 05/24/2015 04:55 AM, Arup Rakshit wrote:
> > On Sunday, May 24, 2015 02:52:47 PM you wrote:
> >> On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote:
> >>> Hi,
> >>>
> >>> I am copying the data from a CSV file to a Table using "COPY" command.
> >>> But one thing that I got stuck, is how to skip duplicate records while
> >>> copying from CSV to tables. By looking at the documentation, it seems,
> >>> Postgresql don't have any inbuilt too to handle this with "copy"
> >>> command. By doing Google I got below 1 idea to use temp table.
> >>>
> >>> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql
> >>>
> >>> I am also thinking what if I let the records get inserted, and then
> >>> delete the duplicate records from table as this post suggested -
> >>> http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com.
> >>>
> >>> Both of the solution looks like doing double work. But I am not sure
> >>> which is the best solution here. Can anybody suggest which approach
> >>> should I adopt ? Or if any better ideas you guys have on this task,
> >>> please share.
> >>
> >> Assuming you are using Unix, or can install Unix tools, run the input
> >> files through
> >>
> >>    sort -u
> >>
> >> before passing them to COPY.
> >>
> >> Oliver Elphick
> >>
> >
> > I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files.
Nowsuppose I have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am
importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file
totable `table1`, if the current CSV data already table has. 
> >
> > How to do this?
>
> As others have pointed out this depends on what you are considering a
> duplicate.
>
> Is it if the entire row is duplicated?

It is entire row.

> Or if some portion of the row(a 'primary key') is duplicated?
>
> >
> >   My SO link is not a solution to my problem I see now.
> >
>
>
>

--
================
Regards,
Arup Rakshit
================
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as
possible,you are, by definition, not smart enough to debug it. 

--Brian Kernighan


On 05/24/2015 06:24 AM, Arup Rakshit wrote:
> On Sunday, May 24, 2015 07:24:41 AM you wrote:
>> On 05/24/2015 04:55 AM, Arup Rakshit wrote:
>>> On Sunday, May 24, 2015 02:52:47 PM you wrote:
>>>> On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote:
>>>>> Hi,
>>>>>
>>>>> I am copying the data from a CSV file to a Table using "COPY" command.
>>>>> But one thing that I got stuck, is how to skip duplicate records while
>>>>> copying from CSV to tables. By looking at the documentation, it seems,
>>>>> Postgresql don't have any inbuilt too to handle this with "copy"
>>>>> command. By doing Google I got below 1 idea to use temp table.
>>>>>
>>>>> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql
>>>>>
>>>>> I am also thinking what if I let the records get inserted, and then
>>>>> delete the duplicate records from table as this post suggested -
>>>>> http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com.
>>>>>
>>>>> Both of the solution looks like doing double work. But I am not sure
>>>>> which is the best solution here. Can anybody suggest which approach
>>>>> should I adopt ? Or if any better ideas you guys have on this task,
>>>>> please share.
>>>>
>>>> Assuming you are using Unix, or can install Unix tools, run the input
>>>> files through
>>>>
>>>>     sort -u
>>>>
>>>> before passing them to COPY.
>>>>
>>>> Oliver Elphick
>>>>
>>>
>>> I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files.
Nowsuppose I have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am
importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file
totable `table1`, if the current CSV data already table has. 
>>>
>>> How to do this?
>>
>> As others have pointed out this depends on what you are considering a
>> duplicate.
>>
>> Is it if the entire row is duplicated?
>
> It is entire row.

So, Olivers second solution.

>
>> Or if some portion of the row(a 'primary key') is duplicated?
>>
>>>
>>>    My SO link is not a solution to my problem I see now.
>>>
>>
>>
>>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


On Sunday, May 24, 2015 07:52:43 AM you wrote:
> >>
> >> Is it if the entire row is duplicated?
> >
> > It is entire row.
>
> So, Olivers second solution.
>

I have done this :

columns_t1 = self.singleton_class.fields.map { |f| "t1.#{f}" }.join(",")
columns_t2 = self.singleton_class.fields.map { |f| "t2.#{f}" }.join(",")
ActiveRecord::Base.transaction do
  conn = ActiveRecord::Base.connection
  conn.execute "CREATE TEMP TABLE tmp_table AS SELECT * FROM #{table.strip}; "
  conn.execute("COPY tmp_table ( #{self.singleton_class.fields.join(',') } ) FROM '#{source_file}' CSV HEADER DELIMITER
'\t'QUOTE '|' ;") 
  conn.execute "INSERT INTO #{table.strip} ( #{self.singleton_class.fields.join(',')} ) SELECT DISTINCT #{columns_t1}
FROMtmp_table t1 WHERE NOT EXISTS ( SELECT 1 FROM #{table.strip} t2 WHERE (#{columns_t2}) IS NOT DISTINCT FROM
(#{columns_t1}));" 
  conn.execute "DROP TABLE IF EXISTS tmp_table;"
End

The SQL wrapped inside the ActiveRecord ORM as you see above. But I hope you got the idea. But I am not sure, if it is
thecorrect way to do it or how it will hit the performance..... 

The Application can run on different OS. So I am helpless to use Unix commands.

--
================
Regards,
Arup Rakshit
================
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as
possible,you are, by definition, not smart enough to debug it. 

--Brian Kernighan


Hi Arup

On Sun, May 24, 2015 at 12:26 PM, Arup Rakshit
<aruprakshit@rocketmail.com> wrote:
> I am copying the data from a CSV file to a Table using "COPY" command. But one thing that I got stuck, is how to skip
duplicaterecords while copying from CSV to tables. By looking at the documentation, it seems, Postgresql don't have any
inbuilttoo to handle this with "copy" command. By doing Google I got below 1 idea to use temp table. 
> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql
> I am also thinking what if I let the records get inserted, and then delete the duplicate records from table as this
postsuggested - http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. 
> Both of the solution looks like doing double work. But I am not sure which is the best solution here. Can anybody
suggestwhich approach should I adopt ? Or if any better ideas you guys have on this task, please share. 

Without looking at your SO answers ( I suppose both consist on copying
to a temp table, then either doing an 'insert into main table select *
from temp where not exists...' or a delete from temp / insert into
main , which is what I would suggest ) I would point out copy is not
the tool for this kind of jobs. Copy is normally used for bulk loading
of correct data, and is great at this. If you need to preclean data,
you are not doing double work using one of those solutions. Many ways
of doing it have been pointed, the exact one depends on your data.
You've pointed there are duplicate rows, if this is true you can
easily do that using text tools and dumps or the temporary table.
There is also the solution of just turning your CSV with your favorite
text handiling tool into a set of insert lines conditioned of
inexistence of an appropiately indexed set of fields in the table (
which I've done but cannot quote exactly, it was an experiment and got
quite hairy ). From my experience, go for the temporary table plus
insert-select/delete+select route, it's easier, is normally faster (
supposing you have appropiate indexes ) and it really is no double
work. You have dirty data ( duplicates ), you have to clean it before
inserting.

Or, you could turn the file into a series of plain inserts and feed it
to psql in autocommit mode, if you have some unique condition on the
table, so it errors out on every duplicate. It's not pretty but should
work. I still recommend the temp table approach.

Francisco Olarte.


On Sun, May 24, 2015 at 4:26 AM, Arup Rakshit
<aruprakshit@rocketmail.com> wrote:
> Hi,
>
> I am copying the data from a CSV file to a Table using "COPY" command. But one thing that I got stuck, is how to skip
duplicaterecords while copying from CSV to tables. By looking at the documentation, it seems, Postgresql don't have any
inbuilttoo to handle this with "copy" command. By doing Google I got below 1 idea to use temp table. 
>
> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql
>
> I am also thinking what if I let the records get inserted, and then delete the duplicate records from table as this
postsuggested - http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. 
>
> Both of the solution looks like doing double work. But I am not sure which is the best solution here. Can anybody
suggestwhich approach should I adopt ? Or if any better ideas you guys have on this task, please share. 

Have you looked at pg_loader?
http://pgloader.io/index.html