Thread: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Arup Rakshit
Date:
Hi, I am copying the data from a CSV file to a Table using "COPY" command. But one thing that I got stuck, is how to skip duplicaterecords while copying from CSV to tables. By looking at the documentation, it seems, Postgresql don't have any inbuilttoo to handle this with "copy" command. By doing Google I got below 1 idea to use temp table. http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql I am also thinking what if I let the records get inserted, and then delete the duplicate records from table as this postsuggested - http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. Both of the solution looks like doing double work. But I am not sure which is the best solution here. Can anybody suggestwhich approach should I adopt ? Or if any better ideas you guys have on this task, please share. Thanks in advance! -- ================ Regards, Arup Rakshit ================ Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible,you are, by definition, not smart enough to debug it. --Brian Kernighan
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Oliver Elphick
Date:
On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote: > Hi, > > I am copying the data from a CSV file to a Table using "COPY" command. > But one thing that I got stuck, is how to skip duplicate records while > copying from CSV to tables. By looking at the documentation, it seems, > Postgresql don't have any inbuilt too to handle this with "copy" > command. By doing Google I got below 1 idea to use temp table. > > http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql > > I am also thinking what if I let the records get inserted, and then > delete the duplicate records from table as this post suggested - > http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. > > Both of the solution looks like doing double work. But I am not sure > which is the best solution here. Can anybody suggest which approach > should I adopt ? Or if any better ideas you guys have on this task, > please share. Assuming you are using Unix, or can install Unix tools, run the input files through sort -u before passing them to COPY. Oliver Elphick
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Arup Rakshit
Date:
On Sunday, May 24, 2015 02:52:47 PM you wrote: > On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote: > > Hi, > > > > I am copying the data from a CSV file to a Table using "COPY" command. > > But one thing that I got stuck, is how to skip duplicate records while > > copying from CSV to tables. By looking at the documentation, it seems, > > Postgresql don't have any inbuilt too to handle this with "copy" > > command. By doing Google I got below 1 idea to use temp table. > > > > http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql > > > > I am also thinking what if I let the records get inserted, and then > > delete the duplicate records from table as this post suggested - > > http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. > > > > Both of the solution looks like doing double work. But I am not sure > > which is the best solution here. Can anybody suggest which approach > > should I adopt ? Or if any better ideas you guys have on this task, > > please share. > > Assuming you are using Unix, or can install Unix tools, run the input > files through > > sort -u > > before passing them to COPY. > > Oliver Elphick > I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files. Now supposeI have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file totable `table1`, if the current CSV data already table has. How to do this? My SO link is not a solution to my problem I see now. -- ================ Regards, Arup Rakshit ================ Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible,you are, by definition, not smart enough to debug it. --Brian Kernighan
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Oliver Elphick
Date:
On Sun, 2015-05-24 at 18:25 +0630, Arup Rakshit wrote: > > > > Assuming you are using Unix, or can install Unix tools, run the input > > files through > > > > sort -u > > > > before passing them to COPY. > > > > Oliver Elphick > > > > I think I need to ask more specific way. I have a table say `table1`, > where I feed data from different CSV files. Now suppose I have > inserted N records to my table `table1` from csv file `c1`. This is > ok, next time when again I am importing from a different CSV file say > `c2` to `table1`, I just don't want reinsert any record from this new > CSV file to table `table1`, if the current CSV data already table has. > > How to do this? Unix tools are still the easiest way to deal with it, I think. Ensure the total input is unique as above and stored in file1. Use COPY to output the existing table to another text file (file2) with similar format to file1. Then cat file1 file2 | sort | uniq -d >file3 This will only output lines that exist in both file1 and file2. Then cat file1 file3 | sort | uniq -u >newinputfile This will eliminate from file1 lines that are already in file2. It will only eliminate lines that are entirely identical; it won't stop duplicate primary keys. Oliver Elphick
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
rob stone
Date:
> I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files. Now supposeI have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file totable `table1`, if the current CSV data already table has. > > How to do this? > > My SO link is not a solution to my problem I see now. > > -- > ================ > Regards, > Arup Rakshit > ================ > Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible,you are, by definition, not smart enough to debug it. > > --Brian Kernighan > > Assuming that these CSV files are coming from an external source (e.g. Bank statement transactions for feeding into a Bank Rec module) then you need a program to read the file and handle it accordingly. If end users are running this, then they would probably appreciate a little report about what was loaded and what was discarded. On the other hand, if DBA's are running this you could use ExecuteQuery (written in Java) that has a facility to load CSV files and it will report the duplicates. However, you can ignore the duplicates and still commit the non duplicated transactions to the table, if you so desire. The default for EQ is NOT to run in auto-commit mode, so you have to actually issue a "commit" to save your work. However, this option can be changed in your preferences. HTH, Robert
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Adrian Klaver
Date:
On 05/24/2015 04:55 AM, Arup Rakshit wrote: > On Sunday, May 24, 2015 02:52:47 PM you wrote: >> On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote: >>> Hi, >>> >>> I am copying the data from a CSV file to a Table using "COPY" command. >>> But one thing that I got stuck, is how to skip duplicate records while >>> copying from CSV to tables. By looking at the documentation, it seems, >>> Postgresql don't have any inbuilt too to handle this with "copy" >>> command. By doing Google I got below 1 idea to use temp table. >>> >>> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql >>> >>> I am also thinking what if I let the records get inserted, and then >>> delete the duplicate records from table as this post suggested - >>> http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. >>> >>> Both of the solution looks like doing double work. But I am not sure >>> which is the best solution here. Can anybody suggest which approach >>> should I adopt ? Or if any better ideas you guys have on this task, >>> please share. >> >> Assuming you are using Unix, or can install Unix tools, run the input >> files through >> >> sort -u >> >> before passing them to COPY. >> >> Oliver Elphick >> > > I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files. Now supposeI have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file totable `table1`, if the current CSV data already table has. > > How to do this? As others have pointed out this depends on what you are considering a duplicate. Is it if the entire row is duplicated? Or if some portion of the row(a 'primary key') is duplicated? > > My SO link is not a solution to my problem I see now. > -- Adrian Klaver adrian.klaver@aklaver.com
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Arup Rakshit
Date:
On Sunday, May 24, 2015 07:24:41 AM you wrote: > On 05/24/2015 04:55 AM, Arup Rakshit wrote: > > On Sunday, May 24, 2015 02:52:47 PM you wrote: > >> On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote: > >>> Hi, > >>> > >>> I am copying the data from a CSV file to a Table using "COPY" command. > >>> But one thing that I got stuck, is how to skip duplicate records while > >>> copying from CSV to tables. By looking at the documentation, it seems, > >>> Postgresql don't have any inbuilt too to handle this with "copy" > >>> command. By doing Google I got below 1 idea to use temp table. > >>> > >>> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql > >>> > >>> I am also thinking what if I let the records get inserted, and then > >>> delete the duplicate records from table as this post suggested - > >>> http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. > >>> > >>> Both of the solution looks like doing double work. But I am not sure > >>> which is the best solution here. Can anybody suggest which approach > >>> should I adopt ? Or if any better ideas you guys have on this task, > >>> please share. > >> > >> Assuming you are using Unix, or can install Unix tools, run the input > >> files through > >> > >> sort -u > >> > >> before passing them to COPY. > >> > >> Oliver Elphick > >> > > > > I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files. Nowsuppose I have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file totable `table1`, if the current CSV data already table has. > > > > How to do this? > > As others have pointed out this depends on what you are considering a > duplicate. > > Is it if the entire row is duplicated? It is entire row. > Or if some portion of the row(a 'primary key') is duplicated? > > > > > My SO link is not a solution to my problem I see now. > > > > > -- ================ Regards, Arup Rakshit ================ Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible,you are, by definition, not smart enough to debug it. --Brian Kernighan
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Adrian Klaver
Date:
On 05/24/2015 06:24 AM, Arup Rakshit wrote: > On Sunday, May 24, 2015 07:24:41 AM you wrote: >> On 05/24/2015 04:55 AM, Arup Rakshit wrote: >>> On Sunday, May 24, 2015 02:52:47 PM you wrote: >>>> On Sun, 2015-05-24 at 16:56 +0630, Arup Rakshit wrote: >>>>> Hi, >>>>> >>>>> I am copying the data from a CSV file to a Table using "COPY" command. >>>>> But one thing that I got stuck, is how to skip duplicate records while >>>>> copying from CSV to tables. By looking at the documentation, it seems, >>>>> Postgresql don't have any inbuilt too to handle this with "copy" >>>>> command. By doing Google I got below 1 idea to use temp table. >>>>> >>>>> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql >>>>> >>>>> I am also thinking what if I let the records get inserted, and then >>>>> delete the duplicate records from table as this post suggested - >>>>> http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. >>>>> >>>>> Both of the solution looks like doing double work. But I am not sure >>>>> which is the best solution here. Can anybody suggest which approach >>>>> should I adopt ? Or if any better ideas you guys have on this task, >>>>> please share. >>>> >>>> Assuming you are using Unix, or can install Unix tools, run the input >>>> files through >>>> >>>> sort -u >>>> >>>> before passing them to COPY. >>>> >>>> Oliver Elphick >>>> >>> >>> I think I need to ask more specific way. I have a table say `table1`, where I feed data from different CSV files. Nowsuppose I have inserted N records to my table `table1` from csv file `c1`. This is ok, next time when again I am importingfrom a different CSV file say `c2` to `table1`, I just don't want reinsert any record from this new CSV file totable `table1`, if the current CSV data already table has. >>> >>> How to do this? >> >> As others have pointed out this depends on what you are considering a >> duplicate. >> >> Is it if the entire row is duplicated? > > It is entire row. So, Olivers second solution. > >> Or if some portion of the row(a 'primary key') is duplicated? >> >>> >>> My SO link is not a solution to my problem I see now. >>> >> >> >> > -- Adrian Klaver adrian.klaver@aklaver.com
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Arup Rakshit
Date:
On Sunday, May 24, 2015 07:52:43 AM you wrote: > >> > >> Is it if the entire row is duplicated? > > > > It is entire row. > > So, Olivers second solution. > I have done this : columns_t1 = self.singleton_class.fields.map { |f| "t1.#{f}" }.join(",") columns_t2 = self.singleton_class.fields.map { |f| "t2.#{f}" }.join(",") ActiveRecord::Base.transaction do conn = ActiveRecord::Base.connection conn.execute "CREATE TEMP TABLE tmp_table AS SELECT * FROM #{table.strip}; " conn.execute("COPY tmp_table ( #{self.singleton_class.fields.join(',') } ) FROM '#{source_file}' CSV HEADER DELIMITER '\t'QUOTE '|' ;") conn.execute "INSERT INTO #{table.strip} ( #{self.singleton_class.fields.join(',')} ) SELECT DISTINCT #{columns_t1} FROMtmp_table t1 WHERE NOT EXISTS ( SELECT 1 FROM #{table.strip} t2 WHERE (#{columns_t2}) IS NOT DISTINCT FROM (#{columns_t1}));" conn.execute "DROP TABLE IF EXISTS tmp_table;" End The SQL wrapped inside the ActiveRecord ORM as you see above. But I hope you got the idea. But I am not sure, if it is thecorrect way to do it or how it will hit the performance..... The Application can run on different OS. So I am helpless to use Unix commands. -- ================ Regards, Arup Rakshit ================ Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible,you are, by definition, not smart enough to debug it. --Brian Kernighan
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Francisco Olarte
Date:
Hi Arup On Sun, May 24, 2015 at 12:26 PM, Arup Rakshit <aruprakshit@rocketmail.com> wrote: > I am copying the data from a CSV file to a Table using "COPY" command. But one thing that I got stuck, is how to skip duplicaterecords while copying from CSV to tables. By looking at the documentation, it seems, Postgresql don't have any inbuilttoo to handle this with "copy" command. By doing Google I got below 1 idea to use temp table. > http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql > I am also thinking what if I let the records get inserted, and then delete the duplicate records from table as this postsuggested - http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. > Both of the solution looks like doing double work. But I am not sure which is the best solution here. Can anybody suggestwhich approach should I adopt ? Or if any better ideas you guys have on this task, please share. Without looking at your SO answers ( I suppose both consist on copying to a temp table, then either doing an 'insert into main table select * from temp where not exists...' or a delete from temp / insert into main , which is what I would suggest ) I would point out copy is not the tool for this kind of jobs. Copy is normally used for bulk loading of correct data, and is great at this. If you need to preclean data, you are not doing double work using one of those solutions. Many ways of doing it have been pointed, the exact one depends on your data. You've pointed there are duplicate rows, if this is true you can easily do that using text tools and dumps or the temporary table. There is also the solution of just turning your CSV with your favorite text handiling tool into a set of insert lines conditioned of inexistence of an appropiately indexed set of fields in the table ( which I've done but cannot quote exactly, it was an experiment and got quite hairy ). From my experience, go for the temporary table plus insert-select/delete+select route, it's easier, is normally faster ( supposing you have appropiate indexes ) and it really is no double work. You have dirty data ( duplicates ), you have to clean it before inserting. Or, you could turn the file into a series of plain inserts and feed it to psql in autocommit mode, if you have some unique condition on the table, so it errors out on every duplicate. It's not pretty but should work. I still recommend the temp table approach. Francisco Olarte.
Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"
From
Scott Marlowe
Date:
On Sun, May 24, 2015 at 4:26 AM, Arup Rakshit <aruprakshit@rocketmail.com> wrote: > Hi, > > I am copying the data from a CSV file to a Table using "COPY" command. But one thing that I got stuck, is how to skip duplicaterecords while copying from CSV to tables. By looking at the documentation, it seems, Postgresql don't have any inbuilttoo to handle this with "copy" command. By doing Google I got below 1 idea to use temp table. > > http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql > > I am also thinking what if I let the records get inserted, and then delete the duplicate records from table as this postsuggested - http://www.postgresql.org/message-id/37013500.DFF0A64A@manhattanproject.com. > > Both of the solution looks like doing double work. But I am not sure which is the best solution here. Can anybody suggestwhich approach should I adopt ? Or if any better ideas you guys have on this task, please share. Have you looked at pg_loader? http://pgloader.io/index.html