Thread: [GENERAL] import CSV file to a table
Hi all,
I want to import content of CSV file to a table via bash script without creating temporary table and I also want to skip some columns in CSV file (for instance, CSV file has 12 column and main table has only 2 column, If possible I would use only 2 column in CSV file) Is there any way to do it?
Regards,
--
Gunce Kaya
Since you're using bash, I will assume you are not averse to using a slightly complicated pipeline. First, install this: https://github.com/wireservice/csvkit
Then use that to cut out the columns, you want and pipe the result into psql with an appropriate \copy command.
On Wed, Mar 8, 2017 at 4:13 AM, Günce Kaya <guncekaya14@gmail.com> wrote:
Hi all,I want to import content of CSV file to a table via bash script without creating temporary table and I also want to skip some columns in CSV file (for instance, CSV file has 12 column and main table has only 2 column, If possible I would use only 2 column in CSV file) Is there any way to do it?Regards,--Gunce Kaya
Hi all,I want to import content of CSV file to a table via bash script without creating temporary table and I also want to skip some columns in CSV file (for instance, CSV file has 12 column and main table has only 2 column, If possible I would use only 2 column in CSV file) Is there any way to do it?Regards,--Gunce Kaya
Not too difficult, but "unusual". However, there is a restriction that the data cannot have an embedded comma. That is, you can't have something like: "a,b",c and want two columns with a,b and c as the values.
[tsh009@it-johnmckown-linux junk]$ cat ptest.csv
a,b,c,d,e,f,g,h,i
1,2,3,4,5,6,7,8,9
z,y,x,w,v,u,t,s,r,q
[tsh009@it-johnmckown-linux junk]$ cat ptest.sh
#!/bin/bash
printf "%s\n" 'COPY schema1.table1(column1from5, column2from7) FROM stdin;'
export IFS=','
while read i;do # read in the records until EOF
test -n "${i}" && { # ignore blank lines!
set ${i} # set shell variables $1, $2, ...
printf "%s\t%s\n" $5 $7 #write out columns 5 & 7
}
done
printf "%s\n" '\.' # write EOF delimiter for COPY
[tsh009@it-johnmckown-linux junk]$ ./ptest.sh <ptest.csv
COPY schema1.table1(column1from5, column2from7) FROM stdin;
e g
5 7
v t
\.
"Irrigation of the land with seawater desalinated by fusion power is ancient. It's called 'rain'." -- Michael McClary, in alt.fusion
Maranatha! <><
John McKown
John McKown
Hi,
If this is a one-time thing, you can use the Import Data tool in Database Workbench, see
Hope this helps.
With regards,
Martijn Tonies
Upscene Productions
http://www.upscene.com
Database Workbench - developer tool for Oracle, MS SQL Server, PostgreSQL,
SQL Anywhere, MySQL, InterBase, NexusDB and Firebird.
Martijn Tonies
Upscene Productions
http://www.upscene.com
Database Workbench - developer tool for Oracle, MS SQL Server, PostgreSQL,
SQL Anywhere, MySQL, InterBase, NexusDB and Firebird.
From: Günce Kaya
Sent: Wednesday, March 08, 2017 10:13 AM
Subject: [GENERAL] import CSV file to a table
Hi all,
I want to import content of CSV file to a table via bash script without creating temporary table and I also want to skip some columns in CSV file (for instance, CSV file has 12 column and main table has only 2 column, If possible I would use only 2 column in CSV file) Is there any way to do it?
Regards,
--
Gunce Kaya
On 2017-03-08 10:13, Günce Kaya wrote: > Hi all, > > I want to import content of CSV file to a table via bash script > without creating temporary table and I also want to skip some columns > in CSV file (for instance, CSV file has 12 column and main table has > only 2 column, If possible I would use only 2 column in CSV file) Is > there any way to do it? > > Regards, > > -- > > Gunce Kaya This is more a programming question than a database question, and there are many possible solutions. Do *not*, whatever you do, try to write your own piece of code to read the CSV. There are lots of unexpected ways that the CSV file can be slightly different from what you expect, and figuring all those out is a waste of time. The example of embedded comma's is just one way, there could also be newlines, linebreaks, utf8-escape characters etc. Personally I'd go the python route because it's simple and straightforward, but anything you are comfortable with will do. If you are going to install additional software to do this then remember that you'll need that same software again if you need to do this again, or when you need to move this code to a different server.
This is more a programming question than a database question, and there are many possible solutions.On 2017-03-08 10:13, Günce Kaya wrote:Hi all,
I want to import content of CSV file to a table via bash script
without creating temporary table and I also want to skip some columns
in CSV file (for instance, CSV file has 12 column and main table has
only 2 column, If possible I would use only 2 column in CSV file) Is
there any way to do it?
Regards,
--
Gunce Kaya
Do *not*, whatever you do, try to write your own piece of code to read the CSV. There are lots of unexpected
ways that the CSV file can be slightly different from what you expect, and figuring all those out is a waste of time.
The example of embedded comma's is just one way, there could also be newlines, linebreaks, utf8-escape characters etc.
Personally I'd go the python route because it's simple and straightforward, but anything you are comfortable with will do.
If you are going to install additional software to do this then remember that you'll need that same software again if
you need to do this again, or when you need to move this code to a different server.
I agree. I went with a "pure BASH" approach because it is what the user asked for & I wasn't sure what language she might be comfortable with. I use PERL a lot. Or maybe I should say that I abuse PERL a lot. Such as a PERL script with writes out another PERL script, based on some input files & parameters, then runs the just written PERL script, which does the load into a PostgreSQL database (multiple tables). Ya, a bit perverted.
"Irrigation of the land with seawater desalinated by fusion power is ancient. It's called 'rain'." -- Michael McClary, in alt.fusion
Maranatha! <><
John McKown
John McKown
On Mar 08, John McKown modulated: ... > I agree. I went with a "pure BASH" approach because it is what the > user asked for & I wasn't sure what language she might be comfortable > with. I use PERL a lot. Or maybe I should say that I abuse PERL a lot. > Such as a PERL script with writes out another PERL script, based on > some input files & parameters, then runs the just written PERL script, > which does the load into a PostgreSQL database (multiple tables). Ya, a > bit perverted. > Well, you could follow a similar meta-programming/code-generating pattern to have the BASH script output a single SQL file to run with psql. You could even generate PL/pgsql code to defer more data processing to the database itself. I think the only robust "pure BASH" approach is to use a temporary table, so you aren't trying to parse CSV content in BASH. Using csvkit sounds good if you can introduce these third-party dependencies. With the temporary table, you can use SQL for most validation or data interrogation, but you need to know at least enough schema information in advance to form the COPY statement. Parsing the CSV header row to plan your work puts you right back to requiring a robust CSV parser unless you can constrain your input scenarios to only handle very trivial headers. If you play games with a defaulting serial column and fixed column names like "id, c1, c2, ..., cN" for the temporary table, you might use the id column as a DB-assigned "row number" during COPY and validation. In this case, you could even tell Postgres there is no header, and then let it parse the header as another data record so you can use SQL statements to determine the actual header names and ordering in the input. But this still requires knowing the column count in advance of the COPY. I also think using something like Python with structured data processing would be wiser, unless you know enough about the schema in advance to avoid any CSV parsing on the client side. Karl
With the temporary table, you can use SQL for most validation or data
interrogation, but you need to know at least enough schema information
in advance to form the COPY statement. Parsing the CSV header row to
plan your work puts you right back to requiring a robust CSV parser
unless you can constrain your input scenarios to only handle very
trivial headers.
You can write the entire contents of the CSV into a psql variable and process the text blob from there using intermediate arrays.
David J.
Since bash has been bandied about in this thread I presume awk is available. Here's how I would check just how 'csv'ish the incoming file is.
awk -F"," '{a[$NF]++}END{for(i in a){printf "%d lines have %d fields(columns)\n", a[i], i}}' csvfilename
If this doesn't produce one line you have to suspect quoted values including commas (or what ever char you choose). then you need a real csv parser.
If just one line, I'ld use cut to get rid of unwanted columns, then let COPY do it's thing
On 03/08/2017 09:13 AM, Karl Czajkowski wrote:
On Mar 08, John McKown modulated: ...I agree. I went with a "pure BASH" approach because it is what the user asked for & I wasn't sure what language she might be comfortable with. I use PERL a lot. Or maybe I should say that I abuse PERL a lot. Such as a PERL script with writes out another PERL script, based on some input files & parameters, then runs the just written PERL script, which does the load into a PostgreSQL database (multiple tables). Ya, a bit perverted.Well, you could follow a similar meta-programming/code-generating pattern to have the BASH script output a single SQL file to run with psql. You could even generate PL/pgsql code to defer more data processing to the database itself. I think the only robust "pure BASH" approach is to use a temporary table, so you aren't trying to parse CSV content in BASH. Using csvkit sounds good if you can introduce these third-party dependencies. With the temporary table, you can use SQL for most validation or data interrogation, but you need to know at least enough schema information in advance to form the COPY statement. Parsing the CSV header row to plan your work puts you right back to requiring a robust CSV parser unless you can constrain your input scenarios to only handle very trivial headers. If you play games with a defaulting serial column and fixed column names like "id, c1, c2, ..., cN" for the temporary table, you might use the id column as a DB-assigned "row number" during COPY and validation. In this case, you could even tell Postgres there is no header, and then let it parse the header as another data record so you can use SQL statements to determine the actual header names and ordering in the input. But this still requires knowing the column count in advance of the COPY. I also think using something like Python with structured data processing would be wiser, unless you know enough about the schema in advance to avoid any CSV parsing on the client side. Karl
I believe that in its fully glory, you cannot reliably locate CSV record boundaries except by parsing each field in order including quote processing. Individual records may have arbitrary numbers of field and record separator characters within the values. Karl On Mar 08, Rob Sargent modulated: > Since bash has been bandied about in this thread I presume awk is > available. Here's how I would check just how 'csv'ish the incoming > file is. > ...
On 03/08/2017 09:36 AM, Karl Czajkowski wrote: > I believe that in its fully glory, you cannot reliably locate CSV > record boundaries except by parsing each field in order including > quote processing. Individual records may have arbitrary numbers of > field and record separator characters within the values. > > Karl > > > On Mar 08, Rob Sargent modulated: >> Since bash has been bandied about in this thread I presume awk is >> available. Here's how I would check just how 'csv'ish the incoming >> file is. >> ... Yes Karl, I agree. I admitted as much. But if it's clean, as in free of quoted commas, life is much more simple. I've lost site of whether or not the OP knows his situation w.r.t. to this. The awk line will tell him and for a one-off load this can make a world of difference in complexity - two bash lines and a COPY.
On Mar 08, Rob Sargent modulated: > Yes Karl, I agree. I admitted as much. But if it's clean, as in > free of quoted commas, life is much more simple. I've lost site of > whether or not the OP knows his situation w.r.t. to this. The awk > line will tell him and for a one-off load this can make a world of > difference in complexity - two bash lines and a COPY. > Maybe I didn't understand your awk? I thought it was counting commas in lines. This isn't the same as counting commas in records. this,is,record,one "this,,","is ,,record","two ,,," this has three commas on each line and definitely is not suitable for naive CSV handling. Karl
I'll throw in.
If tab delimited is available, perhaps that option will work better...or..
use Access to find the violations of the quote comma delimited assumptions, then
export from Access an import
Bret
On Wed, 2017-03-08 at 08:36 -0800, Karl Czajkowski wrote:
If tab delimited is available, perhaps that option will work better...or..
use Access to find the violations of the quote comma delimited assumptions, then
export from Access an import
Bret
On Wed, 2017-03-08 at 08:36 -0800, Karl Czajkowski wrote:
I believe that in its fully glory, you cannot reliably locate CSV record boundaries except by parsing each field in order including quote processing. Individual records may have arbitrary numbers of field and record separator characters within the values. Karl On Mar 08, Rob Sargent modulated: > Since bash has been bandied about in this thread I presume awk is > available. Here's how I would check just how 'csv'ish the incoming > file is. > ...
On 03/08/2017 09:52 AM, Karl Czajkowski wrote: > On Mar 08, Rob Sargent modulated: > >> Yes Karl, I agree. I admitted as much. But if it's clean, as in >> free of quoted commas, life is much more simple. I've lost site of >> whether or not the OP knows his situation w.r.t. to this. The awk >> line will tell him and for a one-off load this can make a world of >> difference in complexity - two bash lines and a COPY. >> > Maybe I didn't understand your awk? I thought it was counting commas > in lines. This isn't the same as counting commas in records. > > this,is,record,one > "this,,","is > ,,record","two > ,,," > > this has three commas on each line and definitely is not suitable > for naive CSV handling. > > > Karl In essence it does count commas but plus one :). $NF is number of fields defined by commas so one more field than number of commas. If you think/hope the file is simple and well formatted, this is a pretty quick check. But if you're looking for a general solution, you need a real csv parser. I recall being quite surprised and amused to learn there is an actual standard for csv format. (Naturally if you have one to hand, you don't need the awk line.)
Hi,
When I open a new thread, I didn't know exactly what is true words to research. I read all of your posts and I think CSV parsing is the point to me. I've created my script and I share it to record.
#cat cargo2.sh
#!/bin/bash
while IFS=, read uor_desc crime_type zip_code ucr_hierarchy date_reported date_occured
do
echo "select * from insertorders('$uor_desc', '$crime_type', '$zip_code', '$fucr_hierarchy', '$date_reported', '$date_occured');"
done < test.txt;
Thank you for your advices and helps.
Regards,
Gunce
2017-03-08 20:00 GMT+03:00 Rob Sargent <robjsargent@gmail.com>:
On 03/08/2017 09:52 AM, Karl Czajkowski wrote:On Mar 08, Rob Sargent modulated:In essence it does count commas but plus one :). $NF is number of fields defined by commas so one more field than number of commas. If you think/hope the file is simple and well formatted, this is a pretty quick check. But if you're looking for a general solution, you need a real csv parser. I recall being quite surprised and amused to learn there is an actual standard for csv format. (Naturally if you have one to hand, you don't need the awk line.)Yes Karl, I agree. I admitted as much. But if it's clean, as inMaybe I didn't understand your awk? I thought it was counting commas
free of quoted commas, life is much more simple. I've lost site of
whether or not the OP knows his situation w.r.t. to this. The awk
line will tell him and for a one-off load this can make a world of
difference in complexity - two bash lines and a COPY.
in lines. This isn't the same as counting commas in records.
this,is,record,one
"this,,","is
,,record","two
,,,"
this has three commas on each line and definitely is not suitable
for naive CSV handling.
Karl
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Gunce Kaya