Thread: Finding Errors in .csv Input Data
I'm sure many of you have solved this problem in the past and can offer solutions that will work for me. The context is a 73-column postgres table of data that was originally in an Access .mdb file. A colleague loaded the file into Access and wrote a .csv file for me to use since we have nothing Microsoft here. There are 110,752 rows in the file/table. After a lot of cleaning with emacs and sed, the copy command accepted all but 80 rows of data. Now I need to figure out why postgres reports them as having too many columns. Starting to work with a single row, I first compared by cleaned row to the raw .csv from the Access output. They match column-by-column. Then I copied the schema to a text file and started comparing the .csv data column-by-column. While this looks OK to me, postgres doesn't like it. For example, I get this error message: nevada=# \copy water_well from 'one.csv' with delimiter '|' null '' CSV; ERROR: value too long for type character(1) CONTEXT: COPY water_well, line 1, column gravel_packed: "106" Yet, the column comparison for gravel_packed and surrounding attributes does not show this: lot_no TEXT, | block_no TEXT, | well_finish_date DATE, 11/15/1948| date_cmplt_acc CHAR(1), D| gravel_packed CHAR(1), | depth_seal INTEGER, | depth_drilled INTEGER, 106| depth_bedrock INTEGER, | aquifer_desc TEXT, | Notice the NULL value for gravel_packed, while the "106" is for depth_drilled, two columns later. I'm stymied and wonder if there's a tool I can use to fix these 80 rows so the copy command will accept them. Rich
On 02/22/2011 07:10 PM, Rich Shepard wrote: > I'm sure many of you have solved this problem in the past and can offer > solutions that will work for me. The context is a 73-column postgres table > of data that was originally in an Access .mdb file. A colleague loaded the > file into Access and wrote a .csv file for me to use since we have nothing > Microsoft here. There are 110,752 rows in the file/table. After a lot of > cleaning with emacs and sed, the copy command accepted all but 80 rows of > data. Now I need to figure out why postgres reports them as having too many > columns. > > Starting to work with a single row, I first compared by cleaned row to the > raw .csv from the Access output. They match column-by-column. Then I copied > the schema to a text file and started comparing the .csv data > column-by-column. While this looks OK to me, postgres doesn't like it. For > example, I get this error message: > > nevada=# \copy water_well from 'one.csv' with delimiter '|' null '' CSV; > ERROR: value too long for type character(1) > CONTEXT: COPY water_well, line 1, column gravel_packed: "106" > > Yet, the column comparison for gravel_packed and surrounding attributes > does not show this: > > lot_no TEXT, | > block_no TEXT, | > well_finish_date DATE, 11/15/1948| > date_cmplt_acc CHAR(1), D| > gravel_packed CHAR(1), | > depth_seal INTEGER, | > depth_drilled INTEGER, 106| > depth_bedrock INTEGER, | > aquifer_desc TEXT, | > > Notice the NULL value for gravel_packed, while the "106" is for > depth_drilled, two columns later. > > I'm stymied and wonder if there's a tool I can use to fix these 80 rows so > the copy command will accept them. > > Rich > Can we see a few lines of one.csv? If we ignore the CONTEXT line, and just look at the error "too long for char(1)", itlook like only two columns to think about date_cmplt_acc and gravel_packed. Sure there is no extra spaces or tabs or weirdnessin the file for those two columns? You might also consider dumping out insert statements. Might be a little slower, but simpler to debug. Did access dump out comma separated with quoted fields? I've done this on several occasions and never had to do any cleaning. But... I also use perl to split the csv and fire off inserts/copys/updates/etc. -Andy
On 02/22/2011 07:25 PM, Andy Colson wrote: > On 02/22/2011 07:10 PM, Rich Shepard wrote: >> I'm sure many of you have solved this problem in the past and can offer >> solutions that will work for me. The context is a 73-column postgres table >> of data that was originally in an Access .mdb file. A colleague loaded the >> file into Access and wrote a .csv file for me to use since we have nothing >> Microsoft here. There are 110,752 rows in the file/table. After a lot of >> cleaning with emacs and sed, the copy command accepted all but 80 rows of >> data. Now I need to figure out why postgres reports them as having too many >> columns. >> >> Starting to work with a single row, I first compared by cleaned row to the >> raw .csv from the Access output. They match column-by-column. Then I copied >> the schema to a text file and started comparing the .csv data >> column-by-column. While this looks OK to me, postgres doesn't like it. For >> example, I get this error message: >> >> nevada=# \copy water_well from 'one.csv' with delimiter '|' null '' CSV; >> ERROR: value too long for type character(1) >> CONTEXT: COPY water_well, line 1, column gravel_packed: "106" >> >> Yet, the column comparison for gravel_packed and surrounding attributes >> does not show this: >> >> lot_no TEXT, | >> block_no TEXT, | >> well_finish_date DATE, 11/15/1948| >> date_cmplt_acc CHAR(1), D| >> gravel_packed CHAR(1), | >> depth_seal INTEGER, | >> depth_drilled INTEGER, 106| >> depth_bedrock INTEGER, | >> aquifer_desc TEXT, | >> >> Notice the NULL value for gravel_packed, while the "106" is for >> depth_drilled, two columns later. >> >> I'm stymied and wonder if there's a tool I can use to fix these 80 rows so >> the copy command will accept them. >> >> Rich >> > > Can we see a few lines of one.csv? If we ignore the CONTEXT line, and just look at the error "too long for char(1)", itlook like only two columns to think about date_cmplt_acc and gravel_packed. Sure there is no extra spaces or tabs or weirdnessin the file for those two columns? > > You might also consider dumping out insert statements. Might be a little slower, but simpler to debug. > > Did access dump out comma separated with quoted fields? I've done this on several occasions and never had to do any cleaning.But... I also use perl to split the csv and fire off inserts/copys/updates/etc. > > -Andy > Hum... and another also: you could change your fields to all be 'text', then do the import. Then clean up the data witha few: select * from water_well where length(date_cmplt_acc) > 1 And I see that you have 73 columns, not just the few you posted. Well, I suppose... one of these days, I'll start payingmore attention :-) -Andy
On Tuesday, February 22, 2011 5:10:34 pm Rich Shepard wrote: > I'm sure many of you have solved this problem in the past and can offer > solutions that will work for me. The context is a 73-column postgres table > of data that was originally in an Access .mdb file. A colleague loaded the > file into Access and wrote a .csv file for me to use since we have nothing > Microsoft here. There are 110,752 rows in the file/table. After a lot of > cleaning with emacs and sed, the copy command accepted all but 80 rows of > data. Now I need to figure out why postgres reports them as having too many > columns. > > Starting to work with a single row, I first compared by cleaned row to > the raw .csv from the Access output. They match column-by-column. Then I > copied the schema to a text file and started comparing the .csv data > column-by-column. While this looks OK to me, postgres doesn't like it. For > example, I get this error message: > > nevada=# \copy water_well from 'one.csv' with delimiter '|' null '' CSV; > ERROR: value too long for type character(1) > CONTEXT: COPY water_well, line 1, column gravel_packed: "106" > > Yet, the column comparison for gravel_packed and surrounding attributes > does not show this: > > lot_no TEXT, | > block_no TEXT, | > well_finish_date DATE, 11/15/1948| > date_cmplt_acc CHAR(1), D| > gravel_packed CHAR(1), | > depth_seal INTEGER, | > depth_drilled INTEGER, 106| > depth_bedrock INTEGER, | > aquifer_desc TEXT, | > > Notice the NULL value for gravel_packed, while the "106" is for > depth_drilled, two columns later. > > I'm stymied and wonder if there's a tool I can use to fix these 80 rows > so the copy command will accept them. > > Rich We are going to need to see at least a sample of the actual data in one.csv that is causing the problem. You have an off by two error as you suggest, but that could actually have happened earlier in the row. For instance the well_finish_date would insert into lot_no because lot_no is TEXT and the date value at this point is just text. Same with date_cmplt_acc and block_no. -- Adrian Klaver adrian.klaver@gmail.com
On Tue, 22 Feb 2011, Adrian Klaver wrote: > We are going to need to see at least a sample of the actual data in > one.csv that is causing the problem. You have an off by two error as you > suggest, but that could actually have happened earlier in the row. For > instance the well_finish_date would insert into lot_no because lot_no is > TEXT and the date value at this point is just text. Same with > date_cmplt_acc and block_no. Adrian/Andy, The data came out of Access as comma-and-quote csv. I massaged it in emacs and sed to change the delimiter to a pipe rather than a comma and removed the double quotes. I cannot assume that each of the 80 problem rows suffer from the defect in the same place, so if there's a generic process I can apply it row-by-row. After all, 80 problem rows out of 110,752 is not bad. Here are the schema and the first row, in one.csv. Because I don't know if attachments are stripped off before the message is allowed to be distributed to the list subscribers, I'll just include both here. DDL: The column names are original (except for 'ref' that seems to be a reserved word), but I modifed the data types. CREATE TABLE water_well ( sequence_no TEXT PRIMARY KEY, well_log TEXT, app VARCHAR(20), notice_of_intent VARCHAR(6), waiver_no VARCHAR(30), date_log_rcvd DATE, date_log_rcvd_acc CHAR(1), site_type CHAR(1), work_type CHAR(1), work_type_rmks TEXT, proposed_use CHAR(1), drilling_method CHAR(1), sc TEXT, ha TEXT, twn VARCHAR(3), legal_twn VARCHAR(3), rng VARCHAR(3), legal_rng VARCHAR(3), sec TEXT, sec_quarters TEXT, legal_quarters TEXT, quarters_seq TEXT, ref TEXT, latitude NUMERIC(9,6), longitude NUMERIC(9,6), lat_long_src VARCHAR(5), lat_long_acc CHAR(1), owner_current TEXT, owner_address TEXT, owner_no TEXT, parcel_no TEXT, subdivision_name TEXT, lot_no TEXT, block_no TEXT, well_finish_date DATE, date_cmplt_acc CHAR(1), gravel_packed CHAR(1), depth_seal INTEGER, depth_drilled INTEGER, depth_bedrock INTEGER, aquifer_desc TEXT, depth_cased INTEGER, csng_diameter FLOAT, csng_reductions INTEGER, top_perf INTEGER, bottom_perf INTEGER, perf_intervals INTEGER, static_wl FLOAT, temperature FLOAT, yield FLOAT, drawdown FLOAT, hours_pumped FLOAT, test_method CHAR(1), qual_const_data CHAR(1), qual_lith_data CHAR(1), remarks TEXT, remarks_additional TEXT, contractor_lic_no VARCHAR(8), contractor_name TEXT, contractor_address TEXT, contractor_drlr_no VARCHAR(6), driller_lic_no VARCHAR(6), source_agency TEXT, user_id TEXT, date_entry DATE, update_user_id VARCHAR(16), date_update DATE, edit_status VARCHAR(16), well_start_date DATE, gravel_pack_top INTEGER, gravel_pack_bot INTEGER, utm_x NUMERIC(13,6), utm_y NUMERIC(13,6) ); Here's one.csv: 68670|724||0||11/27/1948|D|N|N||H|C|32031|087|N18|18N|E20|20E|07||||MD|39.44|119.77|NV003|M|KAIPER, R L|||SIERRA MANOR||11/15/1948|D|||106|||106|6.62|0|60|102|1|12.00||30.00|||B|G|G|AIRCOMPRESSOR TESTED 30 GPM ALSO||3|MEL MEYER|RT 1 BOX10 RENO|||3|NV003|JSWINGHOLM|1/16/2003|||F|11/11/1948|||261013.36|4369139.23 I hope you're able to see what I keep missing as the source of the problem. Rich
Rich, The data and table structure provided do not seem to correlate. Regardless, if you changed the delimiter to "|" from "," it is possible that you converted an embedded "," in one of the textual fields into a "|" when you should not have. For Instance: Value1,value2,"value, with comma",value4 Thus became Value1|value2|value|with comma|value4 Giving you additional fields that should not be present. I suggest opening up a testing file (one with the 80 malformed records and 10 to 20 good/control records) in an Excel or equivalent spreadsheet and import/text-to-columns using the "|" delimiter. You will be able to quickly see rows with extra columns and specifically where those extras are originating. Then you can decide on how to fix the problem. I would suggest manually changing each incorrect "|" into a "," as a form of self-punishment for improper data conversion - but whatever works for you. In the odd chance you do not see extra columns in the spreadsheet you can also scan down the columns and see if you recognize any pattern of differences between the 80 failing records and the 10 successful records that might point you further in the correct direction. If THAT fails you might want to see if anyone will receive the testing file and "CREATE TABLE" and try to independently diagnose the cause of the problem. That or do the Access export and skip your conversion routine and directly import the CSV into Postgres. David J. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Rich Shepard Sent: Tuesday, February 22, 2011 9:59 PM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Finding Errors in .csv Input Data On Tue, 22 Feb 2011, Adrian Klaver wrote: > We are going to need to see at least a sample of the actual data in > one.csv that is causing the problem. You have an off by two error as > you suggest, but that could actually have happened earlier in the row. > For instance the well_finish_date would insert into lot_no because > lot_no is TEXT and the date value at this point is just text. Same > with date_cmplt_acc and block_no. Adrian/Andy, The data came out of Access as comma-and-quote csv. I massaged it in emacs and sed to change the delimiter to a pipe rather than a comma and removed the double quotes. Rich -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Tue, 22 Feb 2011, David Johnston wrote: > The data and table structure provided do not seem to correlate. David, That's the problem. However, they should match since they came from the same .mdb file. > Regardless, if you changed the delimiter to "|" from "," it is possible that > you converted an embedded "," in one of the textual fields into a "|" when > you should not have. This is why I worked out a processing flow that avoided this problem in _almost_ every case. The 80 remaining rows do not appear to have this problem. At least, it's not immediately obvious they do which is why I want to learn how to identify where (and why) the data get out of sync with the schema. > I suggest opening up a testing file (one with the 80 malformed records and > 10 to 20 good/control records) in an Excel or equivalent spreadsheet and > import/text-to-columns using the "|" delimiter. You will be able to > quickly see rows with extra columns and specifically where those extras > are originating. Then you can decide on how to fix the problem. I would > suggest manually changing each incorrect "|" into a "," as a form of > self-punishment for improper data conversion - but whatever works for you. Each of the 80 rows have extra columns, at least one. Here are the same two rows from the original .csv export and the transformation: 68670,724,"",0,,11/27/1948,"D","N","N","","H","C","32031","087","N18","18N","E20","20E","07","","","","MD",39.44,119.77,"NV003","M","KAIPER, RL","","","","SIERRA MANOR","","",11/15/1948,"D","",,106,,"",106,6.62,0,60,102,1,12.00,,30.00,,,"B","G","G","AIR COMPRESSORTESTED 30 GPM ALSO","","3","MEL MEYER","RT 1 BOX 10RENO,",,3,"NV003","JSWINGHOLM",1/16/2003,"",,"F",11/11/1948,,,261013.36,4369139.23 68670|724||0||11/27/1948|D|N|N||H|C|32031|087|N18|18N|E20|20E|07||||MD|39.44|119.77|NV003|M|KAIPER, R L|||SIERRA MANOR||11/15/1948|D|||106|||106|6.62|0|60|102|1|12.00||30.00|||B|G|G|AIRCOMPRESSOR TESTED 30 GPM ALSO||3|MEL MEYER|RT 1 BOX10 RENO|||3|NV003|JSWINGHOLM|1/16/2003|||F|11/11/1948|||261013.36|4369139.23 When I compared the two I saw no difference. Why the latter does not want to fit the schema I don't know. I can send the schema again and the whole 80-row file to anyone willing to hemp me find the problems. Rich
Why does your transformed data repeat the first 5 columns out the original CSV? Why do values within a given field end up on different lines (e.g., SIERRA MANOR, 39.44)? Basically, your transform output as presented makes no sense to me; I don't even see how you could import any records into Postgres if they have that structure. Why can you not just import the CSV as generated by Access? Ignoring the repeating "header" and the line-wrapping the following differences appears: Missing a null/blank field between "KAIPER, R L" and "SIERRA MANOR" Missing a null/blank field between "SIERRA MANOR" and "11/15/1948" Thus everything after "SIERRA MANOR" is off-by-two in column position (SIERRA MANOR itself of off by 1) which seems to match what you are seeing during the import. You can add a "|" in those two locations and the record should then import. Repeat for all other records. David J. -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Rich Shepard Sent: Wednesday, February 23, 2011 9:11 AM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] Finding Errors in .csv Input Data Each of the 80 rows have extra columns, at least one. Here are the same two rows from the original .csv export and the transformation: 68670,724,"",0,,11/27/1948,"D","N","N","","H","C","32031","087","N18","18N", "E20","20E","07","","","","MD",39.44,119.77,"NV003","M","KAIPER, R L","","","","SIERRA MANOR","","",11/15/1948,"D","",,106,,"",106,6.62,0,60,102,1,12.00,,30.00,,," B","G","G","AIR COMPRESSOR TESTED 30 GPM ALSO","","3","MEL MEYER","RT 1 BOX 10RENO,",,3,"NV003","JSWINGHOLM",1/16/2003,"",,"F",11/11/1948,,,261013.36,43 69139.23 68670|724||0||11/27/1948|D|N|N||H|C|32031|087|N18|18N|E20|20E|07||||MD|3 68670|724||0||9.44|119.77|NV003|M|KAIPER, R L|||SIERRA 68670|724||0||MANOR||11/15/1948|D|||106|||106|6.62|0|60|102|1|12.00||30. 68670|724||0||00|||B|G|G|AIR COMPRESSOR TESTED 30 GPM ALSO||3|MEL 68670|724||0||MEYER|RT 1 BOX 10 68670|724||0||RENO|||3|NV003|JSWINGHOLM|1/16/2003|||F|11/11/1948|||26101 68670|724||0||3.36|4369139.23 When I compared the two I saw no difference. Why the latter does not want to fit the schema I don't know. I can send the schema again and the whole 80-row file to anyone willing to hemp me find the problems. Rich -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
On Wednesday, February 23, 2011 6:11:16 am Rich Shepard wrote: > On Tue, 22 Feb 2011, David Johnston wrote: > > The data and table structure provided do not seem to correlate. > > David, > > That's the problem. However, they should match since they came from the > same .mdb file. > > > Regardless, if you changed the delimiter to "|" from "," it is possible > > that you converted an embedded "," in one of the textual fields into a > > "|" when you should not have. > > This is why I worked out a processing flow that avoided this problem in > _almost_ every case. The 80 remaining rows do not appear to have this > problem. At least, it's not immediately obvious they do which is why I want > to learn how to identify where (and why) the data get out of sync with the > schema. > > > I suggest opening up a testing file (one with the 80 malformed records > > and 10 to 20 good/control records) in an Excel or equivalent spreadsheet > > and import/text-to-columns using the "|" delimiter. You will be able to > > quickly see rows with extra columns and specifically where those extras > > are originating. Then you can decide on how to fix the problem. I > > would suggest manually changing each incorrect "|" into a "," as a form > > of self-punishment for improper data conversion - but whatever works for > > you. > > Each of the 80 rows have extra columns, at least one. > > Here are the same two rows from the original .csv export and the > transformation: > > 68670,724,"",0,,11/27/1948,"D","N","N","","H","C","32031","087","N18","18N" > ,"E20","20E","07","","","","MD",39.44,119.77,"NV003","M","KAIPER, R > L","","","","SIERRA > MANOR","","",11/15/1948,"D","",,106,,"",106,6.62,0,60,102,1,12.00,,30.00,, > ,"B","G","G","AIR COMPRESSOR TESTED 30 GPM ALSO","","3","MEL MEYER","RT 1 > BOX > 10RENO,",,3,"NV003","JSWINGHOLM",1/16/2003,"",,"F",11/11/1948,,,261013.36, > 4369139.23 > > 68670|724||0||11/27/1948|D|N|N||H|C|32031|087|N18|18N|E20|20E|07||||MD|39.4 > 4|119.77|NV003|M|KAIPER, R L|||SIERRA > MANOR||11/15/1948|D|||106|||106|6.62|0|60|102|1|12.00||30.00|||B|G|G|AIR > COMPRESSOR TESTED 30 GPM ALSO||3|MEL MEYER|RT 1 BOX 10 > RENO|||3|NV003|JSWINGHOLM|1/16/2003|||F|11/11/1948|||261013.36|4369139.23 > > When I compared the two I saw no difference. Why the latter does not > want to fit the schema I don't know. > > I can send the schema again and the whole 80-row file to anyone willing > to hemp me find the problems. > > Rich As a quick test try quoting KAIPER, R L -->"KAIPER, R L". If that does not work you can send me the 80 line csv file and I will see if anything stands out. -- Adrian Klaver adrian.klaver@gmail.com
On Wed, 23 Feb 2011, David Johnston wrote: > Why can you not just import the CSV as generated by Access? I don't want additional quotation marks on all text. Loading the 80 rows in OO.o Calc does let me see the problems and I'm fixing them one-by-one. I didn't think of doing this. Thanks, Rich
On 02/23/2011 09:23 AM, Rich Shepard wrote: > On Wed, 23 Feb 2011, David Johnston wrote: > >> Why can you not just import the CSV as generated by Access? > > I don't want additional quotation marks on all text. I haven't followed this thread too closely, so maybe I'm missing something, but isn't this essentially what you need to do? # cat /tmp/test.csv 1,"testing 123,456","hello world",42 create table testcsv(id int, f1 text, f2 text, f3 int); copy testcsv from '/tmp/test.csv' with csv delimiter ',' quote '"'; select * from testcsv; id | f1 | f2 | f3 ----+-----------------+-------------+---- 1 | testing 123,456 | hello world | 42 (1 row) HTH, Joe -- Joe Conway credativ LLC: http://www.credativ.us Linux, PostgreSQL, and general Open Source Training, Service, Consulting, & 24x7 Support
Attachment
Rich Shepard <rshepard@appl-ecosys.com> writes: > I'm sure many of you have solved this problem in the past and can offer > solutions that will work for me. The context is a 73-column postgres table > of data that was originally in an Access .mdb file. A colleague loaded the > file into Access and wrote a .csv file for me to use since we have nothing > Microsoft here. There are 110,752 rows in the file/table. After a lot of > cleaning with emacs and sed, the copy command accepted all but 80 rows of > data. Now I need to figure out why postgres reports them as having too many > columns. Did you try pgloader yet? http://pgloader.projects.postgresql.org/ http://pgfoundry.org/projects/pgloader/ https://github.com/dimitri/pgloader http://packages.debian.org/sid/pgloader Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sun, 6 Mar 2011, Dimitri Fontaine wrote: > Did you try pgloader yet? Nope. I did determine the reasons and fixed them so all rows read into the table. Thanks for the suggestion, Rich