Thread: copy command -- foiled by pg_atoi
Hi I'm using postgresql on RedHat/i386 from the RH6.1 RPMs (Linux 2.2.5-15 PostgreSQL-6.5.2). I've been evaluating it, and am quite excited about the possibilty of using this outstanding open-source code in my company's systems. I noticed the following and want to submit it for your consideration: When doing a 'copy' from a big text file, if there is a record, say half way down, which has an unparseable integer field (say 'X' in a column that shoudl be an integer) one gets the following error: ERROR: pg_atoi: error in "X": can't parse "X" However, I don't get a line number reference where the error occurred. This is a problem when you have half a million lines in your file, some of which may contain legitimate X's as well as non-legit ones. I don't even know which attribute failed. Simple example: (imagine a million rows with 12 columns and you're trying to find the problem:) ------- create table foo (x int); copy foo from stdin; 1 2 3 X 4 \. ----------- I noticed that in pg_atoi there is this code: if (errno) /* strtol must set ERANGE */ elog(ERROR, "pg_atoi: error reading \"%s\": %m", s); if (badp && *badp && (*badp != c)) elog(ERROR, "pg_atoi: error in \"%s\": can\'t parse \"%s\"",s, badp); This is called (via int4in or some such) from copy.c: values[i] = (Datum) (*fmgr_faddr(&in_functions[i]))(string, elements[i],typmod[i]); /* * Sanity check - by reference attributes cannot * return NULL */ if (!PointerIsValid(values[i]) && !(rel->rd_att->attrs[i]->attbyval)) elog(ERROR, "copy from line %d: Bad file format", lineno); So the lineno information is there, it's just not available when pg_atoi logs its error and bails out. Is it possible for pg_atoi to return an invalid pointer instead of completely bailing out, thus allowing the 'sanity check' code to print the line number? I have no idea how many things this might break... Regards Mike Beller CTO Tradeworx.com
This is also my problem. I'm getting '@!?àù§èé+*_|!&/%§¸' to load a table with more than 23,000 rows because I don't know in which line I have to look for the the error. It's obvious that psql reads the input file line by line, and of course psql knows in wich line the error occurred. This information would be very important to correct any syntax error in the file. Comments? Jose' Mike Beller wrote: > > Hi > > I'm using postgresql on RedHat/i386 from the RH6.1 RPMs > (Linux 2.2.5-15 PostgreSQL-6.5.2). I've been evaluating it, > and am quite excited about the possibilty of using this outstanding > open-source code in my company's systems. > > I noticed the following and want to submit it for your consideration: > > When doing a 'copy' from a big text file, if there is a record, say > half way down, which has an unparseable integer field (say 'X' > in a column that shoudl be an integer) one gets the following error: > > ERROR: pg_atoi: error in "X": can't parse "X" > > However, I don't get a line number reference where the error > occurred. This is a problem when you have half a million lines > in your file, some of which may contain legitimate X's as > well as non-legit ones. I don't even know which attribute failed. > > Simple example: (imagine a million rows with 12 columns and > you're trying to find the problem:) > > ------- > create table foo (x int); > copy foo from stdin; > 1 > 2 > 3 > X > 4 > \. > ----------- > > I noticed that in pg_atoi there is this code: > > if (errno) /* strtol must set ERANGE */ > elog(ERROR, "pg_atoi: error reading \"%s\": %m", s); > if (badp && *badp && (*badp != c)) > elog(ERROR, "pg_atoi: error in \"%s\": can\'t parse > \"%s\"",s, badp); > > This is called (via int4in or some such) from copy.c: > > values[i] = (Datum) (*fmgr_faddr(&in_functions[i]))(string, > elements[i],typmod[i]); > /* > * Sanity check - by reference attributes cannot > * return NULL > */ > if (!PointerIsValid(values[i]) && > !(rel->rd_att->attrs[i]->attbyval)) > elog(ERROR, "copy from line %d: Bad file format", > lineno); > > So the lineno information is there, it's just not available when pg_atoi > logs its error and bails out. > > Is it possible for pg_atoi to return an invalid pointer instead > of completely bailing out, thus allowing the 'sanity check' code > to print the line number? I have no idea how many things this > might break... > > Regards > > Mike Beller > CTO > Tradeworx.com > > ************
I've run into this a few times as well. My strategy to "hunt" down the offending line has been to do a "bisection" algorithm. Essentially, cut your file in half. If the first half loads ok, you know the problem is in the second half, and vice versa. Now cut the offending half in half again. Do the same. With 23,000 rows, you will do this sequence no more than 15 times, and you will have narrowed down the offender (assuming 23,000 rows). With 2 million rows (something we've had to contend with in the past), you'll nail it down in 21 iterations. Not pretty to say the least, but workable. Cheers, Thomas Jose Soares wrote: > > This is also my problem. I'm getting '@!?àù§èé+*_|!&/%§¸' to load a > table with more than 23,000 rows > because I don't know in which line I have to look for the the error. > It's obvious that psql reads the input file line by line, and of course > psql knows in wich line the error > occurred. This information would be very important to correct any syntax > error in the file. > > Comments? > > Jose' > > Mike Beller wrote: > > > > Hi > > > > I'm using postgresql on RedHat/i386 from the RH6.1 RPMs > > (Linux 2.2.5-15 PostgreSQL-6.5.2). I've been evaluating it, > > and am quite excited about the possibilty of using this outstanding > > open-source code in my company's systems. > > > > I noticed the following and want to submit it for your consideration: > > > > When doing a 'copy' from a big text file, if there is a record, say > > half way down, which has an unparseable integer field (say 'X' > > in a column that shoudl be an integer) one gets the following error: > > > > ERROR: pg_atoi: error in "X": can't parse "X" > > > > However, I don't get a line number reference where the error > > occurred. This is a problem when you have half a million lines > > in your file, some of which may contain legitimate X's as > > well as non-legit ones. I don't even know which attribute failed. > > > > Simple example: (imagine a million rows with 12 columns and > > you're trying to find the problem:) > > > > ------- > > create table foo (x int); > > copy foo from stdin; > > 1 > > 2 > > 3 > > X > > 4 > > \. > > ----------- > > > > I noticed that in pg_atoi there is this code: > > > > if (errno) /* strtol must set ERANGE */ > > elog(ERROR, "pg_atoi: error reading \"%s\": %m", s); > > if (badp && *badp && (*badp != c)) > > elog(ERROR, "pg_atoi: error in \"%s\": can\'t parse > > \"%s\"",s, badp); > > > > This is called (via int4in or some such) from copy.c: > > > > values[i] = (Datum) (*fmgr_faddr(&in_functions[i]))(string, > > elements[i],typmod[i]); > > /* > > * Sanity check - by reference attributes cannot > > * return NULL > > */ > > if (!PointerIsValid(values[i]) && > > !(rel->rd_att->attrs[i]->attbyval)) > > elog(ERROR, "copy from line %d: Bad file format", > > lineno); > > > > So the lineno information is there, it's just not available when pg_atoi > > logs its error and bails out. > > > > Is it possible for pg_atoi to return an invalid pointer instead > > of completely bailing out, thus allowing the 'sanity check' code > > to print the line number? I have no idea how many things this > > might break... > > > > Regards > > > > Mike Beller > > CTO > > Tradeworx.com > > > > ************ > > ************ -- ------------------------------------------------------------ Thomas Reinke Tel: (905) 331-2260 Director of Technology Fax: (905) 331-2504 E-Soft Inc. http://www.e-softinc.com
I have found that judicious placement of a few queries (selects, intentional errors, etc.) within a long sequence of inserts will help segment them for identification of offending lines. Hokie, but it helps me. Cheers, Ed Loehr Thomas Reinke wrote: > I've run into this a few times as well. My strategy to "hunt" down > the offending line has been to do a "bisection" algorithm. > > Essentially, cut your file in half. If the first half loads ok, > you know the problem is in the second half, and vice versa. > Now cut the offending half in half again. Do the same. > With 23,000 rows, you will do this sequence no more than 15 > times, and you will have narrowed down the offender (assuming > 23,000 rows). With 2 million rows (something we've had to contend > with in the past), you'll nail it down in 21 iterations. > > Not pretty to say the least, but workable. > > Cheers, Thomas > > Jose Soares wrote: > > > > This is also my problem. I'm getting '@!?àù§èé+*_|!&/%§¸' to load a > > table with more than 23,000 rows > > because I don't know in which line I have to look for the the error. > > It's obvious that psql reads the input file line by line, and of course > > psql knows in wich line the error > > occurred. This information would be very important to correct any syntax > > error in the file. > > > > Comments? > > > > Jose' > > > > Mike Beller wrote: > > > > > > Hi > > > > > > I'm using postgresql on RedHat/i386 from the RH6.1 RPMs > > > (Linux 2.2.5-15 PostgreSQL-6.5.2). I've been evaluating it, > > > and am quite excited about the possibilty of using this outstanding > > > open-source code in my company's systems. > > > > > > I noticed the following and want to submit it for your consideration: > > > > > > When doing a 'copy' from a big text file, if there is a record, say > > > half way down, which has an unparseable integer field (say 'X' > > > in a column that shoudl be an integer) one gets the following error: > > > > > > ERROR: pg_atoi: error in "X": can't parse "X" > > > > > > However, I don't get a line number reference where the error > > > occurred. This is a problem when you have half a million lines > > > in your file, some of which may contain legitimate X's as > > > well as non-legit ones. I don't even know which attribute failed. > > > > > > Simple example: (imagine a million rows with 12 columns and > > > you're trying to find the problem:) > > > > > > ------- > > > create table foo (x int); > > > copy foo from stdin; > > > 1 > > > 2 > > > 3 > > > X > > > 4 > > > \. > > > ----------- > > > > > > I noticed that in pg_atoi there is this code: > > > > > > if (errno) /* strtol must set ERANGE */ > > > elog(ERROR, "pg_atoi: error reading \"%s\": %m", s); > > > if (badp && *badp && (*badp != c)) > > > elog(ERROR, "pg_atoi: error in \"%s\": can\'t parse > > > \"%s\"",s, badp); > > > > > > This is called (via int4in or some such) from copy.c: > > > > > > values[i] = (Datum) (*fmgr_faddr(&in_functions[i]))(string, > > > elements[i],typmod[i]); > > > /* > > > * Sanity check - by reference attributes cannot > > > * return NULL > > > */ > > > if (!PointerIsValid(values[i]) && > > > !(rel->rd_att->attrs[i]->attbyval)) > > > elog(ERROR, "copy from line %d: Bad file format", > > > lineno); > > > > > > So the lineno information is there, it's just not available when pg_atoi > > > logs its error and bails out. > > > > > > Is it possible for pg_atoi to return an invalid pointer instead > > > of completely bailing out, thus allowing the 'sanity check' code > > > to print the line number? I have no idea how many things this > > > might break... > > > > > > Regards > > > > > > Mike Beller > > > CTO > > > Tradeworx.com > > > > > > ************ > > > > ************ > > -- > ------------------------------------------------------------ > Thomas Reinke Tel: (905) 331-2260 > Director of Technology Fax: (905) 331-2504 > E-Soft Inc. http://www.e-softinc.com > > ************
Folks-- Thanks for the ideas. But bisection just seemed too cumbersome. In the end I decided to write a data filter in perl which checks all the data for valid types before putting it into the DB! Mike Ed Loehr wrote: > > I have found that judicious placement of a few queries (selects, intentional > errors, etc.) within a long sequence of inserts will help segment them for > identification of offending lines. Hokie, but it helps me. ... > Thomas Reinke wrote: > > > I've run into this a few times as well. My strategy to "hunt" down > > the offending line has been to do a "bisection" algorithm. > > > > Jose Soares wrote: > > > > > > This is also my problem. I'm getting '@!?àù§èé+*_|!&/%§¸' to load a > > > table with more than 23,000 rows > > > because I don't know in which line I have to look for the the error.