Thread: copy command -- foiled by pg_atoi

copy command -- foiled by pg_atoi

From
Mike Beller
Date:
Hi

I'm using postgresql on RedHat/i386 from the RH6.1 RPMs
(Linux 2.2.5-15 PostgreSQL-6.5.2).  I've been evaluating it,
and am quite excited about the possibilty of using this outstanding
open-source code in my company's systems.

I noticed the following and want to submit it for your consideration:

When doing a 'copy' from a big text file, if there is a record, say
half way down, which has an unparseable integer field (say 'X'
in a column that shoudl be an integer) one gets the following error:

ERROR:  pg_atoi: error in "X": can't parse "X"

However, I don't get a line number reference where the error
occurred.  This is a problem when you have half a million lines
in your file, some of which  may contain legitimate X's as
well as non-legit ones.  I don't even know which attribute failed.

Simple example:  (imagine a million rows with 12 columns and
you're trying to find the problem:)

-------
create table foo (x int);
copy foo from stdin;
1
2
3
X
4
\.
-----------

I noticed that in pg_atoi there is this code:

     if (errno)                         /* strtol must set ERANGE */
            elog(ERROR, "pg_atoi: error reading \"%s\": %m", s);
     if (badp && *badp && (*badp != c))
            elog(ERROR, "pg_atoi: error in \"%s\": can\'t parse
\"%s\"",s, badp);


This is called (via int4in or some such) from copy.c:

    values[i] = (Datum) (*fmgr_faddr(&in_functions[i]))(string,
elements[i],typmod[i]);
           /*
          * Sanity check - by reference attributes cannot
           * return NULL
          */
          if (!PointerIsValid(values[i]) &&
!(rel->rd_att->attrs[i]->attbyval))
                  elog(ERROR, "copy from line %d: Bad file format",
lineno);

So the lineno information is there, it's just not available when pg_atoi
logs its error and bails out.

Is it possible for pg_atoi to return an invalid pointer instead
of completely bailing out, thus allowing the 'sanity check' code
to print the line number?  I have no idea how many things this
might break...

Regards


Mike Beller
CTO
Tradeworx.com

Re: [GENERAL] copy command -- foiled by pg_atoi

From
Jose Soares
Date:
This is also my problem. I'm getting '@!?àù§èé+*_|!&/%§¸' to load a
table with more than 23,000 rows
because I don't know in which line I have to look for the the error.
It's obvious that psql reads the input file line by line, and of course
psql knows in wich line the error
occurred. This information would be very important to correct any syntax
error in the file.

Comments?

Jose'


Mike Beller wrote:
>
> Hi
>
> I'm using postgresql on RedHat/i386 from the RH6.1 RPMs
> (Linux 2.2.5-15 PostgreSQL-6.5.2).  I've been evaluating it,
> and am quite excited about the possibilty of using this outstanding
> open-source code in my company's systems.
>
> I noticed the following and want to submit it for your consideration:
>
> When doing a 'copy' from a big text file, if there is a record, say
> half way down, which has an unparseable integer field (say 'X'
> in a column that shoudl be an integer) one gets the following error:
>
> ERROR:  pg_atoi: error in "X": can't parse "X"
>
> However, I don't get a line number reference where the error
> occurred.  This is a problem when you have half a million lines
> in your file, some of which  may contain legitimate X's as
> well as non-legit ones.  I don't even know which attribute failed.
>
> Simple example:  (imagine a million rows with 12 columns and
> you're trying to find the problem:)
>
> -------
> create table foo (x int);
> copy foo from stdin;
> 1
> 2
> 3
> X
> 4
> \.
> -----------
>
> I noticed that in pg_atoi there is this code:
>
>      if (errno)                         /* strtol must set ERANGE */
>             elog(ERROR, "pg_atoi: error reading \"%s\": %m", s);
>      if (badp && *badp && (*badp != c))
>             elog(ERROR, "pg_atoi: error in \"%s\": can\'t parse
> \"%s\"",s, badp);
>
> This is called (via int4in or some such) from copy.c:
>
>     values[i] = (Datum) (*fmgr_faddr(&in_functions[i]))(string,
> elements[i],typmod[i]);
>            /*
>           * Sanity check - by reference attributes cannot
>            * return NULL
>           */
>           if (!PointerIsValid(values[i]) &&
> !(rel->rd_att->attrs[i]->attbyval))
>                   elog(ERROR, "copy from line %d: Bad file format",
> lineno);
>
> So the lineno information is there, it's just not available when pg_atoi
> logs its error and bails out.
>
> Is it possible for pg_atoi to return an invalid pointer instead
> of completely bailing out, thus allowing the 'sanity check' code
> to print the line number?  I have no idea how many things this
> might break...
>
> Regards
>
> Mike Beller
> CTO
> Tradeworx.com
>
> ************

Re: [GENERAL] copy command -- foiled by pg_atoi

From
Thomas Reinke
Date:
I've run into this a few times as well. My strategy to "hunt" down
the offending line has been to do a "bisection" algorithm.

Essentially, cut your file in half. If the first half loads ok,
you know the problem is in the second half, and vice versa.
Now cut the offending half in half again. Do the same.
With 23,000 rows, you will do this sequence no more than 15
times, and you will have narrowed down the offender (assuming
23,000 rows). With 2 million rows (something we've had to contend
with in the past), you'll nail it down in 21 iterations.

Not pretty to say the least, but workable.

Cheers, Thomas

Jose Soares wrote:
>
> This is also my problem. I'm getting '@!?àù§èé+*_|!&/%§¸' to load a
> table with more than 23,000 rows
> because I don't know in which line I have to look for the the error.
> It's obvious that psql reads the input file line by line, and of course
> psql knows in wich line the error
> occurred. This information would be very important to correct any syntax
> error in the file.
>
> Comments?
>
> Jose'
>
> Mike Beller wrote:
> >
> > Hi
> >
> > I'm using postgresql on RedHat/i386 from the RH6.1 RPMs
> > (Linux 2.2.5-15 PostgreSQL-6.5.2).  I've been evaluating it,
> > and am quite excited about the possibilty of using this outstanding
> > open-source code in my company's systems.
> >
> > I noticed the following and want to submit it for your consideration:
> >
> > When doing a 'copy' from a big text file, if there is a record, say
> > half way down, which has an unparseable integer field (say 'X'
> > in a column that shoudl be an integer) one gets the following error:
> >
> > ERROR:  pg_atoi: error in "X": can't parse "X"
> >
> > However, I don't get a line number reference where the error
> > occurred.  This is a problem when you have half a million lines
> > in your file, some of which  may contain legitimate X's as
> > well as non-legit ones.  I don't even know which attribute failed.
> >
> > Simple example:  (imagine a million rows with 12 columns and
> > you're trying to find the problem:)
> >
> > -------
> > create table foo (x int);
> > copy foo from stdin;
> > 1
> > 2
> > 3
> > X
> > 4
> > \.
> > -----------
> >
> > I noticed that in pg_atoi there is this code:
> >
> >      if (errno)                         /* strtol must set ERANGE */
> >             elog(ERROR, "pg_atoi: error reading \"%s\": %m", s);
> >      if (badp && *badp && (*badp != c))
> >             elog(ERROR, "pg_atoi: error in \"%s\": can\'t parse
> > \"%s\"",s, badp);
> >
> > This is called (via int4in or some such) from copy.c:
> >
> >     values[i] = (Datum) (*fmgr_faddr(&in_functions[i]))(string,
> > elements[i],typmod[i]);
> >            /*
> >           * Sanity check - by reference attributes cannot
> >            * return NULL
> >           */
> >           if (!PointerIsValid(values[i]) &&
> > !(rel->rd_att->attrs[i]->attbyval))
> >                   elog(ERROR, "copy from line %d: Bad file format",
> > lineno);
> >
> > So the lineno information is there, it's just not available when pg_atoi
> > logs its error and bails out.
> >
> > Is it possible for pg_atoi to return an invalid pointer instead
> > of completely bailing out, thus allowing the 'sanity check' code
> > to print the line number?  I have no idea how many things this
> > might break...
> >
> > Regards
> >
> > Mike Beller
> > CTO
> > Tradeworx.com
> >
> > ************
>
> ************

--
------------------------------------------------------------
Thomas Reinke                            Tel: (905) 331-2260
Director of Technology                   Fax: (905) 331-2504
E-Soft Inc.                         http://www.e-softinc.com

Re: [GENERAL] copy command -- foiled by pg_atoi

From
Ed Loehr
Date:
I have found that judicious placement of a few queries (selects, intentional
errors, etc.) within a long sequence of inserts will help segment them for
identification of offending lines.  Hokie, but it helps me.

Cheers,
Ed Loehr

Thomas Reinke wrote:

> I've run into this a few times as well. My strategy to "hunt" down
> the offending line has been to do a "bisection" algorithm.
>
> Essentially, cut your file in half. If the first half loads ok,
> you know the problem is in the second half, and vice versa.
> Now cut the offending half in half again. Do the same.
> With 23,000 rows, you will do this sequence no more than 15
> times, and you will have narrowed down the offender (assuming
> 23,000 rows). With 2 million rows (something we've had to contend
> with in the past), you'll nail it down in 21 iterations.
>
> Not pretty to say the least, but workable.
>
> Cheers, Thomas
>
> Jose Soares wrote:
> >
> > This is also my problem. I'm getting '@!?àù§èé+*_|!&/%§¸' to load a
> > table with more than 23,000 rows
> > because I don't know in which line I have to look for the the error.
> > It's obvious that psql reads the input file line by line, and of course
> > psql knows in wich line the error
> > occurred. This information would be very important to correct any syntax
> > error in the file.
> >
> > Comments?
> >
> > Jose'
> >
> > Mike Beller wrote:
> > >
> > > Hi
> > >
> > > I'm using postgresql on RedHat/i386 from the RH6.1 RPMs
> > > (Linux 2.2.5-15 PostgreSQL-6.5.2).  I've been evaluating it,
> > > and am quite excited about the possibilty of using this outstanding
> > > open-source code in my company's systems.
> > >
> > > I noticed the following and want to submit it for your consideration:
> > >
> > > When doing a 'copy' from a big text file, if there is a record, say
> > > half way down, which has an unparseable integer field (say 'X'
> > > in a column that shoudl be an integer) one gets the following error:
> > >
> > > ERROR:  pg_atoi: error in "X": can't parse "X"
> > >
> > > However, I don't get a line number reference where the error
> > > occurred.  This is a problem when you have half a million lines
> > > in your file, some of which  may contain legitimate X's as
> > > well as non-legit ones.  I don't even know which attribute failed.
> > >
> > > Simple example:  (imagine a million rows with 12 columns and
> > > you're trying to find the problem:)
> > >
> > > -------
> > > create table foo (x int);
> > > copy foo from stdin;
> > > 1
> > > 2
> > > 3
> > > X
> > > 4
> > > \.
> > > -----------
> > >
> > > I noticed that in pg_atoi there is this code:
> > >
> > >      if (errno)                         /* strtol must set ERANGE */
> > >             elog(ERROR, "pg_atoi: error reading \"%s\": %m", s);
> > >      if (badp && *badp && (*badp != c))
> > >             elog(ERROR, "pg_atoi: error in \"%s\": can\'t parse
> > > \"%s\"",s, badp);
> > >
> > > This is called (via int4in or some such) from copy.c:
> > >
> > >     values[i] = (Datum) (*fmgr_faddr(&in_functions[i]))(string,
> > > elements[i],typmod[i]);
> > >            /*
> > >           * Sanity check - by reference attributes cannot
> > >            * return NULL
> > >           */
> > >           if (!PointerIsValid(values[i]) &&
> > > !(rel->rd_att->attrs[i]->attbyval))
> > >                   elog(ERROR, "copy from line %d: Bad file format",
> > > lineno);
> > >
> > > So the lineno information is there, it's just not available when pg_atoi
> > > logs its error and bails out.
> > >
> > > Is it possible for pg_atoi to return an invalid pointer instead
> > > of completely bailing out, thus allowing the 'sanity check' code
> > > to print the line number?  I have no idea how many things this
> > > might break...
> > >
> > > Regards
> > >
> > > Mike Beller
> > > CTO
> > > Tradeworx.com
> > >
> > > ************
> >
> > ************
>
> --
> ------------------------------------------------------------
> Thomas Reinke                            Tel: (905) 331-2260
> Director of Technology                   Fax: (905) 331-2504
> E-Soft Inc.                         http://www.e-softinc.com
>
> ************


Re: [GENERAL] copy command -- foiled by pg_atoi

From
Mike Beller
Date:
Folks--

Thanks for the ideas.  But bisection just seemed too cumbersome.
In the end I decided to write a data filter in perl which checks
all the data for valid types before putting it into the DB!

Mike


Ed Loehr wrote:
>
> I have found that judicious placement of a few queries (selects, intentional
> errors, etc.) within a long sequence of inserts will help segment them for
> identification of offending lines.  Hokie, but it helps me.
...
> Thomas Reinke wrote:
>
> > I've run into this a few times as well. My strategy to "hunt" down
> > the offending line has been to do a "bisection" algorithm.

> >
> > Jose Soares wrote:
> > >
> > > This is also my problem. I'm getting '@!?àù§èé+*_|!&/%§¸' to load a
> > > table with more than 23,000 rows
> > > because I don't know in which line I have to look for the the error.