Thread: copy losing information

copy losing information

From
"Silvela, Jaime \(Exchange\)"
Date:

This is the first time I post to the list. I’ve done a brief search and didn’t find my issue treated already, so here it goes. Apologies if this has been reported before.

 

I have a pretty big file, around 2 million rows, in tab-separated format, with 4 columns, that I read into a table in Postgres using the copy command.

I’ve started to notice missing info sometimes. I’ll truncate the table, read from the file, and notice that sometimes there are less rows in the table than in the file.

This is not well reproducible. If I truncate again, and reread, I may get all the lines, or I may get a different amount of missing lines.

 

I concluded that there was a bug in the copy command, and wrote a replacement in Ruby, using the pure-ruby Postgres-pr library.

I run into the same issue. Some lines seem to be dropped, but no exceptions nor SQL errors are reported by the program.

In order to improve throughput, in my ruby program I connect to the server just once, and send the INSERT statements to the server in batches of 2000.

 

I’ve checked that the file doesn’t contain any SQL escape sequences or anything else that would invalidate an INSERT.

 

The version running in the server is 8.1.3 on Linux 2.6.5 on an Intel platform.

 

The imports are being run from windows machines in the same network.

 

Has somebody seen this before?

 

Thanks

Jaime

Re: copy losing information

From
Tom Lane
Date:
"Silvela, Jaime \(Exchange\)" <JSilvela@Bear.com> writes:
> I've started to notice missing info sometimes. I'll truncate the table,
> read from the file, and notice that sometimes there are less rows in the
> table than in the file.

Have you made any attempt to determine *which* rows are missing?
I'm wondering about sloppy quoting allowing lines to get joined,
or some such.

            regards, tom lane

Re: copy losing information

From
"Silvela, Jaime \(Exchange\)"
Date:
No lines contain quotes. And the same file will sometimes be fully
imported, and sometimes lose data. I'm thinking that under heavy loads,
the database is discarding INSERTS.

thanks

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, July 26, 2006 4:14 PM
To: Silvela, Jaime (Exchange)
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] copy losing information

"Silvela, Jaime \(Exchange\)" <JSilvela@Bear.com> writes:
> I've started to notice missing info sometimes. I'll truncate the
table,
> read from the file, and notice that sometimes there are less rows in
the
> table than in the file.

Have you made any attempt to determine *which* rows are missing?
I'm wondering about sloppy quoting allowing lines to get joined,
or some such.

            regards, tom lane



***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

Re: copy losing information

From
Alvaro Herrera
Date:
Silvela, Jaime (Exchange) wrote:
> No lines contain quotes. And the same file will sometimes be fully
> imported, and sometimes lose data. I'm thinking that under heavy loads,
> the database is discarding INSERTS.

I don't think that's very likely.

How are you checking that the data is there?  Do you check the whole
table by way of a big, fat, unconstrained SELECT, or do you extract some
rows one by one?  My guess would be that maybe some insertions are not
making it into an index that may be later used for getting the data
during the examination.

If there are no indexes involved (no primary key either) then this
theory can be discarded quickly.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: copy losing information

From
Reece Hart
Date:
On Wed, 2006-07-26 at 12:48 -0400, Silvela, Jaime (Exchange) wrote:
This is the first time I post to the list. I’ve done a brief search and didn’t find my issue treated already, so here it goes. Apologies if this has been reported before.

What PG version and environment?  How about sending the output of  'select version()'  ?

copy or \copy? What arguments?

If you can release the data and get it to me (e.g., compressed email attachment, http, ftp), I volunteer to try a run on 8.1.4 on suse 10.0/x86_64.

-Reece

-- 
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0

Re: copy losing information

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Silvela, Jaime (Exchange) wrote:
>> No lines contain quotes. And the same file will sometimes be fully
>> imported, and sometimes lose data. I'm thinking that under heavy loads,
>> the database is discarding INSERTS.

> I don't think that's very likely.

Especially not since he says he's using COPY --- any sort of error would
be all-or-nothing.

Personally I'm wondering about individual rows getting dropped on the
client side.

            regards, tom lane

Re: copy losing information

From
"Silvela, Jaime \(Exchange\)"
Date:
Thanks guys,

The output of select version() is
"PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.0.3"

Tom, how and why would INSERTs be dropped on the client side?
I'll be away next week, but when I get back I'll conduct iterations to
find out how often this happens and if there is a pattern.

Reece, is there an ftp site where I can drop the file? It's way too
large for email.

The copy to the table from the file has been tried in two ways:

"C:\Program Files\PostgreSQL\8.1\bin\psql" -c "\copy mytable (series,
ticker, date, value) from C:\temp\myfile.out" -h <server> -U <user>
<database>

And also through a ruby script that uses the ruby-native library to
connect to Postgres, and sends INSERTs in batches of 2000.

Both ways work unreliably, sometimes getting every line, sometimes
dropping a few lines. I check that doing a "select count(*)", and
comparing against the number of lines in the file. The table gets
truncated before the import, and has no sort of indexing.


In case it can help, here is the definition.

CREATE TABLE mytable
(
  series varchar(15),
  ticker varchar(20),
  date date,
  value numeric,
  variable varchar(20),
  msa varchar(3),
  "year" int4,
  "month" int2,
  freq varchar(2),
  geog varchar(6)
)


Thanks
Jaime

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Wednesday, July 26, 2006 5:05 PM
To: Alvaro Herrera
Cc: Silvela, Jaime (Exchange); pgsql-general@postgresql.org
Subject: Re: [GENERAL] copy losing information

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Silvela, Jaime (Exchange) wrote:
>> No lines contain quotes. And the same file will sometimes be fully
>> imported, and sometimes lose data. I'm thinking that under heavy
loads,
>> the database is discarding INSERTS.

> I don't think that's very likely.

Especially not since he says he's using COPY --- any sort of error would
be all-or-nothing.

Personally I'm wondering about individual rows getting dropped on the
client side.

            regards, tom lane



***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.

Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************

Re: copy losing information

From
Tom Lane
Date:
"Silvela, Jaime \(Exchange\)" <JSilvela@bear.com> writes:
> Tom, how and why would INSERTs be dropped on the client side?

[ shrug... ] I don't know your code; I was thinking about garden variety
bugs in your ruby script.  However, if you can make it happen just
through psql \copy then that theory seems to lose its luster :-(

Something else that might be worth looking at: have you got any
user-written BEFORE INSERT triggers on that table?  A trigger that
sometimes returned NULL would explain the symptoms.

            regards, tom lane

Re: copy losing information

From
Reece Hart
Date:
[My ISP has had a power failure and my outgoing mail appears to bouncing, but I'm unsure -- apologies if you receive multiple copies.]

On Wed, 2006-07-26 at 13:38 -0700, Reece Hart wrote:
If you can release the data and get it to me (e.g., compressed email attachment, http, ftp), I volunteer to try a run on 8.1.4 on suse 10.0/x86_64.

Jaime-

I cannot replicate your observations.

There were 1915733 lines in the file you sent. In all tests below, select(*) returned this number.

- linux/x86 client: within psql shell, create table, \copy, select count(*), truncate, \copy, select count(*). OK.
- linux/x86 client: psql -dcsb-dev -c '\copy haver_msamo_raw(series,ticker,date,value) from test.out', then select(*).  OK
- win XP client: command line copy as you had posted, akin to above. Twice. OK
- linux/x86 client: truncate-copy-select(*) 35 times. all OK

All tests were with 8.1.4 server and clients. server_encoding and client_encoding are SQL_ASCII according to pg_settings.  On windows, I got some error about code page mismatch at the console, but I doubt that's relevant.

Am at a loss to explain what you see. I don't know enough about character encoding to know whether that might be a culprit. Since you apparently have a linux box (from your version() string), have you tried loading from the server box?

Good luck,
Reece



Some details:
tallac$ gzip -t test.out.gz
test.out.gz:     OK
tallac$ gzip -cd test.out.gz | wc -l
test.out.gz:     88.0%
1915733
tallac$ md5sum test.out.gz
929582602507880045c1795970c974e4  test.out.gz

And in PostgreSQL:
rkh@csb-dev=> select version();
                                                     version
------------------------------------------------------------------------------------------------------------------PostgreSQL 8.1.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.0.2 20050901 (prerelease) (SUSE Linux)
(1 row)

rkh@csb-dev=> TRUNCATE haver_msamo_raw ;
TRUNCATE TABLE

rkh@csb-dev=> \copy haver_msamo_raw(series,ticker,date,value) from copy/test.out
\.
rkh@csb-dev=> select count(*) from haver_msamo_raw ;
  count
---------1915733
(1 row)

Linux command line:
$ psql -dcsb-dev -c '\copy haver_msamo_raw(series,ticker,date,value) from test.out'

Windows XP command line (psql from postgresql org 8.1.4 binaries):
psql.exe -hcsb -dcsb-dev -Upostgres  -c "\copy haver_msamo_raw (series,ticker,date,value) from C:\temp\test.out"

Then I did this 35 times (in a shell script):
truncate rkh.haver_msamo_raw;
select 'before',count(*) from rkh.haver_msamo_raw;
\copy rkh.haver_msamo_raw(series,ticker,date,value) from test.out
select 'after',count(*) from rkh.haver_msamo_raw;
-- 
Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0

Re: copy losing information

From
Reece Hart
Date:
On Wed, 2006-07-26 at 13:38 -0700, Reece Hart wrote:
If you can release the data and get it to me (e.g., compressed email attachment, http, ftp), I volunteer to try a run on 8.1.4 on suse 10.0/x86_64.

Jaime-

I cannot replicate your observations.

There were 1915733 lines in the file you sent. In all tests below, select(*) returned this number.

- linux/x86 client: within psql shell, create table, \copy, select count(*), truncate, \copy, select count(*). OK.
- linux/x86 client: psql -dcsb-dev -c '\copy haver_msamo_raw(series,ticker,date,value) from test.out', then select(*).  OK
- win XP client: command line copy as you had posted, akin to above. Twice. OK
- linux/x86 client: truncate-copy-select(*) 35 times. all OK

All tests were with 8.1.4 server and clients. server_encoding and client_encoding are SQL_ASCII according to pg_settings.  On windows, I got some error about code page mismatch at the console, but I doubt that's relevant.

Am at a loss to explain what you see. I don't know enough about character encoding to know whether that might be a culprit. Since you apparently have a linux box (from your version() string), have you tried loading from the server box?

Good luck,
Reece



Some details:
tallac$ gzip -t test.out.gz
test.out.gz:     OK
tallac$ gzip -cd test.out.gz | wc -l
test.out.gz:     88.0%
1915733
tallac$ md5sum test.out.gz
929582602507880045c1795970c974e4  test.out.gz

And in PostgreSQL:
rkh@csb-dev=> select version();
                                                     version
------------------------------------------------------------------------------------------------------------------PostgreSQL 8.1.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.0.2 20050901 (prerelease) (SUSE Linux)
(1 row)

rkh@csb-dev=> TRUNCATE haver_msamo_raw ;
TRUNCATE TABLE

rkh@csb-dev=> \copy haver_msamo_raw(series,ticker,date,value) from copy/test.out
\.
rkh@csb-dev=> select count(*) from haver_msamo_raw ;
  count
---------1915733
(1 row)

Linux command line:
$ psql -dcsb-dev -c '\copy haver_msamo_raw(series,ticker,date,value) from test.out'

Windows XP command line (psql from postgresql org 8.1.4 binaries):
psql.exe -hcsb -dcsb-dev -Upostgres  -c "\copy haver_msamo_raw (series,ticker,date,value) from C:\temp\test.out"

Then I did this 35 times (in a shell script):
truncate rkh.haver_msamo_raw;
select 'before',count(*) from rkh.haver_msamo_raw;
\copy rkh.haver_msamo_raw(series,ticker,date,value) from test.out
select 'after',count(*) from rkh.haver_msamo_raw;

-- 
Reece Hart, Ph.D.                      rkh@gene.com, http://www.gene.com/
Genentech, Inc.                        650-225-6133 (voice), -5389 (fax)
Bioinformatics and Protein Engineering
1 DNA Way, MS-93                       http://harts.net/reece/
South San Francisco, CA  94080-4990    reece@harts.net, GPG:0x25EC91A0

Re: copy losing information

From
Tom Lane
Date:
"Silvela, Jaime \(Exchange\)" <JSilvela@Bear.com> writes:
> I have a pretty big file, around 2 million rows, in tab-separated
> format, with 4 columns, that I read into a table in Postgres using the
> copy command.

> I've started to notice missing info sometimes. I'll truncate the table,
> read from the file, and notice that sometimes there are less rows in the
> table than in the file.

This looks a lot like Dan Kavan's problem,
http://archives.postgresql.org/pgsql-admin/2006-09/msg00092.php
which we have now pretty well proven was a kernel bug.

> The version running in the server is 8.1.3 on Linux 2.6.5 on an Intel
> platform.

2.6.5 what exactly?  And what's the "Intel platform"?  Although I don't
have any details about the kernel bug, I can't help suspecting that it
only manifested on systems pretty similar to Dan's dual Opteron.

            regards, tom lane