Thread: copy losing information
This is the first time I post to the list. I’ve done a brief search and didn’t find my issue treated already, so here it goes. Apologies if this has been reported before.
I have a pretty big file, around 2 million rows, in tab-separated format, with 4 columns, that I read into a table in Postgres using the copy command.
I’ve started to notice missing info sometimes. I’ll truncate the table, read from the file, and notice that sometimes there are less rows in the table than in the file.
This is not well reproducible. If I truncate again, and reread, I may get all the lines, or I may get a different amount of missing lines.
I concluded that there was a bug in the copy command, and wrote a replacement in Ruby, using the pure-ruby Postgres-pr library.
I run into the same issue. Some lines seem to be dropped, but no exceptions nor SQL errors are reported by the program.
In order to improve throughput, in my ruby program I connect to the server just once, and send the INSERT statements to the server in batches of 2000.
I’ve checked that the file doesn’t contain any SQL escape sequences or anything else that would invalidate an INSERT.
The version running in the server is 8.1.3 on Linux 2.6.5 on an Intel platform.
The imports are being run from windows machines in the same network.
Has somebody seen this before?
Thanks
Jaime
"Silvela, Jaime \(Exchange\)" <JSilvela@Bear.com> writes: > I've started to notice missing info sometimes. I'll truncate the table, > read from the file, and notice that sometimes there are less rows in the > table than in the file. Have you made any attempt to determine *which* rows are missing? I'm wondering about sloppy quoting allowing lines to get joined, or some such. regards, tom lane
No lines contain quotes. And the same file will sometimes be fully imported, and sometimes lose data. I'm thinking that under heavy loads, the database is discarding INSERTS. thanks -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, July 26, 2006 4:14 PM To: Silvela, Jaime (Exchange) Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] copy losing information "Silvela, Jaime \(Exchange\)" <JSilvela@Bear.com> writes: > I've started to notice missing info sometimes. I'll truncate the table, > read from the file, and notice that sometimes there are less rows in the > table than in the file. Have you made any attempt to determine *which* rows are missing? I'm wondering about sloppy quoting allowing lines to get joined, or some such. regards, tom lane *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
Silvela, Jaime (Exchange) wrote: > No lines contain quotes. And the same file will sometimes be fully > imported, and sometimes lose data. I'm thinking that under heavy loads, > the database is discarding INSERTS. I don't think that's very likely. How are you checking that the data is there? Do you check the whole table by way of a big, fat, unconstrained SELECT, or do you extract some rows one by one? My guess would be that maybe some insertions are not making it into an index that may be later used for getting the data during the examination. If there are no indexes involved (no primary key either) then this theory can be discarded quickly. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
This is the first time I post to the list. I’ve done a brief search and didn’t find my issue treated already, so here it goes. Apologies if this has been reported before.
What PG version and environment? How about sending the output of 'select version()' ?
copy or \copy? What arguments?
If you can release the data and get it to me (e.g., compressed email attachment, http, ftp), I volunteer to try a run on 8.1.4 on suse 10.0/x86_64.
-Reece
-- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0 |
Alvaro Herrera <alvherre@commandprompt.com> writes: > Silvela, Jaime (Exchange) wrote: >> No lines contain quotes. And the same file will sometimes be fully >> imported, and sometimes lose data. I'm thinking that under heavy loads, >> the database is discarding INSERTS. > I don't think that's very likely. Especially not since he says he's using COPY --- any sort of error would be all-or-nothing. Personally I'm wondering about individual rows getting dropped on the client side. regards, tom lane
Thanks guys, The output of select version() is "PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 4.0.3" Tom, how and why would INSERTs be dropped on the client side? I'll be away next week, but when I get back I'll conduct iterations to find out how often this happens and if there is a pattern. Reece, is there an ftp site where I can drop the file? It's way too large for email. The copy to the table from the file has been tried in two ways: "C:\Program Files\PostgreSQL\8.1\bin\psql" -c "\copy mytable (series, ticker, date, value) from C:\temp\myfile.out" -h <server> -U <user> <database> And also through a ruby script that uses the ruby-native library to connect to Postgres, and sends INSERTs in batches of 2000. Both ways work unreliably, sometimes getting every line, sometimes dropping a few lines. I check that doing a "select count(*)", and comparing against the number of lines in the file. The table gets truncated before the import, and has no sort of indexing. In case it can help, here is the definition. CREATE TABLE mytable ( series varchar(15), ticker varchar(20), date date, value numeric, variable varchar(20), msa varchar(3), "year" int4, "month" int2, freq varchar(2), geog varchar(6) ) Thanks Jaime -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, July 26, 2006 5:05 PM To: Alvaro Herrera Cc: Silvela, Jaime (Exchange); pgsql-general@postgresql.org Subject: Re: [GENERAL] copy losing information Alvaro Herrera <alvherre@commandprompt.com> writes: > Silvela, Jaime (Exchange) wrote: >> No lines contain quotes. And the same file will sometimes be fully >> imported, and sometimes lose data. I'm thinking that under heavy loads, >> the database is discarding INSERTS. > I don't think that's very likely. Especially not since he says he's using COPY --- any sort of error would be all-or-nothing. Personally I'm wondering about individual rows getting dropped on the client side. regards, tom lane *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
"Silvela, Jaime \(Exchange\)" <JSilvela@bear.com> writes: > Tom, how and why would INSERTs be dropped on the client side? [ shrug... ] I don't know your code; I was thinking about garden variety bugs in your ruby script. However, if you can make it happen just through psql \copy then that theory seems to lose its luster :-( Something else that might be worth looking at: have you got any user-written BEFORE INSERT triggers on that table? A trigger that sometimes returned NULL would explain the symptoms. regards, tom lane
On Wed, 2006-07-26 at 13:38 -0700, Reece Hart wrote:
If you can release the data and get it to me (e.g., compressed email attachment, http, ftp), I volunteer to try a run on 8.1.4 on suse 10.0/x86_64.
Jaime-
I cannot replicate your observations.
There were 1915733 lines in the file you sent. In all tests below, select(*) returned this number.
- linux/x86 client: within psql shell, create table, \copy, select count(*), truncate, \copy, select count(*). OK.
- linux/x86 client: psql -dcsb-dev -c '\copy haver_msamo_raw(series,ticker,date,value) from test.out', then select(*). OK
- win XP client: command line copy as you had posted, akin to above. Twice. OK
- linux/x86 client: truncate-copy-select(*) 35 times. all OK
All tests were with 8.1.4 server and clients. server_encoding and client_encoding are SQL_ASCII according to pg_settings. On windows, I got some error about code page mismatch at the console, but I doubt that's relevant.
Am at a loss to explain what you see. I don't know enough about character encoding to know whether that might be a culprit. Since you apparently have a linux box (from your version() string), have you tried loading from the server box?
Good luck,
Reece
Some details:
tallac$ gzip -t test.out.gz test.out.gz: OK tallac$ gzip -cd test.out.gz | wc -l test.out.gz: 88.0% 1915733 tallac$ md5sum test.out.gz 929582602507880045c1795970c974e4 test.out.gz
And in PostgreSQL:
rkh@csb-dev=> select version(); version ------------------------------------------------------------------------------------------------------------------PostgreSQL 8.1.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.0.2 20050901 (prerelease) (SUSE Linux) (1 row) rkh@csb-dev=> TRUNCATE haver_msamo_raw ; TRUNCATE TABLE rkh@csb-dev=> \copy haver_msamo_raw(series,ticker,date,value) from copy/test.out \. rkh@csb-dev=> select count(*) from haver_msamo_raw ; count ---------1915733 (1 row)
Linux command line:
$ psql -dcsb-dev -c '\copy haver_msamo_raw(series,ticker,date,value) from test.out'
Windows XP command line (psql from postgresql org 8.1.4 binaries):
psql.exe -hcsb -dcsb-dev -Upostgres -c "\copy haver_msamo_raw (series,ticker,date,value) from C:\temp\test.out"
Then I did this 35 times (in a shell script):
truncate rkh.haver_msamo_raw; select 'before',count(*) from rkh.haver_msamo_raw; \copy rkh.haver_msamo_raw(series,ticker,date,value) from test.out select 'after',count(*) from rkh.haver_msamo_raw;
-- Reece Hart, http://harts.net/reece/, GPG:0x25EC91A0 |
If you can release the data and get it to me (e.g., compressed email attachment, http, ftp), I volunteer to try a run on 8.1.4 on suse 10.0/x86_64.
Jaime-
I cannot replicate your observations.
There were 1915733 lines in the file you sent. In all tests below, select(*) returned this number.
- linux/x86 client: within psql shell, create table, \copy, select count(*), truncate, \copy, select count(*). OK.
- linux/x86 client: psql -dcsb-dev -c '\copy haver_msamo_raw(series,ticker,date,value) from test.out', then select(*). OK
- win XP client: command line copy as you had posted, akin to above. Twice. OK
- linux/x86 client: truncate-copy-select(*) 35 times. all OK
All tests were with 8.1.4 server and clients. server_encoding and client_encoding are SQL_ASCII according to pg_settings. On windows, I got some error about code page mismatch at the console, but I doubt that's relevant.
Am at a loss to explain what you see. I don't know enough about character encoding to know whether that might be a culprit. Since you apparently have a linux box (from your version() string), have you tried loading from the server box?
Good luck,
Reece
Some details:
tallac$ gzip -t test.out.gz test.out.gz: OK tallac$ gzip -cd test.out.gz | wc -l test.out.gz: 88.0% 1915733 tallac$ md5sum test.out.gz 929582602507880045c1795970c974e4 test.out.gz
And in PostgreSQL:
rkh@csb-dev=> select version(); version ------------------------------------------------------------------------------------------------------------------PostgreSQL 8.1.4 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.0.2 20050901 (prerelease) (SUSE Linux) (1 row) rkh@csb-dev=> TRUNCATE haver_msamo_raw ; TRUNCATE TABLE rkh@csb-dev=> \copy haver_msamo_raw(series,ticker,date,value) from copy/test.out \. rkh@csb-dev=> select count(*) from haver_msamo_raw ; count ---------1915733 (1 row)
Linux command line:
$ psql -dcsb-dev -c '\copy haver_msamo_raw(series,ticker,date,value) from test.out'
Windows XP command line (psql from postgresql org 8.1.4 binaries):
psql.exe -hcsb -dcsb-dev -Upostgres -c "\copy haver_msamo_raw (series,ticker,date,value) from C:\temp\test.out"
Then I did this 35 times (in a shell script):
truncate rkh.haver_msamo_raw; select 'before',count(*) from rkh.haver_msamo_raw; \copy rkh.haver_msamo_raw(series,ticker,date,value) from test.out select 'after',count(*) from rkh.haver_msamo_raw;
-- Reece Hart, Ph.D. rkh@gene.com, http://www.gene.com/ Genentech, Inc. 650-225-6133 (voice), -5389 (fax) Bioinformatics and Protein Engineering 1 DNA Way, MS-93 http://harts.net/reece/ South San Francisco, CA 94080-4990 reece@harts.net, GPG:0x25EC91A0 |
"Silvela, Jaime \(Exchange\)" <JSilvela@Bear.com> writes: > I have a pretty big file, around 2 million rows, in tab-separated > format, with 4 columns, that I read into a table in Postgres using the > copy command. > I've started to notice missing info sometimes. I'll truncate the table, > read from the file, and notice that sometimes there are less rows in the > table than in the file. This looks a lot like Dan Kavan's problem, http://archives.postgresql.org/pgsql-admin/2006-09/msg00092.php which we have now pretty well proven was a kernel bug. > The version running in the server is 8.1.3 on Linux 2.6.5 on an Intel > platform. 2.6.5 what exactly? And what's the "Intel platform"? Although I don't have any details about the kernel bug, I can't help suspecting that it only manifested on systems pretty similar to Dan's dual Opteron. regards, tom lane