Thread: Problems with pg_dump
Two weeks ago we did an upgrade from postgresql 7.3.2 to 7.4.1. No we have the problem that pg_dump does not finish. We get the following error message: pg_dump: lost synchronization with server: got message type "5", length 842281016 pg_dump: SQL command to dump the contents of table "aggr_stddev_hour" failed: PQendcopy() failed. pg_dump: Error message from server: lost synchronization with server: got message type "5", length 842281016 pg_dump: The command was: COPY messungen.aggr_stddev_hour (id, von, wert, counts) TO stdout; The database size is about 10 GB. OS is Redhat Linux 7.1. Help appreciated! Stefan -- ----------------------------- Dr. Stefan Holzheu Tel.: 0921/55-5720 Fax.: 0921/55-5799 BITOeK Wiss. Sekretariat Universitaet Bayreuth D-95440 Bayreuth -----------------------------
Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes: > Two weeks ago we did an upgrade from postgresql 7.3.2 to 7.4.1. > No we have the problem that pg_dump does not finish. We get the > following error message: > pg_dump: lost synchronization with server: got message type "5", length > 842281016 > pg_dump: SQL command to dump the contents of table "aggr_stddev_hour" > failed: That's interesting. What are the column data types in that table? What character set encoding are you using? Can you do a dump if you select the dump-data-as-INSERT-commands option? > The database size is about 10 GB. How big is the particular table? regards, tom lane
> >>Two weeks ago we did an upgrade from postgresql 7.3.2 to 7.4.1. >>No we have the problem that pg_dump does not finish. We get the >>following error message: > > >>pg_dump: lost synchronization with server: got message type "5", length >>842281016 >>pg_dump: SQL command to dump the contents of table "aggr_stddev_hour" >>failed: > > > That's interesting. What are the column data types in that table? What > character set encoding are you using? Can you do a dump if you select > the dump-data-as-INSERT-commands option? Tabelle »messungen.aggr_stddev_hour« Spalte | Typ | Attribute --------+--------------------------+----------- id | integer | not null von | timestamp with time zone | not null wert | double precision | counts | integer | not null Indexe: »idx_aggr_stddev_hour_id_von« eindeutig, btree (id, von) Encoding: LATIN9 The error does not occur always and not always with the same table. However, the error occurs only on that kind of aggregation tables. There is a cron-job keeping the tables up to date, starting all 10 minutes. The job does delete and inserts on the table. Could this somehow block the dump process? Normally it should not? > > >>The database size is about 10 GB. > > > How big is the particular table? The table has 9 000 000 entries. Stefan -- ----------------------------- Dr. Stefan Holzheu Tel.: 0921/55-5720 Fax.: 0921/55-5799 BITOeK Wiss. Sekretariat Universitaet Bayreuth D-95440 Bayreuth -----------------------------
Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes: > pg_dump: lost synchronization with server: got message type "5", length > 842281016 > The error does not occur always and not always with the same table. Oh, that's even more interesting. Is the failure message itself consistent --- that is, is it always complaining about "message type 5" and the same bizarre length value? The "length" looks like it's really ASCII text ("2408" to be specific), so somehow libpq is misinterpreting a piece of the COPY datastream as the start of a new message. > However, the error occurs only on that kind of aggregation tables. There > is a cron-job keeping the tables up to date, starting all 10 minutes. > The job does delete and inserts on the table. Could this somehow block > the dump process? Normally it should not? It's hard to see how another backend would have anything to do with this, unless perhaps the error is dependent on a particular data value that is sometimes present in the table and sometimes not. It looks to me like either libpq or the backend is miscounting the number of data bytes in the COPY datastream. Would it be possible for you to use a packet sniffer to capture the communication between pg_dump and the backend? If we could look at exactly what's going over the wire, it would help to pin down the blame. regards, tom lane
> >>pg_dump: lost synchronization with server: got message type "5", length >>842281016 > > >>The error does not occur always and not always with the same table. > > > Oh, that's even more interesting. Is the failure message itself > consistent --- that is, is it always complaining about "message type 5" > and the same bizarre length value? The "length" looks like it's really > ASCII text ("2408" to be specific), so somehow libpq is misinterpreting > a piece of the COPY datastream as the start of a new message. > The failure looks similarly but the number do not exactly the same: pg_dump: lost synchronization with server: got message type "4", length 858863113 pg_dump: SQL command to dump the contents of table "aggr_max_hour" failed: PQendcopy() failed. pg_dump: Error message from server: lost synchronization with server: got message type "4", length 858863113 pg_dump: The command was: COPY messungen.aggr_max_hour (id, von, wert, counts) TO stdout; > >>However, the error occurs only on that kind of aggregation tables. There >>is a cron-job keeping the tables up to date, starting all 10 minutes. >>The job does delete and inserts on the table. Could this somehow block >>the dump process? Normally it should not? > > > It's hard to see how another backend would have anything to do with > this, unless perhaps the error is dependent on a particular data value > that is sometimes present in the table and sometimes not. It looks to > me like either libpq or the backend is miscounting the number of data > bytes in the COPY datastream. Would it be possible for you to use a > packet sniffer to capture the communication between pg_dump and the > backend? If we could look at exactly what's going over the wire, it > would help to pin down the blame. > Well, I never used a packet sniffer but if anyone tells me what to capture, I'd be glad to. PostgreSQL is really a great piece of software, keep on making it even better! Stefan -- ----------------------------- Dr. Stefan Holzheu Tel.: 0921/55-5720 Fax.: 0921/55-5799 BITOeK Wiss. Sekretariat Universitaet Bayreuth D-95440 Bayreuth -----------------------------
Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes: > Well, I never used a packet sniffer but if anyone tells me what to > capture, I'd be glad to. Since we don't know where exactly in the table the error occurs, I think you'll just have to capture the whole connection traffic between pg_dump and the backend. You could help keep it to a reasonable size by doing a selective dump of just one table, viz "pg_dump -t tablename dbname", although I guess you might have to try a few times to reproduce the error that way ... regards, tom lane
> > pg_dump: lost synchronization with server: got message type "4", length > 858863113 > pg_dump: SQL command to dump the contents of table "aggr_max_hour" > failed: PQendcopy() failed. > pg_dump: Error message from server: lost synchronization with server: > got message type "4", length 858863113 > pg_dump: The command was: COPY messungen.aggr_max_hour (id, von, wert, > counts) TO stdout; > >> >>> However, the error occurs only on that kind of aggregation tables. >>> There is a cron-job keeping the tables up to date, starting all 10 >>> minutes. The job does delete and inserts on the table. Could this >>> somehow block the dump process? Normally it should not? >> >> >> >> It's hard to see how another backend would have anything to do with >> this, unless perhaps the error is dependent on a particular data value >> that is sometimes present in the table and sometimes not. It looks to >> me like either libpq or the backend is miscounting the number of data >> bytes in the COPY datastream. Would it be possible for you to use a >> packet sniffer to capture the communication between pg_dump and the >> backend? If we could look at exactly what's going over the wire, it >> would help to pin down the blame. >> > Well, I never used a packet sniffer but if anyone tells me what to > capture, I'd be glad to. > In order to capture the pg_dump traffic with tcpdump we changed the pg_dump command from pg_dump DB > ... to pg_dump -h 127.0.0.2 DB > ... Since that time the dump completed without any errors. Before using local connection pg_dump failed to about 50%. Anyway, we can live with that solution. But still I'd be curious to know what is the problem with local connections. Is there a way to capture traffic of local socket connections? Stefan -- ----------------------------- Dr. Stefan Holzheu Tel.: 0921/55-5720 Fax.: 0921/55-5799 BITOeK Wiss. Sekretariat Universitaet Bayreuth D-95440 Bayreuth -----------------------------
Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes: > [ pg_dump shows data corruption over local socket but not TCP ] That is really odd. > Anyway, we can live with that solution. But still I'd be curious to know > what is the problem with local connections. Is there a way to capture > traffic of local socket connections? No, not that I know of. You could modify libpq to dump everything it receives from the socket to a log file, and then compare the results from local socket and TCP cases. The whole thing is bizarre though. AFAICS it must represent a kernel bug. What is your platform exactly? regards, tom lane
take my name and e-mail out of your address!!!! Wrong Scott! ----- Original Message ----- From: "Tom Lane" <tgl@sss.pgh.pa.us> To: "Stefan Holzheu" <stefan.holzheu@bitoek.uni-bayreuth.de> Cc: "ADMIN" <pgsql-admin@postgresql.org> Sent: Wednesday, February 25, 2004 6:42 AM Subject: Re: [ADMIN] Problems with pg_dump > Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes: > > [ pg_dump shows data corruption over local socket but not TCP ] > > That is really odd. > > > Anyway, we can live with that solution. But still I'd be curious to know > > what is the problem with local connections. Is there a way to capture > > traffic of local socket connections? > > No, not that I know of. You could modify libpq to dump everything it > receives from the socket to a log file, and then compare the results > from local socket and TCP cases. > > The whole thing is bizarre though. AFAICS it must represent a kernel > bug. What is your platform exactly? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org
>>[ pg_dump shows data corruption over local socket but not TCP ] > > > That is really odd. > > >>Anyway, we can live with that solution. But still I'd be curious to know >>what is the problem with local connections. Is there a way to capture >>traffic of local socket connections? > > > No, not that I know of. You could modify libpq to dump everything it > receives from the socket to a log file, and then compare the results > from local socket and TCP cases. > > The whole thing is bizarre though. AFAICS it must represent a kernel > bug. What is your platform exactly? > It's a RH7.1 with standard rpm kernel-2.4.2-2. The special thing is a proprietary infortrend kernel modul rd124f necessary to access a IFT-2101 raid. Unfortunately Infortrend does not supply a modul for a more recent linux distribution. The machine was originally intended to run Oracle on Windows... I will check whether the error occurs on an other machine with different kernel but identical database and postgres version. Regards, Stefan -- ----------------------------- Dr. Stefan Holzheu Tel.: 0921/55-5720 Fax.: 0921/55-5799 BITOeK Wiss. Sekretariat Universitaet Bayreuth D-95440 Bayreuth -----------------------------