Thread: Problems with pg_dump

Problems with pg_dump

From
Stefan Holzheu
Date:
Two weeks ago we did an upgrade from postgresql 7.3.2 to 7.4.1.
No we have the problem that pg_dump does not finish. We get the
following error message:

pg_dump: lost synchronization with server: got message type "5", length
842281016
pg_dump: SQL command to dump the contents of table "aggr_stddev_hour"
failed:
PQendcopy() failed.
pg_dump: Error message from server: lost synchronization with server: got
message type "5", length 842281016
pg_dump: The command was: COPY messungen.aggr_stddev_hour (id, von, wert,
counts) TO stdout;

The database size is about 10 GB. OS is Redhat Linux 7.1.

Help appreciated!

Stefan


--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------


Re: Problems with pg_dump

From
Tom Lane
Date:
Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes:
> Two weeks ago we did an upgrade from postgresql 7.3.2 to 7.4.1.
> No we have the problem that pg_dump does not finish. We get the
> following error message:

> pg_dump: lost synchronization with server: got message type "5", length
> 842281016
> pg_dump: SQL command to dump the contents of table "aggr_stddev_hour"
> failed:

That's interesting.  What are the column data types in that table?  What
character set encoding are you using?  Can you do a dump if you select
the dump-data-as-INSERT-commands option?

> The database size is about 10 GB.

How big is the particular table?

            regards, tom lane

Re: Problems with pg_dump

From
Stefan Holzheu
Date:
>
>>Two weeks ago we did an upgrade from postgresql 7.3.2 to 7.4.1.
>>No we have the problem that pg_dump does not finish. We get the
>>following error message:
>
>
>>pg_dump: lost synchronization with server: got message type "5", length
>>842281016
>>pg_dump: SQL command to dump the contents of table "aggr_stddev_hour"
>>failed:
>
>
> That's interesting.  What are the column data types in that table?  What
> character set encoding are you using?  Can you do a dump if you select
> the dump-data-as-INSERT-commands option?

      Tabelle »messungen.aggr_stddev_hour«
  Spalte |           Typ            | Attribute
--------+--------------------------+-----------
  id     | integer                  | not null
  von    | timestamp with time zone | not null
  wert   | double precision         |
  counts | integer                  | not null
Indexe:
     »idx_aggr_stddev_hour_id_von« eindeutig, btree (id, von)

Encoding: LATIN9

The error does not occur always and not always with the same table.
However, the error occurs only on that kind of aggregation tables. There
is a cron-job keeping the tables up to date, starting all 10 minutes.
The job does delete and inserts on the table. Could this somehow block
the dump process? Normally it should not?


>
>
>>The database size is about 10 GB.
>
>
> How big is the particular table?

The table has 9 000 000 entries.

Stefan

--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------


Re: Problems with pg_dump

From
Tom Lane
Date:
Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes:
> pg_dump: lost synchronization with server: got message type "5", length
> 842281016

> The error does not occur always and not always with the same table.

Oh, that's even more interesting.  Is the failure message itself
consistent --- that is, is it always complaining about "message type 5"
and the same bizarre length value?  The "length" looks like it's really
ASCII text ("2408" to be specific), so somehow libpq is misinterpreting
a piece of the COPY datastream as the start of a new message.

> However, the error occurs only on that kind of aggregation tables. There
> is a cron-job keeping the tables up to date, starting all 10 minutes.
> The job does delete and inserts on the table. Could this somehow block
> the dump process? Normally it should not?

It's hard to see how another backend would have anything to do with
this, unless perhaps the error is dependent on a particular data value
that is sometimes present in the table and sometimes not.  It looks to
me like either libpq or the backend is miscounting the number of data
bytes in the COPY datastream.  Would it be possible for you to use a
packet sniffer to capture the communication between pg_dump and the
backend?  If we could look at exactly what's going over the wire, it
would help to pin down the blame.

            regards, tom lane

Re: Problems with pg_dump

From
Stefan Holzheu
Date:
>
>>pg_dump: lost synchronization with server: got message type "5", length
>>842281016
>
>
>>The error does not occur always and not always with the same table.
>
>
> Oh, that's even more interesting.  Is the failure message itself
> consistent --- that is, is it always complaining about "message type 5"
> and the same bizarre length value?  The "length" looks like it's really
> ASCII text ("2408" to be specific), so somehow libpq is misinterpreting
> a piece of the COPY datastream as the start of a new message.
>

The failure looks similarly but the number do not exactly the same:

pg_dump: lost synchronization with server: got message type "4", length
858863113
pg_dump: SQL command to dump the contents of table "aggr_max_hour"
failed: PQendcopy() failed.
pg_dump: Error message from server: lost synchronization with server:
got message type "4", length 858863113
pg_dump: The command was: COPY messungen.aggr_max_hour (id, von, wert,
counts) TO stdout;

>
>>However, the error occurs only on that kind of aggregation tables. There
>>is a cron-job keeping the tables up to date, starting all 10 minutes.
>>The job does delete and inserts on the table. Could this somehow block
>>the dump process? Normally it should not?
>
>
> It's hard to see how another backend would have anything to do with
> this, unless perhaps the error is dependent on a particular data value
> that is sometimes present in the table and sometimes not.  It looks to
> me like either libpq or the backend is miscounting the number of data
> bytes in the COPY datastream.  Would it be possible for you to use a
> packet sniffer to capture the communication between pg_dump and the
> backend?  If we could look at exactly what's going over the wire, it
> would help to pin down the blame.
>
Well, I never used a packet sniffer but if anyone tells me what to
capture, I'd be glad to.

PostgreSQL is really a great piece of software, keep on making it even
better!

Stefan



--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------


Re: Problems with pg_dump

From
Tom Lane
Date:
Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes:
> Well, I never used a packet sniffer but if anyone tells me what to
> capture, I'd be glad to.

Since we don't know where exactly in the table the error occurs, I think
you'll just have to capture the whole connection traffic between pg_dump
and the backend.  You could help keep it to a reasonable size by doing a
selective dump of just one table, viz "pg_dump -t tablename dbname",
although I guess you might have to try a few times to reproduce the
error that way ...

            regards, tom lane

Re: Problems with pg_dump

From
Stefan Holzheu
Date:
>
> pg_dump: lost synchronization with server: got message type "4", length
> 858863113
> pg_dump: SQL command to dump the contents of table "aggr_max_hour"
> failed: PQendcopy() failed.
> pg_dump: Error message from server: lost synchronization with server:
> got message type "4", length 858863113
> pg_dump: The command was: COPY messungen.aggr_max_hour (id, von, wert,
> counts) TO stdout;
>
>>
>>> However, the error occurs only on that kind of aggregation tables.
>>> There is a cron-job keeping the tables up to date, starting all 10
>>> minutes. The job does delete and inserts on the table. Could this
>>> somehow block the dump process? Normally it should not?
>>
>>
>>
>> It's hard to see how another backend would have anything to do with
>> this, unless perhaps the error is dependent on a particular data value
>> that is sometimes present in the table and sometimes not.  It looks to
>> me like either libpq or the backend is miscounting the number of data
>> bytes in the COPY datastream.  Would it be possible for you to use a
>> packet sniffer to capture the communication between pg_dump and the
>> backend?  If we could look at exactly what's going over the wire, it
>> would help to pin down the blame.
>>
> Well, I never used a packet sniffer but if anyone tells me what to
> capture, I'd be glad to.
>
In order to capture the pg_dump traffic with tcpdump we changed the
pg_dump command from
pg_dump DB > ...
to
pg_dump -h 127.0.0.2 DB > ...
Since that time the dump completed without any errors. Before using
local connection pg_dump failed to about 50%.

Anyway, we can live with that solution. But still I'd be curious to know
what is the problem with local connections. Is there a way to capture
traffic of local socket connections?


Stefan




--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------

Re: Problems with pg_dump

From
Tom Lane
Date:
Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes:
> [ pg_dump shows data corruption over local socket but not TCP ]

That is really odd.

> Anyway, we can live with that solution. But still I'd be curious to know
> what is the problem with local connections. Is there a way to capture
> traffic of local socket connections?

No, not that I know of.  You could modify libpq to dump everything it
receives from the socket to a log file, and then compare the results
from local socket and TCP cases.

The whole thing is bizarre though.  AFAICS it must represent a kernel
bug.  What is your platform exactly?

            regards, tom lane

Re: Problems with pg_dump

From
"scott"
Date:
take my name and e-mail out of your address!!!! Wrong Scott!
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Stefan Holzheu" <stefan.holzheu@bitoek.uni-bayreuth.de>
Cc: "ADMIN" <pgsql-admin@postgresql.org>
Sent: Wednesday, February 25, 2004 6:42 AM
Subject: Re: [ADMIN] Problems with pg_dump


> Stefan Holzheu <stefan.holzheu@bitoek.uni-bayreuth.de> writes:
> > [ pg_dump shows data corruption over local socket but not TCP ]
>
> That is really odd.
>
> > Anyway, we can live with that solution. But still I'd be curious to know
> > what is the problem with local connections. Is there a way to capture
> > traffic of local socket connections?
>
> No, not that I know of.  You could modify libpq to dump everything it
> receives from the socket to a log file, and then compare the results
> from local socket and TCP cases.
>
> The whole thing is bizarre though.  AFAICS it must represent a kernel
> bug.  What is your platform exactly?
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org


Re: Problems with pg_dump

From
Stefan Holzheu
Date:
>>[ pg_dump shows data corruption over local socket but not TCP ]
>
>
> That is really odd.
>
>
>>Anyway, we can live with that solution. But still I'd be curious to know
>>what is the problem with local connections. Is there a way to capture
>>traffic of local socket connections?
>
>
> No, not that I know of.  You could modify libpq to dump everything it
> receives from the socket to a log file, and then compare the results
> from local socket and TCP cases.
>
> The whole thing is bizarre though.  AFAICS it must represent a kernel
> bug.  What is your platform exactly?
>

It's a RH7.1 with standard rpm kernel-2.4.2-2. The special thing is a
proprietary infortrend kernel modul rd124f necessary to access a
IFT-2101 raid.
Unfortunately Infortrend does not supply a modul for a more recent linux
distribution. The machine was originally intended to run Oracle on
Windows...

I will check whether the error occurs on an other machine with different
kernel but identical database and postgres version.

Regards, Stefan

--
-----------------------------
Dr. Stefan Holzheu
Tel.: 0921/55-5720
Fax.: 0921/55-5799
BITOeK Wiss. Sekretariat
Universitaet Bayreuth
D-95440 Bayreuth
-----------------------------