Thread: Restore killing the backend

Restore killing the backend

From
Stephen Bacon
Date:
Hello,
   I'm trying to migrate from 7.1.3 to 7.2.1
I pg_dump'ed all the data from 7.1.3, yet when I try psql dbname < dumpfile
it crashes, taking out postgres with it.
I am running under RedHat Linux 7.3 (kernel-smp-2.4.18-5)
This has happened both with the RPM from RedHat (on the CD) and the latest
one off the website (postgresql-7.2.1-2PGDG)

Attached are the output from two different runs. The error occurs when
performing the COPY command to repopulate the data.

Also, this dump will successfully import into a 7.1.3 db. I've looked in
the dump file for "3835" (which it says it can't process) and the only
occurannce for that string is within a phone number (text field).

Finally, after the error I usually have to restart the machine, as I get
segmentation faults from other apps - which hints to me that there's a
problem somewhere in an OS level library.

Any ideas?

Thanks,
   -Steve


<errors>------------------------

Re: Restore killing the backend

From
Andrew Sullivan
Date:
On Mon, Jul 29, 2002 at 04:38:36PM -0400, Stephen Bacon wrote:

> I pg_dump'ed all the data from 7.1.3, yet when I try psql dbname < dumpfile
> it crashes, taking out postgres with it.

> Finally, after the error I usually have to restart the machine, as I get
> segmentation faults from other apps - which hints to me that there's a
> problem somewhere in an OS level library.
>
> Any ideas?

Test your memory.  It's possible you have some bad memory, and you
happen to exercise it by reading the big dumpfile and writing into
the database.  Once the bad bit is getting used, you keep running
into it; hence the subsequent segfaults.

There have been _a lot_ of hadrware-related problems reported lately.
For production use, if your data is worth anything at all, buy ECC
memory.  It's worth it, even though it's expensive.

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Restore killing the backend

From
Stephen Bacon
Date:
> Test your memory.  It's possible you have some bad memory, and you
<snip>
> For production use, if your data is worth anything at all, buy ECC
> memory.  It's worth it, even though it's expensive.

Hello again.

Indeed we are using ECC memory - although if I knew how to test it, I
would - is there a utility for doing this?

For some reasons the errors did not appear in my original post, so here
there again (hopefully). I've also included the tail of the postgres log
file (I had debug_level set to 2) which shows (what I think is) a lot of
WAL activity and recommendations to increase WAL_FILES.
I increased wal_files to 8, wal_buffers to 15 and checkpoint_segments to
3 and yet the problem still occurs.

Next I'm going to break my import up into separate files and do it step
by step, but in the meantime does anyone have any ideas?

Thanks,
  -Steve


*** Run X ***

Re: Restore killing the backend

From
Stephen Bacon
Date:
Hello *again*
strangeness...something seems to be stripping the logs off the bottom of my
last two posts (so I'm trying a different email client).
Here's the third try - appologies for the wasted bandwidth.
-Steve


<crash with defaul WAL settings>

[[[output from psql attempting import]]]

<snip>

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
'tbloutdischassess_pkey' for table 'tbloutdischassess'
CREATE
NOTICE: copy: line 4232076, Message from PostgreSQL backend:
The Postmaster has informed me that some other backend
died abnormally and possibly corrupted shared memory.
I have rolled back the current transaction and am
going to terminate your database system connection and exit.
Please reconnect to the database system and repeat your query.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
lost synchronization with server, resetting connection
connection to server was lost

[[[and a second attempt:]]]

<snip>

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
'tblnatldischtier_pkey' for table 'tblnatldischtier'
CREATE
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
'tbloutdischassess_pkey' for table 'tbloutdischassess'
CREATE
ERROR: copy: line 5706, pg_atoi: error in ""3835": can't parse ""3835"
lost synchronization with server, resetting connection
ERROR: copy: line 1008, Bad float4 input format 'LN'
lost synchronization with server, resetting connection
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
lost synchronization with server, resetting connection
connection to server was lost


[[[ level 2 debug messages ]]]

DEBUG:  shmem_exit(0)
DEBUG:  exit(0)
DEBUG:  reaping dead processes
DEBUG:  child process (pid 11024) exited with exit code 0
DEBUG:  recycled transaction log file 0000000000000025
DEBUG:  recycled transaction log file 0000000000000026
DEBUG:  recycled transaction log file 0000000000000024
DEBUG:  proc_exit(0)
DEBUG:  shmem_exit(0)
DEBUG:  exit(0)
DEBUG:  reaping dead processes
DEBUG:  child process (pid 11025) exited with exit code 0
FATAL 2:  XLogWrite: write request 0/2D10C000 is past end of log 0/2D0FE000
DEBUG:  proc_exit(2)
DEBUG:  shmem_exit(2)
DEBUG:  exit(2)
DEBUG:  reaping dead processes
DEBUG:  child process (pid 11026) exited with exit code 2
DEBUG:  server process (pid 11026) exited with exit code 2
DEBUG:  terminating any other active server processes
DEBUG:  CleanupProc: sending SIGQUIT to process 11009
NOTICE:  copy: line 4232076, Message from PostgreSQL backend:
    The Postmaster has informed me that some other backend
    died abnormally and possibly corrupted shared memory.
    I have rolled back the current transaction and am
    going to terminate your database system connection and exit.
    Please reconnect to the database system and repeat your query.
DEBUG:  reaping dead processes
DEBUG:  child process (pid 11009) exited with exit code 1
DEBUG:  all server processes terminated; reinitializing shared memory and
semaphores
DEBUG:  shmem_exit(0)
invoking IpcMemoryCreate(size=728809472)

[[[ level 2 debug messages w/ increased WAL buffers ]]]

<snip>

DEBUG: StartTransactionCommand
DEBUG: query: COPY "tblirfpai_quality" FROM stdin;
DEBUG: ProcessUtility: COPY "tblirfpai_quality" FROM stdin;
DEBUG: reaping dead processes
DEBUG: child process (pid 4433) was terminated by signal 11
DEBUG: server process (pid 4433) was terminated by signal 11
DEBUG: terminating any other active server processes
DEBUG: all server processes terminated; reinitializing shared memory and
semaphores
DEBUG: shmem_exit(0)
invoking IpcMemoryCreate(size=417505280)
DEBUG: database system was interrupted at 2002-07-31 14:21:39 EDT
DEBUG: checkpoint record is at 1/9400808C
DEBUG: redo record is at 1/9300702C; undo record is at 0/0; shutdown FALSE
DEBUG: next transaction id: 1778; next oid: 39371344
DEBUG: database system was not properly shut down; automatic recovery in
progress
DEBUG: redo starts at 1/9300702C
DEBUG: reaping dead processes
DEBUG: startup process (pid 4434) was terminated by signal 11
DEBUG: aborting startup due to startup process failure
DEBUG: proc_exit(1)
DEBUG: shmem_exit(1)DEBUG: exit(1)


Re: Restore killing the backend

From
Andrew Sullivan
Date:
On Wed, Jul 31, 2002 at 03:15:37PM -0400, Stephen Bacon wrote:
>
> Indeed we are using ECC memory - although if I knew how to test it, I
> would - is there a utility for doing this?

If you are using ECC, and your hardware and software support it, you
should see errors in your syslogs if ECC is having trouble.

Anyway, the whole point of ECC is to prevent bad data from getting
through, so I doubt very much that's the problem.  If you're using an
x86 architecture, you can try memtest86:  http://www.memtest86.com/

A


--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Restore killing the backend

From
Andrew Sullivan
Date:
On Wed, Jul 31, 2002 at 03:45:19PM -0400, Stephen Bacon wrote:
> <crash with defaul WAL settings>
>
> [[[output from psql attempting import]]]
>
> <snip>
>
> NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index
> 'tbloutdischassess_pkey' for table 'tbloutdischassess'
> CREATE
> NOTICE: copy: line 4232076, Message from PostgreSQL backend:
> The Postmaster has informed me that some other backend
                                      ^^^^^^^^^^

Who else is connected?

> ERROR: copy: line 5706, pg_atoi: error in ""3835": can't parse ""3835"
                                             ^

Looks like a delimiting problem.

> ERROR: copy: line 1008, Bad float4 input format 'LN'
                               ^^^^^               ^^

It appears your data source is bad.  Maybe you need to have a look at
it with an editor?

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110


Re: Restore killing the backend - solved

From
Stephen Bacon
Date:
Well!
Indeed the problem *was* with the data (I know; assuming makes and
ass...) I had assumed the data I was importing was that which had been
pg_dumped from the 7.1.3 db and so was "correct".
It turns out that after the pg_dump it had been processed though a perl
script to:
   1) convert the string ":60.00" to ":59.99" - pg_dump had a bug with
      roundoff of seconds and psql would choke trying to set a time
      with a seconds value of 60
   2) convert the old style col. default of 'timestamp(now())' to
      CURRENT_TIMESTAMP

However - some of our tables contain data with embedded ^M's (ASCII  13)
(because they're being populated from a web page.)

Well, the perl script would interpret the ^M as a newline and end up
converting the two lines:

114<TAB>3<TAB>re: Updates<TAB>Can you read over^M\
the help files?<TAB>6523

to three lines:

114<TAB>3<TAB>re: Updates<TAB>Can you read over
\
the help files?<TAB>6523

so the first line would have truncated data, and the second line would
continue the mess.

This would obviously cause problems during import.

What I can't figure out is why it would kill the back-end (and usually
end up making the OS (Linux 7.3) unstable). I've been trying to make a
small example that repeatably shows this but of course it won't crash!
It just (properly) give's a "Fail to add null" error.

I'm going to try and make a more complex setup to get this thing to
"reliably" cause the crash because it seems there's a bug in here
somewhere. I'll post that when I get it.

So anyways... for those of you out there migrating up where you process
the dump before psql db < dumpfile - watch out for embedded ^Ms!

thanks,
  -Steve