Thread: pg_dump strangeness

pg_dump strangeness

From
"Lane Rollins"
Date:

I’m having an issue with pg_dump crashing one of my servers. I was running PG 7.2.1 now running 7.2.4 on RedHat 7.3 with up to date patches. It happens when I’m dumping a largish (for me) database. The database has two tables one with 1.2 million entries the other has 3.5 million entries, there are also about 700,000 blobs with signatures. The exact command I’m using is…

 

pg_dump -Fc -b docarc >docarc.cust

 

It usually doesn’t happen on the first iteration it’s the second that brings the box down. I ran it by hand on the console Saturday and it slowly destabilized the system. I lost the title bars on the windows and then the gnome task bar. Only the mouse cursor moved but it did not responded to clicks or keyboard. I was able to restart the box from a telnet session.

 

I added more memory to the box and that seems to be helping. It now takes four runs to kill the box.

 

Any clue to the root of the problem? OS, hardware, postgresql, something misconfigured???

 

Thanks,
Lane

 

 

From the system logfile -

 

Mar 10 02:34:08 internal kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000020

Mar 10 02:34:08 internal kernel:  printing eip:

Mar 10 02:34:08 internal kernel: c013bbee

Mar 10 02:34:08 internal kernel: *pde = 00000000

Mar 10 02:34:08 internal kernel: Oops: 0000

Mar 10 02:34:08 internal kernel: sis sisfb agpgart 8139too mii usb-ohci usbcore ext3 jbd dpt_i2o sd_mod scsi_mod

Mar 10 02:34:08 internal kernel: CPU:    0

Mar 10 02:34:08 internal kernel: EIP:    0010:[<c013bbee>]    Not tainted

Mar 10 02:34:08 internal kernel: EFLAGS: 00010286

Mar 10 02:34:08 internal kernel:

Mar 10 02:34:08 internal kernel: EIP is at block_read_full_page [kernel] 0xe (2.4.18-26.7.x)

Mar 10 02:34:08 internal kernel: eax: 00000000   ebx: e1025d34   ecx: 00000000 edx: 00000000

Mar 10 02:34:08 internal kernel: esi: c15d46f0   edi: c02d4a24   ebp: c15d470c esp: e2261d90

Mar 10 02:34:08 internal kernel: ds: 0018   es: 0018   ss: 0018

Mar 10 02:34:08 internal kernel: Process pg_dump (pid: 13593, stackpage=e2261000)

Mar 10 02:34:08 internal kernel: Stack: 00000001 ded15500 e1043540 c01cc410 dd36b600 c020dd17 e2260000 0000000c

Mar 10 02:34:08 internal kernel:        e2261eb0 0000000c e1043540 00000282 c01cc431 dd36b600 00000000 00000000

Mar 10 02:34:08 internal kernel:        c01cd39b 00000283 0000000c e1025d34 c15d46f0 c02d4a24 00001417 c0128a23

Mar 10 02:34:08 internal kernel: Call Trace: [<c01cc410>] sock_wfree [kernel] 0x0 (0xe2261d9c))

Mar 10 02:34:08 internal kernel: [<c020dd17>] unix_write_space [kernel] 0x37 (0xe2261da4))

Mar 10 02:34:08 internal kernel: [<c01cc431>] sock_wfree [kernel] 0x21 (0xe2261dc0))

Mar 10 02:34:08 internal kernel: [<c01cd39b>] kfree_skbmem [kernel] 0xb (0xe2261dd0))

Mar 10 02:34:08 internal kernel: [<c0128a23>] __remove_inode_page [kernel] 0x33(0xe2261dec))

Mar 10 02:34:08 internal kernel: [<e7946a20>] ext3_get_block [ext3] 0x0 (0xe2261df4))

Mar 10 02:34:08 internal kernel: [<c012fdac>] reclaim_page [kernel] 0x1ec (0xe2261dfc))

Mar 10 02:34:08 internal kernel: [<c0132171>] __alloc_pages_limit [kernel] 0x71(0xe2261e1c))

Mar 10 02:34:08 internal kernel: [<c0132239>] __alloc_pages [kernel] 0x99 (0xe2261e30))

Mar 10 02:34:08 internal kernel: [<c0126cb0>] do_anonymous_page [kernel] 0x50 (0xe2261e64))

Mar 10 02:34:08 internal kernel: [<e7948e65>] ext3_mark_iloc_dirty [ext3] 0x35 (0xe2261e68))

Mar 10 02:34:08 internal kernel: [<c0126da3>] do_no_page [kernel] 0x33 (0xe2261e88))

Mar 10 02:34:08 internal kernel: [<c01cb02c>] sys_recvfrom [kernel] 0xec (0xe2261eac))

Mar 10 02:34:08 internal kernel: [<c0126fea>] handle_mm_fault [kernel] 0xca (0xe2261ec0))

Mar 10 02:34:08 internal kernel: [<c01324a0>] __get_free_pages [kernel] 0x10 (0xe2261ee0))

Mar 10 02:34:08 internal kernel: [<c0146b83>] __pollwait [kernel] 0x33 (0xe2261ee4))

Mar 10 02:34:08 internal kernel: [<c011456a>] do_page_fault [kernel] 0x12a (0xe2261f08))

Mar 10 02:34:08 internal kernel: [<c01286e9>] do_brk [kernel] 0x249 (0xe2261f44))

Mar 10 02:34:08 internal kernel: [<c01cb05d>] sys_recv [kernel] 0x1d (0xe2261f6c))

Mar 10 02:34:08 internal kernel: [<c0127452>] sys_brk [kernel] 0xb2 (0xe2261f94))

Mar 10 02:34:08 internal kernel: [<c0114440>] do_page_fault [kernel] 0x0 (0xe2261fb0))

Mar 10 02:34:08 internal kernel: [<c0108a4c>] error_code [kernel] 0x34 (0xe2261fb8))

Re: pg_dump strangeness

From
Joseph Shraibman
Date:
Lane Rollins wrote:

Do you notice a lot of memory being allocated?  Use the free command.  Is uptime high?

What is the output of this command (assuming the db is being run by user 'postgres')?
ps -w -w -o pid,rss,size,args --sort size -u postgres


Re: pg_dump strangeness

From
"Lane Rollins"
Date:
I tried stopping and starting postmaster to see if it would release any
memory and it didn't.

I'll try doing the ps later tonight and see what happens. But here is
some info from top if that helps at all.

The machine in a quit state
  5:02pm  up  1:30,  1 user,  load average: 0.00, 0.00, 0.00
77 processes: 74 sleeping, 3 running, 0 zombie, 0 stopped
CPU states:  0.0% user,  0.1% system,  0.0% nice, 99.8% idle
Mem:  1015028K av,  205732K used,  809296K free,       0K shrd,   38036K
buff
Swap:  136512K av,       0K used,  136512K free                   86884K
cached


After 3 consecutive pg_dumps
  3:15pm  up  5:47,  4 users,  load average: 0.06, 0.13, 0.21
111 processes: 108 sleeping, 3 running, 0 zombie, 0 stopped
CPU states:  3.5% user,  1.7% system,  0.3% nice, 94.2% idle
Mem:  1015028K av,  997768K used,   17260K free,       0K shrd,   76016K
buff
Swap:  136512K av,   10484K used,  126028K free                  801560K
cached

Last update before died
  3:27pm  up  6:00,  4 users,  load average: 1.65, 1.90, 1.23
110 processes: 106 sleeping, 4 running, 0 zombie, 0 stopped
CPU states: 20.0% user, 20.2% system,  0.1% nice, 59.4% idle
Mem:  1015028K av, 1006508K used,    8520K free,       0K shrd,   76968K
buff
Swap:  136512K av,   10484K used,  126028K free                  726256K
cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME COMMAND
 3641 postgres  16   0  110M 110M 26840 R    29.8 11.1   5:20 postmaster
 1227 root       5 -10 26152 6128  3008 S <   3.3  0.6  40:35 X
 1538 laner     15   0 12096  11M  6124 S     1.5  1.1   2:02 rhn-applet


Thanks again,
Lane

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Joseph Shraibman
> Sent: Monday, March 10, 2003 4:38 PM
> To: Lane Rollins
>
> Lane Rollins wrote:
>
> Do you notice a lot of memory being allocated?  Use the free command.
Is
> uptime high?
>
> What is the output of this command (assuming the db is being run by
user
> 'postgres')?
> ps -w -w -o pid,rss,size,args --sort size -u postgres
>




Re: pg_dump strangeness

From
Neil Conway
Date:
On Mon, 2003-03-10 at 19:15, Lane Rollins wrote:
> Mar 10 02:34:08internal kernel: Unable to handle kernel NULL pointer
> dereference at virtual address 00000020

Looks like a kernel bug -- there's not much we can do to help, AFAIK.
Have you tried applying any errata that RH have put out for your kernel,
and/or reporting the problem to the appropriate source? (lkml, RH, etc.)

Cheers,

Neil

--
Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC




Re: pg_dump strangeness

From
Stephen Robert Norris
Date:
On Tue, 2003-03-11 at 16:57, Neil Conway wrote:
> On Mon, 2003-03-10 at 19:15, Lane Rollins wrote:
> > Mar 10 02:34:08internal kernel: Unable to handle kernel NULL pointer
> > dereference at virtual address 00000020
>
> Looks like a kernel bug -- there's not much we can do to help, AFAIK.
> Have you tried applying any errata that RH have put out for your kernel,
> and/or reporting the problem to the appropriate source? (lkml, RH, etc.)
>
> Cheers,
>
> Neil

It could also be bad memory - try memtest86 for an hour or two.

    Stephen

Attachment

Re: pg_dump strangeness

From
"Lane Rollins"
Date:
The problem seems to be either bad mainboard or memory. The machine
decided it was going to start crashing very regularly and finally
stopped even booting. I ended up moving the raid board, drives and one
of the sticks of memory to another box and so far it's still running.

I'll try running the memory tests tomorrow. I'm not stuck in meetings
all day.

Thanks for the help and suggestions,
-Lane

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Stephen Robert Norris
> Sent: Tuesday, March 11, 2003 9:19 PM
> To: Neil Conway
> Cc: Lane Rollins; PostgreSQL General
> Subject: Re: [GENERAL] pg_dump strangeness
>
> On Tue, 2003-03-11 at 16:57, Neil Conway wrote:
> > On Mon, 2003-03-10 at 19:15, Lane Rollins wrote:
> > > Mar 10 02:34:08internal kernel: Unable to handle kernel NULL
pointer
> > > dereference at virtual address 00000020
> >
> > Looks like a kernel bug -- there's not much we can do to help,
AFAIK.
> > Have you tried applying any errata that RH have put out for your
kernel,
> > and/or reporting the problem to the appropriate source? (lkml, RH,
etc.)
> >
> > Cheers,
> >
> > Neil
>
> It could also be bad memory - try memtest86 for an hour or two.
>
>     Stephen