Thread: pg_dump strangeness
I’m having an issue with pg_dump crashing one of my servers. I was running PG 7.2.1 now running 7.2.4 on RedHat 7.3 with up to date patches. It happens when I’m dumping a largish (for me) database. The database has two tables one with 1.2 million entries the other has 3.5 million entries, there are also about 700,000 blobs with signatures. The exact command I’m using is…
pg_dump -Fc -b docarc >docarc.cust
It usually doesn’t happen on the first iteration it’s the second that brings the box down. I ran it by hand on the console Saturday and it slowly destabilized the system. I lost the title bars on the windows and then the gnome task bar. Only the mouse cursor moved but it did not responded to clicks or keyboard. I was able to restart the box from a telnet session.
I added more memory to the box and that seems to be helping. It now takes four runs to kill the box.
Any clue to the root of the problem? OS, hardware, postgresql, something misconfigured???
Thanks,
Lane
From the system logfile -
Mar 10 02:34:08 internal kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000020
Mar 10 02:34:08 internal kernel: printing eip:
Mar 10 02:34:08 internal kernel: c013bbee
Mar 10 02:34:08 internal kernel: *pde = 00000000
Mar 10 02:34:08 internal kernel: Oops: 0000
Mar 10 02:34:08 internal kernel: sis sisfb agpgart 8139too mii usb-ohci usbcore ext3 jbd dpt_i2o sd_mod scsi_mod
Mar 10 02:34:08 internal kernel: CPU: 0
Mar 10 02:34:08 internal kernel: EIP: 0010:[<c013bbee>] Not tainted
Mar 10 02:34:08 internal kernel: EFLAGS: 00010286
Mar 10 02:34:08 internal kernel:
Mar 10 02:34:08 internal kernel: EIP is at block_read_full_page [kernel] 0xe (2.4.18-26.7.x)
Mar 10 02:34:08 internal kernel: eax: 00000000 ebx: e1025d34 ecx: 00000000 edx: 00000000
Mar 10 02:34:08 internal kernel: esi: c15d46f0 edi: c02d4a24 ebp: c15d470c esp: e2261d90
Mar 10 02:34:08 internal kernel: ds: 0018 es: 0018 ss: 0018
Mar 10 02:34:08 internal kernel: Process pg_dump (pid: 13593, stackpage=e2261000)
Mar 10 02:34:08 internal kernel: Stack: 00000001 ded15500 e1043540 c01cc410 dd36b600 c020dd17 e2260000 0000000c
Mar 10 02:34:08 internal kernel: e2261eb0 0000000c e1043540 00000282 c01cc431 dd36b600 00000000 00000000
Mar 10 02:34:08 internal kernel: c01cd39b 00000283 0000000c e1025d34 c15d46f0 c02d4a24 00001417 c0128a23
Mar 10 02:34:08 internal kernel: Call Trace: [<c01cc410>] sock_wfree [kernel] 0x0 (0xe2261d9c))
Mar 10 02:34:08 internal kernel: [<c020dd17>] unix_write_space [kernel] 0x37 (0xe2261da4))
Mar 10 02:34:08 internal kernel: [<c01cc431>] sock_wfree [kernel] 0x21 (0xe2261dc0))
Mar 10 02:34:08 internal kernel: [<c01cd39b>] kfree_skbmem [kernel] 0xb (0xe2261dd0))
Mar 10 02:34:08 internal kernel: [<c0128a23>] __remove_inode_page [kernel] 0x33(0xe2261dec))
Mar 10 02:34:08 internal kernel: [<e7946a20>] ext3_get_block [ext3] 0x0 (0xe2261df4))
Mar 10 02:34:08 internal kernel: [<c012fdac>] reclaim_page [kernel] 0x1ec (0xe2261dfc))
Mar 10 02:34:08 internal kernel: [<c0132171>] __alloc_pages_limit [kernel] 0x71(0xe2261e1c))
Mar 10 02:34:08 internal kernel: [<c0132239>] __alloc_pages [kernel] 0x99 (0xe2261e30))
Mar 10 02:34:08 internal kernel: [<c0126cb0>] do_anonymous_page [kernel] 0x50 (0xe2261e64))
Mar 10 02:34:08 internal kernel: [<e7948e65>] ext3_mark_iloc_dirty [ext3] 0x35 (0xe2261e68))
Mar 10 02:34:08 internal kernel: [<c0126da3>] do_no_page [kernel] 0x33 (0xe2261e88))
Mar 10 02:34:08 internal kernel: [<c01cb02c>] sys_recvfrom [kernel] 0xec (0xe2261eac))
Mar 10 02:34:08 internal kernel: [<c0126fea>] handle_mm_fault [kernel] 0xca (0xe2261ec0))
Mar 10 02:34:08 internal kernel: [<c01324a0>] __get_free_pages [kernel] 0x10 (0xe2261ee0))
Mar 10 02:34:08 internal kernel: [<c0146b83>] __pollwait [kernel] 0x33 (0xe2261ee4))
Mar 10 02:34:08 internal kernel: [<c011456a>] do_page_fault [kernel] 0x12a (0xe2261f08))
Mar 10 02:34:08 internal kernel: [<c01286e9>] do_brk [kernel] 0x249 (0xe2261f44))
Mar 10 02:34:08 internal kernel: [<c01cb05d>] sys_recv [kernel] 0x1d (0xe2261f6c))
Mar 10 02:34:08 internal kernel: [<c0127452>] sys_brk [kernel] 0xb2 (0xe2261f94))
Mar 10 02:34:08 internal kernel: [<c0114440>] do_page_fault [kernel] 0x0 (0xe2261fb0))
Mar 10 02:34:08 internal kernel: [<c0108a4c>] error_code [kernel] 0x34 (0xe2261fb8))
Lane Rollins wrote: Do you notice a lot of memory being allocated? Use the free command. Is uptime high? What is the output of this command (assuming the db is being run by user 'postgres')? ps -w -w -o pid,rss,size,args --sort size -u postgres
I tried stopping and starting postmaster to see if it would release any memory and it didn't. I'll try doing the ps later tonight and see what happens. But here is some info from top if that helps at all. The machine in a quit state 5:02pm up 1:30, 1 user, load average: 0.00, 0.00, 0.00 77 processes: 74 sleeping, 3 running, 0 zombie, 0 stopped CPU states: 0.0% user, 0.1% system, 0.0% nice, 99.8% idle Mem: 1015028K av, 205732K used, 809296K free, 0K shrd, 38036K buff Swap: 136512K av, 0K used, 136512K free 86884K cached After 3 consecutive pg_dumps 3:15pm up 5:47, 4 users, load average: 0.06, 0.13, 0.21 111 processes: 108 sleeping, 3 running, 0 zombie, 0 stopped CPU states: 3.5% user, 1.7% system, 0.3% nice, 94.2% idle Mem: 1015028K av, 997768K used, 17260K free, 0K shrd, 76016K buff Swap: 136512K av, 10484K used, 126028K free 801560K cached Last update before died 3:27pm up 6:00, 4 users, load average: 1.65, 1.90, 1.23 110 processes: 106 sleeping, 4 running, 0 zombie, 0 stopped CPU states: 20.0% user, 20.2% system, 0.1% nice, 59.4% idle Mem: 1015028K av, 1006508K used, 8520K free, 0K shrd, 76968K buff Swap: 136512K av, 10484K used, 126028K free 726256K cached PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND 3641 postgres 16 0 110M 110M 26840 R 29.8 11.1 5:20 postmaster 1227 root 5 -10 26152 6128 3008 S < 3.3 0.6 40:35 X 1538 laner 15 0 12096 11M 6124 S 1.5 1.1 2:02 rhn-applet Thanks again, Lane > -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Joseph Shraibman > Sent: Monday, March 10, 2003 4:38 PM > To: Lane Rollins > > Lane Rollins wrote: > > Do you notice a lot of memory being allocated? Use the free command. Is > uptime high? > > What is the output of this command (assuming the db is being run by user > 'postgres')? > ps -w -w -o pid,rss,size,args --sort size -u postgres >
On Mon, 2003-03-10 at 19:15, Lane Rollins wrote: > Mar 10 02:34:08internal kernel: Unable to handle kernel NULL pointer > dereference at virtual address 00000020 Looks like a kernel bug -- there's not much we can do to help, AFAIK. Have you tried applying any errata that RH have put out for your kernel, and/or reporting the problem to the appropriate source? (lkml, RH, etc.) Cheers, Neil -- Neil Conway <neilc@samurai.com> || PGP Key ID: DB3C29FC
On Tue, 2003-03-11 at 16:57, Neil Conway wrote: > On Mon, 2003-03-10 at 19:15, Lane Rollins wrote: > > Mar 10 02:34:08internal kernel: Unable to handle kernel NULL pointer > > dereference at virtual address 00000020 > > Looks like a kernel bug -- there's not much we can do to help, AFAIK. > Have you tried applying any errata that RH have put out for your kernel, > and/or reporting the problem to the appropriate source? (lkml, RH, etc.) > > Cheers, > > Neil It could also be bad memory - try memtest86 for an hour or two. Stephen
Attachment
The problem seems to be either bad mainboard or memory. The machine decided it was going to start crashing very regularly and finally stopped even booting. I ended up moving the raid board, drives and one of the sticks of memory to another box and so far it's still running. I'll try running the memory tests tomorrow. I'm not stuck in meetings all day. Thanks for the help and suggestions, -Lane > -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Stephen Robert Norris > Sent: Tuesday, March 11, 2003 9:19 PM > To: Neil Conway > Cc: Lane Rollins; PostgreSQL General > Subject: Re: [GENERAL] pg_dump strangeness > > On Tue, 2003-03-11 at 16:57, Neil Conway wrote: > > On Mon, 2003-03-10 at 19:15, Lane Rollins wrote: > > > Mar 10 02:34:08internal kernel: Unable to handle kernel NULL pointer > > > dereference at virtual address 00000020 > > > > Looks like a kernel bug -- there's not much we can do to help, AFAIK. > > Have you tried applying any errata that RH have put out for your kernel, > > and/or reporting the problem to the appropriate source? (lkml, RH, etc.) > > > > Cheers, > > > > Neil > > It could also be bad memory - try memtest86 for an hour or two. > > Stephen