Thread: Strange Postgresql crash

Strange Postgresql crash

From

Eric Rousse

Date:

16 November 2006, 16:51:59

Hello all,

I've been experiencing strange crash, never really took care of it since
it was happening only every 1-2 months or so. But lately, I've seen it a
lot in the past week and I have no clue about it, other than the backups.

So, here's some info about it and about my machine:

When: it crashes at night, at around 4AM, during the backup:

00 3 * * * root /export/dbsystem/pg_backup.sh va > /dev/null 2>&1
00 4 * * * root /export/dbsystem/pg_backup.sh b > /dev/null 2>&1

I move the vacuum to another time, just to make sure they are not in
conflict, who knows!

Which version: 7.3.16, I used the tar.gz version from the website.


Normally, in a crash the machine just hangs on a kernel panic. The
person on site always reboot the machine before taking a look at it. But
I never had any crash during the day or almost, maybe once, and the
kernel panic was talking about APCI. But during the night I'm not sure
if it's the same thing, I think I'll just disable the APCI from the
kernel and see if its okay.

Anyway, here's a quick log around 4AM, it doesn't say much...
10.1.1.54, is our monitoring machine, it only to the port using telnet.

"2006-11-16 03:55:39 [8681]   LOG:  pq_recvbuf: unexpected EOF on client
connection
2006-11-16 03:55:39 [8681]   LOG:  incomplete startup packet
2006-11-16 03:56:39 [8682]   LOG:  connection received: host=10.1.1.54
port=4754
2006-11-16 03:56:39 [8682]   LOG:  pq_recvbuf: unexpected EOF on client
connection
2006-11-16 03:56:39 [8682]   LOG:  incomplete startup packet
2006-11-16 03:57:39 [8684]   LOG:  connection received: host=10.1.1.54
port=4775
2006-11-16 03:57:39 [8684]   LOG:  pq_recvbuf: unexpected EOF on client
connection
2006-11-16 03:57:39 [8684]   LOG:  incomplete startup packet
2006-11-16 03:58:39 [8685]   LOG:  connection received: host=10.1.1.54
port=4828
2006-11-16 03:58:39 [8685]   LOG:  pq_recvbuf: unexpected EOF on client
connection
2006-11-16 03:58:39 [8685]   LOG:  incomplete startup packet
2006-11-16 03:59:24 [8132]   ERROR:  parser: parse error at or near
"WHEREligneid" at character 72
2006-11-16 03:59:24 [8132]   LOG:  statement: Update Appels Set
controller=4506500413, agentassignedligne='1012261'  WHEREligneid=4506500420
2006-11-16 03:59:49 [8686]   LOG:  connection received: host=10.1.1.54
port=4872
2006-11-16 03:59:50 [8686]   LOG:  pq_recvbuf: unexpected EOF on client
connection
2006-11-16 03:59:50 [8686]   LOG:  incomplete startup packet
2006-11-16 04:00:02 [8702]   LOG:  connection received: host=10.1.1.45
port=50457
2006-11-16 04:00:02 [8702]   LOG:  connection authorized: user=postgres
database=template1
2006-11-16 04:00:02 [8726]   LOG:  connection received: host=10.1.1.45
port=50458
2006-11-16 04:00:02 [8726]   LOG:  connection authorized: user=postgres
database=martin_test
2006-11-16 04:00:29 [8744]   LOG:  connection received: host=10.1.1.45
port=50459
2006-11-16 04:00:29 [8744]   LOG:  connection authorized: user=postgres
database=test
2006-11-16 04:00:29 [8762]   LOG:  connection received: host=10.1.1.45
port=50460
2006-11-16 04:00:29 [8762]   LOG:  connection authorized: user=postgres
database=wincentrex
2006-11-16 04:00:39 [8763]   LOG:  connection received: host=10.1.1.54
port=4894
2006-11-16 04:00:40 [8763]   LOG:  pq_recvbuf: unexpected EOF on client
connection
2006-11-16 04:00:40 [8763]   LOG:  incomplete startup packet
2006-11-16 04:02:26 [2534]   LOG:  database system was interrupted at
2006-11-16 03:57:36 EST
2006-11-16 04:02:26 [2534]   LOG:  checkpoint record is at C/6733EB68
2006-11-16 04:02:26 [2534]   LOG:  redo record is at C/6733EB68; undo
record is at 0/0; shutdown FALSE
2006-11-16 04:02:26 [2534]   LOG:  next transaction id: 2720349894; next
oid: 14377807
2006-11-16 04:02:26 [2534]   LOG:  database system was not properly shut
down; automatic recovery in progress
2006-11-16 04:02:26 [2534]   LOG:  redo starts at C/6733EBA8
2006-11-16 04:02:27 [2534]   LOG:  ReadRecord: record with zero length
at C/6735AB44
2006-11-16 04:02:27 [2534]   LOG:  redo done at C/6735AB20
2006-11-16 04:02:30 [2534]   LOG:  database system is ready"


Here's our active settings in postgresql.conf:

tcpip_socket = true
max_connections = 64
port = 5432
hostname_lookup = false
shared_buffers = 1520   # min max_connections*2 or 16, 8KB each
#shared_buffers = 12288  # min max_connections*2 or 16, 8KB each
max_fsm_relations = 1000        # min 10, fsm is free space map, ~40 bytes
max_fsm_pages = 10000           # min 1000, fsm is free space map, ~6 bytes
max_locks_per_transaction = 64  # min 10
wal_buffers = 8         # min 4, typically 8KB each
sort_mem = 32168                # min 64, size in KB
fsync = false
enable_seqscan = true
enable_indexscan = true
enable_tidscan = true
enable_sort = true
enable_nestloop = true
enable_mergejoin = true
enable_hashjoin = true

effective_cache_size =8000      # typically 8KB each
random_page_cost = 4            # units are one sequential page fetch cost
cpu_tuple_cost = 0.01           # (same)
cpu_index_tuple_cost = 0.001    # (same)
cpu_operator_cost = 0.0025      # (same)
log_connections = false
log_pid = true
log_statement = false
log_duration = false
log_timestamp = true
log_min_error_statement = notice # Values in order of increasing severity:
                                 #   debug5, debug4, debug3, debug2, debug1,
                                 #   info, notice, warning, error,
panic(off)
syslog = 0                      # range 0-2
syslog_facility = 'LOCAL0'
syslog_ident = 'postgres'
LC_MESSAGES = 'en_US'
LC_MONETARY = 'en_US'
LC_NUMERIC = 'en_US'
LC_TIME = 'en_US'

I tested my memory with memtest, and it's perfect. I also did some
stress test within Linux, using stress and donnie++ to see if it would
crash with APCI or not, while doing a dump... So far its okay.

The machine: Linux aquilonII 2.6.17-1.2142_FC4 #1 Tue Jul 11 22:41:14
EDT 2006 i686 i686 i386 GNU/Linux

Any one has a suggestion ?

--
Eric Rousse
514-655-1001

Telmatik inc.
204 Montarville, suite 250
Boucherville, QC, Canada
J4B 6S2

www.telmatik.com

Re: Strange Postgresql crash

From

brian

Date:

16 November 2006, 17:18:02

Eric Rousse wrote:
> Hello all,
>
> I've been experiencing strange crash, never really took care of it since
> it was happening only every 1-2 months or so. But lately, I've seen it a
> lot in the past week and I have no clue about it, other than the backups.
>
> So, here's some info about it and about my machine:
>
> When: it crashes at night, at around 4AM, during the backup:
>
> 00 3 * * * root /export/dbsystem/pg_backup.sh va > /dev/null 2>&1
> 00 4 * * * root /export/dbsystem/pg_backup.sh b > /dev/null 2>&1
>
> I move the vacuum to another time, just to make sure they are not in
> conflict, who knows!
>

Is there anything else running at that time? What does /etc/crontab
have? I ask because my fedora box has cron.daily scripts run at 4:02am
by default.

brian

Re: Strange Postgresql crash

From

Tom Lane

Date:

16 November 2006, 17:32:21

Eric Rousse <eric.rousse@telmatik.com> writes:
> ...
> 2006-11-16 04:00:39 [8763]   LOG:  connection received: host=10.1.1.54
> port=4894
> 2006-11-16 04:00:40 [8763]   LOG:  pq_recvbuf: unexpected EOF on client
> connection
> 2006-11-16 04:00:40 [8763]   LOG:  incomplete startup packet
> 2006-11-16 04:02:26 [2534]   LOG:  database system was interrupted at
> 2006-11-16 03:57:36 EST
> 2006-11-16 04:02:26 [2534]   LOG:  checkpoint record is at C/6733EB68
> ...

I think what you're seeing here is probably a kernel-level crash and
system reboot.  It's not any normal sort of Postgres problem, because
if it were you'd see the postmaster bleating about crash of one of its
child processes.  Here it appears that the postmaster and all its
children died at once leaving no messages behind --- and that just
doesn't happen without either manual intervention or a system crash.

If it seems to be triggered by running a PG backup, it could be that
you've got a disk hardware problem that only manifests when you try to
read a particular data block :-(.  Have you tried running "badblocks"?

            regards, tom lane

Re: Strange Postgresql crash

From

Eric Rousse

Date:

16 November 2006, 18:29:06

duh! right. I didn't thought about this one!! but the strange thing
though is that it doesn't happen frequently, only recently it started to
crash regularly.

here's the content of the crontab:

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/

# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 4 * * * root run-parts /etc/cron.daily
22 4 * * 0 root run-parts /etc/cron.weekly
42 4 1 * * root run-parts /etc/cron.monthly

00 3 * * * root /export/dbsystem/pg_backup.sh va > /dev/null 2>&1
00 4 * * * root /export/dbsystem/pg_backup.sh b > /dev/null 2>&1
00 5 * * * root rsync --password-file=/etc/.rs_sec -Cauzbqr
/export/dbsystem/backup/ rsync://root@aquilon/rsync/


I'll move cron.daily to 4:30

brian a écrit :
> Eric Rousse wrote:
>> Hello all,
>>
>> I've been experiencing strange crash, never really took care of it
>> since it was happening only every 1-2 months or so. But lately, I've
>> seen it a lot in the past week and I have no clue about it, other
>> than the backups.
>>
>> So, here's some info about it and about my machine:
>>
>> When: it crashes at night, at around 4AM, during the backup:
>>
>> 00 3 * * * root /export/dbsystem/pg_backup.sh va > /dev/null 2>&1
>> 00 4 * * * root /export/dbsystem/pg_backup.sh b > /dev/null 2>&1
>>
>> I move the vacuum to another time, just to make sure they are not in
>> conflict, who knows!
>>
>
> Is there anything else running at that time? What does /etc/crontab
> have? I ask because my fedora box has cron.daily scripts run at 4:02am
> by default.
>
> brian
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


--
Eric Rousse
514-655-1001

Telmatik inc.
204 Montarville, suite 250
Boucherville, QC, Canada
J4B 6S2

www.telmatik.com

Re: Strange Postgresql crash

From

Eric Rousse

Date:

16 November 2006, 18:34:47

Hi Tom,

Yeah, that's what I suspect it seems more like a hardware/os issue. Since, I really have no proof against PostgreSQL, other
than my daily dump that crashes *sometimes*.

I didn't know about badblocks, I'll try this one. Last time I did a full fsck check on all the volumes, and everything was clean.
But I have a sata raid on this server, I never knew if a raid would actually replicate badblocks to the other disk ?

Thanks for your advice!

Tom Lane a écrit :

Eric Rousse <eric.rousse@telmatik.com> writes:

...
2006-11-16 04:00:39 [8763]   LOG:  connection received: host=10.1.1.54 
port=4894
2006-11-16 04:00:40 [8763]   LOG:  pq_recvbuf: unexpected EOF on client 
connection
2006-11-16 04:00:40 [8763]   LOG:  incomplete startup packet
2006-11-16 04:02:26 [2534]   LOG:  database system was interrupted at 
2006-11-16 03:57:36 EST
2006-11-16 04:02:26 [2534]   LOG:  checkpoint record is at C/6733EB68
...

I think what you're seeing here is probably a kernel-level crash and
system reboot.  It's not any normal sort of Postgres problem, because
if it were you'd see the postmaster bleating about crash of one of its
child processes.  Here it appears that the postmaster and all its
children died at once leaving no messages behind --- and that just
doesn't happen without either manual intervention or a system crash.

If it seems to be triggered by running a PG backup, it could be that
you've got a disk hardware problem that only manifests when you try to
read a particular data block :-(.  Have you tried running "badblocks"?
		regards, tom lane

-- 
Eric Rousse
514-655-1001

Telmatik inc.
204 Montarville, suite 250
Boucherville, QC, Canada
J4B 6S2

www.telmatik.com