Thread: Strange Postgresql crash
Hello all, I've been experiencing strange crash, never really took care of it since it was happening only every 1-2 months or so. But lately, I've seen it a lot in the past week and I have no clue about it, other than the backups. So, here's some info about it and about my machine: When: it crashes at night, at around 4AM, during the backup: 00 3 * * * root /export/dbsystem/pg_backup.sh va > /dev/null 2>&1 00 4 * * * root /export/dbsystem/pg_backup.sh b > /dev/null 2>&1 I move the vacuum to another time, just to make sure they are not in conflict, who knows! Which version: 7.3.16, I used the tar.gz version from the website. Normally, in a crash the machine just hangs on a kernel panic. The person on site always reboot the machine before taking a look at it. But I never had any crash during the day or almost, maybe once, and the kernel panic was talking about APCI. But during the night I'm not sure if it's the same thing, I think I'll just disable the APCI from the kernel and see if its okay. Anyway, here's a quick log around 4AM, it doesn't say much... 10.1.1.54, is our monitoring machine, it only to the port using telnet. "2006-11-16 03:55:39 [8681] LOG: pq_recvbuf: unexpected EOF on client connection 2006-11-16 03:55:39 [8681] LOG: incomplete startup packet 2006-11-16 03:56:39 [8682] LOG: connection received: host=10.1.1.54 port=4754 2006-11-16 03:56:39 [8682] LOG: pq_recvbuf: unexpected EOF on client connection 2006-11-16 03:56:39 [8682] LOG: incomplete startup packet 2006-11-16 03:57:39 [8684] LOG: connection received: host=10.1.1.54 port=4775 2006-11-16 03:57:39 [8684] LOG: pq_recvbuf: unexpected EOF on client connection 2006-11-16 03:57:39 [8684] LOG: incomplete startup packet 2006-11-16 03:58:39 [8685] LOG: connection received: host=10.1.1.54 port=4828 2006-11-16 03:58:39 [8685] LOG: pq_recvbuf: unexpected EOF on client connection 2006-11-16 03:58:39 [8685] LOG: incomplete startup packet 2006-11-16 03:59:24 [8132] ERROR: parser: parse error at or near "WHEREligneid" at character 72 2006-11-16 03:59:24 [8132] LOG: statement: Update Appels Set controller=4506500413, agentassignedligne='1012261' WHEREligneid=4506500420 2006-11-16 03:59:49 [8686] LOG: connection received: host=10.1.1.54 port=4872 2006-11-16 03:59:50 [8686] LOG: pq_recvbuf: unexpected EOF on client connection 2006-11-16 03:59:50 [8686] LOG: incomplete startup packet 2006-11-16 04:00:02 [8702] LOG: connection received: host=10.1.1.45 port=50457 2006-11-16 04:00:02 [8702] LOG: connection authorized: user=postgres database=template1 2006-11-16 04:00:02 [8726] LOG: connection received: host=10.1.1.45 port=50458 2006-11-16 04:00:02 [8726] LOG: connection authorized: user=postgres database=martin_test 2006-11-16 04:00:29 [8744] LOG: connection received: host=10.1.1.45 port=50459 2006-11-16 04:00:29 [8744] LOG: connection authorized: user=postgres database=test 2006-11-16 04:00:29 [8762] LOG: connection received: host=10.1.1.45 port=50460 2006-11-16 04:00:29 [8762] LOG: connection authorized: user=postgres database=wincentrex 2006-11-16 04:00:39 [8763] LOG: connection received: host=10.1.1.54 port=4894 2006-11-16 04:00:40 [8763] LOG: pq_recvbuf: unexpected EOF on client connection 2006-11-16 04:00:40 [8763] LOG: incomplete startup packet 2006-11-16 04:02:26 [2534] LOG: database system was interrupted at 2006-11-16 03:57:36 EST 2006-11-16 04:02:26 [2534] LOG: checkpoint record is at C/6733EB68 2006-11-16 04:02:26 [2534] LOG: redo record is at C/6733EB68; undo record is at 0/0; shutdown FALSE 2006-11-16 04:02:26 [2534] LOG: next transaction id: 2720349894; next oid: 14377807 2006-11-16 04:02:26 [2534] LOG: database system was not properly shut down; automatic recovery in progress 2006-11-16 04:02:26 [2534] LOG: redo starts at C/6733EBA8 2006-11-16 04:02:27 [2534] LOG: ReadRecord: record with zero length at C/6735AB44 2006-11-16 04:02:27 [2534] LOG: redo done at C/6735AB20 2006-11-16 04:02:30 [2534] LOG: database system is ready" Here's our active settings in postgresql.conf: tcpip_socket = true max_connections = 64 port = 5432 hostname_lookup = false shared_buffers = 1520 # min max_connections*2 or 16, 8KB each #shared_buffers = 12288 # min max_connections*2 or 16, 8KB each max_fsm_relations = 1000 # min 10, fsm is free space map, ~40 bytes max_fsm_pages = 10000 # min 1000, fsm is free space map, ~6 bytes max_locks_per_transaction = 64 # min 10 wal_buffers = 8 # min 4, typically 8KB each sort_mem = 32168 # min 64, size in KB fsync = false enable_seqscan = true enable_indexscan = true enable_tidscan = true enable_sort = true enable_nestloop = true enable_mergejoin = true enable_hashjoin = true effective_cache_size =8000 # typically 8KB each random_page_cost = 4 # units are one sequential page fetch cost cpu_tuple_cost = 0.01 # (same) cpu_index_tuple_cost = 0.001 # (same) cpu_operator_cost = 0.0025 # (same) log_connections = false log_pid = true log_statement = false log_duration = false log_timestamp = true log_min_error_statement = notice # Values in order of increasing severity: # debug5, debug4, debug3, debug2, debug1, # info, notice, warning, error, panic(off) syslog = 0 # range 0-2 syslog_facility = 'LOCAL0' syslog_ident = 'postgres' LC_MESSAGES = 'en_US' LC_MONETARY = 'en_US' LC_NUMERIC = 'en_US' LC_TIME = 'en_US' I tested my memory with memtest, and it's perfect. I also did some stress test within Linux, using stress and donnie++ to see if it would crash with APCI or not, while doing a dump... So far its okay. The machine: Linux aquilonII 2.6.17-1.2142_FC4 #1 Tue Jul 11 22:41:14 EDT 2006 i686 i686 i386 GNU/Linux Any one has a suggestion ? -- Eric Rousse 514-655-1001 Telmatik inc. 204 Montarville, suite 250 Boucherville, QC, Canada J4B 6S2 www.telmatik.com
Eric Rousse wrote: > Hello all, > > I've been experiencing strange crash, never really took care of it since > it was happening only every 1-2 months or so. But lately, I've seen it a > lot in the past week and I have no clue about it, other than the backups. > > So, here's some info about it and about my machine: > > When: it crashes at night, at around 4AM, during the backup: > > 00 3 * * * root /export/dbsystem/pg_backup.sh va > /dev/null 2>&1 > 00 4 * * * root /export/dbsystem/pg_backup.sh b > /dev/null 2>&1 > > I move the vacuum to another time, just to make sure they are not in > conflict, who knows! > Is there anything else running at that time? What does /etc/crontab have? I ask because my fedora box has cron.daily scripts run at 4:02am by default. brian
Eric Rousse <eric.rousse@telmatik.com> writes: > ... > 2006-11-16 04:00:39 [8763] LOG: connection received: host=10.1.1.54 > port=4894 > 2006-11-16 04:00:40 [8763] LOG: pq_recvbuf: unexpected EOF on client > connection > 2006-11-16 04:00:40 [8763] LOG: incomplete startup packet > 2006-11-16 04:02:26 [2534] LOG: database system was interrupted at > 2006-11-16 03:57:36 EST > 2006-11-16 04:02:26 [2534] LOG: checkpoint record is at C/6733EB68 > ... I think what you're seeing here is probably a kernel-level crash and system reboot. It's not any normal sort of Postgres problem, because if it were you'd see the postmaster bleating about crash of one of its child processes. Here it appears that the postmaster and all its children died at once leaving no messages behind --- and that just doesn't happen without either manual intervention or a system crash. If it seems to be triggered by running a PG backup, it could be that you've got a disk hardware problem that only manifests when you try to read a particular data block :-(. Have you tried running "badblocks"? regards, tom lane
duh! right. I didn't thought about this one!! but the strange thing though is that it doesn't happen frequently, only recently it started to crash regularly. here's the content of the crontab: SHELL=/bin/bash PATH=/sbin:/bin:/usr/sbin:/usr/bin MAILTO=root HOME=/ # run-parts 01 * * * * root run-parts /etc/cron.hourly 02 4 * * * root run-parts /etc/cron.daily 22 4 * * 0 root run-parts /etc/cron.weekly 42 4 1 * * root run-parts /etc/cron.monthly 00 3 * * * root /export/dbsystem/pg_backup.sh va > /dev/null 2>&1 00 4 * * * root /export/dbsystem/pg_backup.sh b > /dev/null 2>&1 00 5 * * * root rsync --password-file=/etc/.rs_sec -Cauzbqr /export/dbsystem/backup/ rsync://root@aquilon/rsync/ I'll move cron.daily to 4:30 brian a écrit : > Eric Rousse wrote: >> Hello all, >> >> I've been experiencing strange crash, never really took care of it >> since it was happening only every 1-2 months or so. But lately, I've >> seen it a lot in the past week and I have no clue about it, other >> than the backups. >> >> So, here's some info about it and about my machine: >> >> When: it crashes at night, at around 4AM, during the backup: >> >> 00 3 * * * root /export/dbsystem/pg_backup.sh va > /dev/null 2>&1 >> 00 4 * * * root /export/dbsystem/pg_backup.sh b > /dev/null 2>&1 >> >> I move the vacuum to another time, just to make sure they are not in >> conflict, who knows! >> > > Is there anything else running at that time? What does /etc/crontab > have? I ask because my fedora box has cron.daily scripts run at 4:02am > by default. > > brian > > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > -- Eric Rousse 514-655-1001 Telmatik inc. 204 Montarville, suite 250 Boucherville, QC, Canada J4B 6S2 www.telmatik.com
Hi Tom,
Yeah, that's what I suspect it seems more like a hardware/os issue. Since, I really have no proof against PostgreSQL, other
than my daily dump that crashes *sometimes*.
I didn't know about badblocks, I'll try this one. Last time I did a full fsck check on all the volumes, and everything was clean.
But I have a sata raid on this server, I never knew if a raid would actually replicate badblocks to the other disk ?
Thanks for your advice!
Tom Lane a écrit :
Yeah, that's what I suspect it seems more like a hardware/os issue. Since, I really have no proof against PostgreSQL, other
than my daily dump that crashes *sometimes*.
I didn't know about badblocks, I'll try this one. Last time I did a full fsck check on all the volumes, and everything was clean.
But I have a sata raid on this server, I never knew if a raid would actually replicate badblocks to the other disk ?
Thanks for your advice!
Tom Lane a écrit :
Eric Rousse <eric.rousse@telmatik.com> writes:... 2006-11-16 04:00:39 [8763] LOG: connection received: host=10.1.1.54 port=4894 2006-11-16 04:00:40 [8763] LOG: pq_recvbuf: unexpected EOF on client connection 2006-11-16 04:00:40 [8763] LOG: incomplete startup packet 2006-11-16 04:02:26 [2534] LOG: database system was interrupted at 2006-11-16 03:57:36 EST 2006-11-16 04:02:26 [2534] LOG: checkpoint record is at C/6733EB68 ...I think what you're seeing here is probably a kernel-level crash and system reboot. It's not any normal sort of Postgres problem, because if it were you'd see the postmaster bleating about crash of one of its child processes. Here it appears that the postmaster and all its children died at once leaving no messages behind --- and that just doesn't happen without either manual intervention or a system crash. If it seems to be triggered by running a PG backup, it could be that you've got a disk hardware problem that only manifests when you try to read a particular data block :-(. Have you tried running "badblocks"? regards, tom lane
-- Eric Rousse 514-655-1001 Telmatik inc. 204 Montarville, suite 250 Boucherville, QC, Canada J4B 6S2 www.telmatik.com