Thread: PosgreSQL is crashing with a signal 11 - Bug?

PosgreSQL is crashing with a signal 11 - Bug?

From
Rafael Martinez Guerrero
Date:
Hello

We have a problem with one of our central databases and we need your
help to find a solution.

----------------------
* Description:=20
----------------------
"PosgreSQL is crashing with a signal 11 and we do not think is a
hardware problem" :(

Since last week, we are having a big problem with one of our postgreSQL
installations. The database is not so big but it is used intensely with
different jobs running parallel transactions.

The first time the database crashed, we had been running 7.3.5 for a
long time without problems. Because the signal 11, we thought it was a
problem with defective memory, we changed RAM in the server and restored
the database from last backup. The memory was defective and we thought
we found the problem to our signal 11.

After some hours the database crashed again with the samme error. We did
not take any chances and moved the database to a new server. This did
not help and we got the samme problem after some hours. We updated to
7.3.7 hoping for the best but it did not help either.

Today it crashed again but this time we have logged more information
from the crash and we hope you can help us to find a solution to this
problem.=20

Below, some relevant information about our system. Please do not
hesitate to ask for more information if you need it.

------------------------
* OS/Machine/Filesystem:=20
------------------------
- Red Hat Enterprise Linux WS release 3 (Taroon Update 3)
  kernel 2.4.21-15.0.3.ELsmp

- Dell 2650:=20
  2 x Intel(R)Xeon(TM)CPU 2.40GHz
  2GB RAM
  PERCRAID Mirror 2 x 73GB

- LVM - ext3

-----------------------------
* Version / compilator / libc
-----------------------------
- PostgreSQL 7.3.7

- Options given to 'configure' script when PostgreSQL was built:
'--prefix=3D/local/opt/postgresql' '--mandir=3D/local/share/man'
'--with-openssl=3D/local' '--with-perl' '--with-java' 'CC=3Dcc-wrapper'
'LDFLAGS=3D-L/local/lib'

- gcc version 3.2.3 20030502 (Red Hat Linux 3.2.3-42)
- glibc-2.3.2-95.27 (RHEL3)

----------------------
* Configuration / DBSize
----------------------

- Relevant parameters that have been changed by us:

postgresql.conf:
max_connections =3D 600
superuser_reserved_connections =3D 2

shared_buffers =3D 8192
max_fsm_relations =3D 1000
max_fsm_pages =3D 20000
wal_buffers =3D 64=20

sort_mem =3D 2048
vacuum_mem =3D 32768=20
fsync =3D true

effective_cache_size =3D 131072=20
random_page_cost =3D 2

stats_start_collector =3D true
stats_command_string =3D true
stats_row_level =3D true
stats_block_level =3D true

autocommit=3Dfalse


/etc/sysctl.conf:
kernel.shmall =3D 134217728
kernel.shmmax =3D 134217728


-bash-2.05b# du -h
4.2M    ./base/1
3.6M    ./base/16975
4.0K    ./base/63684339/pgsql_tmp
2.3G    ./base/63684339
2.3G    ./base
168K    ./global
129M    ./pg_xlog
8.6M    ./pg_clog
2.4G    .

-----------------------------------------------------------
* Information from CORE dump we got without --enable-debug.
-----------------------------------------------------------

This GDB was configured as "i386-redhat-linux-gnu"...(no debugging
symbols found)...Using host libthread_db library
"/lib/tls/libthread_db.so.1".

Program terminated with signal 11, Segmentation fault.

[......]
(gdb) bt
#0  0xb734d07c in memcpy () from /lib/tls/libc.so.6
#1  0x0806bba8 in DataFill ()
#2  0x0806c3ee in heap_formtuple ()
#3  0x080d1af1 in ExecTargetList ()
#4  0x080d1cdb in ExecProject ()
#5  0x080d1d7d in ExecScan ()
#6  0x080d5b5e in ExecIndexScan ()
#7  0x080cfd91 in ExecProcNode ()
#8  0x082fcd08 in ?? ()
#9  0x00000000 in ?? ()
#10 0x082ff120 in ?? ()
#11 0x00000000 in ?? ()
#12 0x082f9a78 in ?? ()
#13 0xbfff8028 in ?? ()
#14 0x080d6e5a in ExecMergeJoin ()
Previous frame inner to this frame (corrupt stack?)
---------------------------------------------------------

-----------------------------------------------------------
* Information from CORE dump we got with --enable-debug. We have
compiled a new version of postgres and run it through gdb with the core
dump we had/got from postgres without --enable-debug.=20
-----------------------------------------------------------

#0  0xb734d07c in memcpy () from /lib/tls/libc.so.6

#1  0x0806bba8 in DataFill (data=3D0xb7489000 <Address 0xb7489000 out of
bounds>, tupleDesc=3D0x82fd554, value=3D0x82fd550, nulls=3D0xbfff7ec0 "  n =
 ",
infomask=3D0x836e904c, bit=3D0x836e904f "\003\f") at heaptuple.c:139

#2  0x0806c3ee in heap_formtuple (tupleDescriptor=3D0x82fd620,
value=3D0x82fd550, nulls=3D0xbfff7ec0 "  n  ") at heaptuple.c:623

#3  0x080d1af1 in ExecTargetList (targetlist=3D0x82fa250, nodomains=3D5,
targettype=3D0x82fd620, values=3D0x82fd550, econtext=3D0x82fd4e0,
isDone=3D0xbfff7f68) at execQual.c:2230

#4  0x080d1cdb in ExecScan (node=3D0x82fd528, accessMtd=3D0xbfff7f68) at
execScan.c:49

#5  0x080d1d7d in ExecScan (node=3D0x82fa140, accessMtd=3D0x80d58d4
<IndexNext+24>) at execScan.c:146

#6  0x080d5b5e in ExecIndexReScan (node=3D0x82fa140, exprCtxt=3D0xb72117d,
parent=3D0x0) at nodeIndexscan.c:284

#7  0x080d58d4 in IndexNext (node=3D0x0) at nodeIndexscan.c:87
Previous frame inner to this frame (corrupt stack?)

--------------------------------
* Relevant errors in the logfile:
--------------------------------
2004-09-07 15:59:08 [8100]   LOG:  statement: FETCH 1 FROM
"PgSQL_08456DE4"
2004-09-07 15:59:39 [7370]   DEBUG:  reaping dead processes
2004-09-07 15:59:39 [7370]   DEBUG:  child process (pid 8100) was
terminated by signal 11
2004-09-07 15:59:39 [7370]   LOG:  server process (pid 8100) was
terminated by signal 11
2004-09-07 15:59:39 [7370]   LOG:  terminating any other active server
processes
2004-09-07 15:59:39 [7370]   DEBUG:  CleanupProc: sending SIGQUIT to
process 7539
2004-09-07 15:59:39 [7370]   DEBUG:  CleanupProc: sending SIGQUIT to
process 7468
2004-09-07 15:59:39 [7370]   DEBUG:  CleanupProc: sending SIGQUIT to
process 7423
2004-09-07 15:59:39 [7468]   WARNING:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
2004-09-07 15:59:39 [7423]   WARNING:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
2004-09-07 15:59:39 [7539]   WARNING:  Message from PostgreSQL backend:
        The Postmaster has informed me that some other backend
        died abnormally and possibly corrupted shared memory.
        I have rolled back the current transaction and am
        going to terminate your database system connection and exit.
        Please reconnect to the database system and repeat your query.
2004-09-07 15:59:39 [7370]   DEBUG:  reaping dead processes
2004-09-07 15:59:39 [7370]   DEBUG:  child process (pid 7423) exited
with exit code 1
2004-09-07 15:59:39 [7370]   DEBUG:  reaping dead processes
2004-09-07 15:59:39 [7370]   DEBUG:  child process (pid 7468) exited
with exit code 1
2004-09-07 15:59:39 [7370]   DEBUG:  reaping dead processes
2004-09-07 15:59:39 [7370]   DEBUG:  child process (pid 7539) exited
with exit code 1
2004-09-07 15:59:39 [7370]   LOG:  all server processes terminated;
reinitializing shared memory and semaphores
2004-09-07 15:59:39 [7370]   DEBUG:  shmem_exit(0)
2004-09-07 15:59:39 [7370]   DEBUG:  invoking
IpcMemoryCreate(size=3D78536704)
------------------------------------------------------------------

-----------------------------------------------------------------------
* The child process (pid 8100) that crashed with signal 11 logged these
statements (We have a complete log for pid 8100, but it has a size of
2MB):
-----------------------------------------------------------------------

2004-09-07 15:58:50 [8100]   LOG:  statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100]   LOG:  statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100]   LOG:  statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100]   LOG:  statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100]   LOG:  statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100]   LOG:  statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100]   LOG:  statement: begin; select
getdatabaseencoding(); commit
2004-09-07 15:58:50 [8100]   LOG:  statement: select version()
2004-09-07 15:58:50 [8100]   LOG:  statement: select version()
2004-09-07 15:58:50 [8100]   LOG:  statement: select version()
2004-09-07 15:58:50 [8100]   LOG:  statement: BEGIN WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: BEGIN WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: BEGIN WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: BEGIN WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: SET CLIENT_ENCODING TO
'ISO_8859_1'
2004-09-07 15:58:50 [8100]   LOG:  statement: SET CLIENT_ENCODING TO
'ISO_8859_1'
2004-09-07 15:58:50 [8100]   LOG:  statement: SET CLIENT_ENCODING TO
'ISO_8859_1'
2004-09-07 15:58:50 [8100]   LOG:  statement: COMMIT WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: COMMIT WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: COMMIT WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: BEGIN WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: BEGIN WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: BEGIN WORK
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_user@ifi'
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_user@ifi'
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_user@ifi'
2004-09-07 15:58:50 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1700
2004-09-07 15:58:50 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1700
2004-09-07 15:58:50 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1700
2004-09-07 15:58:50 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_fg@ifi'
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_fg@ifi'
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM spread_code WHERE code_str =3D 'NIS_fg@ifi'
2004-09-07 15:58:50 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code , shell FROM posix_shell_code
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code , shell FROM posix_shell_code
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code , shell FROM posix_shell_code
2004-09-07 15:58:50 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1043
2004-09-07 15:58:50 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1043
2004-09-07 15:58:50 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1043
2004-09-07 15:58:50 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT count ( ah . account_id ) , ah . spread , di . disk_id
, di . host_id , di . path FROM disk_info di LEFT JOIN account_home ah
ON di . disk_id =3D ah . disk_id AND ah . spread =3D 105 LEFT JOIN
account_info ai ON ah . account_id =3D ai . account_id GROUP BY di .
disk_id , di . host_id , di . path , ah . spread
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT count ( ah . account_id ) , ah . spread , di . disk_id
, di . host_id , di . path FROM disk_info di LEFT JOIN account_home ah
ON di . disk_id =3D ah . disk_id AND ah . spread =3D 105 LEFT JOIN
account_info ai ON ah . account_id =3D ai . account_id GROUP BY di .
disk_id , di . host_id , di . path , ah . spread
2004-09-07 15:58:50 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT count ( ah . account_id ) , ah . spread , di . disk_id
, di . host_id , di . path FROM disk_info di LEFT JOIN account_home ah
ON di . disk_id =3D ah . disk_id AND ah . spread =3D 105 LEFT JOIN
account_info ai ON ah . account_id =3D ai . account_id GROUP BY di .
disk_id , di . host_id , di . path , ah . spread
2004-09-07 15:58:53 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 20
2004-09-07 15:58:53 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 20
2004-09-07 15:58:53 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 20
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM value_domain_code WHERE code_str =3D
'account_names'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM value_domain_code WHERE code_str =3D
'account_names'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM value_domain_code WHERE code_str =3D
'account_names'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authentication_code WHERE code_str =3D
'crypt3-DES'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authentication_code WHERE code_str =3D
'crypt3-DES'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authentication_code WHERE code_str =3D
'crypt3-DES'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authoritative_system_code WHERE code_str =3D
'Cached'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authoritative_system_code WHERE code_str =3D
'Cached'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM authoritative_system_code WHERE code_str =3D
'Cached'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM person_name_code WHERE code_str =3D 'FULL'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM person_name_code WHERE code_str =3D 'FULL'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM person_name_code WHERE code_str =3D 'FULL'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_08456DE4"
CURSOR FOR SELECT ai . account_id , posix_uid , shell , gecos ,
entity_name , aa . auth_data , pg . posix_gid , pn . name , eq .
quarantine_type , eq . start_date , eq . disable_until , eq . end_date ,
ah . home , ah . disk_id FROM posix_user pu LEFT JOIN entity_quarantine
eq ON pu . account_id =3D eq . entity_id JOIN entity_spread es ON pu .
account_id =3D es . entity_id AND ( es . spread =3D 105 ) JOIN account_home
ah ON es . entity_id =3D ah . account_id JOIN account_info ai ON ai .
account_id =3D pu . account_id LEFT JOIN person_name pn ON pn . person_id
=3D ai . owner_id AND pn . source_system =3D 92 AND pn . name_variant =3D 1=
17
JOIN posix_group pg ON pu . gid =3D pg . group_id LEFT JOIN
account_authentication aa ON aa . account_id =3D pu . account_id AND aa .
method =3D 12 JOIN entity_name en ON en . entity_id =3D pu . account_id AND
en . value_domain =3D 14 ORDER BY ai . account_id
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_08456DE4"
CURSOR FOR SELECT ai . account_id , posix_uid , shell , gecos ,
entity_name , aa . auth_data , pg . posix_gid , pn . name , eq .
quarantine_type , eq . start_date , eq . disable_until , eq . end_date ,
ah . home , ah . disk_id FROM posix_user pu LEFT JOIN entity_quarantine
eq ON pu . account_id =3D eq . entity_id JOIN entity_spread es ON pu .
account_id =3D es . entity_id AND ( es . spread =3D 105 ) JOIN account_home
ah ON es . entity_id =3D ah . account_id JOIN account_info ai ON ai .
account_id =3D pu . account_id LEFT JOIN person_name pn ON pn . person_id
=3D ai . owner_id AND pn . source_system =3D 92 AND pn . name_variant =3D 1=
17
JOIN posix_group pg ON pu . gid =3D pg . group_id LEFT JOIN
account_authentication aa ON aa . account_id =3D pu . account_id AND aa .
method =3D 12 JOIN entity_name en ON en . entity_id =3D pu . account_id AND
en . value_domain =3D 14 ORDER BY ai . account_id
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_08456DE4"
CURSOR FOR SELECT ai . account_id , posix_uid , shell , gecos ,
entity_name , aa . auth_data , pg . posix_gid , pn . name , eq .
quarantine_type , eq . start_date , eq . disable_until , eq . end_date ,
ah . home , ah . disk_id FROM posix_user pu LEFT JOIN entity_quarantine
eq ON pu . account_id =3D eq . entity_id JOIN entity_spread es ON pu .
account_id =3D es . entity_id AND ( es . spread =3D 105 ) JOIN account_home
ah ON es . entity_id =3D ah . account_id JOIN account_info ai ON ai .
account_id =3D pu . account_id LEFT JOIN person_name pn ON pn . person_id
=3D ai . owner_id AND pn . source_system =3D 92 AND pn . name_variant =3D 1=
17
JOIN posix_group pg ON pu . gid =3D pg . group_id LEFT JOIN
account_authentication aa ON aa . account_id =3D pu . account_id AND aa .
method =3D 12 JOIN entity_name en ON en . entity_id =3D pu . account_id AND
en . value_domain =3D 14 ORDER BY ai . account_id
2004-09-07 15:58:54 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1082
2004-09-07 15:58:54 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1082
2004-09-07 15:58:54 [8100]   LOG:  statement: SELECT typname, -1 ,
typelem FROM pg_type WHERE oid =3D 1082
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_brk'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_brk'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_brk'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'permisjon'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'permisjon'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'permisjon'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autointsomm'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autointsomm'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autointsomm'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_sh'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_sh'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_sh'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_nystuden'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_nystuden'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_nystuden'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autoekstern'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autoekstern'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autoekstern'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'autostud'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'autostud'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'autostud'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'system'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'system'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'system'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_ftpuser'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_ftpuser'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_ftpuser'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autopassord'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autopassord'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'autopassord'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'teppe'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'teppe'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'teppe'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'generell'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'generell'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'generell'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'slutta'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'slutta'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'slutta'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_stengt'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_stengt'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'nologin_stengt'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'svakt_passord'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'svakt_passord'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D
'svakt_passord'
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: CLOSE "PgSQL_083326D4"
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'nologin'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'nologin'
2004-09-07 15:58:54 [8100]   LOG:  statement: DECLARE "PgSQL_083326D4"
CURSOR FOR SELECT code FROM quarantine_code WHERE code_str =3D 'nologin'
------------------------------------------------------

Thanks in advance, any help will be appreciated.

--=20
 Rafael Martinez, <r.m.guerrero@usit.uio.no>
 Center for Information Technology Services
 University of Oslo, Norway

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Tom Lane
Date:
Rafael Martinez Guerrero <r.m.guerrero@usit.uio.no> writes:
> * Information from CORE dump we got with --enable-debug. We have
> compiled a new version of postgres and run it through gdb with the core
> dump we had/got from postgres without --enable-debug.=20

Okay, theoretically that works, but it might be smarter to install the
debug build and get a fresh core dump that definitely corresponds to it.

> #0  0xb734d07c in memcpy () from /lib/tls/libc.so.6

> #1  0x0806bba8 in DataFill (data=3D0xb7489000 <Address 0xb7489000 out of
> bounds>, tupleDesc=3D0x82fd554, value=3D0x82fd550, nulls=3D0xbfff7ec0 "  n =
>  ",
> infomask=3D0x836e904c, bit=3D0x836e904f "\003\f") at heaptuple.c:139

If accurate, that says it's crashing here:

            /* fixed-length pass-by-reference */
            Assert(att[i]->attlen > 0);
            data_length = att[i]->attlen;
-->         memcpy(data, DatumGetPointer(value[i]), data_length);

which suggests either that att[i]->attlen is corrupt, or that the
computed length for the preceding column was wacko (leading to the
data pointer being moved to a silly address), or that the provided
value[i] is wrong.  In the context at hand none of these seem especially
likely, but one of them must be the case.  Can you look with jdb to
 see what the value of i is, and print out the contents of the *(att[i])
struct?  Also look at "data" and "value[i]" to see if they are sensible
pointers or not.

How reproducible is the crash --- does it happen every time you execute
this particular FETCH?

            regards, tom lane

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Rafael Martinez
Date:
On Tue, 2004-09-07 at 19:58, Tom Lane wrote:

> Rafael Martinez Guerrero <r.m.guerrero@usit.uio.no> writes:
> > * Information from CORE dump we got with --enable-debug. We have
> > compiled a new version of postgres and run it through gdb with the core
> > dump we had/got from postgres without --enable-debug.=20
>
> Okay, theoretically that works, but it might be smarter to install the
> debug build and get a fresh core dump that definitely corresponds to it.
>

It is late in Norway and we need to sleep, we will try this tomorrow
morning.


> > #0  0xb734d07c in memcpy () from /lib/tls/libc.so.6
>
> > #1  0x0806bba8 in DataFill (data=3D0xb7489000 <Address 0xb7489000 out of
> > bounds>, tupleDesc=3D0x82fd554, value=3D0x82fd550, nulls=3D0xbfff7ec0 "  n =
> >  ",
> > infomask=3D0x836e904c, bit=3D0x836e904f "\003\f") at heaptuple.c:139
>
> If accurate, that says it's crashing here:
>
>             /* fixed-length pass-by-reference */
>             Assert(att[i]->attlen > 0);
>             data_length = att[i]->attlen;
> -->         memcpy(data, DatumGetPointer(value[i]), data_length);
>
> which suggests either that att[i]->attlen is corrupt, or that the
> computed length for the preceding column was wacko (leading to the
> data pointer being moved to a silly address), or that the provided
> value[i] is wrong.  In the context at hand none of these seem especially
> likely, but one of them must be the case.  Can you look with jdb to
>  see what the value of i is, and print out the contents of the *(att[i])
> struct?  Also look at "data" and "value[i]" to see if they are sensible
> pointers or not.
>

I got this from one of our developers (from the core dump generated by
7.3.7 without --enable-debug):
--------------------------------------
(gdb) inspect i
$1 = 1

(gdb) inspect att[i]
$2 = 0x82fd6e8

(gdb) inspect *att[i]
$3 = {attrelid = 0, attname = {data = '\0' <repeats 63 times>,
alignmentDummy = 0}, atttypid = 1700, attstattarget = -1, attlen = -1,
attnum = 2, attndims = 0, attcacheoff = -1, atttypmod = 393220, attbyval
= 0 '\0', attstorage = 109 'm', attisset = 0 '\0', attalign = 105 'i',
attnotnull = 0 '\0', atthasdef = 0 '\0', attisdropped = 0 '\0',
attislocal = 1 '\001', attinhcount = 0}

(gdb) inspect data
$4 = 0xb7489000 <Address 0xb7489000 out of bounds>

(gdb) inspect value[i]
$5 = 3054556648


> How reproducible is the crash --- does it happen every time you execute
> this particular FETCH?
>

We are not sure about this. We did not log as much as we should in the
beginning. One thing is sure, the last time, it happens after this
FETCH. We have full logging on now and we will be able to know more
about this if/when it crash again.



>             regards, tom lane


Thanks for your help. I hope you/we will be able to find out this, right
now is a big crisis for us.

--
 Rafael Martinez, <r.m.guerrero@usit.uio.no>
 Center for Information Technology Services
 University of Oslo, Norway

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Rafael Martinez
Date:
On Tue, 2004-09-07 at 23:36, Tom Lane wrote:
> Rafael Martinez <r.m.guerrero@usit.uio.no> writes:
> > I got this from one of our developers (from the core dump generated by
> > 7.3.7 without --enable-debug):
>
> > (gdb) inspect *att[i]
> > $3 = {attrelid = 0, attname = {data = '\0' <repeats 63 times>,
> > alignmentDummy = 0}, atttypid = 1700, attstattarget = -1, attlen = -1,
> > attnum = 2, attndims = 0, attcacheoff = -1, atttypmod = 393220, attbyval
> > = 0 '\0', attstorage = 109 'm', attisset = 0 '\0', attalign = 105 'i',
> > attnotnull = 0 '\0', atthasdef = 0 '\0', attisdropped = 0 '\0',
> > attislocal = 1 '\001', attinhcount = 0}
>
> That looks reasonable ...
>
> > (gdb) inspect data
> > $4 = 0xb7489000 <Address 0xb7489000 out of bounds>
>
> > (gdb) inspect value[i]
> > $5 = 3054556648
>
> Hmm, what do you get from "x/10 3054556648" ?  Also, it'd be worth
> looking at the contents of *att[0] to see if that's also sensible,
> as well as value[0] and wherever that points (if it's a pointer).
>
>             regards, tom lane

(gdb) x/10 3054556648
0xb610d5e8:     0x2f00000c      0x00000002      0x30170000
0x020c6172
0xb610d5f8:     0x00000000      0x00000000      0x00ae0000
0x0006002b
0xb610d608:     0x2f1c0913      0x0404b70b

(gdb) inspect att[0]
$1 = 0x82fd660

(gdb) inspect *att[0]
$2 = {attrelid = 0, attname = {data = '\0' <repeats 63 times>,
alignmentDummy = 0}, atttypid = 1700, attstattarget = -1, attlen = -1,
attnum = 1, attndims = 0, attcacheoff = 0, atttypmod = 786436, attbyval
= 0 '\0', attstorage = 109 'm', attisset = 0 '\0', attalign = 105 'i',
attnotnull = 0 '\0', atthasdef = 0 '\0', attisdropped = 0 '\0',
attislocal = 1 '\001', attinhcount = 0}

(gdb) inspect value[0]
$3 = 3054556612

(gdb) inspect *value[0]
$4 = 12


--
 Rafael Martinez, <r.m.guerrero@usit.uio.no>
 Center for Information Technology Services
 University of Oslo, Norway

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Tom Lane
Date:
Rafael Martinez <r.m.guerrero@usit.uio.no> writes:
>> Hmm, what do you get from "x/10 3054556648" ?

> (gdb) x/10 3054556648
> 0xb610d5e8:     0x2f00000c      0x00000002      0x30170000     0x020c6172
> 0xb610d5f8:     0x00000000      0x00000000      0x00ae0000     0x0006002b
> 0xb610d608:     0x2f1c0913      0x0404b70b

Well, that's certainly not a sensible first word for a numeric field;
the first word should be a length and this obviously isn't.

A reasonable theory at this point is that the data on disk for this
table have gotten corrupted, probably in the way of a bad length value
for whatever field(s) lie between the two that are being extracted here.
That could result in a miscomputed address for the next field, which
seems to be what we're looking at.

What I would suggest doing next is backtracking to find out which
physical tuple this is on which disk page, and then dumping that out
with pg_filedump (or your tool of choice) so that we can verify or
disprove the hypothesis of bad stored data.  If it is bad data, we'll
want to examine the whole page anyway to see if we can see any pattern
of corruption.

You should be able to find out the physical tuple involved by looking at
the "ecxt_scantuple" field of ExecTargetList's econtext parameter.  Its
"val" field should point to something like this:

(gdb) p *econtext->ecxt_scantuple->val
$3 = {t_len = 276, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 1},
  t_tableOid = 863135, t_datamcxt = 0x0, t_data = 0xc2c0fc48}

t_tableOid is the source table OID, ip_blkid is the page number (divided
into high and low 16-bit halves for arcane reasons), and ip_posid is the
tuple number on that page.  You can also look at *t_data for additional
confirmation of what you are dealing with:

(gdb) p *econtext->ecxt_scantuple->val->t_data
$4 = {t_choice = {t_heap = {t_xmin = 42833, t_cmin = 0, t_xmax = 863136,
      t_field4 = {t_cmax = 0, t_xvac = 0}}, t_datum = {datum_len = 42833,
      datum_typmod = 0, datum_typeid = 863136}}, t_ctid = {ip_blkid = {
      bi_hi = 0, bi_lo = 0}, ip_posid = 1}, t_natts = 16, t_infomask = 2320,
  t_hoff = 32 ' ', t_bits = ""}

I'm using CVS tip to prepare this example, so the field layout is not
the same as what you'll see in 7.4, but there will be a t_ctid field
and it will probably have the same contents as what you saw in the
scantuple struct.

Once you have the table OID, discover its file node number:

regression=# select relfilenode from pg_class where oid = 863135;
 relfilenode
-------------
      863135
(1 row)

(These will often be the same, but don't assume so without verifying.)
And look up your database OID:

regression=# select oid from pg_database where datname = 'mydb';

Now the file you want to look at is $PGDATA/base/dboid/relfilenode.

If you are using pg_filedump (see http://sources.redhat.com/rhdb/)
then I'd recommend a command along the lines of

    pg_filedump -i -f -R pagenum $PGDATA/base/dboid/relfilenode

to dump the page in the most useful format.

We'll need to know the table schema ("\d tabname") also to interpret
what's in the dump.

            regards, tom lane

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Tom Lane
Date:
Rafael Martinez <r.m.guerrero@usit.uio.no> writes:
> I got this from one of our developers (from the core dump generated by
> 7.3.7 without --enable-debug):

> (gdb) inspect *att[i]
> $3 = {attrelid = 0, attname = {data = '\0' <repeats 63 times>,
> alignmentDummy = 0}, atttypid = 1700, attstattarget = -1, attlen = -1,
> attnum = 2, attndims = 0, attcacheoff = -1, atttypmod = 393220, attbyval
> = 0 '\0', attstorage = 109 'm', attisset = 0 '\0', attalign = 105 'i',
> attnotnull = 0 '\0', atthasdef = 0 '\0', attisdropped = 0 '\0',
> attislocal = 1 '\001', attinhcount = 0}

That looks reasonable ...

> (gdb) inspect data
> $4 = 0xb7489000 <Address 0xb7489000 out of bounds>

> (gdb) inspect value[i]
> $5 = 3054556648

Hmm, what do you get from "x/10 3054556648" ?  Also, it'd be worth
looking at the contents of *att[0] to see if that's also sensible,
as well as value[0] and wherever that points (if it's a pointer).

            regards, tom lane

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Tom Lane
Date:
Rafael Martinez <r.m.guerrero@usit.uio.no> writes:
> I send you the information you ask for.

This is really interesting.  Looking at tuple 44 in the pg_filedump
output:

>  Item  44 -- Length:   88  Offset: 4016 (0x0fb0)  Flags: USED
>   XID: min (34365810)  CMIN|XMAX: 0  CMAX|XVAC: 0
>   Block Id: 174  linp Index: 44   Attributes: 6   Size: 28
>   infomask: 0x0913
> (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_INVALID)
>   t_bits: [0]: 0x2f

>   0fb0: 72610c02 00000000 00000000 0000ae00  ra..............
>   0fc0: 2c000600 13091c2f 0cb70404 0c000000  ,....../........
>   0fd0: 04000000 00004869 0c000000 03000000  ......Hi........
>   0fe0: 00009681 0c000000 03000000 00001002  ................
>   0ff0: 0c000000 02000000 00001230 0c000000  ...........0....
>   1000: 02000000 00001730                    .......0

whereas what we saw in the gdb output was:

> (gdb) x/10 3054556648
> 0xb610d5e8:     0x2f00000c      0x00000002      0x30170000     0x020c6172
> 0xb610d5f8:     0x00000000      0x00000000      0x00ae0000     0x0006002b
> 0xb610d608:     0x2f1c0913      0x0404b70b

This evidently corresponds to data starting at offset 0ffc on the disk
page (the last few bytes of the gdb output match the start of tuple 43,
which is in the next higher part of the page --- note that the bytes are
being printed in opposite orders by pg_filedump and gdb).  So somehow,
the byte at page offset 0fff got changed from 00 to 2f in memory, though
not on disk.  If you still have the core dump, would you look at the
area surrounding address 3054556648 and see if there are any other
discrepancies between that and what is on disk?

            regards, tom lane

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Tom Lane
Date:
Kjetil Torgrim Homme <kjetilho@ifi.uio.no> writes:
> [Tom Lane]:
>> So somehow, the byte at page offset 0fff got changed
>> from 00 to 2f in memory, though not on disk.

> indeed interesting.  IMHO 0x0fff sounds more like a write to -1
> (relative to the next page) than a random memory error.

Note though that that offset is only special with respect to locations
on disk; it was not a special address in memory.  I'm wondering about
transient flakiness in your disk drive controller causing it to
sometimes return bad data.  Have you run any read/write disk tests?

            regards, tom lane

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Kjetil Torgrim Homme
Date:
[Tom Lane]:
>
>   Kjetil Torgrim Homme <kjetilho@ifi.uio.no> writes:
>   > indeed interesting.  IMHO 0x0fff sounds more like a write to -1
>   > (relative to the next page) than a random memory error.
>
>   Note though that that offset is only special with respect to
>   locations on disk; it was not a special address in memory.  I'm
>   wondering about transient flakiness in your disk drive controller
>   causing it to sometimes return bad data.  Have you run any
>   read/write disk tests?

well, the hardware RAID1 controller claims the disks are healthy, but
of course it would.  I was unable to find any software to do such
testing, the closest I got was badblocks(8), but it only wants to work
on a block device.

I hacked good old bonnie to write blocks of a random byte, then read
them back and see if they match.  I'll run a few such processes
concurrently, we'll see if it suffices to trigger (and discover) any
hardware bugs.
--
Kjetil T.

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Kjetil Torgrim Homme
Date:
we got a new coredump of 7.3.7 today.  this instance was running on a
freshly installed computer, to eliminate(?) all hardware issues.  it's
still the same brand and model, though.  the old system has been
running hard disk tests 30+ hours with no errors yet.

the core dump happens at the same place in the code, and this time we
got a complete backtrace:

(gdb) bt
#0  0xb734d07c in memcpy () from /lib/tls/libc.so.6
#1  0x0806bba8 in DataFill (data=3D0xb7488fff "", tupleDesc=3D0x82899a0,=20
    value=3D0x8289980, nulls=3D0xbfffd3c0 "    n    \"", infomask=3D0x8806b=
04c,=20
    bit=3D0x8806b04f "=EF\001") at heaptuple.c:139
#2  0x0806c3ee in heap_formtuple (tupleDescriptor=3D0x8279ec0, value=3D0x82=
89980,=20
    nulls=3D0xbfffd3c0 "    n    \"") at heaptuple.c:623
#3  0x080d1af1 in ExecTargetList (targetlist=3D0x8278298, nodomains=3D9,=20
    targettype=3D0x8279ec0, values=3D0x8289980, econtext=3D0x8279a60,=20
    isDone=3D0xbfffd468) at execQual.c:2230
#4  0x080d1cdb in ExecScan (node=3D0x827a208, accessMtd=3D0xbfffd468)
    at execScan.c:49
#5  0x080d1d7d in ExecScan (node=3D0x8278c70, accessMtd=3D0x80d7c58 <SeqNex=
t+24>)
    at execScan.c:146
#6  0x080d7cfb in InitScanRelation (node=3D0x82899a0, estate=3D0x8278c70,=
=20
    scanstate=3D0xbfffd4c8) at nodeSeqscan.c:162
#7  0x080cfd86 in ExecProcNode (node=3D0x8289bf8, parent=3D0x0)
    at execProcnode.c:315
#8  0x080cecf3 in ExecutePlan (estate=3D0x8279c90, plan=3D0x8278c70,=20
    operation=3DCMD_SELECT, numberTuples=3D0, direction=3D136878496,=20
    destfunc=3D0x82899c8) at execMain.c:964
#9  0x080ce392 in ExecutorEnd (queryDesc=3D0x82899a0, estate=3D0x0)
    at execMain.c:223
#10 0x0811d069 in ProcessQuery (parsetree=3D0x82899c8, plan=3D0x8278c70,=20
    dest=3DRemote, completionTag=3D0xbfffd610 "") at pquery.c:251
#11 0x0811b7ed in pg_exec_query_string (query_string=3D0xbfffd610, dest=3DR=
emote,=20
    parse_context=3D0x823d610) at postgres.c:844
#12 0x0811c64d in PostgresMain (argc=3D4, argv=3D0xbfffd850,=20
    username=3D0x8238c69 "cerebrum") at postgres.c:2018
#13 0x0810413d in DoBackend (port=3D0x8238b38) at postmaster.c:2304
#14 0x08103cb2 in BackendStartup (port=3D0x8238b38) at postmaster.c:1935
#15 0x08102dad in ServerLoop () at postmaster.c:1016
#16 0x081027ea in PostmasterMain (argc=3D1, argv=3D0x8220170) at postmaster=
.c:797
#17 0x080e1234 in main (argc=3D1, argv=3D0xbfffe204) at main.c:217



(gdb) print *att[i]
$20 =3D {attrelid =3D 0, attname =3D {
    data =3D "pageunits_total", '\0' <repeats 48 times>,=20
    alignmentDummy =3D 1701273968}, atttypid =3D 1700, attstattarget =3D -1=
,=20
  attlen =3D -1, attnum =3D 9, attndims =3D 0, attcacheoff =3D -1, atttypmo=
d =3D 393220,=20
  attbyval =3D 0 '\0', attstorage =3D 109 'm', attisset =3D 0 '\0',=20
  attalign =3D 105 'i', attnotnull =3D 0 '\0', atthasdef =3D 0 '\0',=20
  attisdropped =3D 0 '\0', attislocal =3D 1 '\001', attinhcount =3D 0}
(gdb) print i
$21 =3D 8
(gdb) x/10 value[i]
0xb7190928:     0x2f00000b      0x00000000      0x00200000      0x00000207
0xb7190938:     0x00000314      0x01bf913d      0x10120000      0x00090020
0xb7190948:     0xef201553      0x00000001


the relevant code again is:

       if (att[i]->attbyval)
           [...]
       else if (att[i]->attlen =3D=3D -1)
           [...]
       else if (att[i]->attlen =3D=3D -2)
           [...]
       else
       {
               /* fixed-length pass-by-reference */
               Assert(att[i]->attlen > 0);
               data_length =3D att[i]->attlen;
=3D=3D=3D>           memcpy(data, DatumGetPointer(value[i]), data_length);
       }

(gdb) print data_length
$25 =3D 788529163
(gdb) print att[i]->attlen
$26 =3D -1

how can att[i]->attlen possibly change in the interim?  but
data_length looks corrupted, too.

(gdb) print *att[i-1]
$27 =3D {attrelid =3D 0, attname =3D {
    data =3D "pageunits_paid", '\0' <repeats 49 times>,=20
    alignmentDummy =3D 1701273968}, atttypid =3D 1700, attstattarget =3D -1=
,=20
  attlen =3D -1, attnum =3D 8, attndims =3D 0, attcacheoff =3D -1, atttypmo=
d =3D 393220,=20
  attbyval =3D 0 '\0', attstorage =3D 109 'm', attisset =3D 0 '\0',=20
  attalign =3D 105 'i', attnotnull =3D 0 '\0', atthasdef =3D 0 '\0',=20
  attisdropped =3D 0 '\0', attislocal =3D 1 '\001', attinhcount =3D 0}

also:

(gdb) print data
$39 =3D 0xb7488fff ""

which doesn't seem very aligned for an integer.

(gdb) print data[1]
Cannot access memory at address 0xb7489000

thank you for any insights.
--=20
Kjetil T.

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Tom Lane
Date:
Kjetil Torgrim Homme <kjetilho@ifi.uio.no> writes:
> how can att[i]->attlen possibly change in the interim?  but
> data_length looks corrupted, too.

Unless you compiled with no optimization at all (-O0), the compiler
would likely fold the identical memcpy() calls in the different
if-branches together.  So I wouldn't put too much stock in the
reported line number.

It does seem striking that a 0x2f got dumped into the high byte of the
length word in both cases.  Have you checked to see what the
page-on-disk looks like?  I'd be interested to know if the offset of the
damaged byte within the page is again 0x0fff.

            regards, tom lane

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Kjetil Torgrim Homme
Date:
[Tom Lane]:
>
>   This evidently corresponds to data starting at offset 0ffc on the
>   disk page (the last few bytes of the gdb output match the start of
>   tuple 43, which is in the next higher part of the page --- note
>   that the bytes are being printed in opposite orders by pg_filedump
>   and gdb).  So somehow, the byte at page offset 0fff got changed
>   from 00 to 2f in memory, though not on disk.

indeed interesting.  IMHO 0x0fff sounds more like a write to -1
(relative to the next page) than a random memory error.

>   If you still have the core dump, would you look at the area
>   surrounding address 3054556648 and see if there are any other
>   discrepancies between that and what is on disk?

I can't see any further discrepancies:

(gdb) x/20 3054556572
0xb610d59c:     0x020c6172      0x00000000      0x00000000      0x00ae0000
0xb610d5ac:     0x0006002c      0x2f1c0913      0x0404b70c      0x0000000c
0xb610d5bc:     0x00000004      0x69480000      0x0000000c      0x00000003
0xb610d5cc:     0x81960000      0x0000000c      0x00000003      0x02100000
0xb610d5dc:     0x0000000c      0x00000002      0x30120000      0x2f00000c

>   0fb0: 72610c02 00000000 00000000 0000ae00  ra..............
>   0fc0: 2c000600 13091c2f 0cb70404 0c000000  ,....../........
>   0fd0: 04000000 00004869 0c000000 03000000  ......Hi........
>   0fe0: 00009681 0c000000 03000000 00001002  ................
>   0ff0: 0c000000 02000000 00001230 0c000000  ...........0....
>   1000: 02000000 00001730                    .......0

thank you for your tremendous assistance!

--
Kjetil T.

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Rafael Martinez Guerrero
Date:
On Fri, 2004-09-10 at 16:24, Tom Lane wrote:
> Kjetil Torgrim Homme <kjetilho@ifi.uio.no> writes:
> > how can att[i]->attlen possibly change in the interim?  but
> > data_length looks corrupted, too.
>=20
> Unless you compiled with no optimization at all (-O0), the compiler
> would likely fold the identical memcpy() calls in the different
> if-branches together.  So I wouldn't put too much stock in the
> reported line number.
>=20
> It does seem striking that a 0x2f got dumped into the high byte of the
> length word in both cases.  Have you checked to see what the
> page-on-disk looks like?  I'd be interested to know if the offset of the
> damaged byte within the page is again 0x0fff.
>=20

Hei Tom=20

Kjetil will answer you about this.=20

In the meant time we got new core dumps when taking a backup of the same
database.=20

Some more info I got from the departament in charge of this database:
-----------------------------------------------------------
We make a backup of our production server every 15 minutes.  Recently,
we've seen behaviour like this:

  [12/09/2004-05:46:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
  [12/09/2004-05:48:03] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no
  [12/09/2004-06:01:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
  pg_dump: ERROR:  MemoryContextAlloc: invalid request size 1577058307
  pg_dump: lost synchronization with server, resetting connection
  pg_dump: SQL command to dump the contents of table
"paid_quota_history" failed: PQendcopy() failed.
  pg_dump: Error message from server: pg_dump: The command was: COPY
public.paid_quota_history (job_id, transaction_type, person_id, tstamp,
update_by, update
  _program, pageunits_free, pageunits_paid, pageunits_total) TO stdout;
  pg_dumpall: pg_dump failed on cerebrum_prod, exiting
  [12/09/2004-06:02:16] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no

Every consecutive backup failes with the same message, and then
suddenly:

  [12/09/2004-08:46:00] PostgreSQL: starting backup_cluster01.sh: on
cerebellum.uio.no
  [12/09/2004-08:48:34] PostgreSQL: backup_cluster01.sh finnished on
cerebellum.uio.no

To me this looks like a cache somewhere that upon read contained some
incorrect data.  This cache was somehow flushed two-hours later, and
fresh data was read from disk.

Could this be postgres problem, or is it hardware/kernel related?
Upgrading from 7.3.5 to 7.3.7 to 7.4.5 does not help.  We have now
moved the database between 3 different Dell2650 servers, and replaced
memory chips on one system once.  Lately one or more postgres
processes received signal11 atleast once a day.  The problems started
about a week ago after stable production for about 9 months.

The backup failures above were accompanied by 4 core-dumps.  Backtrace
follows:

  #0  0xb734d07c in memcpy () from /lib/tls/libc.so.6
  #1  0x08174880 in set_var_from_num (num=3D0xb7021d24, dest=3D0x87b432fe)
at numeric.c:2673
  #2  0x08171927 in numeric_out (fcinfo=3D0xbfffc2d0) at numeric.c:373
  #3  0x081aa81d in FunctionCall3 (flinfo=3D0x82cc4e8, arg1=3D3221209808,
arg2=3D3221209808, arg3=3D3221209808) at fmgr.c:1016
  #4  0x080c78fb in CopyTo (rel=3D0xb6800bd0, attnumlist=3D0x82cb4a0,
binary=3D0 '\0', oids=3D0 '\0', delim=3D0x82232a8 "\t", null_print=3D0x81fc=
95d
"\\N")
      at copy.c:1096
  #5  0x080c7021 in DoCopy (stmt=3D0x2f000004) at copy.c:920
  #6  0x081507c5 in PortalRunUtility (portal=3D0x82bdfd8, query=3D0x82ba220,
dest=3D0x82ba1d8, completionTag=3D0xbfffc650 "") at pquery.c:772
  #7  0x08150a3e in PortalRunMulti (portal=3D0x82bdfd8, dest=3D0x82ba1d8,
altdest=3D0x82ba1d8, completionTag=3D0xbfffc650 "") at pquery.c:836
  #8  0x0815033c in PortalRun (portal=3D0x82bdfd8, count=3D2147483647,
dest=3D0x82ba1d8, altdest=3D0x82ba1d8, completionTag=3D0xbfffc650 "") at
pquery.c:494
  #9  0x0814d5f8 in exec_simple_query (
      query_string=3D0x82b9bc0 "COPY public.change_log (tstamp, change_id,
subject_entity, change_type_id, dest_entity, change_params, change_by,
change_program, description) TO stdout;") at postgres.c:873
  #10 0x0814f660 in PostgresMain (argc=3D4, argv=3D0x82701b8,
username=3D0x8270188 "postgres") at postgres.c:2868
  #11 0x0812f5ab in BackendFork (port=3D0x827d0a0) at postmaster.c:2564
  #12 0x0812f09e in BackendStartup (port=3D0x827d0a0) at postmaster.c:2207
  #13 0x0812d95f in ServerLoop () at postmaster.c:1119
  #14 0x0812d305 in PostmasterMain (argc=3D3, argv=3D0x826e1c0) at
postmaster.c:897
  #15 0x08104f10 in main (argc=3D3, argv=3D0xbfffd6c4) at main.c:214

We are currently in the process of moving the production server to an
IBM box, which should eliminate any Dell2650 spesific causes.
-----------------------------------------------------------

--=20
 Rafael Martinez, <r.m.guerrero@usit.uio.no>
 Center for Information Technology Services
 University of Oslo, Norway

Re: PosgreSQL is crashing with a signal 11 - Bug?

From
Tom Lane
Date:
Rafael Martinez Guerrero <r.m.guerrero@usit.uio.no> writes:
> To me this looks like a cache somewhere that upon read contained some
> incorrect data.  This cache was somehow flushed two-hours later, and
> fresh data was read from disk.

Very possible.

> Could this be postgres problem, or is it hardware/kernel related?

I'd wonder about bad RAM in the address range that the kernel uses for
disk block caching.  Or we could return to the theory I suggested a
bit ago: once in a while the disk controller messes up the last byte
of a transfer.

> The problems started about a week ago after stable production for
> about 9 months.

If you hadn't changed any software immediately beforehand, then that
sure sounds like hardware going bad.

            regards, tom lane