Thread: postmaster core ( finally I have it )
Hi all, since long time ( in the mean time I did Postgres upgrade four time and now I'm using 7.3.3 ) I'm having, at least once in a week, a signal 11 on a backend, and how you can immagine with the subseguent drop of all connections, finally now I have the core. The process killed made always the same select ( with different id_package ): SELECT id_publisher, publisher_name, id_package, package_name FROM v_psl_package_info WHERE id_package = 177; Here it is: bash-2.05$ file core core: ELF 32-bit LSB core file of 'postmaster' (signal 11), Intel 80386, version 1, from 'postmaster' bash-2.05$ gdb /usr/bin/postmaster core GNU gdb Red Hat Linux (5.2-2) Copyright 2002 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"...(no debugging symbols found)... warning: core file may not match specified executable file. Core was generated by `postgres: push empdb 192.168'. Program terminated with signal 11, Segmentation fault. Reading symbols from /lib/libpam.so.0...(no debugging symbols found)...done. Loaded symbols for /lib/libpam.so.0 Reading symbols from /lib/libssl.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libssl.so.2 Reading symbols from /lib/libcrypto.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libcrypto.so.2 Reading symbols from /usr/kerberos/lib/libkrb5.so.3...(no debugging symbols found)...done. Loaded symbols for /usr/kerberos/lib/libkrb5.so.3 Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libz.so.1 Reading symbols from /usr/lib/libreadline.so.4...(no debugging symbols found)...done. Loaded symbols for /usr/lib/libreadline.so.4 Reading symbols from /lib/libtermcap.so.2...(no debugging symbols found)...done. Loaded symbols for /lib/libtermcap.so.2 Reading symbols from /lib/libcrypt.so.1...done. Loaded symbols for /lib/libcrypt.so.1 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libnsl.so.1...done. Loaded symbols for /lib/libnsl.so.1 Reading symbols from /lib/libdl.so.2...done. Loaded symbols for /lib/libdl.so.2 Reading symbols from /lib/libm.so.6...done. Loaded symbols for /lib/libm.so.6 Reading symbols from /lib/libc.so.6...done. Loaded symbols for /lib/libc.so.6 Reading symbols from /usr/kerberos/lib/libcom_err.so.3...done. Loaded symbols for /usr/kerberos/lib/libcom_err.so.3 Reading symbols from /usr/kerberos/lib/libk5crypto.so.3...done. Loaded symbols for /usr/kerberos/lib/libk5crypto.so.3 Reading symbols from /lib/ld-linux.so.2...done. Loaded symbols for /lib/ld-linux.so.2 Reading symbols from /lib/libnss_files.so.2...done. Loaded symbols for /lib/libnss_files.so.2 #0 0x08171fdd in RelationBuildRuleLock () (gdb) where #0 0x08171fdd in RelationBuildRuleLock () #1 0x081722ab in RelationBuildDesc () #2 0x0817311c in RelationIdGetRelation () #3 0x08077b95 in relation_open () #4 0x08077d24 in heap_openrv () #5 0x080b1015 in addRangeTableEntry () #6 0x080ab6de in transformTableEntry () #7 0x080ab942 in transformFromClauseItem () #8 0x080ab321 in transformFromClause () #9 0x0809f639 in transformSelectStmt () #10 0x0809dcc4 in transformStmt () #11 0x0809da06 in parse_analyze () #12 0x08128a35 in pg_analyze_and_rewrite () #13 0x08128ce2 in pg_exec_query_string () #14 0x08129fc3 in PostgresMain () #15 0x08110344 in DoBackend () #16 0x0810fc2d in BackendStartup () #17 0x0810ed21 in ServerLoop () #18 0x0810e6f9 in PostmasterMain () #19 0x080e9cdd in main () #20 0x40249336 in __libc_start_main (main=0x80e9ab0 <main>, argc=3, ubp_av=0xbffffb04, init=0x806b2ec <_init>, fini=0x8188830 <_fini>, rtld_fini=0x4000d2fc <_dl_fini>, stack_end=0xbffffafc) at ../sysdeps/generic/libc-start.c:129 (gdb) quit and here all the activity performed by the process: Jul 22 20:00:02 sdr2 postgres[25079]: [2896523] LOG: connection received: host= xxx.xxx.xxx.xxx port=53809 Jul 22 20:00:02 sdr2 postgres[25079]: [2896524] LOG: connection authorized: user=push database=empdb Jul 22 20:00:02 sdr2 postgres[25079]: [2896525] DEBUG: /usr/bin/postmaster child[25079]: starting with ( Jul 22 20:00:02 sdr2 postgres[25079]: [2896526] DEBUG: ^Ipostgres Jul 22 20:00:02 sdr2 postgres[25079]: [2896527] DEBUG: ^I-v131072 Jul 22 20:00:02 sdr2 postgres[25079]: [2896528] DEBUG: ^I-p Jul 22 20:00:02 sdr2 postgres[25079]: [2896529] DEBUG: ^Iempdb Jul 22 20:00:02 sdr2 postgres[25079]: [2896530] DEBUG: ) Jul 22 20:00:02 sdr2 postgres[25079]: [2896531] DEBUG: InitPostgres Jul 22 20:00:02 sdr2 postgres[25079]: [2896532] DEBUG: StartTransactionCommand Jul 22 20:00:02 sdr2 postgres[25079]: [2896533-1] LOG: query: set datestyle to 'ISO'; select version(), case when pg_encoding_to_char(1) = 'SQL_ASCII' then 'UNKNOWN' else Jul 22 20:00:02 sdr2 postgres[25079]: [2896533-2] getdatabaseencoding() end; Jul 22 20:00:02 sdr2 postgres[25079]: [2896534] DEBUG: ProcessUtility Jul 22 20:00:02 sdr2 postgres[25079]: [2896535] DEBUG: CommitTransactionCommand Jul 22 20:00:02 sdr2 postgres[25079]: [2896536] DEBUG: StartTransactionCommand Jul 22 20:00:02 sdr2 postgres[25079]: [2896537] DEBUG: ProcessQuery Jul 22 20:00:02 sdr2 postgres[25079]: [2896538] DEBUG: CommitTransactionCommand Jul 22 20:00:02 sdr2 postgres[25079]: [2896539] LOG: duration: 0.035168 sec Jul 22 20:00:02 sdr2 postgres[25079]: [2896540] DEBUG: StartTransactionCommand Jul 22 20:00:02 sdr2 postgres[25079]: [2896541] LOG: query: set client_encoding = 'UNICODE'; show autocommit Jul 22 20:00:02 sdr2 postgres[25079]: [2896542] DEBUG: ProcessUtility Jul 22 20:00:02 sdr2 postgres[25079]: [2896543] DEBUG: CommitTransactionCommand Jul 22 20:00:02 sdr2 postgres[25079]: [2896544] DEBUG: StartTransactionCommand Jul 22 20:00:02 sdr2 postgres[25079]: [2896545] DEBUG: ProcessUtility Jul 22 20:00:02 sdr2 postgres[25079]: [2896546] DEBUG: CommitTransactionCommand Jul 22 20:00:02 sdr2 postgres[25079]: [2896547] LOG: duration: 0.003609 sec Jul 22 20:00:02 sdr2 postgres[25079]: [2896548] DEBUG: StartTransactionCommand Jul 22 20:00:02 sdr2 postgres[25079]: [2896549] LOG: query: SELECT id_publisher, publisher_name, id_package, package_name FROM v_psl_package_info WHERE id_package = 177 Do you need others informations or may I do something in order to permit you to isolate the problem ? Regards Gaetano
"Mendola Gaetano" <mendola@bigfoot.com> writes: > The process killed made always the same select ( with different > id_package ): > SELECT id_publisher, publisher_name, id_package, package_name > FROM v_psl_package_info > WHERE id_package = 177; > (gdb) where > #0 0x08171fdd in RelationBuildRuleLock () > #1 0x081722ab in RelationBuildDesc () > #2 0x0817311c in RelationIdGetRelation () > #3 0x08077b95 in relation_open () > #4 0x08077d24 in heap_openrv () > #5 0x080b1015 in addRangeTableEntry () > #6 0x080ab6de in transformTableEntry () > #7 0x080ab942 in transformFromClauseItem () > #8 0x080ab321 in transformFromClause () > #9 0x0809f639 in transformSelectStmt () > #10 0x0809dcc4 in transformStmt () > #11 0x0809da06 in parse_analyze () Hmm. RelationBuildRuleLock is concerned with reading the rule information for a table; specifically, it's evidently trying to read the rules for v_psl_package_info. I suspect some form of data corruption in the pg_rewrite row(s) for this table. Do you see any misbehavior when you do select * from pg_rewrite where ev_class = 'v_psl_package_info'::regclass In particular, are any of the columns NULL in these rows? regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us> > "Mendola Gaetano" <mendola@bigfoot.com> writes: > > The process killed made always the same select ( with different > > id_package ): > > > SELECT id_publisher, publisher_name, id_package, package_name > > FROM v_psl_package_info > > WHERE id_package = 177; > > > (gdb) where > > #0 0x08171fdd in RelationBuildRuleLock () > > #1 0x081722ab in RelationBuildDesc () > > #2 0x0817311c in RelationIdGetRelation () > > #3 0x08077b95 in relation_open () > > #4 0x08077d24 in heap_openrv () > > #5 0x080b1015 in addRangeTableEntry () > > #6 0x080ab6de in transformTableEntry () > > #7 0x080ab942 in transformFromClauseItem () > > #8 0x080ab321 in transformFromClause () > > #9 0x0809f639 in transformSelectStmt () > > #10 0x0809dcc4 in transformStmt () > > #11 0x0809da06 in parse_analyze () > > Hmm. RelationBuildRuleLock is concerned with reading the rule > information for a table; specifically, it's evidently trying to > read the rules for v_psl_package_info. I suspect some form of > data corruption in the pg_rewrite row(s) for this table. Do you > see any misbehavior when you do > > select * from pg_rewrite where ev_class = 'v_psl_package_info'::regclass All seems good. > In particular, are any of the columns NULL in these rows? I did: select * from pg_rewrite where ev_class = 'v_psl_package_info'::regclass and ( rulename is null or ev_class is null or ev_attr isnull or ev_type is null or is_instead is null or ev_qual is null or ev_action is null ); and I had: No rows. v_psl_package_info is a view, shall we look in another direction ? thank you Gaetano
"Mendola Gaetano" <mendola@bigfoot.com> writes: > From: "Tom Lane" <tgl@sss.pgh.pa.us> >> I suspect some form of >> data corruption in the pg_rewrite row(s) for this table. Do you >> see any misbehavior when you do >> >> select * from pg_rewrite where ev_class = 'v_psl_package_info'::regclass > All seems good. I really don't see any other explanation for a crash in that routine than problems with the pg_rewrite data. Do you get this any time you try to do something with the v_psl_package_info view, or is it a once-in-a-while problem? If the latter, I wonder whether it could be an issue with a flaky disk sector ... regards, tom lane
From: "Tom Lane" <tgl@sss.pgh.pa.us> > "Mendola Gaetano" <mendola@bigfoot.com> writes: > > From: "Tom Lane" <tgl@sss.pgh.pa.us> > >> I suspect some form of > >> data corruption in the pg_rewrite row(s) for this table. Do you > >> see any misbehavior when you do > >> > >> select * from pg_rewrite where ev_class = 'v_psl_package_info'::regclass > > > All seems good. > > I really don't see any other explanation for a crash in that routine > than problems with the pg_rewrite data. > > Do you get this any time you try to do something with the > v_psl_package_info view, or is it a once-in-a-while problem? > If the latter, I wonder whether it could be an issue with a > flaky disk sector ... Is once-in-a-while but always at 00 minutes. This select is performed each 20 minutes and the core happen always at 00 never at 20 and never at 40! I have scheduled also a vacuumdb -z each 15 minutes so only at 00 these are running together. I had another core at 21:00 today and this is the log of the vacuum ( there was that error just during the pg_rewrite vacuum!): INFO: --Relation pg_catalog.pg_description-- INFO: Pages 12: Changed 0, Empty 0; Tup 1390: Vac 0, Keep 0, UnUsed 1. Total CPU 0.00s/0.00u sec elapsed 0.20 sec. INFO: --Relation pg_toast.pg_toast_16416-- INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.01 sec. INFO: Analyzing pg_catalog.pg_description INFO: --Relation pg_catalog.pg_group-- INFO: Pages 1: Changed 0, Empty 0; Tup 3: Vac 0, Keep 0, UnUsed 9. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: --Relation pg_toast.pg_toast_1261-- INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.01 sec. INFO: Analyzing pg_catalog.pg_group INFO: --Relation pg_catalog.pg_proc-- INFO: Pages 111: Changed 0, Empty 0; Tup 1822: Vac 0, Keep 0, UnUsed 272. Total CPU 0.02s/0.00u sec elapsed 0.29 sec. INFO: --Relation pg_toast.pg_toast_1255-- INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: Analyzing pg_catalog.pg_proc INFO: --Relation pg_catalog.pg_rewrite-- INFO: Pages 47: Changed 0, Empty 0; Tup 231: Vac 0, Keep 0, UnUsed 74. Total CPU 0.00s/0.00u sec elapsed 0.07 sec. INFO: --Relation pg_toast.pg_toast_16410-- INFO: Pages 22: Changed 0, Empty 0; Tup 77: Vac 0, Keep 0, UnUsed 33. Total CPU 0.00s/0.00u sec elapsed 0.03 sec. INFO: Analyzing pg_catalog.pg_rewrite WARNING: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormallyand possibly corrupted shared memory. I have rolled back the current transaction and am going to terminateyour database system connection and exit. Please reconnect to the database system and repeat your query. server closed the connection unexpectedly This probably means the server terminated abnormally before or whileprocessing the request. connection to server was lost vacuumdb: vacuum empdb failed I think we are near the solution now. Thank you Gaetano
"Mendola Gaetano" <mendola@bigfoot.com> writes: > Is once-in-a-while but always at 00 minutes. This select is performed each > 20 minutes and > the core happen always at 00 never at 20 and never at 40! Now that is very interesting ... why would that be? Could we see the definition of this view? > I had another core at 21:00 today and this is the log of the vacuum > ( there was that error just during the pg_rewrite vacuum!): I think that's just coincidence. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > "Mendola Gaetano" <mendola@bigfoot.com> writes: > > Is once-in-a-while but always at 00 minutes. This select is performed each > > 20 minutes and > > the core happen always at 00 never at 20 and never at 40! > > Now that is very interesting ... why would that be? > > Could we see the definition of this view? Of course here it is: CREATE OR REPLACE VIEW v_psl_package_info AS SELECT pb.id_publisher AS id_publisher, pb.name AS publisher_name, pk.id_package AS id_package, pk.name AS package_name FROM v_packages pk JOIN v_publishers pb USING (id_publisher); CREATE OR REPLACE VIEW v_packages AS SELECT p.id_package AS id_package, p.id_publisher AS id_publisher, p.name AS name, p.information AS information, p.description AS description, sp_lookup_key('package_type', p.id_package_type) AS TYPE, sp_lookup_key('target', p.id_target) AS target, p.port AS port, p.priority AS priority, sp_lookup_key('fec', p.id_fec) AS fec, p.input_group AS input_group, p.output_group AS output_group, CASE WHEN p.updatable THEN 1 ELSE 0 END AS updatable, p.checksum AS checksum, p.version AS version, p.start_file AS start_file, p.view_target_group AS view_target_group, p.target_group AS target_group, CASEWHEN p.auto_listen THEN 1 ELSE 0 END AS auto_listen, CASE WHENp.public_flag THEN 1 ELSE 0 END AS public_flag, p.needed_version AS needed_version, p.logic_version AS logic_version, p.package_size AS package_size, ps.id_drm_process AS id_drm_process, ps.id_cas_service AS id_cas_service, ps.id_cas_settings AS id_cas_settings, ps.id_drm_service AS id_drm_service FROM packages p LEFT OUTER JOIN package_security ps USING (id_package) ORDER BY p.id_publisher, p.name; CREATE OR REPLACE VIEW v_publishers AS SELECT p.id_publisher AS id_publisher, p.login AS login, p.password AS password, p.ftp_password AS ftp_password, p.name AS name, p.address ASaddress, p.city AS city, sp_lookup_descr('country', p.id_country) AS country, p.zip AS zip, p.phone_number AS phone_number, p.fax_number AS fax_number, p.email AS email, p.web_site AS web_site, p.description AS description, v.id_pid ASid_pid, v.id_transponder AS id_transponder, v.transponder AS transponder, v.pid AS pid, v.satellite AS satellite, v.frequency AS downlink_freq, v.polarization AS polarization, v.fec AS fec, v.symbol_rate AS symbol_rate, v.description AS txp_description, a.multicast_ipAS multicast_ip FROM publishers p JOIN v_pids v USING (id_pid) JOIN addresses a USING (id_publisher) WHERE id_publisher<>0ORDER BY p.name; CREATE OR REPLACE VIEW v_pids AS SELECT p.id_pid AS id_pid, t.id_transponder AS id_transponder, t.name AS transponder, p.pid AS pid, sp_lookup_descr('orbital_ptn', t.orbital_pos) AS satellite, t.frequency AS frequency, t.polarization AS polarization, t.fec AS fec, t.symbol_rate AS symbol_rate, t.description AS description FROM pids p JOIN transponders t USING (id_transponder); and the code for the function sp_lookup_key: CREATE OR REPLACE FUNCTION sp_lookup_key ( TEXT,INTEGER ) RETURNS TEXT AS' DECLARE lookup_name ALIAS FOR $1; my_id_key ALIAS FOR $2; my_id_lookup INTEGER; my_key TEXT; BEGIN SELECT INTO my_id_lookup id_lookup FROM v_lookup_tables WHERE name = lookup_name::varchar; IF NOT FOUND THEN RETURN NULL; END IF; SELECT INTO my_key key FROM lookup_tables WHERE id_table = my_id_lookup AND id_key = my_id_key; IF NOT FOUND THEN RETURN NULL; END IF; RETURN my_key; END; ' LANGUAGE 'plpgsql' WITH ( iscachable ); > > I had another core at 21:00 today and this is the log of the vacuum > > ( there was that error just during the pg_rewrite vacuum!): > > I think that's just coincidence. > > regards, tom lane May be was coincidence, I'm going to avoid the vacuum and that select concurreny and see what happen in these next days. Just to add some others information, this is what happens 22 Jul at 20:00 ( this time was during the pg_attribute vacuum) INFO: --Relation pg_catalog.pg_description-- INFO: Pages 12: Changed 0, Empty 0; Tup 1390: Vac 0, Keep 0, UnUsed 1. Total CPU 0.01s/0.00u sec elapsed 0.03 sec. INFO: --Relation pg_toast.pg_toast_16416-- INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: Analyzing pg_catalog.pg_description INFO: --Relation pg_catalog.pg_group-- INFO: Pages 1: Changed 0, Empty 0; Tup 3: Vac 0, Keep 0, UnUsed 9. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: --Relation pg_toast.pg_toast_1261-- INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: Analyzing pg_catalog.pg_group INFO: --Relation pg_catalog.pg_proc-- INFO: Pages 110: Changed 0, Empty 0; Tup 1821: Vac 0, Keep 0, UnUsed 269. Total CPU 0.01s/0.00u sec elapsed 0.22 sec. INFO: --Relation pg_toast.pg_toast_1255-- INFO: Pages 0: Changed 0, Empty 0; Tup 0: Vac 0, Keep 0, UnUsed 0. Total CPU 0.00s/0.00u sec elapsed 0.00 sec. INFO: Analyzing pg_catalog.pg_proc INFO: --Relation pg_catalog.pg_rewrite-- INFO: Pages 47: Changed 0, Empty 0; Tup 231: Vac 0, Keep 0, UnUsed 73. Total CPU 0.01s/0.00u sec elapsed 0.18 sec. INFO: --Relation pg_toast.pg_toast_16410-- INFO: Pages 22: Changed 0, Empty 0; Tup 77: Vac 0, Keep 0, UnUsed 33. Total CPU 0.00s/0.00u sec elapsed 0.02 sec. INFO: Analyzing pg_catalog.pg_rewrite INFO: --Relation pg_catalog.pg_type-- INFO: Pages 10: Changed 0, Empty 0; Tup 523: Vac 0, Keep 0, UnUsed 27. Total CPU 0.00s/0.00u sec elapsed 0.03 sec. INFO: Analyzing pg_catalog.pg_type INFO: --Relation pg_catalog.pg_attribute-- WARNING: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormallyand possibly corrupted shared memory. I have rolled back the current transaction and am going to terminateyour database system connection and exit. Please reconnect to the database system and repeat your query. server closed the connection unexpectedly This probably means the server terminated abnormally before or whileprocessing the request. connection to server was lost vacuumdb: vacuum empdb failed Thank you Gaetano