Thread: Abnormal termination of PostgreSQL
Tom Lane wrote: > Bill Thoen <bthoen@gisnet.com> writes: > >> Does anyone know what happened and how I can fix it? >> > Well, you evidently hit a bug, but there's not nearly enough info here > to guess whether the bug is in Postgres, PostGIS, GEOS, or GDAL, or > perhaps in your custom build process for one of these. > > ... > > Then send the trace to pgsql-bugs. You'll need to provide some details > about the table and the query, too. > Is this what you need to see? (I attached GDB to the running postgres process and let it continue until it crashed, and then entered 'bt' and this is what I got.) Program received signal SIGSEGV, Segmentation fault. 0x08081357 in slot_deform_tuple (slot=0x92d3618, natts=4) at heaptuple.c:1502 1502 off = att_align_pointer(off, thisatt->attalign, -1, (gdb) bt #0 0x08081357 in slot_deform_tuple (slot=0x92d3618, natts=4) at heaptuple.c:1502 #1 0x08081739 in slot_getattr (slot=0x92d3618, attnum=4, isnull=0x92dd753 "") at heaptuple.c:1625 #2 0x08168525 in ExecProject (projInfo=0x92dd470, isDone=0xbfed09e8) at execQual.c:4601 #3 0x0816e5f6 in ExecScan (node=0x92d3798, accessMtd=0x8179a70 <SeqNext>) at execScan.c:143 #4 0x08179a69 in ExecSeqScan (node=0x92d3798) at nodeSeqscan.c:130 #5 0x08167888 in ExecProcNode (node=0x92d3798) at execProcnode.c:334 #6 0x08165b23 in ExecutorRun (queryDesc=0x92a3b50, direction=ForwardScanDirection, count=0) at execMain.c:1248 #7 0x082024ab in ProcessQuery (plan=0x929c798, params=<value optimized out>, dest=0x92be820, completionTag=0xbfed0c8a "") at pquery.c:179 #8 0x082026ac in PortalRunMulti (portal=0x92ca8b0, isTopLevel=<value optimized out>, dest=0x92be820, altdest=0x92be820, completionTag=0xbfed0c8a "") at pquery.c:1242 #9 0x08202e2c in PortalRun (portal=0x92ca8b0, count=2147483647, isTopLevel=-45 '�', dest=0x92be820, altdest=0x92be820, completionTag=0xbfed0c8a "") at pquery.c:813 #10 0x081fdcf7 in exec_simple_query ( query_string=0x9272d18 "update corn0715 set ncs_yield_factor=ncs_yield_factor/10000, transitional_yield=transitional_yield/100, fsa_yield=fsa_yield/100, aqpproved_yield=aqpproved_yield/100, prev_approved_yield=prev_approved_"...) at postgres.c:986 #11 0x081ff44c in PostgresMain (argc=4, argv=0x921b590, username=0x921b570 "bthoen") at postgres.c:3572 #12 0x081d253f in ServerLoop () at postmaster.c:3207 ---Type <return> to continue, or q <return> to quit--- #13 0x081d3492 in PostmasterMain (argc=3, argv=0x9218378) at postmaster.c:1029 #14 0x08188126 in main (argc=3, argv=0x9218378) at main.c:188 (gdb) Here's the backstory. The process that's crashing is an SQL update query on the table (corn0715). There are 104 fields in the table and the raw record length is about 648 characters. I have just imported it as a delimited file with COPY and there are about 1.7 million records. There are no indexes yet, and even though the database has been "spatialized" by PostGIS this particular table has no spatial component. The database contains only this table and the two reference tables installed by PostGIS. Also, this crash occurs not only when running the update query, but also when trying to create an index or selecting data from all rows or even when running pg_dump. It takes about 10 minuets to occur, which is about the same time that it took to SELECT count(*) FROM corn0715; (which did work, BTW). So I suspect it's an "end-of-table" problem. Here's the query: UPDATE corn0715 SET ncs_yield_factor=ncs_yield_factor/10000, transitional_yield=transitional_yield/100, fsa_yield=fsa_yield/100, aqpproved_yield=aqpproved_yield/100, prev_approved_yield=prev_approved_yield/100, annual_yield_1=annual_yield_1/100, yield_acres_1=yield_acres_1/100,annual_yield_2=annual_yield_2/100, yield_acres_2=yield_acres_2/100,annual_yield_3=annual_yield_3/100, yield_acres_3=yield_acres_3/100, annual_yield_4=annual_yield_4/100, yield_acres_4=yield_acres_4/100, annual_yield_5=annual_yield_5/100, yield_acres_5=yield_acres_5/100, annual_yield_6=annual_yield_6/100, yield_acres_6=yield_acres_6/100, annual_yield_7=annual_yield_7/100, yield_acres_7=yield_acres_7/100, annual_yield_8=annual_yield_8/100, yield_acres_8=yield_acres_8/100, annual_yield_9=annual_yield_9/100, yield_acres_9=yield_acres_9/100, annual_yield_10=annual_yield_10/100, yield_acres_10=yield_acres_10/100, rate_yield=rate_yield/100, average_yield=average_yield/100, perennial_transitional_yield_factor=perennial_transitional_yield_factor/100; I compiled PostgreSQL 8.3.3, PostGIS 1.3.3, GEOS 3.0.0, Proj4 4.6.0 and GDAL 1.5.2 from source using gcc 4.3.0 on a newly installed Fedora Core 9 system. All software is the latest version and the disc drive was wiped clean before I started, so there are no old libraries or configurations lurking in the background. I did have some issues compiling GEOS, PostGIS and GDAL. Apparently, there have been some changes in gcc, particularly in the include files, and I was getting errors like 'memcpy' not declared, and the like. I fixed that (I think) with a diff patch, and everything seemed to compile and install OK. I don't really understand how to interpret the results from GDB, but if somebody who knows could take a look at this and give me some suggestions, I'd appreciate it. If you need samples of the table, or the SQL scripts I used to load it, I can provide those too. If you want to see the entire table you'll need at least about 1.1 GB of disc space. TIA, - Bill Thoen
Bill Thoen <bthoen@gisnet.com> writes: > Is this what you need to see? (I attached GDB to the running postgres > process and let it continue until it crashed, and then entered 'bt' and > this is what I got.) Hmm ... interesting but not conclusive. The most likely explanation for this is corrupt data on-disk, but how it got that way is unclear. Can you reproduce the problem if you import the same data into a new table? > Here's the backstory. The process that's crashing is an SQL update > query on the table (corn0715). There are 104 fields in the table and the > raw record length is about 648 characters. I have just imported it as a > delimited file with COPY and there are about 1.7 million records. Can we see the exact table declaration? I'm wondering if the table has any columns of datatypes that are defined by the add-on modules. > I compiled PostgreSQL 8.3.3, PostGIS 1.3.3, GEOS 3.0.0, Proj4 4.6.0 and > GDAL 1.5.2 from source using gcc 4.3.0 on a newly installed Fedora Core > 9 system. All software is the latest version and the disc drive was > wiped clean before I started, so there are no old libraries or > configurations lurking in the background. > I did have some issues compiling GEOS, PostGIS and GDAL. Apparently, > there have been some changes in gcc, particularly in the include files, > and I was getting errors like 'memcpy' not declared, and the like. I > fixed that (I think) with a diff patch, and everything seemed to compile > and install OK. This seems pretty suspicious to me. gcc 4.3 shouldn't have resulted in any major changes in system header layout. What I am wondering is if GEOS and GDAL have any direct dependencies on Postgres, and if so whether they've been updated to work with 8.3. regards, tom lane
Tom Lane wrote: > This seems pretty suspicious to me. gcc 4.3 shouldn't have resulted in > any major changes in system header layout. What I am wondering is if > GEOS and GDAL have any direct dependencies on Postgres, and if so > whether they've been updated to work with 8.3. FWIW I compile with gcc 4.3 without any problem, and there are buildfarm members that do so as well. Not these add-ons though. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Sun, Jun 22, 2008 at 02:17:47PM -0400, Tom Lane wrote: > Hmm ... interesting but not conclusive. The most likely explanation > for this is corrupt data on-disk, but how it got that way is unclear. > Can you reproduce the problem if you import the same data into a new > table? I'll try that and let you know. > Can we see the exact table declaration? I'm wondering if the table has > any columns of datatypes that are defined by the add-on modules. Sure. It's just char(), integer and double precision; nothing fancy. Here it is: drop table corn0715 ; create table corn0715 ( record_type char(2), reporting_org char(2), fips_st_cd char(2), company char(3), policy_num char(7), crop_year char(4), crop_cd char(4), insurance_plan_cd char(2), fips_cou_cd char(3), unit_num char(5), type_cd char(3), practice_cd char(3), coverage_flag char(1), record_number char(3), t_yield_map_area char(3), ncs_yield_factor double precision, written_agreement_type char(2), written_agreement_num char(8), written_agreement_proc_flag char(2), yield_indicator char(2), transitional_yield double precision, fsa_yield double precision, aqpproved_yield double precision, prev_approved_yield double precision, yield_year_1 char(4), yield_type_1 char(2), annual_yield_1 double precision, yield_acres_1 double precision, revenue_yield_code_1 integer, yield_year_2 char(4), yield_type_2 char(2), annual_yield_2 double precision, yield_acres_2 double precision, revenue_yield_code_2 integer, yield_year_3 char(4), yield_type_3 char(2), annual_yield_3 double precision, yield_acres_3 double precision, revenue_yield_code_3 integer, yield_year_4 char(4), yield_type_4 char(2), annual_yield_4 double precision, yield_acres_4 double precision, revenue_yield_code_4 integer, yield_year_5 char(4), yield_type_5 char(2), annual_yield_5 double precision, yield_acres_5 double precision, revenue_yield_code_5 integer, yield_year_6 char(4), yield_type_6 char(2), annual_yield_6 double precision, yield_acres_6 double precision, revenue_yield_code_6 integer, yield_year_7 char(4), yield_type_7 char(2), annual_yield_7 double precision, yield_acres_7 double precision, revenue_yield_code_7 integer, yield_year_8 char(4), yield_type_8 char(2), annual_yield_8 double precision, yield_acres_8 double precision, revenue_yield_code_8 integer, yield_year_9 char(4), yield_type_9 char(2), annual_yield_9 double precision, yield_acres_9 double precision, revenue_yield_code_9 integer, yield_year_10 char(4), yield_type_10 char(2), annual_yield_10 double precision, yield_acres_10 double precision, revenue_yield_code_10 integer, rate_st char(2), rate_cou char(3), farm_nbr char(7), yield_limitation_flag char(2), excessive_yield_edit_bypass char(1), year_with_yield_refs integer, applicable_option_codes char(16), rate_yield double precision, average_yield double precision, prev_yield_limitation_flag char(2), yield_index char(5), perennial_year_of_set_out char(6), perennial_leaf_year char(2), perennial_density integer, perennial_block_num integer, perennial_transitional_yield_factor double precision, perennial_special_cases char(3), perennial_other char(1), perennial_year_grafting char(6), year_1_skip_row_code char(5), year_2_skip_row_code char(5), year_3_skip_row_code char(5), year_4_skip_row_code char(5), year_5_skip_row_code char(5), year_6_skip_row_code char(5), year_7_skip_row_code char(5), year_8_skip_row_code char(5), year_9_skip_row_code char(5), year_10_skip_row_code char(5) );
On Sun, Jun 22, 2008 at 02:17:47PM -0400, Tom Lane wrote: > Hmm ... interesting but not conclusive. The most likely explanation > for this is corrupt data on-disk, but how it got that way is unclear. > Can you reproduce the problem if you import the same data into a new > table? Arrg! Looks like the first load was corrupted as you suspected.. When I tried it a second time, it seems to have worked just fine. Well, at least I learned how to use GDB, so the time spent wasn't a total waste. I noticed searching through Google for postgres error messages like the one I had that one of the main things you have to teach people is how to use GDB. ;-) Thanks, Tom! I really appreciate your help. - Bill Thoen
Bill Thoen <bthoen@gisnet.com> writes: > On Sun, Jun 22, 2008 at 02:17:47PM -0400, Tom Lane wrote: >> Can you reproduce the problem if you import the same data into a new >> table? > Arrg! Looks like the first load was corrupted as you suspected.. When I > tried it a second time, it seems to have worked just fine. Hmm. Unless you have reason to think that your hardware is flaky, I dislike writing this off as "just a glitch". I think most likely you got bit by a bug somewhere. Still, if we can't reproduce it it's gonna be mighty hard to find. If you have the time and interest, it might be worth repeating the whole setup sequence starting from initdb. I'm speculating that installing PostGIS might have somehow left corruption in server memory that later manifested as the visible problem. regards, tom lane
On Sun, Jun 22, 2008 at 06:47:14PM -0400, Tom Lane wrote: > Hmm. Unless you have reason to think that your hardware is flaky, > I dislike writing this off as "just a glitch". I think most likely > you got bit by a bug somewhere. Still, if we can't reproduce it > it's gonna be mighty hard to find. > > If you have the time and interest, it might be worth repeating the > whole setup sequence starting from initdb. I'm speculating that > installing PostGIS might have somehow left corruption in server memory > that later manifested as the visible problem. > I'm going to have a lot riding on this project if it flies, so if you think there might something lurking in the background, I'll wipe the database and rebuild it and we'll see. The data load is all scripted anyway, so it shouldn't take more than an hour or so. I'll let you know either way tomorrow morning.
Tom Lane wrote: > If you have the time and interest, it might be worth repeating the > whole setup sequence starting from initdb. I'm speculating that > installing PostGIS might have somehow left corruption in server memory > that later manifested as the visible problem. > Tom, I'm not sure if I did this right, but a rebuild of that data cluster worked fine. Here's the steps I took (please correct me if I messed something up here): I stopped the server, then just deleted /usr/local/pgsql/data, followed by: # su - postgres $ /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data Then I updated pg_hba.conf to the correct access settings, and started the postmaster service again. Following that I logged into psql and set a password for postgres and created a lower-privileged user account (bthoen) to use with the database I will be building (vfm). Then I created the database and "spatialized" it. $ createdb -O bthoen vfm $ createlang plpgsql vfm $ psql -f /usr/local/pgsql/share/lwpostgis.sql -d vfm Finally I created the corn0715 table and populated it, ran my update script and indexed the table. Finally I did a 'vacuum full' on it to pack it back down. The table seems fine this time, and there was no segfaulting nonsense at any time. But the server log was a lot more active. Whe I loaded this data the first time and had that corruption prroblem, there was hardly anything posted to the log. However, this time I noticed when I ran the postgis script that there was some weirdness written to the log that doesn't look right . Does this look like geos or postgis is a bit broken to you? NOTICE: type "histogram2d" is not yet defined DETAIL: Creating a shell type definition. NOTICE: return type histogram2d is only a shell NOTICE: argument type histogram2d is only a shell NOTICE: argument type histogram2d is only a shell NOTICE: type "spheroid" is not yet defined DETAIL: Creating a shell type definition. NOTICE: return type spheroid is only a shell NOTICE: argument type spheroid is only a shell NOTICE: argument type spheroid is only a shell NOTICE: type "geometry" is not yet defined DETAIL: Creating a shell type definition. NOTICE: return type geometry is only a shell NOTICE: argument type geometry is only a shell NOTICE: argument type geometry is only a shell NOTICE: return type geometry is only a shell NOTICE: return type geometry is only a shell NOTICE: argument type geometry is only a shell NOTICE: argument type geometry is only a shell NOTICE: type "box3d" is not yet definedDETAIL: Creating a shell type definition. NOTICE: argument type box3d is only a shell NOTICE: return type box3d is only a shell NOTICE: argument type box3d is only a shell NOTICE: type "chip" is not yet defined DETAIL: Creating a shell type definition. NOTICE: return type chip is only a shell NOTICE: argument type chip is only a shell NOTICE: argument type chip is only a shell NOTICE: type "box2d" is not yet defined DETAIL: Creating a shell type definition. NOTICE: return type box2d is only a shell NOTICE: argument type box2d is only a shell NOTICE: argument type box2d is only a shell NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "spatial_ref_sys_pkey" for table "spatial_ref_sys" NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "geometry_columns_pk" for table "geometry_columns" ERROR: table "corn0715" does not exist STATEMENT: drop table corn0715 ; LOG: checkpoints are occurring too frequently (25 seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". LOG: checkpoints are occurring too frequently (25 seconds apart) HINT: Consider increasing the configuration parameter "checkpoint_segments". . . . (lots of checkpoints static with the COPY loading, but otherwise things looked normal from here on.) So I think postgres is all right, but I'm not so sure about postgis. I did have problems trying to compile GEOS (and GDAL) so maybe that's where the trouble lies. Any thoughts? - Bill Thoen
Bill Thoen <bthoen@gisnet.com> writes: > The table seems fine this time, and there was no segfaulting nonsense at > any time. But the server log was a lot more active. Whe I loaded this > data the first time and had that corruption prroblem, there was hardly > anything posted to the log. However, this time I noticed when I ran the > postgis script that there was some weirdness written to the log that > doesn't look right . Does this look like geos or postgis is a bit broken > to you? No, that's more or less the sort of output I'd expect to see from loading the postgis datatypes. There's nothing wrong here that I can tell. So, whatever that glitch was, it seems we don't know how to trigger it. Oh well ... regards, tom lane