Thread: problem creating index in 6,5,3
If PostgreSQL failed to compile on your computer or you found a bug that is likely to be specific to one platform then please fill out this form and e-mail it to pgsql-ports@postgresql.org. To report any other bug, fill out the form below and e-mail it to pgsql-bugs@postgresql.org. If you not only found the problem but solved it and generated a patch then e-mail it to pgsql-patches@postgresql.org instead. Please use the command "diff -c" to generate the patch. You may also enter a bug report at http://www.postgresql.org/ instead of e-mail-ing this form. ============================================================================ POSTGRESQL BUG REPORT TEMPLATE ============================================================================ Your name : Karl DeBisschop Your email address : kdebisschop@alert.infoplease.com System Configuration --------------------- Architecture (example: Intel Pentium) : Intel Pentium III 450MHz x2 SMP Operating System (example: Linux 2.0.26 ELF) : Linux version 2.2.7-1.23smp (root@jiangsup.var.com) (gcc version egcs-2.91.6619990314/Linux (egcs-1.1.2 release)) #1 SMP Thu Jul 8 15:23:01 PDT 1999 PostgreSQL version (example: PostgreSQL-6.5.2): PostgreSQL-6.5.3 (from postgresql-6.5.3-1.i386.rpm distributed by www.POstgreSQL.org) Compiler used (example: gcc 2.8.0) : Please enter a FULL description of your problem: ------------------------------------------------ Cannot create index. Database consists of: +------------------+----------------------------------+----------+ | webadmin | abusers | table | | kdebisschop | daily | table | | webadmin | monthly | table | | postgres | q11999 | table | | kdebisschop | q21999 | table | | postgres | q41998 | table | | webadmin | users_into_db_temp | table | +------------------+----------------------------------+----------+ Table = daily (also q11999,q21999,q41998,users_into_db_temp,abusers) +----------------------------------+----------------------------------+-------+ | Field | Type | Length| +----------------------------------+----------------------------------+-------+ | n | int4 | 4 | | date | date | 4 | | cookie | char() | 1 | | id | text | var | +----------------------------------+----------------------------------+-------+ Table = monthly +----------------------------------+----------------------------------+-------+ | Field | Type | Length| +----------------------------------+----------------------------------+-------+ | month | date | 4 | | totalusers | int4 | 4 | | newusers | int4 | 4 | +----------------------------------+----------------------------------+-------+ Table 'daily' has 10,122,805 tuples and consumes 872,333,312 bytes (One part of trying to resolve this has been to move data into the q[1-4](199n) tables - daily was formerly split into two files - but this has not seemed to help) The problem manifests itself as webusers=> CREATE INDEX zdaily_id ON daily (id); pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. We have lost the connection to the backend, so further processing is impossible. Terminating. It runs for probably a little less than an hour then dies. I have set debug to level 3 (I was unable to find documentaion on what levels are available) and on message was printed to the log regarding the cause of the failure. (That is, I grepped the log for 'daily' and 'webusers' - the only occurence was connecting and issuing the query - no mention after that) the disk has plenty of space: $ df -k . Filesystem 1k-blocks Used Available Use% Mounted on /dev/sda7 6123224 3312116 2495032 57% /disk/1 ==> 2,554,912,768 bytes During index generation, it looks like another 1,101,922,304 bytes of temp sort filespace is consumed, which still leave over 1.5GB free. I tried to take snapshots of the index space as it cycled through each pass, in hope that I could find out a little about what was going on when it died. This last snapshot is from just before the backend terminated: [postgres@sterno data]$ df -k base/webusers;ls -l base/webusers/pg_sorttemp21771.* base/webusers/z* Filesystem 1k-blocks Used Available Use% Mounted on /dev/sda7 6123224 4187180 1619968 72% /disk/1 -rw------- 1 postgres postgres 0 Dec 17 12:29 base/webusers/pg_sorttemp21771.0 -rw------- 1 postgres postgres 78675968 Dec 17 12:27 base/webusers/pg_sorttemp21771.1 -rw------- 1 postgres postgres 0 Dec 17 12:29 base/webusers/pg_sorttemp21771.10 -rw------- 1 postgres postgres 28639232 Dec 17 12:29 base/webusers/pg_sorttemp21771.11 -rw------- 1 postgres postgres 0 Dec 17 12:29 base/webusers/pg_sorttemp21771.12 -rw------- 1 postgres postgres 0 Dec 17 12:26 base/webusers/pg_sorttemp21771.13 -rw------- 1 postgres postgres 0 Dec 17 12:29 base/webusers/pg_sorttemp21771.2 -rw------- 1 postgres postgres 78675968 Dec 17 12:27 base/webusers/pg_sorttemp21771.3 -rw------- 1 postgres postgres 0 Dec 17 12:29 base/webusers/pg_sorttemp21771.4 -rw------- 1 postgres postgres 78708736 Dec 17 12:28 base/webusers/pg_sorttemp21771.5 -rw------- 1 postgres postgres 0 Dec 17 12:29 base/webusers/pg_sorttemp21771.6 -rw------- 1 postgres postgres 78675968 Dec 17 12:29 base/webusers/pg_sorttemp21771.7 -rw------- 1 postgres postgres 0 Dec 17 12:29 base/webusers/pg_sorttemp21771.8 -rw------- 1 postgres postgres 78708736 Dec 17 12:29 base/webusers/pg_sorttemp21771.9 -rw------- 1 postgres postgres 472956928 Dec 17 12:33 base/webusers/zdaily_id This did work for us using 6.5.1 compiled in house. Our problems started with the 6.5.3 RPM. Other than taking the dump of this data and copying in into a new database, I have found no way to reproduce this problem. This is the largest single table we have in production. It does happen with a fresh copy into a totally new database, so I think the problem is in the DBMS. Please describe a way to repeat the problem. Please try to provide a concise reproducible example, if at all possible: ---------------------------------------------------------------------- If you know how this problem might be fixed, list the solution below: ---------------------------------------------------------------------
Karl DeBisschop <kdebisschop@range.infoplease.com> writes: > The problem manifests itself as > webusers=> CREATE INDEX zdaily_id ON daily (id); > pqReadData() -- backend closed the channel unexpectedly. > This probably means the backend terminated abnormally > before or while processing the request. Is a corefile dumped in the database directory when this happens? If so, can you get a backtrace from it with gdb? > This did work for us using 6.5.1 compiled in house. Our problems > started with the 6.5.3 RPM. Hmm. We have been seeing some reports recently of other odd behavior with the Linux RPMs (see the "ordering RH6.1" thread on pgsql-hackers). Would you take the time to compile 6.5.3 from source and see if you still see the same behavior? If you do this, please compile the backend with full debug+assert support, which is to say (a) add --enable-cassert to the configure command; (b) do make COPT=-g all instead of just make all. That should make the gdb backtrace a lot more informative, if a coredump happens... regards, tom lane
> cc: pgsql-bugs@postgreSQL.org > Date: Fri, 17 Dec 1999 13:59:13 -0500 > From: Tom Lane <tgl@sss.pgh.pa.us> > > Karl DeBisschop <kdebisschop@range.infoplease.com> writes: > > The problem manifests itself as > > > webusers=> CREATE INDEX zdaily_id ON daily (id); > > pqReadData() -- backend closed the channel unexpectedly. > > This probably means the backend terminated abnormally > > before or while processing the request. > > Is a corefile dumped in the database directory when this happens? > If so, can you get a backtrace from it with gdb? No corefile is dumped > > This did work for us using 6.5.1 compiled in house. Our problems > > started with the 6.5.3 RPM. > > Hmm. We have been seeing some reports recently of other odd behavior > with the Linux RPMs (see the "ordering RH6.1" thread on pgsql-hackers). > Would you take the time to compile 6.5.3 from source and see if you > still see the same behavior? In progress. Stumbling over the fact that our VA Research is not recognized by config.guess -- found out how to fix config.guess to correct that (I probably should have checked over at FSF first). Just patching the SRPM now > If you do this, please compile the backend with full debug+assert > support, which is to say > (a) add --enable-cassert to the configure command; > (b) do make COPT=-g all instead of just make all. > That should make the gdb backtrace a lot more informative, if > a coredump happens... > > regards, tom lane okey dokey. karl
Recompiled with (a) add --enable-cassert to the configure command; (b) do make COPT=-g all instead of just make all. On the same machine that the backend runs on. Still same behaviour. Saw small notice in log file: proc_exit(0) [#0] shmem_exit(0) [#0] exit(0) proc_exit(0) [#0] shmem_exit(0) [#0] exit(0) Still no core. I'm not really sure there is a linkage with the ordering thing - remember that this worked fine a week ago before I upgraded to 6.5.3. Unless the backend reaches out to different libraries than in 6.5.1, I think it's unlikely to be RH6n libraries. Also, even tough RH libs can be odd, this is not a redhat machine - it's VA linux, which uses RPMs but is in many aspects different than VA linux. So I'm still stumped. For laughs, I think I'll try it on a much less powerful RH60 box and see what I get. Let me know if there's anything else I can do to help diagnose. Karl
> From: Tom Lane <tgl@sss.pgh.pa.us> > > Karl DeBisschop <kdebisschop@range.infoplease.com> writes: > > The problem manifests itself as > > > webusers=> CREATE INDEX zdaily_id ON daily (id); > > pqReadData() -- backend closed the channel unexpectedly. > > This probably means the backend terminated abnormally > > before or while processing the request. > > Is a corefile dumped in the database directory when this happens? > If so, can you get a backtrace from it with gdb? > > > This did work for us using 6.5.1 compiled in house. Our problems > > started with the 6.5.3 RPM. > > Hmm. We have been seeing some reports recently of other odd behavior > with the Linux RPMs (see the "ordering RH6.1" thread on pgsql-hackers). > Would you take the time to compile 6.5.3 from source and see if you > still see the same behavior? > > If you do this, please compile the backend with full debug+assert > support, which is to say > (a) add --enable-cassert to the configure command; > (b) do make COPT=-g all instead of just make all. > That should make the gdb backtrace a lot more informative, if > a coredump happens... Finally got a tiny little error message: sterno.infoplease.com:/u/kdebisschop> psql -p 5433 webusers -c "CREATE INDEX zdaily_id ON daily (id)" pqReadData() -- backend closed the channel unexpectedly. This probably means the backend terminated abnormally before or while processing the request. We have lost the connection to the backend, so further processing is impossible. Terminating. [2]+ Done psql -p 5433 webusers -c "CREATE INDEX zdaily_date_n ON daily (date,n)" sterno.infoplease.com:/u/kdebisschop> psql -p 5433 webusers -c "CREATE INDEX zdaily_id ON daily USING hash (id)" ERROR: cannot open segment 1 of relation zdaily_id sterno.infoplease.com:/u/kdebisschop> Still no core dump. Each of these takes an hour or so to croak. Any ideas would be welcome. Karl DeBisschop
Karl DeBisschop <kdebisschop@range.infoplease.com> writes: > sterno.infoplease.com:/u/kdebisschop> psql -p 5433 webusers -c "CREATE INDEX zdaily_id ON daily (id)" > pqReadData() -- backend closed the channel unexpectedly. > This probably means the backend terminated abnormally > before or while processing the request. > We have lost the connection to the backend, so further processing is impossible. Terminating. > [2]+ Done psql -p 5433 webusers -c "CREATE INDEX zdaily_date_n ON daily (date,n)" > sterno.infoplease.com:/u/kdebisschop> psql -p 5433 webusers -c "CREATE INDEX zdaily_id ON daily USING hash (id)" > ERROR: cannot open segment 1 of relation zdaily_id > sterno.infoplease.com:/u/kdebisschop> > Still no core dump. Each of these takes an hour or so to croak. Puzzling... The lack of a core dump is probably the thing we ought to be paying attention to here --- most ordinary sorts of unexpected-backend-crash bugs should dump core. One possibility is that the kernel is killing that backend process, most likely for having exceeded a system limit on runtime or disk blocks written; do you have such limits enabled on your machine? Or maybe it is running out of memory/swap space; how big does the process get before terminating? Another possibility is that the backend has detected a fatal error condition and is doing a more-or-less-orderly shutdown. However if that were true then there ought to be an error message in the postmaster logfile; I see no such message in your prior mails. Also, I just noticed that you are trying to index a "text" field. How long are the entries in the field? 6.5.* does not have adequate defenses against overly-long index entries --- the maximum safe size is 2700 bytes IIRC, but this isn't tested for in the current release. regards, tom lane
> From: Tom Lane <tgl@sss.pgh.pa.us> > > Puzzling... > > The lack of a core dump is probably the thing we ought to be paying > attention to here --- most ordinary sorts of unexpected-backend-crash > bugs should dump core. Seems reasonable. To the extent that my frustration over this problem has become an emotional issue, alot of that has to do with the fact that I can't seem to get a clue what the problem is. > One possibility is that the kernel is killing that backend process, > most likely for having exceeded a system limit on runtime or disk blocks > written; do you have such limits enabled on your machine? Or maybe it > is running out of memory/swap space; how big does the process get before > terminating? No quotas are established. This is not the largest, nor longest process that runs routinely on this machine. Plus, it ran under 6.5.1, but stopped running once I installed 6.5.1. As far as machine limits, I'm pretty sure we are not reaching any. See for yourself if you want: ======================================================================== [postgres@sterno webusers]$ ps -u postgres -o sz,vsz,rss,pid,time,stat,args;df -lk; ls -sk1 z* pg_so*;cat /proc/meminfo SZ VSZ RSS PID TIME STAT COMMAND 5263 21052 1008 30040 00:00:01 S /usr/bin/postmaster -i -d 3 -B 2048 -D/var/lib/pgsql -o -S 10240 3085 12340 524 5713 00:00:00 S /opt/postgresql/bin/postmaster -p 5433 -B 1024 -S -D/opt/postgres-6.5.1-webusers-db-o -S 5120 423 1692 964 8558 00:00:00 S -bash 3349 13396 10308 8869 00:01:39 R /opt/postgresql/bin/postgres localhost webadmin webusers CREATE 421 1684 944 8873 00:00:00 S -bash 663 2652 956 8916 00:00:00 R ps -u postgres -o sz,vsz,rss,pid,time,stat,args Filesystem 1k-blocks Used Available Use% Mounted on /dev/sda1 1981000 1114966 763620 59% / /dev/sda7 6123224 3344200 2462948 58% /disk/1 19608 pg_sorttemp8869.0 66184 pg_sorttemp8869.1 19352 pg_sorttemp8869.10 66184 pg_sorttemp8869.11 19480 pg_sorttemp8869.12 66152 pg_sorttemp8869.13 19640 pg_sorttemp8869.2 66184 pg_sorttemp8869.3 19416 pg_sorttemp8869.4 66184 pg_sorttemp8869.5 19384 pg_sorttemp8869.6 66184 pg_sorttemp8869.7 19576 pg_sorttemp8869.8 66184 pg_sorttemp8869.9 224072 zdaily_cookie 235708 zdaily_date_n 16 zdaily_id total: used: free: shared: buffers: cached: Mem: 1061101568 1058770944 2330624 59154432 383434752 615337984 Swap: 526401536 18931712 507469824 MemTotal: 1036232 kB MemFree: 2276 kB MemShared: 57768 kB Buffers: 374448 kB Cached: 600916 kB SwapTotal: 514064 kB SwapFree: 495576 kB ======================================================================== MemFree bounces around some, occasionally becoming rather large, but that is an OS function. There's always plenty of swap to draw from. The porcess size basically does not change during operation. > Another possibility is that the backend has detected a fatal error > condition and is doing a more-or-less-orderly shutdown. However if that > were true then there ought to be an error message in the postmaster > logfile; I see no such message in your prior mails. There have been no such messages. > Also, I just noticed that you are trying to index a "text" field. > How long are the entries in the field? 6.5.* does not have adequate > defenses against overly-long index entries --- the maximum safe size > is 2700 bytes IIRC, but this isn't tested for in the current release. Text field max size is 32 characters. I have done a little more poking around. Just to confirm that size is the problem, I recreated the schema and imported just 30 or so records. Index works fine.
Karl DeBisschop <kdebisschop@range.infoplease.com> writes: > Text field max size is 32 characters. Drat, another good theory down the drain ;-) I think it's time to start wheeling out the heavy artillery. Do you know how to use gdb? I think we might be able to get some useful info this way: 1. Start a postgres process running doing the CREATE INDEX. Use ps to determine the PID of the backend process. 2. gdb /usr/local/pgsql/bin/postgres (or whereever you keep the executable). You will need to run this as the postgres user. 3. In gdb: attach PID-of-backend-process break exit break _exit (just belt and suspenders stuff) cont (Depending on your system, it might not let you set a breakpoint at exit, in which case try setting breaks at "proc_exit" and "quickdie".) 4. Wait for backend to fail. 5. If you hit any of the breakpoints, use "bt" to get a backtrace. Otherwise, gdb should at least tell you what signal caused the program to terminate ... which will be more info than we have now ... regards, tom lane
gdb seesion follows: GNU gdb 4.17.0.11 with Linux support Copyright 1998 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i386-redhat-linux"... (gdb) attach 9923 Attaching to program `/opt/postgresql/bin/postgres', Pid 9923 Reading symbols from /lib/libcrypt.so.1...done. Reading symbols from /lib/libnsl.so.1...done. Reading symbols from /lib/libdl.so.2...done. Reading symbols from /lib/libm.so.6...done. Reading symbols from /usr/lib/libreadline.so.3...done. Reading symbols from /usr/lib/libhistory.so.3...done. Reading symbols from /lib/libtermcap.so.2...done. Reading symbols from /usr/lib/libncurses.so.4...done. Reading symbols from /lib/libc.so.6...done. Reading symbols from /lib/ld-linux.so.2...done. 0x80d8363 in SpinAcquire () (gdb) break exit Breakpoint 1 at 0x2abb0532: file exit.c, line 40. (gdb) break _exit Breakpoint 2 at 0x2ac1e910 (gdb) cont Continuing. Breakpoint 1, exit (status=0) at exit.c:40 exit.c:40: No such file or directory. (gdb) bt #0 exit (status=0) at exit.c:40 #1 0x80d6694 in proc_exit () #2 0x81075ef in elog () #3 0x80775f6 in _bt_buildadd () #4 0x80777ea in _bt_merge () #5 0x8077af5 in _bt_leafbuild () #6 0x807340d in btbuild () #7 0x810800f in fmgr_c () #8 0x8108227 in fmgr () #9 0x807e60b in index_build () #10 0x807de2b in index_create () #11 0x8085376 in DefineIndex () #12 0x80e006d in ProcessUtility () #13 0x80de096 in pg_exec_query_dest () #14 0x80ddfb7 in pg_exec_query () #15 0x80df03c in PostgresMain () #16 0x80c91ba in DoBackend () #17 0x80c8cda in BackendStartup () #18 0x80c8429 in ServerLoop () #19 0x80c7f67 in PostmasterMain () #20 0x80a0986 in main () #21 0x2aba7cb3 in __libc_start_main (main=0x80a0920 <main>, argc=5, argv=0x7ffffd34, init=0x8061360 <_init>, fini=0x810d63c <_fini>, rtld_fini=0x2aab5350 <_dl_fini>, stack_end=0x7ffffd2c) at ../sysdeps/generic/libc-start.c:78 (gdb) cont Continuing. Breakpoint 2, 0x2ac1e910 in _exit () (gdb) bt #0 0x2ac1e910 in _exit () #1 0x2abb05f8 in exit () at exit.c:70 #2 0x80d6694 in proc_exit () #3 0x81075ef in elog () #4 0x80775f6 in _bt_buildadd () #5 0x80777ea in _bt_merge () #6 0x8077af5 in _bt_leafbuild () #7 0x807340d in btbuild () #8 0x810800f in fmgr_c () #9 0x8108227 in fmgr () #10 0x807e60b in index_build () #11 0x807de2b in index_create () #12 0x8085376 in DefineIndex () #13 0x80e006d in ProcessUtility () #14 0x80de096 in pg_exec_query_dest () #15 0x80ddfb7 in pg_exec_query () #16 0x80df03c in PostgresMain () #17 0x80c91ba in DoBackend () #18 0x80c8cda in BackendStartup () #19 0x80c8429 in ServerLoop () #20 0x80c7f67 in PostmasterMain () #21 0x80a0986 in main () #22 0x2aba7cb3 in __libc_start_main (main=0x80a0920 <main>, argc=5, ---Type <return> to continue, or q <return> to quit--- argv=0x7ffffd34, init=0x8061360 <_init>, fini=0x810d63c <_fini>, rtld_fini=0x2aab5350 <_dl_fini>, stack_end=0x7ffffd2c) at ../sysdeps/generic/libc-start.c:78 (gdb) cont Continuing. Program exited normally. (gdb)
Karl DeBisschop <kdebisschop@range.infoplease.com> writes: > (gdb) bt > #0 exit (status=0) at exit.c:40 > #1 0x80d6694 in proc_exit () > #2 0x81075ef in elog () > #3 0x80775f6 in _bt_buildadd () > #4 0x80777ea in _bt_merge () > #5 0x8077af5 in _bt_leafbuild () > #6 0x807340d in btbuild () So the darn thing *is* doing an elog(FATAL) ! If you look at bt_buildadd (in backend/access/nbtree/nbtsort.c) there are two calls to elog: elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)"); elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)"); If you'd built the backend with -g then we'd know which one this is, but at this point it probably doesn't matter much. OK, now we have two questions: (1) why is it failing, and (2) why isn't the elog message showing up in your postmaster log? (You may also be wondering (3) why isn't psql reporting the elog message, but that one is a known bug in 6.5.* libpq --- it forgets to report pending ERROR messages if it sees a backend disconnect. The message ought to have made it to your postmaster log, though.) I have to go do some real work right now :-(, but this needs looking into. regards, tom lane