Thread: problem creating index in 6,5,3

problem creating index in 6,5,3

From
Karl DeBisschop
Date:

If PostgreSQL failed to compile on your computer or you found a bug that
is likely to be specific to one platform then please fill out this form
and e-mail it to pgsql-ports@postgresql.org.

To report any other bug, fill out the form below and e-mail it to
pgsql-bugs@postgresql.org.

If you not only found the problem but solved it and generated a patch
then e-mail it to pgsql-patches@postgresql.org instead.  Please use the
command "diff -c" to generate the patch.

You may also enter a bug report at http://www.postgresql.org/ instead of
e-mail-ing this form.

============================================================================
                        POSTGRESQL BUG REPORT TEMPLATE
============================================================================


Your name               : Karl DeBisschop
Your email address      : kdebisschop@alert.infoplease.com


System Configuration
---------------------
  Architecture (example: Intel Pentium)         :   Intel Pentium III 450MHz x2 SMP

  Operating System (example: Linux 2.0.26 ELF)  :   Linux version 2.2.7-1.23smp (root@jiangsup.var.com) (gcc version
egcs-2.91.6619990314/Linux (egcs-1.1.2 release)) #1 SMP Thu Jul 8 15:23:01 PDT 1999 

  PostgreSQL version (example: PostgreSQL-6.5.2):   PostgreSQL-6.5.3 (from postgresql-6.5.3-1.i386.rpm distributed by
www.POstgreSQL.org)

  Compiler used (example:  gcc 2.8.0)           :


Please enter a FULL description of your problem:
------------------------------------------------

Cannot create index. Database consists of:

 +------------------+----------------------------------+----------+
 | webadmin         | abusers                          | table    |
 | kdebisschop      | daily                            | table    |
 | webadmin         | monthly                          | table    |
 | postgres         | q11999                           | table    |
 | kdebisschop      | q21999                           | table    |
 | postgres         | q41998                           | table    |
 | webadmin         | users_into_db_temp               | table    |
 +------------------+----------------------------------+----------+


Table    = daily (also q11999,q21999,q41998,users_into_db_temp,abusers)
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                | Length|
+----------------------------------+----------------------------------+-------+
| n                                | int4                             |     4 |
| date                             | date                             |     4 |
| cookie                           | char()                           |     1 |
| id                               | text                             |   var |
+----------------------------------+----------------------------------+-------+

Table    = monthly
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                | Length|
+----------------------------------+----------------------------------+-------+
| month                            | date                             |     4 |
| totalusers                       | int4                             |     4 |
| newusers                         | int4                             |     4 |
+----------------------------------+----------------------------------+-------+

Table 'daily' has 10,122,805 tuples and consumes 872,333,312 bytes

(One part of trying to resolve this has been to move data into the
q[1-4](199n) tables - daily was formerly split into two files - but
this has not seemed to help)

The problem manifests itself as

webusers=> CREATE INDEX zdaily_id ON daily (id);
pqReadData() -- backend closed the channel unexpectedly.
        This probably means the backend terminated abnormally
        before or while processing the request.
We have lost the connection to the backend, so further processing is impossible.  Terminating.

It runs for probably a little less than an hour then dies.

I have set debug to level 3 (I was unable to find documentaion on what
levels are available) and on message was printed to the log regarding
the cause of the failure.  (That is, I grepped the log for 'daily' and
'webusers' - the only occurence was connecting and issuing the query -
no mention after that)

the disk has plenty of space:
$ df -k .
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda7              6123224   3312116   2495032  57% /disk/1

==> 2,554,912,768 bytes

During index generation, it looks like another 1,101,922,304 bytes of
temp sort filespace is consumed, which still leave over 1.5GB free.  I
tried to take snapshots of the index space as it cycled through each
pass, in hope that I could find out a little about what was going on
when it died.  This last snapshot is from just before the backend terminated:


[postgres@sterno data]$ df -k base/webusers;ls -l base/webusers/pg_sorttemp21771.* base/webusers/z*
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda7              6123224   4187180   1619968  72% /disk/1
-rw-------   1 postgres postgres        0 Dec 17 12:29 base/webusers/pg_sorttemp21771.0
-rw-------   1 postgres postgres 78675968 Dec 17 12:27 base/webusers/pg_sorttemp21771.1
-rw-------   1 postgres postgres        0 Dec 17 12:29 base/webusers/pg_sorttemp21771.10
-rw-------   1 postgres postgres 28639232 Dec 17 12:29 base/webusers/pg_sorttemp21771.11
-rw-------   1 postgres postgres        0 Dec 17 12:29 base/webusers/pg_sorttemp21771.12
-rw-------   1 postgres postgres        0 Dec 17 12:26 base/webusers/pg_sorttemp21771.13
-rw-------   1 postgres postgres        0 Dec 17 12:29 base/webusers/pg_sorttemp21771.2
-rw-------   1 postgres postgres 78675968 Dec 17 12:27 base/webusers/pg_sorttemp21771.3
-rw-------   1 postgres postgres        0 Dec 17 12:29 base/webusers/pg_sorttemp21771.4
-rw-------   1 postgres postgres 78708736 Dec 17 12:28 base/webusers/pg_sorttemp21771.5
-rw-------   1 postgres postgres        0 Dec 17 12:29 base/webusers/pg_sorttemp21771.6
-rw-------   1 postgres postgres 78675968 Dec 17 12:29 base/webusers/pg_sorttemp21771.7
-rw-------   1 postgres postgres        0 Dec 17 12:29 base/webusers/pg_sorttemp21771.8
-rw-------   1 postgres postgres 78708736 Dec 17 12:29 base/webusers/pg_sorttemp21771.9
-rw-------   1 postgres postgres 472956928 Dec 17 12:33 base/webusers/zdaily_id

This did work for us using 6.5.1 compiled in house.  Our problems
started with the 6.5.3 RPM.

Other than taking the dump of this data and copying in into a new
database, I have found no way to reproduce this problem.  This is the
largest single table we have in production.  It does happen with a
fresh copy into a totally new database, so I think the problem is in
the DBMS.

Please describe a way to repeat the problem.   Please try to provide a
concise reproducible example, if at all possible:
----------------------------------------------------------------------



If you know how this problem might be fixed, list the solution below:
---------------------------------------------------------------------




Re: [BUGS] problem creating index in 6,5,3

From
Tom Lane
Date:
Karl DeBisschop <kdebisschop@range.infoplease.com> writes:
> The problem manifests itself as

> webusers=> CREATE INDEX zdaily_id ON daily (id);
> pqReadData() -- backend closed the channel unexpectedly.
>         This probably means the backend terminated abnormally
>         before or while processing the request.

Is a corefile dumped in the database directory when this happens?
If so, can you get a backtrace from it with gdb?

> This did work for us using 6.5.1 compiled in house.  Our problems
> started with the 6.5.3 RPM.

Hmm.  We have been seeing some reports recently of other odd behavior
with the Linux RPMs (see the "ordering RH6.1" thread on pgsql-hackers).
Would you take the time to compile 6.5.3 from source and see if you
still see the same behavior?

If you do this, please compile the backend with full debug+assert
support, which is to say
  (a) add --enable-cassert to the configure command;
  (b) do   make COPT=-g all  instead of just make all.
That should make the gdb backtrace a lot more informative, if
a coredump happens...

            regards, tom lane

Re: [BUGS] problem creating index in 6,5,3

From
Karl DeBisschop
Date:
>   cc: pgsql-bugs@postgreSQL.org
>   Date: Fri, 17 Dec 1999 13:59:13 -0500
>   From: Tom Lane <tgl@sss.pgh.pa.us>
>
>   Karl DeBisschop <kdebisschop@range.infoplease.com> writes:
>   > The problem manifests itself as
>
>   > webusers=> CREATE INDEX zdaily_id ON daily (id);
>   > pqReadData() -- backend closed the channel unexpectedly.
>   >         This probably means the backend terminated abnormally
>   >         before or while processing the request.
>
>   Is a corefile dumped in the database directory when this happens?
>   If so, can you get a backtrace from it with gdb?

No corefile is dumped

>   > This did work for us using 6.5.1 compiled in house.  Our problems
>   > started with the 6.5.3 RPM.
>
>   Hmm.  We have been seeing some reports recently of other odd behavior
>   with the Linux RPMs (see the "ordering RH6.1" thread on pgsql-hackers).
>   Would you take the time to compile 6.5.3 from source and see if you
>   still see the same behavior?

In progress.  Stumbling over the fact that our VA Research is not
recognized by config.guess -- found out how to fix config.guess to
correct that (I probably should have checked over at FSF first).  Just
patching the SRPM now

>   If you do this, please compile the backend with full debug+assert
>   support, which is to say
>     (a) add --enable-cassert to the configure command;
>     (b) do   make COPT=-g all  instead of just make all.
>   That should make the gdb backtrace a lot more informative, if
>   a coredump happens...
>
>               regards, tom lane


okey dokey.

karl

Re: [BUGS] problem creating index in 6,5,3

From
Karl DeBisschop
Date:
Recompiled with

  (a) add --enable-cassert to the configure command;
  (b) do   make COPT=-g all  instead of just make all.

On the same machine that the backend runs on.

Still same behaviour.  Saw small notice in log file:

proc_exit(0) [#0]
shmem_exit(0) [#0]
exit(0)
proc_exit(0) [#0]
shmem_exit(0) [#0]
exit(0)

Still no core.

I'm not really sure there is a linkage with the ordering thing -
remember that this worked fine a week ago before I upgraded to 6.5.3.
Unless the backend reaches out to different libraries than in 6.5.1, I
think it's unlikely to be RH6n libraries.  Also, even tough RH libs
can be odd, this is not a redhat machine - it's VA linux, which uses
RPMs but is in many aspects different than VA linux.

So I'm still stumped.  For laughs, I think I'll try it on a much less
powerful RH60 box and see what I get.

Let me know if there's anything else I can do to help diagnose.

Karl

Re: [BUGS] problem creating index in 6,5,3

From
Karl DeBisschop
Date:
>   From: Tom Lane <tgl@sss.pgh.pa.us>
>
>   Karl DeBisschop <kdebisschop@range.infoplease.com> writes:
>   > The problem manifests itself as
>
>   > webusers=> CREATE INDEX zdaily_id ON daily (id);
>   > pqReadData() -- backend closed the channel unexpectedly.
>   >         This probably means the backend terminated abnormally
>   >         before or while processing the request.
>
>   Is a corefile dumped in the database directory when this happens?
>   If so, can you get a backtrace from it with gdb?
>
>   > This did work for us using 6.5.1 compiled in house.  Our problems
>   > started with the 6.5.3 RPM.
>
>   Hmm.  We have been seeing some reports recently of other odd behavior
>   with the Linux RPMs (see the "ordering RH6.1" thread on pgsql-hackers).
>   Would you take the time to compile 6.5.3 from source and see if you
>   still see the same behavior?
>
>   If you do this, please compile the backend with full debug+assert
>   support, which is to say
>     (a) add --enable-cassert to the configure command;
>     (b) do   make COPT=-g all  instead of just make all.
>   That should make the gdb backtrace a lot more informative, if
>   a coredump happens...

Finally got a tiny little error message:

sterno.infoplease.com:/u/kdebisschop> psql -p 5433 webusers -c "CREATE INDEX zdaily_id ON daily (id)"
pqReadData() -- backend closed the channel unexpectedly.
        This probably means the backend terminated abnormally
        before or while processing the request.
We have lost the connection to the backend, so further processing is impossible.  Terminating.
[2]+  Done                    psql -p 5433 webusers -c "CREATE INDEX zdaily_date_n ON daily (date,n)"
sterno.infoplease.com:/u/kdebisschop> psql -p 5433 webusers -c "CREATE INDEX zdaily_id ON daily USING hash (id)"
ERROR:  cannot open segment 1 of relation zdaily_id
sterno.infoplease.com:/u/kdebisschop>

Still no core dump.  Each of these takes an hour or so to croak.

Any ideas would be welcome.

Karl DeBisschop

Re: [BUGS] problem creating index in 6,5,3

From
Tom Lane
Date:
Karl DeBisschop <kdebisschop@range.infoplease.com> writes:
> sterno.infoplease.com:/u/kdebisschop> psql -p 5433 webusers -c "CREATE INDEX zdaily_id ON daily (id)"
> pqReadData() -- backend closed the channel unexpectedly.
>         This probably means the backend terminated abnormally
>         before or while processing the request.
> We have lost the connection to the backend, so further processing is impossible.  Terminating.
> [2]+  Done                    psql -p 5433 webusers -c "CREATE INDEX zdaily_date_n ON daily (date,n)"
> sterno.infoplease.com:/u/kdebisschop> psql -p 5433 webusers -c "CREATE INDEX zdaily_id ON daily USING hash (id)"
> ERROR:  cannot open segment 1 of relation zdaily_id
> sterno.infoplease.com:/u/kdebisschop>

> Still no core dump.  Each of these takes an hour or so to croak.

Puzzling...

The lack of a core dump is probably the thing we ought to be paying
attention to here --- most ordinary sorts of unexpected-backend-crash
bugs should dump core.

One possibility is that the kernel is killing that backend process,
most likely for having exceeded a system limit on runtime or disk blocks
written; do you have such limits enabled on your machine?  Or maybe it
is running out of memory/swap space; how big does the process get before
terminating?

Another possibility is that the backend has detected a fatal error
condition and is doing a more-or-less-orderly shutdown.  However if that
were true then there ought to be an error message in the postmaster
logfile; I see no such message in your prior mails.

Also, I just noticed that you are trying to index a "text" field.
How long are the entries in the field?  6.5.* does not have adequate
defenses against overly-long index entries --- the maximum safe size
is 2700 bytes IIRC, but this isn't tested for in the current release.

            regards, tom lane

Re: [BUGS] problem creating index in 6,5,3

From
Karl DeBisschop
Date:
>   From: Tom Lane <tgl@sss.pgh.pa.us>
>
>   Puzzling...
>
>   The lack of a core dump is probably the thing we ought to be paying
>   attention to here --- most ordinary sorts of unexpected-backend-crash
>   bugs should dump core.

Seems reasonable.  To the extent that my frustration over this problem
has become an emotional issue, alot of that has to do with the fact
that I can't seem to get a clue what the problem is.

>   One possibility is that the kernel is killing that backend process,
>   most likely for having exceeded a system limit on runtime or disk blocks
>   written; do you have such limits enabled on your machine?  Or maybe it
>   is running out of memory/swap space; how big does the process get before
>   terminating?

No quotas are established.  This is not the largest, nor longest
process that runs routinely on this machine.  Plus, it ran under
6.5.1, but stopped running once I installed 6.5.1.

As far as machine limits, I'm pretty sure we are not reaching any.
See for yourself if you want:

========================================================================
[postgres@sterno webusers]$ ps -u postgres -o sz,vsz,rss,pid,time,stat,args;df -lk; ls -sk1 z* pg_so*;cat /proc/meminfo
   SZ   VSZ  RSS   PID     TIME STAT COMMAND
 5263 21052 1008 30040 00:00:01 S    /usr/bin/postmaster -i -d 3 -B 2048 -D/var/lib/pgsql -o -S 10240
 3085 12340  524  5713 00:00:00 S    /opt/postgresql/bin/postmaster -p 5433 -B 1024 -S
-D/opt/postgres-6.5.1-webusers-db-o -S 5120 
  423  1692  964  8558 00:00:00 S    -bash
 3349 13396 10308 8869 00:01:39 R    /opt/postgresql/bin/postgres localhost webadmin webusers CREATE
          
  421  1684  944  8873 00:00:00 S    -bash
  663  2652  956  8916 00:00:00 R    ps -u postgres -o sz,vsz,rss,pid,time,stat,args
Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/sda1              1981000   1114966    763620  59% /
/dev/sda7              6123224   3344200   2462948  58% /disk/1
 19608 pg_sorttemp8869.0
 66184 pg_sorttemp8869.1
 19352 pg_sorttemp8869.10
 66184 pg_sorttemp8869.11
 19480 pg_sorttemp8869.12
 66152 pg_sorttemp8869.13
 19640 pg_sorttemp8869.2
 66184 pg_sorttemp8869.3
 19416 pg_sorttemp8869.4
 66184 pg_sorttemp8869.5
 19384 pg_sorttemp8869.6
 66184 pg_sorttemp8869.7
 19576 pg_sorttemp8869.8
 66184 pg_sorttemp8869.9
224072 zdaily_cookie
235708 zdaily_date_n
    16 zdaily_id
        total:    used:    free:  shared: buffers:  cached:
Mem:  1061101568 1058770944  2330624 59154432 383434752 615337984
Swap: 526401536 18931712 507469824
MemTotal:   1036232 kB
MemFree:       2276 kB
MemShared:    57768 kB
Buffers:     374448 kB
Cached:      600916 kB
SwapTotal:   514064 kB
SwapFree:    495576 kB
========================================================================

MemFree bounces around some, occasionally becoming rather large, but
that is an OS function.  There's always plenty of swap to draw from.

The porcess size basically does not change during operation.

>   Another possibility is that the backend has detected a fatal error
>   condition and is doing a more-or-less-orderly shutdown.  However if that
>   were true then there ought to be an error message in the postmaster
>   logfile; I see no such message in your prior mails.

There have been no such messages.

>   Also, I just noticed that you are trying to index a "text" field.
>   How long are the entries in the field?  6.5.* does not have adequate
>   defenses against overly-long index entries --- the maximum safe size
>   is 2700 bytes IIRC, but this isn't tested for in the current release.

Text field max size is 32 characters.

I have done a little more poking around.  Just to confirm that size is
the problem, I recreated the schema and imported just 30 or so
records.  Index works fine.

Re: [BUGS] problem creating index in 6,5,3

From
Tom Lane
Date:
Karl DeBisschop <kdebisschop@range.infoplease.com> writes:
> Text field max size is 32 characters.

Drat, another good theory down the drain ;-)

I think it's time to start wheeling out the heavy artillery.  Do you
know how to use gdb?  I think we might be able to get some useful info
this way:

1. Start a postgres process running doing the CREATE INDEX.  Use ps to
determine the PID of the backend process.

2. gdb /usr/local/pgsql/bin/postgres (or whereever you keep the
executable).  You will need to run this as the postgres user.

3. In gdb:
        attach PID-of-backend-process
        break exit
        break _exit        (just belt and suspenders stuff)
        cont
(Depending on your system, it might not let you set a breakpoint at
exit, in which case try setting breaks at "proc_exit" and "quickdie".)

4. Wait for backend to fail.

5. If you hit any of the breakpoints, use "bt" to get a backtrace.
Otherwise, gdb should at least tell you what signal caused the program
to terminate ... which will be more info than we have now ...

            regards, tom lane

Re: [BUGS] problem creating index in 6,5,3

From
Karl DeBisschop
Date:
gdb seesion follows:

GNU gdb 4.17.0.11 with Linux support
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux"...
(gdb) attach 9923
Attaching to program `/opt/postgresql/bin/postgres', Pid 9923
Reading symbols from /lib/libcrypt.so.1...done.
Reading symbols from /lib/libnsl.so.1...done.
Reading symbols from /lib/libdl.so.2...done.
Reading symbols from /lib/libm.so.6...done.
Reading symbols from /usr/lib/libreadline.so.3...done.
Reading symbols from /usr/lib/libhistory.so.3...done.
Reading symbols from /lib/libtermcap.so.2...done.
Reading symbols from /usr/lib/libncurses.so.4...done.
Reading symbols from /lib/libc.so.6...done.
Reading symbols from /lib/ld-linux.so.2...done.
0x80d8363 in SpinAcquire ()
(gdb) break exit
Breakpoint 1 at 0x2abb0532: file exit.c, line 40.
(gdb) break _exit
Breakpoint 2 at 0x2ac1e910
(gdb) cont
Continuing.

Breakpoint 1, exit (status=0) at exit.c:40
exit.c:40: No such file or directory.
(gdb) bt
#0  exit (status=0) at exit.c:40
#1  0x80d6694 in proc_exit ()
#2  0x81075ef in elog ()
#3  0x80775f6 in _bt_buildadd ()
#4  0x80777ea in _bt_merge ()
#5  0x8077af5 in _bt_leafbuild ()
#6  0x807340d in btbuild ()
#7  0x810800f in fmgr_c ()
#8  0x8108227 in fmgr ()
#9  0x807e60b in index_build ()
#10 0x807de2b in index_create ()
#11 0x8085376 in DefineIndex ()
#12 0x80e006d in ProcessUtility ()
#13 0x80de096 in pg_exec_query_dest ()
#14 0x80ddfb7 in pg_exec_query ()
#15 0x80df03c in PostgresMain ()
#16 0x80c91ba in DoBackend ()
#17 0x80c8cda in BackendStartup ()
#18 0x80c8429 in ServerLoop ()
#19 0x80c7f67 in PostmasterMain ()
#20 0x80a0986 in main ()
#21 0x2aba7cb3 in __libc_start_main (main=0x80a0920 <main>, argc=5,
    argv=0x7ffffd34, init=0x8061360 <_init>, fini=0x810d63c <_fini>,
    rtld_fini=0x2aab5350 <_dl_fini>, stack_end=0x7ffffd2c)
    at ../sysdeps/generic/libc-start.c:78
(gdb) cont
Continuing.

Breakpoint 2, 0x2ac1e910 in _exit ()
(gdb) bt
#0  0x2ac1e910 in _exit ()
#1  0x2abb05f8 in exit () at exit.c:70
#2  0x80d6694 in proc_exit ()
#3  0x81075ef in elog ()
#4  0x80775f6 in _bt_buildadd ()
#5  0x80777ea in _bt_merge ()
#6  0x8077af5 in _bt_leafbuild ()
#7  0x807340d in btbuild ()
#8  0x810800f in fmgr_c ()
#9  0x8108227 in fmgr ()
#10 0x807e60b in index_build ()
#11 0x807de2b in index_create ()
#12 0x8085376 in DefineIndex ()
#13 0x80e006d in ProcessUtility ()
#14 0x80de096 in pg_exec_query_dest ()
#15 0x80ddfb7 in pg_exec_query ()
#16 0x80df03c in PostgresMain ()
#17 0x80c91ba in DoBackend ()
#18 0x80c8cda in BackendStartup ()
#19 0x80c8429 in ServerLoop ()
#20 0x80c7f67 in PostmasterMain ()
#21 0x80a0986 in main ()
#22 0x2aba7cb3 in __libc_start_main (main=0x80a0920 <main>, argc=5,
---Type <return> to continue, or q <return> to quit---
    argv=0x7ffffd34, init=0x8061360 <_init>, fini=0x810d63c <_fini>,
    rtld_fini=0x2aab5350 <_dl_fini>, stack_end=0x7ffffd2c)
    at ../sysdeps/generic/libc-start.c:78
(gdb) cont
Continuing.

Program exited normally.
(gdb)

Re: [BUGS] problem creating index in 6,5,3

From
Tom Lane
Date:
Karl DeBisschop <kdebisschop@range.infoplease.com> writes:
> (gdb) bt
> #0  exit (status=0) at exit.c:40
> #1  0x80d6694 in proc_exit ()
> #2  0x81075ef in elog ()
> #3  0x80775f6 in _bt_buildadd ()
> #4  0x80777ea in _bt_merge ()
> #5  0x8077af5 in _bt_leafbuild ()
> #6  0x807340d in btbuild ()

So the darn thing *is* doing an elog(FATAL) !  If you look at
bt_buildadd (in backend/access/nbtree/nbtsort.c) there are two
calls to elog:
        elog(FATAL, "btree: failed to add item to the page in _bt_sort (1)");
        elog(FATAL, "btree: failed to add item to the page in _bt_sort (2)");
If you'd built the backend with -g then we'd know which one this
is, but at this point it probably doesn't matter much.

OK, now we have two questions: (1) why is it failing, and (2) why
isn't the elog message showing up in your postmaster log?  (You
may also be wondering (3) why isn't psql reporting the elog message,
but that one is a known bug in 6.5.* libpq --- it forgets to
report pending ERROR messages if it sees a backend disconnect.
The message ought to have made it to your postmaster log, though.)

I have to go do some real work right now :-(, but this needs looking
into.

            regards, tom lane