Thread: backend crash following load command

backend crash following load command

From
"Merlin Moncure"
Date:
We are getting a backend crash after issueing a load command.  It's
pretty easy to recreate -- so easy that I'm not sure that there is
something being overlooked.  This is on pg 8.2 roughly two weeks old.

Basic m.o. is:
1. create pic .so
2. load .so and call a function in it (from psql).
3. recompile .so with no changes to source
4. load again and crash.  note that merely touching the file and not
recompiling does not cause a crash...example stack trace:

Process 4808 attached - interrupt to quit
recv(8, "Q\0\0\0\35load \'/pgtest/pgfuncs\'\n;\0", 8192, 0) = 30
gettimeofday({1164652467, 63440}, NULL) = 0
write(2, "LOG:  statement: load \'/pgtest/p"..., 43) = 43
_llseek(3, 0, [16384], SEEK_CUR)        = 0
close(3)                                = 0
[snip]
_llseek(32, 0, [122880], SEEK_CUR)      = 0
close(32)                               = 0
_llseek(33, 0, [8192], SEEK_CUR)        = 0
close(33)                               = 0
_llseek(34, 0, [32768], SEEK_CUR)       = 0
close(34)                               = 0
stat64("/pgtest/pgfuncs", 0xbfec3820)   = -1 ENOENT (No such file or
directory)
stat64("/pgtest/pgfuncs.so", {st_mode=S_IFREG|0755, st_size=4490, ...}) = 0
stat64("/pgtest/pgfuncs.so", {st_mode=S_IFREG|0755, st_size=4490, ...}) = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---


detailed steps to create the problem follow:

1. Create the C function
// pgfuncs.c
#include "postgres.h"
#include "fmgr.h"

#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

PG_FUNCTION_INFO_V1(addone);

Datum addone(PG_FUNCTION_ARGS)
{
  PG_RETURN_INT32(PG_GETARG_INT32(0) + 1);
}
// end pgfuncs.c


2. compile it
PG_SERVER_INC=/usr/local/pgsql/include/server
gcc -fpic -shared -I $PG_SERVER_INC -o /pgtest/pgfuncs.so pgfuncs.c


3. create the addone func (using a fresh psql session)
CREATE OR REPLACE FUNCTION addone(INTEGER) RETURNS INTEGER
  AS '/pgtest/pgfuncs', 'addone' LANGUAGE C STRICT;


4. Execute addone, which will load pgfuncs.so
funcy=# select addone(1);
 addone
--------
      2
(1 row)


5. Try to reload the library (this works)
funcy=# LOAD '/pgtest/pgfuncs';
LOAD


6. Recomplie pgfuncs.so
Follow the same steps that are outlined in Step 2.


7. Issue a LOAD 'library' command
funcy=# LOAD '/pgtest/pgfuncs';
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

merlin

Re: backend crash following load command

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> We are getting a backend crash after issueing a load command.

No crash from your example here (on Fedora Core 5).  What platform and
gcc are you using exactly?  Can you provide a stack trace from the crash?

            regards, tom lane

Re: backend crash following load command

From
"Merlin Moncure"
Date:
On 11/28/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Merlin Moncure" <mmoncure@gmail.com> writes:
> > We are getting a backend crash after issueing a load command.
>
> No crash from your example here (on Fedora Core 5).  What platform and
> gcc are you using exactly?  Can you provide a stack trace from the crash?

ok, an update on this.  we actually covered up the bug in reducing the
problem to our test case.  our make system used cp -f to overwite the
.so file in use by postgresql.  interestingly, this will cause a crash
on the .so reload via LOAD.  There may be a perfectly normal reason
for this.

so,
1. compile just about any c function
2. create a function/load it
3. recompile and cp -f over the one in use (cp works ok)
4. reload...crash

merlin

Re: backend crash following load command

From
Martijn van Oosterhout
Date:
On Tue, Nov 28, 2006 at 02:38:18PM -0500, Merlin Moncure wrote:
> On 11/28/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >"Merlin Moncure" <mmoncure@gmail.com> writes:
> >> We are getting a backend crash after issueing a load command.
> >
> >No crash from your example here (on Fedora Core 5).  What platform and
> >gcc are you using exactly?  Can you provide a stack trace from the crash?
>
> ok, an update on this.  we actually covered up the bug in reducing the
> problem to our test case.  our make system used cp -f to overwite the
> .so file in use by postgresql.  interestingly, this will cause a crash
> on the .so reload via LOAD.  There may be a perfectly normal reason
> for this.

Err, that means copy is just rewriting the executable code in the
backend of the server, while it's running, which understandably
crashes. Probably while trying to unload the old library. I suppose the
answer is: don't do that.

The protection of ETXTBUSY only applies to code started via exec().

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: backend crash following load command

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> ok, an update on this.  we actually covered up the bug in reducing the
> problem to our test case.  our make system used cp -f to overwite the
> .so file in use by postgresql.

With that I can reproduce it --- I think it is a glibc bug.  The crash
occurs inside dlsym() while trying to look up "_PG_fini".

(gdb) bt
#0  0x0000003bf1a08b31 in do_lookup_x () from /lib64/ld-linux-x86-64.so.2
#1  0x0000003bf1a08e6f in _dl_lookup_symbol_x ()
   from /lib64/ld-linux-x86-64.so.2
#2  0x0000003bf1cff5ee in do_sym () from /lib64/libc.so.6
#3  0x0000003bf2101334 in dlsym_doit () from /lib64/libdl.so.2
#4  0x0000003bf1a0ca36 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#5  0x0000003bf210173d in _dlerror_run () from /lib64/libdl.so.2
#6  0x0000003bf21012ea in dlsym () from /lib64/libdl.so.2
#7  0x000000000061f414 in load_file (filename=Variable "filename" is not available.
) at dfmgr.c:352
#8  0x00000000005a3d4c in PortalRunUtility (portal=0x98a828, query=0x9564f0,
    dest=0x956798, completionTag=0x7fffb624e4e0 "") at pquery.c:1063

I'd suggest putting together a simple stand-alone test case and filing
a bug report against glibc.  You probably just need

    dlopen(...);
    system("cp -f over the .so file");
    dlsym(...);

            regards, tom lane

Re: backend crash following load command

From
Martijn van Oosterhout
Date:
On Tue, Nov 28, 2006 at 03:23:36PM -0500, Tom Lane wrote:
> I'd suggest putting together a simple stand-alone test case and filing
> a bug report against glibc.  You probably just need
>
>     dlopen(...);
>     system("cp -f over the .so file");
>     dlsym(...);

How can glibc do anything about this? dlopen() mmaps the .so into
memory and the cp overwrites what was mmaped, changing what is in
memory.

Ideally, the cp should fail with ETXTBSY, but that doesn't happen, so
what else can you do?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: backend crash following load command

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> On Tue, Nov 28, 2006 at 03:23:36PM -0500, Tom Lane wrote:
>> I'd suggest putting together a simple stand-alone test case and filing
>> a bug report against glibc.

> How can glibc do anything about this? dlopen() mmaps the .so into
> memory and the cp overwrites what was mmaped, changing what is in
> memory.

The test case I was using involved a cp -f that overwrote the .so with
the exact same data (ie, I didn't bother recompiling, just cp -f a
second time from the compilation output file).  So if the above were
the explanation there should have been no crash; moreover, if that were
the explanation then the cp-without-dash-f case should crash too.

I suspect that glibc is playing some undocumented games and is getting
confused because the file's inode number has changed.

            regards, tom lane

Re: backend crash following load command

From
"Merlin Moncure"
Date:
On 11/28/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > On Tue, Nov 28, 2006 at 03:23:36PM -0500, Tom Lane wrote:
> >> I'd suggest putting together a simple stand-alone test case and filing
> >> a bug report against glibc.
>
> > How can glibc do anything about this? dlopen() mmaps the .so into
> > memory and the cp overwrites what was mmaped, changing what is in
> > memory.
>
> The test case I was using involved a cp -f that overwrote the .so with
> the exact same data (ie, I didn't bother recompiling, just cp -f a
> second time from the compilation output file).  So if the above were
> the explanation there should have been no crash; moreover, if that were
> the explanation then the cp-without-dash-f case should crash too.
>
> I suspect that glibc is playing some undocumented games and is getting
> confused because the file's inode number has changed.

also, if what Martijn is saying is correct, wouldn't that make the
LOAD command unsupportably dangerous?  The postgresql documentation
suggests its use is for updating libraries in exactly this way
(emphasis mine):

This command loads a shared library file into the PostgreSQL server's
address space. If the file had been loaded previously, it is first
unloaded. This command is primarily useful to unload and reload a
shared library file that *has been changed since the server first
loaded it*. To make use of the shared library, function(s) in it need
to be declared using the CREATE FUNCTION command.

merlin

Re: backend crash following load command

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Err, that means copy is just rewriting the executable code in the
> backend of the server, while it's running, which understandably
> crashes.

No, I don't think so.  "cp -f" means "unlink the old file and create a
new one", as opposed to plain cp which would overwrite in place.  Your
theory would explain an observation that plain cp causes a crash while
cp -f does not, but that's the exact opposite of Merlin's report.

The actual situation is that the mmap is referencing a file that's
disappeared from the directory structure (but still exists on disk,
as long as it's held open).  dlsym seems unable to cope with that
case.  I call that a bug --- it'd be OK for it to return a failure
indication, but not to SIGSEGV.

            regards, tom lane

Re: backend crash following load command

From
Martijn van Oosterhout
Date:
On Tue, Nov 28, 2006 at 04:09:11PM -0500, Tom Lane wrote:
> The mmap man page is pretty vague on the subject, but I wonder whether
> the shlib isn't effectively treated as copy-on-write --- that is, any
> attempted overwrite of the file happens only after the mmap region has
> been fully copied.  Without that, it'd be impossible to update core
> shared libraries like libc.so without a system reboot, but Linux doesn't
> seem to need that.

Hmm? To upgrade libc.so you merely need to delete the old one and
install the new one, there's no need to preserve the inode. The mmap()
is private, but no, Linux does not keep a backup copy of the shared
library if you overwrite it. The behaviour of overwriting the backing
store of a private mapping is explicitly undefined.

I did some digging. At one point there was protection for overwriting
shared libraries, you could pass MAP_DENYWRITE to mmap(), which would
cause any writes to the file to fail with ETXTBSY, just like it does
for normal executables. However:

MAP_DENYWRITE
    This flag is ignored. (Long ago, it signalled that attempts to
    write to the underlying file should fail with ETXTBUSY.  But this
    was a source of denial-of-service attacks.)

> I suspect that this issue is specific to dlsym() and has nothing to do
> with the safeness of ordinary usage of a shared library.  The reason
> 8.2 is getting bit is that it tries to do a dlsym() lookup during shlib
> unload, which we never did before.  (Merlin, I assume you have been
> doing the same things with 8.1 and before without a problem?)

I wouldn't be surprised if this were the problem. People testing shared
libraries would probably not be testing what happened between the time
the shared-library was overwritten and the LOAD command was reexecuted.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment

Re: backend crash following load command

From
Tom Lane
Date:
Martijn van Oosterhout <kleptog@svana.org> writes:
> Hmm? To upgrade libc.so you merely need to delete the old one and
> install the new one, there's no need to preserve the inode. The mmap()
> is private, but no, Linux does not keep a backup copy of the shared
> library if you overwrite it. The behaviour of overwriting the backing
> store of a private mapping is explicitly undefined.

Right, but isn't "cp -f" doing exactly that --- deleting the old one and
installing the new one?

[ experiments a bit... ]  Oh, that's interesting.  I was under the
impression that "cp -f" would always unlink the target file, but
on my machine (reasonably up-to-date Fedora 5, x86_64), this happens
only if it can't do open("foo", O_WRONLY|O_TRUNC).  If the existing
file is overwritable then there is no difference between cp and cp -f
... and *both* crash the backend.  If I "chmod -w" the .so file so that
cp -f is forced to unlink it first, then the backend does not crash!

This is at variance with what Merlin reported --- so I'm asking again
just what platform he's on.  He might want to strace cp to see whether
it's doing an unlink or not in his scenario.

Anyway, on my machine, the behavior is consistent with Martijn's theory.
I suspect the kernel is effectively unmapping the .so when the
overwrite occurs, and then dlsym() naturally SIGSEGV's while trying to
look into the mapped area.  If so, the early-PG_fini-lookup approach
wouldn't really fix anything.

The best solution for Merlin is probably to do "rm" then "cp" to install
a new version of the .so, instead of relying on "cp -f" to do it safely.

            regards, tom lane

Re: backend crash following load command

From
"Merlin Moncure"
Date:
On 11/28/06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This is at variance with what Merlin reported --- so I'm asking again
> just what platform he's on.  He might want to strace cp to see whether
> it's doing an unlink or not in his scenario.

this is centos 32 bit.
[root@boron esilo]# uname -a
Linux boron.esilo.com 2.6.9-42.0.3.EL #1 Fri Oct 6 05:59:54 CDT 2006
i686 athlon i386 GNU/Linux

merlin

Disadvantage of SQL Joins

From
Ranjan Kumar Baisak
Date:
Can anybody please tell me whether there are any disadvantage of SQL
joins in terms of space and time and how postgres has implemented SQL
joins? I am in the impression that SQL join takes more time for
execution as well as space because database internally builds Cartesian
product and then evaluates for condition. Recently I normalized by DB
and changed my application Data model which resulted in writing lots of
inner join queries. Later I found that DB server is consuming more
memory. And once my Database also crashed. I am assuming that because of
normalization and inner joins, my DB crashed.
Can anybody please give a thought into my assumption?

- R

Re: Disadvantage of SQL Joins

From
Scott Ribe
Date:
> I am in the impression that SQL join takes more time for
> execution as well as space because database internally builds Cartesian
> product and then evaluates for condition.

No, that's a conceptual description, but the actual process is more
optimized, often far more so if you have the appropriate indices.

> Recently I normalized by DB
> and changed my application Data model which resulted in writing lots of
> inner join queries. Later I found that DB server is consuming more
> memory.

That can certainly happen. There is a tradeoff there, in that reduction of
redundant data and the problems of maintaining it may require more CPU
cycles and RAM to get data back out.

> And once my Database also crashed. I am assuming that because of
> normalization and inner joins, my DB crashed.

Not likely. When you construct your joins, do be careful about the join
conditions. A common mistake is to leave out a condition in the where clause
which then results in the actual Cartesian product being requested. The more
tables involved in a join, the easier it is to make such a mistake--I think
we've all done this at one time or another.


--
Scott Ribe
scott_ribe@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice



Re: backend crash following load command

From
Tom Lane
Date:
"Merlin Moncure" <mmoncure@gmail.com> writes:
> also, if what Martijn is saying is correct, wouldn't that make the
> LOAD command unsupportably dangerous?

If you have write access to a file that you can LOAD, then you can
already put garbage into the backend's memory space, so I don't see
this as a security hole.  It'd be unfortunate if true though.

The mmap man page is pretty vague on the subject, but I wonder whether
the shlib isn't effectively treated as copy-on-write --- that is, any
attempted overwrite of the file happens only after the mmap region has
been fully copied.  Without that, it'd be impossible to update core
shared libraries like libc.so without a system reboot, but Linux doesn't
seem to need that.

I suspect that this issue is specific to dlsym() and has nothing to do
with the safeness of ordinary usage of a shared library.  The reason
8.2 is getting bit is that it tries to do a dlsym() lookup during shlib
unload, which we never did before.  (Merlin, I assume you have been
doing the same things with 8.1 and before without a problem?)

Hmm ... would it be worth doing the lookup of _PG_fini during library
load instead of unload, and saving the result?  This'd be a waste of
cycles if the library were never unloaded, which is much the normal
case, but library load probably isn't a critical path anyway.

            regards, tom lane