Thread: autovacuum process (PID ...) was terminated by signal 11

autovacuum process (PID ...) was terminated by signal 11

From
Brian Hirt
Date:
Hi.

When I'm doing a database load of a 5gb database, autovacuum always
segfaults shortly after the load finishes.  The load is being done
via slony
during the initial copy set command while building a slave, not through
pg_restore.

LOG:  autovacuum process (PID ...) was terminated by signal 11
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server
process
DETAIL:  The postmaster has commanded this server process to roll
back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

This problem sounds similiar to these reports:

http://archives.postgresql.org/pgsql-bugs/2005-09/msg00200.php

and

http://archives.postgresql.org/pgsql-bugs/2005-11/msg00276.php


here is the stack for the autovacuum process.

Program received signal SIGSEGV, Segmentation fault.
CopySnapshot (snapshot=0x0) at tqual.c:1301
1301            newsnap = (Snapshot) palloc(sizeof(SnapshotData) +
(gdb) bt
#0  CopySnapshot (snapshot=0x0) at tqual.c:1301
#1  0x0812388c in _SPI_execute_plan (plan=0x83cd3e0, Values=0x841adf0,
     Nulls=0x841ae00 "  ", snapshot=0x0, crosscheck_snapshot=0x0,
     read_only=1 '\001', tcount=1) at spi.c:1433
#2  0x0812215a in SPI_execute_plan (plan=0x83cd3e0, Values=0x841adf0,
     Nulls=0x841ae00 "  ", read_only=1 '\001', tcount=1) at spi.c:336
#3  0x403cc0ef in ?? ()
#4  0x403c90e9 in ?? ()
#5  0x403c8668 in ?? ()
#6  0x403c84e7 in ?? ()
#7  0x403c83f1 in ?? ()
#8  0x403c7651 in ?? ()
#9  0x403c442a in ?? ()
#10 0x081137c8 in ExecMakeFunctionResult (fcache=0x83fcb88,
     econtext=0x83fcb18, isNull=0xbfffe94b "?\020???", isDone=0x0)
     at execQual.c:1095
#11 0x0811554e in ExecEvalExprSwitchContext (expression=0x83fcb88,
     econtext=0x83fcb18, isNull=0xbfffe94b "?\020???", isDone=0x0)
     at execQual.c:2864
#12 0x080b20e6 in FormIndexDatum (indexInfo=0x83ad7d0, slot=0x83fea98,
     estate=0x83fca90, values=0xbfffea10, isnull=0xbfffe9f0 "??:\b\n")
     at index.c:971
#13 0x080dd832 in compute_index_stats (onerel=0x401f6f58,
totalrows=11323,
---Type <return> to continue, or q <return> to quit---
     indexdata=0x83ad3d0, nindexes=4, rows=0x83e9940, numrows=3000,
     col_context=0x83fab88) at analyze.c:540
#14 0x080dd4b7 in analyze_rel (relid=32465292, vacstmt=0x83a48d8)
     at analyze.c:387
#15 0x08109365 in vacuum (vacstmt=0x83a48d8, relids=0x83a5540) at
vacuum.c:476
#16 0x0815b4bd in autovacuum_do_vac_analyze (relids=0x83a5540,
dovacuum=0,
     doanalyze=1, freeze=0) at autovacuum.c:907
#17 0x0815b16d in do_autovacuum (dbentry=0x82fb410) at autovacuum.c:681
#18 0x0815acb7 in AutoVacMain (argc=0, argv=0x0) at autovacuum.c:423
#19 0x0815aa01 in autovac_start () at autovacuum.c:170
#20 0x0815fe84 in ServerLoop () at postmaster.c:1269
#21 0x0815f80f in PostmasterMain (argc=2, argv=0x82d6158) at
postmaster.c:943
#22 0x0812a9fb in main (argc=2, argv=0x82d6158) at main.c:256
#23 0x42017589 in __libc_start_main () from /lib/i686/libc.so.6
(gdb)


--------------------------------------------
MobyGames
http://www.mobygames.com
The world's largest and most comprehensive
gaming database project

Re: autovacuum process (PID ...) was terminated by signal 11

From
Tom Lane
Date:
Brian Hirt <bhirt@mobygames.com> writes:
> When I'm doing a database load of a 5gb database, autovacuum always
> segfaults shortly after the load finishes.

This sure looks like the same bug already fixed in 8.1.1:

2005-11-28 08:35  alvherre

    * src/backend/postmaster/autovacuum.c: Set a snapshot before
    running analyze on a single table, to avoid a crash when analyzing
    tables with expressional indexes.

    Per report from Frank van Vugt.

            regards, tom lane

Re: autovacuum process (PID ...) was terminated by signal 11

From
Brian Hirt
Date:
that's strange, because I'm running 8.1.1.

[root@pgdb01 root]# /usr/pg-8.1/bin/postmaster --version
postmaster (PostgreSQL) 8.1.1

Is there more information i can provide to help find the problem?

On Jan 4, 2006, at 10:04 AM, Tom Lane wrote:

> Brian Hirt <bhirt@mobygames.com> writes:
>> When I'm doing a database load of a 5gb database, autovacuum always
>> segfaults shortly after the load finishes.
>
> This sure looks like the same bug already fixed in 8.1.1:
>
> 2005-11-28 08:35  alvherre
>
>     * src/backend/postmaster/autovacuum.c: Set a snapshot before
>     running analyze on a single table, to avoid a crash when analyzing
>     tables with expressional indexes.
>
>     Per report from Frank van Vugt.
>
>             regards, tom lane

--------------------------------------------
MobyGames
http://www.mobygames.com
The world's largest and most comprehensive
gaming database project

Re: autovacuum process (PID ...) was terminated by signal 11

From
Jaime Casanova
Date:
On 1/4/06, Brian Hirt <bhirt@mobygames.com> wrote:
> that's strange, because I'm running 8.1.1.
>

what Tom is saying is that a patch was applied after 8.1.1 was
launched... it will be fixed in 8.1.2 and if you are installing from
sources you can apply yourself the patch to your tree source recompile
and you will have your problem solved now...

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

Re: autovacuum process (PID ...) was terminated by signal 11

From
Tom Lane
Date:
Brian Hirt <bhirt@mobygames.com> writes:
> Is there more information i can provide to help find the problem?

How about the schema of the table in question?  If the backtrace is
to be trusted, it's OID 32465292.

Does it crash if you ANALYZE that table manually?

            regards, tom lane

Re: autovacuum process (PID ...) was terminated by signal 11

From
Michael Fuhr
Date:
On Wed, Jan 04, 2006 at 12:20:28PM -0500, Jaime Casanova wrote:
> On 1/4/06, Brian Hirt <bhirt@mobygames.com> wrote:
> > that's strange, because I'm running 8.1.1.
>
> what Tom is saying is that a patch was applied after 8.1.1 was
> launched...

Is that what Tom is saying?  The commit message he posted had a
date of 2005-11-28; 8.1.1 wasn't tagged until 2005-12-08.

--
Michael Fuhr

Re: autovacuum process (PID ...) was terminated by signal 11

From
Jaime Casanova
Date:
On 1/4/06, Michael Fuhr <mike@fuhr.org> wrote:
> On Wed, Jan 04, 2006 at 12:20:28PM -0500, Jaime Casanova wrote:
> > On 1/4/06, Brian Hirt <bhirt@mobygames.com> wrote:
> > > that's strange, because I'm running 8.1.1.
> >
> > what Tom is saying is that a patch was applied after 8.1.1 was
> > launched...
>
> Is that what Tom is saying?  The commit message he posted had a
> date of 2005-11-28; 8.1.1 wasn't tagged until 2005-12-08.
>
> --
> Michael Fuhr
>

sorry... you are right... my memory is not as good as some years ago...

--
regards,
Jaime Casanova
(DBA: DataBase Aniquilator ;)

Re: autovacuum process (PID ...) was terminated by signal 11

From
Brian Hirt
Date:
Tom,

I can analyze that table without problems.  I don't know if it's the
same table every time.   I'm trying to set up a development
environment where i can test this stuff better without messing up our
production systems. The table does have an expresional index.

basement=# select relname from pg_class where oid = 32465292;
     relname
---------------
developer_aka
(1 row)

basement=# ANALYZE verbose developer_aka;
ANALYZE
basement=# \d developer_aka
                                         Table "public.developer_aka"
       Column      |          Type
|                           Modifiers
------------------+------------------------
+----------------------------------------------------------------
developer_id     | integer                |
developer_aka_id | integer                | not null default nextval
(('developer_id_seq'::text)::regclass)
first_name       | character varying(255) |
last_name        | character varying(255) |
search_name      | character varying(255) |
Indexes:
     "developer_aka_pkey" PRIMARY KEY, btree (developer_aka_id)
     "developer_aka_developer_id_inde" btree (developer_id)
     "developer_aka_name_idx" btree (name_idx(first_name, last_name))
     "developer_aka_search" btree (search_name)
Foreign-key constraints:
     "$1" FOREIGN KEY (developer_id) REFERENCES developer(developer_id)
Triggers:
     developer_modified AFTER INSERT OR UPDATE ON developer_aka FOR
EACH ROW EXECUTE PROCEDURE add_developer_mod()

function definition:

CREATE FUNCTION name_idx(character varying, character varying)
RETURNS character varying
     AS $_$
DECLARE
   n varchar;
BEGIN
   select upper($1) || ' ' || upper($2) into n;
   return n;
END;
$_$
     LANGUAGE plpgsql IMMUTABLE;



--brian

On Jan 4, 2006, at 10:25 AM, Tom Lane wrote:

> Brian Hirt <bhirt@mobygames.com> writes:
>> Is there more information i can provide to help find the problem?
>
> How about the schema of the table in question?  If the backtrace is
> to be trusted, it's OID 32465292.
>
> Does it crash if you ANALYZE that table manually?
>
>             regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--------------------------------------------
MobyGames
http://www.mobygames.com
The world's largest and most comprehensive
gaming database project

Re: autovacuum process (PID ...) was terminated by signal 11

From
Tom Lane
Date:
Brian Hirt <bhirt@mobygames.com> writes:
> I can analyze that table without problems.  I don't know if it's the
> same table every time.   I'm trying to set up a development
> environment where i can test this stuff better without messing up our
> production systems. The table does have an expresional index.

I've managed to reproduce this: the triggering condition is that a
single autovac iteration has to VACUUM one table and then ANALYZE
(no vac) another table that has an expressional index with a plpgsql
function.  It looks like we missed a path of control where
ActiveSnapshot has to be re-set-up, but I'm not clear where.

            regards, tom lane

Re: autovacuum process (PID ...) was terminated by signal 11

From
Brian Hirt
Date:
Cool,

I was just writing to let you know I created an easily reproducible
test case too, but I guess you don't need that now.

Let me know if there is anything else I can do to help.

Also, if a patch is produced, I'd love to get a copy of it. We just
upgraded our production servers to 8.1.1 this morning (this issue
never came up during testing) and I'l like to get this in there
because it's likely to happen again.

--brian

On Jan 4, 2006, at 11:31 AM, Tom Lane wrote:

> Brian Hirt <bhirt@mobygames.com> writes:
>> I can analyze that table without problems.  I don't know if it's the
>> same table every time.   I'm trying to set up a development
>> environment where i can test this stuff better without messing up our
>> production systems. The table does have an expresional index.
>
> I've managed to reproduce this: the triggering condition is that a
> single autovac iteration has to VACUUM one table and then ANALYZE
> (no vac) another table that has an expressional index with a plpgsql
> function.  It looks like we missed a path of control where
> ActiveSnapshot has to be re-set-up, but I'm not clear where.
>
>             regards, tom lane

--------------------------------------------
MobyGames
http://www.mobygames.com
The world's largest and most comprehensive
gaming database project

Re: autovacuum process (PID ...) was terminated by signal 11

From
Tom Lane
Date:
Brian Hirt <bhirt@mobygames.com> writes:
> I was just writing to let you know I created an easily reproducible
> test case too, but I guess you don't need that now.

Does your test case agree with my description of the problem?
(If you're not sure, crank up log_min_messages and watch the log
to see what autovacuum does before it crashes.)

The problem I'm seeing seems to be that vacuum() exits with a
transaction started but no snapshot set (StartTransactionCommand will
leave ActiveSnapshot set to NULL), if it's in use_own_xacts mode.
autovac thinks it only has to set the snapshot once, but that's really
not the case given this behavior.

I'm unsure whether to fix this by adding a CopySnapshot operation right
in vacuum(), or in autovacuum.c.  The former is probably cleaner but it
clutters vacuum.c with something only autovac really needs.  Any
opinions?

> Also, if a patch is produced, I'd love to get a copy of it. We just
> upgraded our production servers to 8.1.1 this morning (this issue
> never came up during testing) and I'l like to get this in there
> because it's likely to happen again.

Definitely a big risk of that :-(.  It'll be a one-liner patch in any
case, we just have to decide where ...

            regards, tom lane

Re: autovacuum process (PID ...) was terminated by signal 11

From
Tom Lane
Date:
Brian Hirt <bhirt@mobygames.com> writes:
> Also, if a patch is produced, I'd love to get a copy of it.

I concluded that patching vacuum.c was much the cleanest way to do it.
Here's the patch against 8.1 branch.

            regards, tom lane

Index: src/backend/commands/vacuum.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/commands/vacuum.c,v
retrieving revision 1.317.2.1
diff -c -r1.317.2.1 vacuum.c
*** src/backend/commands/vacuum.c    22 Nov 2005 18:23:08 -0000    1.317.2.1
--- src/backend/commands/vacuum.c    4 Jan 2006 19:10:35 -0000
***************
*** 510,515 ****
--- 510,523 ----
           * PostgresMain().
           */
          StartTransactionCommand();
+         /*
+          * Re-establish the transaction snapshot.  This is wasted effort
+          * when we are called as a normal utility command, because the
+          * new transaction will be dropped immediately by PostgresMain();
+          * but it's necessary if we are called from autovacuum because
+          * autovacuum might continue on to do an ANALYZE-only call.
+          */
+         ActiveSnapshot = CopySnapshot(GetTransactionSnapshot());
      }

      if (vacstmt->vacuum)

Re: autovacuum process (PID ...) was terminated by signal 11

From
Brian Hirt
Date:
On Jan 4, 2006, at 11:47 AM, Tom Lane wrote:

> Brian Hirt <bhirt@mobygames.com> writes:
>> I was just writing to let you know I created an easily reproducible
>> test case too, but I guess you don't need that now.
>
> Does your test case agree with my description of the problem?
> (If you're not sure, crank up log_min_messages and watch the log
> to see what autovacuum does before it crashes.)
>

Yes, I see the same exact same thing you describe.

>
> Definitely a big risk of that :-(.  It'll be a one-liner patch in any
> case, we just have to decide where ...
>

I applied your patch and I no longer see the problem with my test case.


Thanks for the quick responses and help!

--brian

--------------------------------------------
MobyGames
http://www.mobygames.com
The world's largest and most comprehensive
gaming database project

Re: autovacuum process (PID ...) was terminated by signal 11

From
Frank van Vugt
Date:
> I concluded that patching vacuum.c was much the cleanest way to do it.
> Here's the patch against 8.1 branch.

Great, it looks like this patch fixes the remainder of my original problem as
well !

( http://archives.postgresql.org/pgsql-bugs/2005-11/msg00276.php )



--
Best,




Frank.