Thread: The slave suddenly stopped with such DB log : "will not overwrite a used ItemId" and "heap_insert_redo: failed to add tuple"



Hi, dear pgsql-general


The details are as follows:

1. environment

DB Master

$ cat /etc/issue
CentOS release 6.5 (Final)
Kernel \r on an \m

$ uname -av
Linux l-xxxxx1.xx.cnx 3.14.29-3.centos6.x86_64 #1 SMP Tue Jan 20 17:48:32 CST 2015 x86_64 x86_64 x86_64 GNU/Linux

$ psql -U postgres
psql (9.3.5)
Type "help" for help.

postgres=# select version();
                              
                     version                                                   
--------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3), 64-bit
(1 row)

$ pg_config
BINDIR = /opt/pg93/bin
DOCDIR = /opt/pg93/share/doc/postgresql
HTMLDIR = /opt/pg93/share/doc/postgresql
INCLUDEDIR = /opt/pg93/include
PKGINCLUDEDIR = /opt/pg93/include/postgresql
INCLUDEDIR-SERVER = /opt/pg93/include/postgresql/server
LIBDIR = /opt/pg93/lib
PKGLIBDIR = /opt/pg93/lib/postgresql
LOCALEDIR = /opt/pg93/share/locale
MANDIR = /opt/pg93/share/man
SHAREDIR = /opt/pg93/share/postgresql
SYSCONFDIR = /opt/pg93/etc/postgresql
PGXS = /opt/pg93/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/opt/pg93' '--with-perl' '--with-libxml' '--with-libxslt' '--with-ossp-uuid' 'CFLAGS= -march=core2 -O2 '
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -march=core2 -O2  -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
CFLAGS_SL = -fpic
LDFLAGS = -L../../../src/common -Wl,--as-needed -Wl,-rpath,'/opt/pg93/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgport -lpgcommon -lxslt -lxml2 -lz -lreadline -lcrypt -ldl -lm
VERSION = PostgreSQL 9.3.5



DB Slave

$ cat /etc/issue
CentOS release 6.5 (Final)
Kernel \r on an \m

$ uname -av
Linux l-xxxx2.xx.cnx 3.14.31-3.centos6.x86_64 #1 SMP Mon Feb 2 15:26:04 CST 2015 x86_64 x86_64 x86_64 GNU/Linux

$ psql -U postgres
psql (9.3.5)
Type "help" for help.

postgres=# select version();
                                                   version                                                   
--------------------------------------------------------------------------------------------------------------
 PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3), 64-bit
(1 row)

postgres=# show log_line_prefix ;
       log_line_prefix       
------------------------------
 [%u %d %a %h %m %p %c %l %x]
(1 row)


$ pg_config
BINDIR = /opt/pg93/bin
DOCDIR = /opt/pg93/share/doc/postgresql
HTMLDIR = /opt/pg93/share/doc/postgresql
INCLUDEDIR = /opt/pg93/include
PKGINCLUDEDIR = /opt/pg93/include/postgresql
INCLUDEDIR-SERVER = /opt/pg93/include/postgresql/server
LIBDIR = /opt/pg93/lib
PKGLIBDIR = /opt/pg93/lib/postgresql
LOCALEDIR = /opt/pg93/share/locale
MANDIR = /opt/pg93/share/man
SHAREDIR = /opt/pg93/share/postgresql
SYSCONFDIR = /opt/pg93/etc/postgresql
PGXS = /opt/pg93/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/opt/pg93' '--with-perl' '--with-libxml' '--with-libxslt' '--with-ossp-uuid' 'CFLAGS= -march=core2 -O2 '
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -march=core2 -O2  -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv
CFLAGS_SL = -fpic
LDFLAGS = -L../../../src/common -Wl,--as-needed -Wl,-rpath,'/opt/pg93/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgport -lpgcommon -lxslt -lxml2 -lz -lreadline -lcrypt -ldl -lm
VERSION = PostgreSQL 9.3.5



2. the DB log in the Slave's log_directory

[ 2015-02-05 15:38:51.406 CST 2328 54d08abc.918 6 0]WARNING: will not overwrite a used ItemId
[ 2015-02-05 15:38:51.406 CST 2328 54d08abc.918 7 0]CONTEXT: xlog redo insert: rel 38171461/16384/57220350; tid 1778398/9
[ 2015-02-05 15:38:51.406 CST 2328 54d08abc.918 8 0]PANIC: heap_insert_redo: failed to add tuple
[ 2015-02-05 15:38:51.406 CST 2328 54d08abc.918 9 0]CONTEXT: xlog redo insert: rel 38171461/16384/57220350; tid 1778398/9
[ 2015-02-05 15:38:51.765 CST 2320 54d08abb.910 6 0]LOG: startup process (PID 2328) was terminated by signal 6: Aborted
[ 2015-02-05 15:38:51.765 CST 2320 54d08abb.910 7 0]LOG: terminating any other active server processes
[DBusesr DBname [unknown] 192.168.xxx.x 2015-02-05 15:38:51.765 CST 61450 54d31d48.f00a 3 0]WARNING: terminating connection because of crash of another server process
[DBusesr DBname [unknown] 192.168.xxx.x 2015-02-05 15:38:51.765 CST 61450 54d31d48.f00a 4 0]DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[DBusesr DBname [unknown] 192.168.xxx.x 2015-02-05 15:38:51.765 CST 61450 54d31d48.f00a 5 0]HINT: In a moment you should be able to reconnect to the database and repeat your command.
[DBusesr DBname [unknown] 192.168.xxx.x 2015-02-05 15:38:51.765 CST 51208 54d315b6.c808 7 0]WARNING: terminating connection because of crash of another server process
[DBusesr DBname [unknown] 192.168.xxx.x 2015-02-05 15:38:51.765 CST 51208 54d315b6.c808 8 0]DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[DBusesr DBname [unknown] 192.168.xxx.x 2015-02-05 15:38:51.765 CST 51208 54d315b6.c808 9 0]HINT: In a moment you should be able to reconnect to the database and repeat your command.


The salve was running, but stopped suddenly , and I never start it !

Anyone encounter the same problem? Could tell me why and how to avoid it?

If you need some more detailed information, please tell me and I'll give it to you.


Thanks Best Regards!
On 03/02/2015 02:49 AM, hailong Li wrote:
>
>
> Hi, dear pgsql-general
>
>
> The details are as follows:
>
> *1. environment*
>
> *DB Master*
>
> $ cat /etc/issue
> CentOS release 6.5 (Final)
> Kernel \r on an \m
>
> $ uname -av
> Linux l-xxxxx1.xx.cnx 3.14.29-3.centos6.x86_64 #1 SMP Tue Jan 20
> 17:48:32 CST 2015 x86_64 x86_64 x86_64 GNU/Linux
>
> $ psql -U postgres
> psql (9.3.5)
> Type "help" for help.
>
> postgres=# select version();
>                       version
> --------------------------------------------------------------------------------------------------------------
>   PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
> 4.4.7 20120313 (Red Hat 4.4.7-3), 64-bit
> (1 row)
>

>
> *DB Slave
>
> *$ cat /etc/issue
> CentOS release 6.5 (Final)
> Kernel \r on an \m
>
> $ uname -av
> Linux l-xxxx2.xx.cnx 3.14.31-3.centos6.x86_64 #1 SMP Mon Feb 2 15:26:04
> CST 2015 x86_64 x86_64 x86_64 GNU/Linux*
> *
> $ psql -U postgres
> psql (9.3.5)
> Type "help" for help.
>
> postgres=# select version();
>                                                     version
> --------------------------------------------------------------------------------------------------------------
>   PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
> 4.4.7 20120313 (Red Hat 4.4.7-3), 64-bit
> (1 row)
>

>
> *The salve was running, but stopped suddenly , and I never start it !
>
> Anyone encounter the same problem?  Could tell me why and how to avoid it?
>
> If you need some more detailed information, please tell me and I'll give it to you.

So what sort of replication(streaming, archiving, synchronous,etc) where
you doing?

Was the streaming happening across a local network or a remote network?

Was there a hardware issue on either of the machines?

>
>
> Thanks
>
> Best Regards!
>


--
Adrian Klaver
adrian.klaver@aklaver.com




2015-03-02 23:21 GMT+08:00 Adrian Klaver <adrian.klaver@aklaver.com>:
On 03/02/2015 02:49 AM, hailong Li wrote:


Hi, dear pgsql-general


The details are as follows:

*1. environment*

*DB Master*

$ cat /etc/issue
CentOS release 6.5 (Final)
Kernel \r on an \m

$ uname -av
Linux l-xxxxx1.xx.cnx 3.14.29-3.centos6.x86_64 #1 SMP Tue Jan 20
17:48:32 CST 2015 x86_64 x86_64 x86_64 GNU/Linux

$ psql -U postgres
psql (9.3.5)
Type "help" for help.

postgres=# select version();
                      version
--------------------------------------------------------------------------------------------------------------
  PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
4.4.7 20120313 (Red Hat 4.4.7-3), 64-bit
(1 row)



*DB Slave

*$ cat /etc/issue
CentOS release 6.5 (Final)
Kernel \r on an \m

$ uname -av
Linux l-xxxx2.xx.cnx 3.14.31-3.centos6.x86_64 #1 SMP Mon Feb 2 15:26:04
CST 2015 x86_64 x86_64 x86_64 GNU/Linux*
*
$ psql -U postgres
psql (9.3.5)
Type "help" for help.

postgres=# select version();
                                                    version
--------------------------------------------------------------------------------------------------------------
  PostgreSQL 9.3.5 on x86_64-unknown-linux-gnu, compiled by gcc (GCC)
4.4.7 20120313 (Red Hat 4.4.7-3), 64-bit
(1 row)



*The salve was running, but stopped suddenly , and I never start it !

Anyone encounter the same problem?  Could tell me why and how to avoid it?

If you need some more detailed information, please tell me and I'll give it to you.

So what sort of replication(streaming, archiving, synchronous,etc) where you doing?

streaminig
 

Was the streaming happening across a local network or a remote network?

local network

Was there a hardware issue on either of the machines?

I did not find anything wrong with a hardware issue on either of the machines at that time,  but there is some data on the tablespace which is located on the SSD of the master machine .
 
psql -U postgres
psql (9.3.5)
Type "help" for help.

postgres=# \db+
                           List of tablespaces
    Name    |  Owner   |   Location    | Access privileges | Description
------------+----------+---------------+-------------------+-------------
 pg_default | postgres |               |                   |
 pg_global  | postgres |               |                   |
 pgtblspc   | laser    | /ssd/pgtblspc |                   |
(3 rows)


Finally , I made a  new slave instance on the slave server and it  works fine until now.




Thanks

Best Regards!



--
Adrian Klaver
adrian.klaver@aklaver.com

On 3/3/15 6:52 AM, hailong Li wrote:
>
> Finally , I made a  new slave instance on the slave server and it  works
> fine until now.

Just so you're aware, that error means there was page level corruption
either on the replica or possibly on the master, or the replication
stream or WAL files got corrupted. You probably have either a hardware
or a configuration problem somewhere.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com




2015-03-05 16:34 GMT+08:00 Jim Nasby <Jim.Nasby@bluetreble.com>:
On 3/3/15 6:52 AM, hailong Li wrote:

Finally , I made a  new slave instance on the slave server and it  works
fine until now.

Just so you're aware, that error means there was page level corruption either on the replica or possibly on the master, or the replication stream or WAL files got corrupted. You probably have either a hardware or a configuration problem somewhere.

Actually there were 3 slave nodes of the same master, and they  stopped nearly at the same time. So, I  prefer page level corruption  on the master to the slaves, and SSD hardware to configuration problem somewhere.
 
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com