Thread: BUG #7600: Database crash with data corruption

BUG #7600: Database crash with data corruption

From
hankiewicz@gmail.com
Date:
The following bug has been logged on the website:

Bug reference:      7600
Logged by:          Database crash with data corruption
Email address:      hankiewicz@gmail.com
PostgreSQL version: 9.2.1
Operating system:   Linux
Description:        =


We experienced database crash. =

Our configuration:
Single postgres server instance with 17 databases installed on it. Postgres
server is installed in chroot on Linux Centos.

The list of steps to reproduce bug:
1) Create two database on single server instance (production and sandbox)
2) On production perform normal operations (we use Trac with over 50 active
users)
3) on the sandbox database start explain plan of query
4) while explain plan is executing drop sandbox database
5) after recovery was completed we have discovered that sequences on
production database had wrong values

Our log:
ERROR:  database "sandbox" is being accessed by other users
DETAIL:  There are 2 other sessions using the database.
STATEMENT:  DROP DATABASE sandbox;
    =

ERROR:  database "sandbox" is being accessed by other users
DETAIL:  There is 1 other session using the database.
STATEMENT:  DROP DATABASE sandbox;
    =

LOG:  server process (PID 1482) was terminated by signal 2: Killed
DETAIL:  Failed process was running: EXPLAIN (ANALYZE off, VERBOSE off,
COSTS on, BUFFERS off, TIMING off )select * from ticket_custom tcu join =

            (ticket ti left join =

                (select tcin.field,tcin.time as tc_time, tcin.ticket, tcin.oldvalue =

                from ticket_change tcin =

                where tcin.field =3D 'status' and tcin.time =3D (select max(tcmax.time)=
 from
ticket_change tcmax where tcmax.field =3D 'status' and tcmax.ticket =3D
tcin.ticket)) tc
            on ti.id =3D tc.ticket) con =

        on tcu.ticket =3D con.id =

        join ticket_custom tcu2 on tcu2.ticket =3D con.id =

        join ticket_custom tcu3 on tcu3.ticket =3D con.id =

    where =

        tcu.ticket =3D con.id and =

        tcu.name =3D 'scen' and
        tcu2.name =3D 'err_type' and
        tcu3.name =3D 'err_details' and
        con.component not in ('xxxx', 'xxx xxxx') and
    --    tcu.value not in ('',' ') and =

        con.status <> 'closed' and =

        con.type =3D 'defect' and
        con.owner <> 'TEST_TEAM'
        order by 1 asc,5 asc
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2012-10-12 13:12:25
CEST
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 6/9678A688
WARNING:  could not open directory "base/29337": No such file or directory
CONTEXT:  xlog redo drop db: dir 29337/1663
WARNING:  some useless files may be left behind in old database directory
"base/29337"
CONTEXT:  xlog redo drop db: dir 29337/1663
LOG:  record with zero length at 6/96799D40
LOG:  redo done at 6/96799D10
LOG:  last completed transaction was at log time 2012-10-12
11:12:55.824318+00
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
LOG:  duration: 6533.616 ms  statement: DROP DATABASE sandbox;

Re: BUG #7600: Database crash with data corruption

From
Tom Lane
Date:
hankiewicz@gmail.com writes:
> The list of steps to reproduce bug:
> 1) Create two database on single server instance (production and sandbox)
> 2) On production perform normal operations (we use Trac with over 50 active
> users)
> 3) on the sandbox database start explain plan of query
> 4) while explain plan is executing drop sandbox database
> 5) after recovery was completed we have discovered that sequences on
> production database had wrong values

It's normal for sequences to be advanced by a few counts (up to 32)
after a database crash.  You seem to have done a "kill -9" on the
backend that was running the EXPLAIN, which would count as a crash.
So I don't see any bug here.

            regards, tom lane

Re: BUG #7600: Database crash with data corruption

From
Craig Ringer
Date:
On 10/12/2012 09:35 PM, hankiewicz@gmail.com wrote:
> 5) after recovery was completed we have discovered that sequences on
> production database had wrong values

To follow up on Tom's explanation, if you're relying on sequences not
having "holes" then your design is dangerously mistaken. A simple
ROLLBACK throws away any IDs that were obtained from sequences in that
transaction.

If you're saying that the sequence is wrong in that it's LOWER than IDs
present in the table, then that's a problem.

--
Craig Ringer

Re: BUG #7600: Database crash with data corruption

From
Craig Ringer
Date:
On 10/15/2012 04:17 PM, Michał Hankiewicz wrote:
> You misunderstand me.
> According to the server logs I have sent in the first message process
> received signal 2 (it is SIGINT) and according to the documentation this
> signal should not couse server crash. Wrong values of sequences does not
> mean hole in generated values, but sequence started to generate the same
> values once again causing unique constraint violation.

OK, that sounds more like a problem. It wasn't clear from your original
post that it was replaying used sequence values.

SIGINT should normally just terminate the statement, eg:

regress=# SELECT pg_sleep(100);
ERROR:  canceling statement due to user request

so I agree that something isn't right here.

Is this something you've been able to reproduce? Or is it a one-off
incident so far?

--
Craig Ringer

Re: BUG #7600: Database crash with data corruption

From
Michał Hankiewicz
Date:
You misunderstand me.
According to the server logs I have sent in the first message process received signal 2 (it is SIGINT) and according to the documentation this signal should not couse server crash. Wrong values of sequences does not mean hole in generated values, but sequence started to generate the same values once again causing unique constraint violation.

regards
Michał Hankiewicz

On Sat, Oct 13, 2012 at 1:10 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 10/12/2012 09:35 PM, hankiewicz@gmail.com wrote:
5) after recovery was completed we have discovered that sequences on
production database had wrong values

To follow up on Tom's explanation, if you're relying on sequences not having "holes" then your design is dangerously mistaken. A simple ROLLBACK throws away any IDs that were obtained from sequences in that transaction.

If you're saying that the sequence is wrong in that it's LOWER than IDs present in the table, then that's a problem.

--
Craig Ringer

Re: BUG #7600: Database crash with data corruption

From
Michał Hankiewicz
Date:
On Mon, Oct 15, 2012 at 3:31 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:

OK, that sounds more like a problem. It wasn't clear from your original post that it was replaying used sequence values.

SIGINT should normally just terminate the statement, eg:

regress=# SELECT pg_sleep(100);
ERROR:  canceling statement due to user request

so I agree that something isn't right here.

Is this something you've been able to reproduce? Or is it a one-off incident so far?

--
Craig Ringer


It is just one-off incident. We have increased sequences values and since friday everything is running fine.
We haven't tried to reproduce this situation.

Michał Hankiewicz