Thread: BUG #7600: Database crash with data corruption
The following bug has been logged on the website: Bug reference: 7600 Logged by: Database crash with data corruption Email address: hankiewicz@gmail.com PostgreSQL version: 9.2.1 Operating system: Linux Description: = We experienced database crash. = Our configuration: Single postgres server instance with 17 databases installed on it. Postgres server is installed in chroot on Linux Centos. The list of steps to reproduce bug: 1) Create two database on single server instance (production and sandbox) 2) On production perform normal operations (we use Trac with over 50 active users) 3) on the sandbox database start explain plan of query 4) while explain plan is executing drop sandbox database 5) after recovery was completed we have discovered that sequences on production database had wrong values Our log: ERROR: database "sandbox" is being accessed by other users DETAIL: There are 2 other sessions using the database. STATEMENT: DROP DATABASE sandbox; = ERROR: database "sandbox" is being accessed by other users DETAIL: There is 1 other session using the database. STATEMENT: DROP DATABASE sandbox; = LOG: server process (PID 1482) was terminated by signal 2: Killed DETAIL: Failed process was running: EXPLAIN (ANALYZE off, VERBOSE off, COSTS on, BUFFERS off, TIMING off )select * from ticket_custom tcu join = (ticket ti left join = (select tcin.field,tcin.time as tc_time, tcin.ticket, tcin.oldvalue = from ticket_change tcin = where tcin.field =3D 'status' and tcin.time =3D (select max(tcmax.time)= from ticket_change tcmax where tcmax.field =3D 'status' and tcmax.ticket =3D tcin.ticket)) tc on ti.id =3D tc.ticket) con = on tcu.ticket =3D con.id = join ticket_custom tcu2 on tcu2.ticket =3D con.id = join ticket_custom tcu3 on tcu3.ticket =3D con.id = where = tcu.ticket =3D con.id and = tcu.name =3D 'scen' and tcu2.name =3D 'err_type' and tcu3.name =3D 'err_details' and con.component not in ('xxxx', 'xxx xxxx') and -- tcu.value not in ('',' ') and = con.status <> 'closed' and = con.type =3D 'defect' and con.owner <> 'TEST_TEAM' order by 1 asc,5 asc LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: all server processes terminated; reinitializing LOG: database system was interrupted; last known up at 2012-10-12 13:12:25 CEST LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 6/9678A688 WARNING: could not open directory "base/29337": No such file or directory CONTEXT: xlog redo drop db: dir 29337/1663 WARNING: some useless files may be left behind in old database directory "base/29337" CONTEXT: xlog redo drop db: dir 29337/1663 LOG: record with zero length at 6/96799D40 LOG: redo done at 6/96799D10 LOG: last completed transaction was at log time 2012-10-12 11:12:55.824318+00 LOG: autovacuum launcher started LOG: database system is ready to accept connections LOG: duration: 6533.616 ms statement: DROP DATABASE sandbox;
hankiewicz@gmail.com writes: > The list of steps to reproduce bug: > 1) Create two database on single server instance (production and sandbox) > 2) On production perform normal operations (we use Trac with over 50 active > users) > 3) on the sandbox database start explain plan of query > 4) while explain plan is executing drop sandbox database > 5) after recovery was completed we have discovered that sequences on > production database had wrong values It's normal for sequences to be advanced by a few counts (up to 32) after a database crash. You seem to have done a "kill -9" on the backend that was running the EXPLAIN, which would count as a crash. So I don't see any bug here. regards, tom lane
On 10/12/2012 09:35 PM, hankiewicz@gmail.com wrote: > 5) after recovery was completed we have discovered that sequences on > production database had wrong values To follow up on Tom's explanation, if you're relying on sequences not having "holes" then your design is dangerously mistaken. A simple ROLLBACK throws away any IDs that were obtained from sequences in that transaction. If you're saying that the sequence is wrong in that it's LOWER than IDs present in the table, then that's a problem. -- Craig Ringer
On 10/15/2012 04:17 PM, MichaÅ Hankiewicz wrote: > You misunderstand me. > According to the server logs I have sent in the first message process > received signal 2 (it is SIGINT) and according to the documentation this > signal should not couse server crash. Wrong values of sequences does not > mean hole in generated values, but sequence started to generate the same > values once again causing unique constraint violation. OK, that sounds more like a problem. It wasn't clear from your original post that it was replaying used sequence values. SIGINT should normally just terminate the statement, eg: regress=# SELECT pg_sleep(100); ERROR: canceling statement due to user request so I agree that something isn't right here. Is this something you've been able to reproduce? Or is it a one-off incident so far? -- Craig Ringer
You misunderstand me.
According to the server logs I have sent in the first message process received signal 2 (it is SIGINT) and according to the documentation this signal should not couse server crash. Wrong values of sequences does not mean hole in generated values, but sequence started to generate the same values once again causing unique constraint violation.
regards
Michał Hankiewicz
According to the server logs I have sent in the first message process received signal 2 (it is SIGINT) and according to the documentation this signal should not couse server crash. Wrong values of sequences does not mean hole in generated values, but sequence started to generate the same values once again causing unique constraint violation.
regards
Michał Hankiewicz
On Sat, Oct 13, 2012 at 1:10 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
On 10/12/2012 09:35 PM, hankiewicz@gmail.com wrote:To follow up on Tom's explanation, if you're relying on sequences not having "holes" then your design is dangerously mistaken. A simple ROLLBACK throws away any IDs that were obtained from sequences in that transaction.5) after recovery was completed we have discovered that sequences on
production database had wrong values
If you're saying that the sequence is wrong in that it's LOWER than IDs present in the table, then that's a problem.
--
Craig Ringer
On Mon, Oct 15, 2012 at 3:31 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
OK, that sounds more like a problem. It wasn't clear from your original post that it was replaying used sequence values.
SIGINT should normally just terminate the statement, eg:
regress=# SELECT pg_sleep(100);
ERROR: canceling statement due to user request
so I agree that something isn't right here.
Is this something you've been able to reproduce? Or is it a one-off incident so far?
--
Craig Ringer
It is just one-off incident. We have increased sequences values and since friday everything is running fine.
We haven't tried to reproduce this situation.
Michał Hankiewicz