Thread: destroyed db/index (corruption)

destroyed db/index (corruption)

From
Michael Monnerie
Date:
dear list,

I've found database corruptions occuring with openSUSE 10.1, postgreSQL
version # rpm -qa|grep postg
postgresql-libs-8.1.4-1.2
postgresql-server-8.1.4-1.2
postgresql-8.1.4-1.2

I found this in database logs:
2007-02-05 13:15:51 CET DB=bayes_pg3 HOST=195.202.170.130(54789)
SESSTRT=2007-02-04 05:01:15 CET FEHLER:  duplizierter Schlüssel
verletzt Unique-Constraint »bayes_seen_pkey«

translated it says "duplicate key violates unique constraint", and it's
true, I have 2 rows with the same Primary Key in that table (make dump,
then reload fails on it).

The DB is a bayes DB for spamassassin. I've had this error before, and
dumped/edited dump/reloaded the db. But it keeps coming again. So there
might be hardware or software problems.

The hardware is a server running one AMD Opteron dual core 2210, with
8GB RAM, and VMware. We have www, db, mailserver and mailrelays running
within a VM, and the only problem I see is with the db, so it doesn't
look like a hardware problem.

I could believe it's a software problem. But which? Could a bad config
cause such problems? I've set this in postgresql.conf:

listen_addresses = '*'
port = 5432
max_connections = 200
shared_buffers = 50000
max_prepared_transactions = 0
work_mem = 128000
maintenance_work_mem = 128000
max_fsm_pages = 60000
max_fsm_relations = 1000
max_files_per_process = 5000
vacuum_cost_delay = 100
vacuum_cost_page_hit = 1
vacuum_cost_page_miss = 10
vacuum_cost_page_dirty = 20
vacuum_cost_limit = 10000
bgwriter_delay = 1000
bgwriter_lru_percent = 5.0
bgwriter_lru_maxpages = 100
bgwriter_all_percent = 1.333
bgwriter_all_maxpages = 100
fsync = off
wal_buffers = 8
commit_delay = 3000
commit_siblings = 25
checkpoint_segments = 4
checkpoint_timeout = 300
checkpoint_warning = 30
enable_bitmapscan = on
enable_hashagg = on
enable_hashjoin = on
enable_indexscan = on
enable_mergejoin = on
enable_nestloop = on
enable_seqscan = on
enable_sort = on
enable_tidscan = on
effective_cache_size = 64000
geqo = on
log_destination = 'stderr'
redirect_stderr = on
log_directory = '/var/log/'
log_filename = 'postgresql.%Y'
log_truncate_on_rotation = off
log_rotation_age = 0
log_rotation_size = 0
log_min_duration_statement = 5000
log_connections = off
log_disconnections = off
log_duration = off
log_line_prefix = '%t DB=%d HOST=%r SESSTRT=%s '
log_statement = 'ddl'
stats_start_collector = on
stats_block_level = on
stats_row_level = on
autovacuum = on
autovacuum_naptime = 3600
autovacuum_vacuum_threshold = 5000
autovacuum_analyze_threshold = 2500
autovacuum_vacuum_scale_factor = 0.4
autovacuum_analyze_scale_factor = 0.2
autovacuum_vacuum_cost_delay = -1
autovacuum_vacuum_cost_limit = -1
lc_messages = 'de_DE.UTF-8'
lc_monetary = 'de_DE.UTF-8'
lc_numeric = 'de_DE.UTF-8'
lc_time = 'de_DE.UTF-8'

log_filename = 'postgresql.%Y'

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0676/846 914 666                      .network.your.ideas.
// PGP Key:        "curl -s http://zmi.at/zmi4.asc | gpg --import"
// Fingerprint: EA39 8918 EDFF 0A68 ACFB  11B7 BA2D 060F 1C6F E6B0
// Keyserver: www.keyserver.net                   Key-ID: 1C6FE6B0

Attachment

Re: destroyed db/index (corruption)

From
Achilleas Mantzios
Date:
Στις Δευτέρα 05 Φεβρουάριος 2007 16:00, ο/η Michael Monnerie έγραψε:
> dear list,
>
> I've found database corruptions occuring with openSUSE 10.1, postgreSQL
> version # rpm -qa|grep postg
> postgresql-libs-8.1.4-1.2
> postgresql-server-8.1.4-1.2
> postgresql-8.1.4-1.2
>
> I found this in database logs:
> 2007-02-05 13:15:51 CET DB=bayes_pg3 HOST=195.202.170.130(54789)
> SESSTRT=2007-02-04 05:01:15 CET FEHLER:  duplizierter Schlüssel
> verletzt Unique-Constraint »bayes_seen_pkey«
>
> translated it says "duplicate key violates unique constraint", and it's
> true, I have 2 rows with the same Primary Key in that table (make dump,
> then reload fails on it).

This ERROR is normally thrown when you *try* to violate a unique constraint.
Can you demonstrate the presence of the UNIQUE constraint on your table
and the pair of identical key values rows?
Post a description (\d) of your table, and then the select stmt which return
the 2 bad rows.

I might be wrong, but i would bet the whole issue began because
some rows were inserted by a client (other than the reloading psql) during
reload but before the UNIQUE KEY DDL
was executed, or smth like that, or simply because this bayes_seen value is
attempted to be inserted again
(for reasons beyond our scope)
>
> The DB is a bayes DB for spamassassin. I've had this error before, and
> dumped/edited dump/reloaded the db. But it keeps coming again. So there
> might be hardware or software problems.
>
> The hardware is a server running one AMD Opteron dual core 2210, with
> 8GB RAM, and VMware. We have www, db, mailserver and mailrelays running
> within a VM, and the only problem I see is with the db, so it doesn't
> look like a hardware problem.
>
> I could believe it's a software problem. But which? Could a bad config
> cause such problems? I've set this in postgresql.conf:
>
> listen_addresses = '*'
> port = 5432
> max_connections = 200
> shared_buffers = 50000
> max_prepared_transactions = 0
> work_mem = 128000
> maintenance_work_mem = 128000
> max_fsm_pages = 60000
> max_fsm_relations = 1000
> max_files_per_process = 5000
> vacuum_cost_delay = 100
> vacuum_cost_page_hit = 1
> vacuum_cost_page_miss = 10
> vacuum_cost_page_dirty = 20
> vacuum_cost_limit = 10000
> bgwriter_delay = 1000
> bgwriter_lru_percent = 5.0
> bgwriter_lru_maxpages = 100
> bgwriter_all_percent = 1.333
> bgwriter_all_maxpages = 100
> fsync = off
> wal_buffers = 8
> commit_delay = 3000
> commit_siblings = 25
> checkpoint_segments = 4
> checkpoint_timeout = 300
> checkpoint_warning = 30
> enable_bitmapscan = on
> enable_hashagg = on
> enable_hashjoin = on
> enable_indexscan = on
> enable_mergejoin = on
> enable_nestloop = on
> enable_seqscan = on
> enable_sort = on
> enable_tidscan = on
> effective_cache_size = 64000
> geqo = on
> log_destination = 'stderr'
> redirect_stderr = on
> log_directory = '/var/log/'
> log_filename = 'postgresql.%Y'
> log_truncate_on_rotation = off
> log_rotation_age = 0
> log_rotation_size = 0
> log_min_duration_statement = 5000
> log_connections = off
> log_disconnections = off
> log_duration = off
> log_line_prefix = '%t DB=%d HOST=%r SESSTRT=%s '
> log_statement = 'ddl'
> stats_start_collector = on
> stats_block_level = on
> stats_row_level = on
> autovacuum = on
> autovacuum_naptime = 3600
> autovacuum_vacuum_threshold = 5000
> autovacuum_analyze_threshold = 2500
> autovacuum_vacuum_scale_factor = 0.4
> autovacuum_analyze_scale_factor = 0.2
> autovacuum_vacuum_cost_delay = -1
> autovacuum_vacuum_cost_limit = -1
> lc_messages = 'de_DE.UTF-8'
> lc_monetary = 'de_DE.UTF-8'
> lc_numeric = 'de_DE.UTF-8'
> lc_time = 'de_DE.UTF-8'
>
> log_filename = 'postgresql.%Y'
>
> mfg zmi

--
Achilleas Mantzios

Re: destroyed db/index (corruption)

From
Michael Monnerie
Date:
On Montag, 5. Februar 2007 15:22 Achilleas Mantzios wrote:
> This ERROR is normally thrown when you *try* to violate a unique
> constraint. Can you demonstrate the presence of the UNIQUE constraint
> on your table and the pair of identical key values rows?
> Post a description (\d) of your table, and then the select stmt which
> return the 2 bad rows.

Kalispera Achilleas,

I will do this soon. Just now I'm upgrading that VM from SUSE 10.1 to
10.2, so later this night I'll reload the databases.

> I might be wrong, but i would bet the whole issue began because
> some rows were inserted by a client (other than the reloading psql)
> during reload but before the UNIQUE KEY DDL
> was executed, or smth like that, or simply because this bayes_seen
> value is attempted to be inserted again
> (for reasons beyond our scope)

No, I started those DBs fresh with the command

pg_restore -v pg.fulldump.bayes_power2u.sqlz | \
psql -U postgres bayes_pg3 -f -

Is there an easy way to find out which key is duplicated while I have
that DB running? The log doesn't show it, and the last time I did
reload with trial-and-error:
- start reload
- it stops somewhere complaining about line 34587328 or similiar
- make some tail & head foo, or vi, to delete the one duplicate line
- start again...

It would be nice to have a simpler process. Could it be that the stored
procedure could cause a problem? I took all that from the SpamAssassin
Wiki, without modification.

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0676/846 914 666                      .network.your.ideas.
// PGP Key:        "curl -s http://zmi.at/zmi4.asc | gpg --import"
// Fingerprint: EA39 8918 EDFF 0A68 ACFB  11B7 BA2D 060F 1C6F E6B0
// Keyserver: www.keyserver.net                   Key-ID: 1C6FE6B0

Attachment

Re: destroyed db/index (corruption)

From
Achilleas Mantzios
Date:
Στις Δευτέρα 05 Φεβρουάριος 2007 23:36, ο/η Michael Monnerie έγραψε:
> On Montag, 5. Februar 2007 15:22 Achilleas Mantzios wrote:
> > This ERROR is normally thrown when you *try* to violate a unique
> > constraint. Can you demonstrate the presence of the UNIQUE constraint
> > on your table and the pair of identical key values rows?
> > Post a description (\d) of your table, and then the select stmt which
> > return the 2 bad rows.
>
> Kalispera Achilleas,
:)
>
> I will do this soon. Just now I'm upgrading that VM from SUSE 10.1 to
> 10.2, so later this night I'll reload the databases.
>
> > I might be wrong, but i would bet the whole issue began because
> > some rows were inserted by a client (other than the reloading psql)
> > during reload but before the UNIQUE KEY DDL
> > was executed, or smth like that, or simply because this bayes_seen
> > value is attempted to be inserted again
> > (for reasons beyond our scope)
>
> No, I started those DBs fresh with the command
>
> pg_restore -v pg.fulldump.bayes_power2u.sqlz | \
> psql -U postgres bayes_pg3 -f -
>

Thats what i mean, postmaster is alive, and accepting connections
during the above command, so what would prevent a 2nd client (from your mail
server setup most likely)
from inserting a value that is later present in the dump,
which is later attempted to be COPY'ed to the table
and ofcource that violates the PRIMARY KEY CONSTRAINT?
In other words, do you have full control of your DB when that happens?
If some 2nd client, not controled by you, inserts a value that is later
found in the dump, it is completely normal that this ERROR will be thrown.
Have you tried, for instance, disabling all unwanted access to postmaster,
e.g. through data/pg_hba.conf ?

> Is there an easy way to find out which key is duplicated while I have
> that DB running? The log doesn't show it, and the last time I did
> reload with trial-and-error:
> - start reload
> - it stops somewhere complaining about line 34587328 or similiar
> - make some tail & head foo, or vi, to delete the one duplicate line
> - start again...
>

The "steps" you mentioned earlier are not what the average admin whould do.
For the average admin that would trigger some kind of panic mode!
It is simply not normal to have this ERROR produced by the dump alone.
Try to isolate your DB, by controling access to it, and try to replicate
the problem (if any).

> It would be nice to have a simpler process. Could it be that the stored
> procedure could cause a problem? I took all that from the SpamAssassin
> Wiki, without modification.
>
> mfg zmi

--
Achilleas Mantzios

Re: destroyed db/index (corruption) [SOLVED]

From
Michael Monnerie
Date:
On Dienstag, 6. Februar 2007 08:46 Achilleas Mantzios wrote:
> Thats what i mean, postmaster is alive, and accepting connections
> during the above command, so what would prevent a 2nd client (from
> your mail server setup most likely)
> from inserting a value that is later present in the dump,
> which is later attempted to be COPY'ed to the table
> and ofcource that violates the PRIMARY KEY CONSTRAINT?
> In other words, do you have full control of your DB when that
> happens? If some 2nd client, not controled by you, inserts a value
> that is later found in the dump, it is completely normal that this
> ERROR will be thrown. Have you tried, for instance, disabling all
> unwanted access to postmaster, e.g. through data/pg_hba.conf ?

After your tip, I rechecked the old logs. There, I had
"Tabelle enthält duplizierte Werte"
(table contains duplicate values)
but now the log message was
"duplizierter Schlüssel verletzt Unique-Constraint »bayes_token_pkey«"
(duplicate key violates unique constraint)

The latter message is just an INFO, though its logged as ERROR. It
happens when you try to INSERT a primary key that's already there, but
this got only logged once:
ZUSAMMENHANG:  SQL-Anweisung »INSERT INTO bayes_token (id, token,
spam_count, ham_count, atime) VALUES ( $1 ,  $2 ,  $3 ,  $4 ,  $5 )«
        PL/pgSQL function "put_tokens" line 18 at SQL statement
All other times just the violation was logged.

I double checked with a dump and reload to a fresh db - there is no
spoon [The Matrix], and also no duplicate values exist.

The error I got december 2006 with the duplicate values was really a
situation where values seem to have been inserted during a reload, but
before the index was there. I was just nervous and mixed the two
different warnings.

But I'd like to suggest changing the ERROR to WARNING for "duplicate key
violates unique constraint", as it's not serious - but it sounds like a
big problem.

Efharisto, Achilleas!

mfg zmi
--
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0676/846 914 666                      .network.your.ideas.
// PGP Key:        "curl -s http://zmi.at/zmi4.asc | gpg --import"
// Fingerprint: EA39 8918 EDFF 0A68 ACFB  11B7 BA2D 060F 1C6F E6B0
// Keyserver: www.keyserver.net                   Key-ID: 1C6FE6B0

Attachment

Re: destroyed db/index (corruption) [SOLVED]

From
Achilleas Mantzios
Date:
Στις Πέμπτη 08 Φεβρουάριος 2007 00:07, ο/η Michael Monnerie έγραψε:
>
> But I'd like to suggest changing the ERROR to WARNING for "duplicate key
> violates unique constraint", as it's not serious - but it sounds like a
> big problem.
>
I dont think that would be appropriate. Attempt to violate a UNIQUE CONSTRAINT
is an erroneous act.
> Efharisto, Achilleas!
Bitte Sen!
>
> mfg zmi

--
Achilleas Mantzios