Thread: postgres 9.5 DB corruption: invalid byte sequence for encoding"UTF8"
Hoping someone may be able to offer some guidance on this recurring problem. I am providing this "problem report" to the general list as I understand the bugs list requires a set of reproduction steps we do not yet have. Please advise if I have the process wrong. I have tried to provide all known relevant info here. Thanks in advance for any insights.
--> A description of what you are trying to achieve and what results you expect.:
We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to identify and eliminate all sources. We are using two independent services for data replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based on COPY operations. DB corruption is observed when COPY operations fail with an error of the form: 'invalid byte sequence for encoding "UTF8"'. This occurs with a frequency ranging between a few days and several weeks. Each incident is followed by a race to find and repair or remove corrupted data, which we are getting good at. With well over a dozen incidents, the great majority originally showed corruption in a single VARCHAR(2000) column (value) in a single table (alert_attribute). In this time, we read about suspected and real problems with TOAST functionality and so made the decision to change alert_attribute.value to PLAIN storage. Since that change was made, most new incidents show corruption in the alert_attribute.name column instead (VARCHAR(200)). Another table (alert_instance) has been impacted as well. See below for their schemas. We have looked high and low through system logs and device reporting utility output for any sign of hardware failures. We haven't turned up anything yet. We also tried rebuilding an entire DB from scratch. That did not seem to help. We have not been performing routine reindexing. This is a problem we are working to correct. Normally our master DB serves for an 8-12 week period without reindexing before we failover to a peer. Before assuming the master role, the peer always begins by truncating the alert_instance and alert_attribute tables and loading all data from the current master. Hardware specs are listed below. For storage, we have 8 INTEL SSDSA2BW12 direct-attached disks. We can provide additional info as needed. ams=# \d ams.alert_attribute Table "ams.alert_attribute" Column | Type | Modifiers -------------------+-------------------------+-----------alert_instance_id | integer | not nullname | character varying(200) | not nulldata_type | smallint | not nullvalue | character varying(2000) | Indexes: "pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), tablespace "tbls5" "idx_aa_aval" btree (name, value) Foreign-key constraints: "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE Triggers: _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '2', 'kk') _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2') Disabled user triggers: _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster') _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate() ams=# ams=# \d ams.alert_instance Table "ams.alert_instance" Column | Type | Modifiers ---------------------+--------------------------------+-----------alert_instance_id | integer | not nullalert_definition_id | integer | not nullalert_instance_key | character varying(500) | not nullstart_active_date | timestamp(0) without time zone | not nullstop_active_date | timestamp(0) without time zone | active | smallint | not nullacknowledged | smallint | not nullack_clear_time | timestamp(0) without time zone | user_set_clear_time | smallint | category_id | integer | not nullcondition_start | timestamp(0) without time zone | not nullunack_reason | character varying(1) | viewer_visible | smallint | not null Indexes: "pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace "tbls5" "idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5" "idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5" "idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5" Check constraints: "ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1) "ck_alert_inst_active" CHECK (active = 0 OR active = 1) "ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR user_set_clear_time = 1) "ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1) Foreign-key constraints: "fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES ams.category(category_id) "fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES ams.alert_definition(alert_definition_id) "fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES ams.unack_reason(unack_reason) Referenced by: TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE Triggers: _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '1', 'k') _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1') Disabled user triggers: _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster') _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate() ams=# --> PostgreSQL version number you are running: 9.5.14 --> How you installed PostgreSQL: We use a customized installer. It copies a “make install”-based distribution into place on a target install host. Our compiler is gcc 5.4.0. CFLAGS: -m32 -march=opteron -mno-3dnow -ggdb -O2 -Wall $P/CONFIGURE_OPTIONS := \ --with-perl \ --with-openssl \ --with-libxml \ --with-libxslt \ --with-includes="$(COMMON)/include/perl $(COMMON)/include" \ --with-libs="$(COMMON)/lib" --prefix=$($P/PREFIX) \ --enable-debug \ --disable-nls --> Changes made to the settings in the postgresql.conf file: see Server Configuration for a quick way to list them all. ams=# SELECT version(); version ------------------------------------------------------------------------------------------------------------------------PostgreSQL 9.5.14 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9aka9.0.3) 5.4.0 20160609, 32-bit (1 row) ams=# SELECT name, current_setting(name), SOURCE ams-# FROM pg_settings ams-# WHERE SOURCE NOT IN ('default', 'override'); name | current_setting | source --------------------------------+--------------------+----------------------application_name | psql | clientautovacuum_vacuum_cost_delay | 100ms | configuration filecheckpoint_completion_target | 0.9 | configuration fileclient_encoding | SQL_ASCII | clientcpu_index_tuple_cost | 0.001 | configuration filecpu_operator_cost | 0.0005 | configuration filecpu_tuple_cost | 0.002 | configuration fileDateStyle | ISO, MDY | configuration filedefault_statistics_target | 150 | configuration filedefault_text_search_config | pg_catalog.english | configuration fileeffective_cache_size | 8007MB | configuration filelc_messages | C | configuration filelc_monetary | C | configuration filelc_numeric | C | configuration filelc_time | C | configuration filelisten_addresses | 0.0.0.0 | configuration filelog_autovacuum_min_duration | 0 | configuration filelog_checkpoints | on | configuration filelog_connections | on | configuration filelog_disconnections | on | configuration filelog_line_prefix | %m %p %u %h %d | configuration filelog_min_duration_statement | 1min | configuration filelog_min_messages | notice | configuration filelog_temp_files | 0 | configuration filemaintenance_work_mem | 512MB | configuration filemax_connections | 600 | configuration filemax_stack_depth | 2MB | environment variablemax_wal_size | 15360TB | configuration filerandom_page_cost | 2.5 | configuration filerestart_after_crash | off | configuration fileshared_buffers | 2GB | configuration filessl | on | configuration filessl_ca_file | root.crt | configuration filestandard_conforming_strings | on | configuration filestatement_timeout | 1min | configuration filesuperuser_reserved_connections | 10 | configuration filesyslog_ident | postgresql9.5-1.12 | configuration filetcp_keepalives_idle | 0 | configuration fileunix_socket_permissions | 0770 | configuration filework_mem | 30MB | configuration file (40 rows) ams=# --> Operating system and version: Ubuntu 16.04.5 LTS --> What program you're using to connect to PostgreSQL: Most DML is performed by a standalone JVM using the PostgreSQL JDBC Driver v42.0.0. --> Is there anything relevant or unusual in the PostgreSQL server logs?: We have sometimes but not always seen "invalid memory alloc request size" errors around the time of these events. Our experience suggests that "invalid memory alloc request size" errors are a symptom specifically of index corruption and are remediated by reindexing. --> For questions about any kind of error: --> What you were doing when the error happened / how to cause the error: Periodic DML (inserts and updates) focused primarily on two DB tables (ams.alert_instance and ams.alert_attribute) with hourly batched transfers of inactive alerts to history tables (deletes from ams.alert_instance, ams.alert_attribute and corresponding inserts to ams.alert_instance_hist, ams.alert_attribute_hist within the same transaction). --> The EXACT TEXT of the error message you're getting, if there is one: (Copy and paste the message to the email, do not send a screenshot) 2018-12-27 07:25:24.527 GMT 30462 appuser 127.0.0.1 ams ERROR: invalid byte sequence for encoding "UTF8": 0x86 The specific identified byte varies with each incident. --> CPU Info 2*4-core Intel(R) Xeon(R) CPU E31270 @ 3.40GHz --> Mem Info 16GB RAM --> Storage Info root@ams4.dfw.netmgmt:~# lsscsi -l [0:0:0:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sda state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 [0:0:1:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdb state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 [0:0:2:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdc state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 [0:0:3:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdd state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 [0:0:4:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sde state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 [0:0:5:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdf state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 [0:0:6:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdg state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 [0:0:7:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdh state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 root@ams4.dfw.netmgmt:~# root@ams4.dfw.netmgmt:~# modinfo md_mod filename: /lib/modules/4.14.68-4.14.3-amd64-3adf3675665129fa/kernel/drivers/md/md-mod.ko alias: block-major-9-* alias: md description: MD RAID framework license: GPL depends: retpoline: Y intree: Y name: md_mod vermagic: 4.14.68-4.14.3-amd64-3adf3675665129fa SMP mod_unload modversions parm: start_dirty_degraded:int parm: create_on_open:bool root@ams4.dfw.netmgmt:~#
Re: [External] postgres 9.5 DB corruption: invalid byte sequence forencoding "UTF8"
i think the experts will chime in soon, but why do you think this as db corruption and not just a bad input? https://github.com/postgres/postgres/blob/master/src/pl/plperl/expected/plperl_lc_1.out or it may also be encoding issue. https://pganalyze.com/docs/log-insights/app-errors/U137 can you do a pg_dump and restore on a parallel instance? does it result in failure? we also ask the app to log data (temporarily) inserted so that we could figure out directly if there was bad data upstream or have validations to prevent inserts when there is bad data. also, in our case the query was stuck at "PARSE" (if you do ps aux | grep postgres) and in some cases did result in oom. but upgrading the client and using session mode pooling in pgbouncer worked for us. Regards, Vijay On Tue, Mar 26, 2019 at 12:17 AM Thomas Tignor <tptignor@yahoo.com> wrote: > > Hoping someone may be able to offer some guidance on this recurring problem. I am providing this "problem report" to thegeneral list as I understand the bugs list requires a set of reproduction steps we do not yet have. Please advise if Ihave the process wrong. I have tried to provide all known relevant info here. Thanks in advance for any insights. > > --> A description of what you are trying to achieve and what results you expect.: > > We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to identify and eliminate all sources.We are using two independent services for data replication, Slony-I v2.2.6 and a custom service developed in-house.Both are based on COPY operations. DB corruption is observed when COPY operations fail with an error of the form:'invalid byte sequence for encoding "UTF8"'. This occurs with a frequency ranging between a few days and several weeks.Each incident is followed by a race to find and repair or remove corrupted data, which we are getting good at. Withwell over a dozen incidents, the great majority originally showed corruption in a single VARCHAR(2000) column (value)in a single table (alert_attribute). In this time, we read about suspected and real problems with TOAST functionalityand so made the decision to change alert_attribute.value to PLAIN storage. Since that change was made, mostnew incidents show corruption in the alert_attribute.name column instead (VARCHAR(200)). Another table (alert_instance)has been impacted as well. See below for their schemas. > > We have looked high and low through system logs and device reporting utility output for any sign of hardware failures.We haven't turned up anything yet. We also tried rebuilding an entire DB from scratch. That did not seem to help. > > We have not been performing routine reindexing. This is a problem we are working to correct. Normally our master DB servesfor an 8-12 week period without reindexing before we failover to a peer. Before assuming the master role, the peeralways begins by truncating the alert_instance and alert_attribute tables and loading all data from the current master. > > Hardware specs are listed below. For storage, we have 8 INTEL SSDSA2BW12 direct-attached disks. We can provide additionalinfo as needed. > > ams=# \d ams.alert_attribute > Table "ams.alert_attribute" > Column | Type | Modifiers > -------------------+-------------------------+----------- > alert_instance_id | integer | not null > name | character varying(200) | not null > data_type | smallint | not null > value | character varying(2000) | > Indexes: > "pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), tablespace "tbls5" > "idx_aa_aval" btree (name, value) > Foreign-key constraints: > "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETECASCADE > Triggers: > _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster','2', 'kk') > _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2') > Disabled user triggers: > _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster') > _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate() > > ams=# > ams=# \d ams.alert_instance > Table "ams.alert_instance" > Column | Type | Modifiers > ---------------------+--------------------------------+----------- > alert_instance_id | integer | not null > alert_definition_id | integer | not null > alert_instance_key | character varying(500) | not null > start_active_date | timestamp(0) without time zone | not null > stop_active_date | timestamp(0) without time zone | > active | smallint | not null > acknowledged | smallint | not null > ack_clear_time | timestamp(0) without time zone | > user_set_clear_time | smallint | > category_id | integer | not null > condition_start | timestamp(0) without time zone | not null > unack_reason | character varying(1) | > viewer_visible | smallint | not null > Indexes: > "pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace "tbls5" > "idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, alert_instance_id, active, acknowledged, ack_clear_time),tablespace "tbls5" > "idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5" > "idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5" > Check constraints: > "ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1) > "ck_alert_inst_active" CHECK (active = 0 OR active = 1) > "ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR user_set_clear_time = 1) > "ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1) > Foreign-key constraints: > "fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES ams.category(category_id) > "fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES ams.alert_definition(alert_definition_id) > "fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES ams.unack_reason(unack_reason) > Referenced by: > TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id)ON DELETE CASCADE > Triggers: > _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster','1', 'k') > _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1') > Disabled user triggers: > _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster') > _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate() > > ams=# > > --> PostgreSQL version number you are running: > 9.5.14 > > --> How you installed PostgreSQL: > We use a customized installer. It copies a “make install”-based distribution into place on a target install host. > Our compiler is gcc 5.4.0. > > CFLAGS: -m32 -march=opteron -mno-3dnow -ggdb -O2 -Wall > > $P/CONFIGURE_OPTIONS := \ > --with-perl \ > --with-openssl \ > --with-libxml \ > --with-libxslt \ > --with-includes="$(COMMON)/include/perl $(COMMON)/include" \ > --with-libs="$(COMMON)/lib" --prefix=$($P/PREFIX) \ > --enable-debug \ > --disable-nls > > --> Changes made to the settings in the postgresql.conf file: see Server Configuration for a quick way to list them all. > ams=# SELECT version(); > version > ------------------------------------------------------------------------------------------------------------------------ > PostgreSQL 9.5.14 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9aka9.0.3) 5.4.0 20160609, 32-bit > (1 row) > > ams=# SELECT name, current_setting(name), SOURCE > ams-# FROM pg_settings > ams-# WHERE SOURCE NOT IN ('default', 'override'); > name | current_setting | source > --------------------------------+--------------------+---------------------- > application_name | psql | client > autovacuum_vacuum_cost_delay | 100ms | configuration file > checkpoint_completion_target | 0.9 | configuration file > client_encoding | SQL_ASCII | client > cpu_index_tuple_cost | 0.001 | configuration file > cpu_operator_cost | 0.0005 | configuration file > cpu_tuple_cost | 0.002 | configuration file > DateStyle | ISO, MDY | configuration file > default_statistics_target | 150 | configuration file > default_text_search_config | pg_catalog.english | configuration file > effective_cache_size | 8007MB | configuration file > lc_messages | C | configuration file > lc_monetary | C | configuration file > lc_numeric | C | configuration file > lc_time | C | configuration file > listen_addresses | 0.0.0.0 | configuration file > log_autovacuum_min_duration | 0 | configuration file > log_checkpoints | on | configuration file > log_connections | on | configuration file > log_disconnections | on | configuration file > log_line_prefix | %m %p %u %h %d | configuration file > log_min_duration_statement | 1min | configuration file > log_min_messages | notice | configuration file > log_temp_files | 0 | configuration file > maintenance_work_mem | 512MB | configuration file > max_connections | 600 | configuration file > max_stack_depth | 2MB | environment variable > max_wal_size | 15360TB | configuration file > random_page_cost | 2.5 | configuration file > restart_after_crash | off | configuration file > shared_buffers | 2GB | configuration file > ssl | on | configuration file > ssl_ca_file | root.crt | configuration file > standard_conforming_strings | on | configuration file > statement_timeout | 1min | configuration file > superuser_reserved_connections | 10 | configuration file > syslog_ident | postgresql9.5-1.12 | configuration file > tcp_keepalives_idle | 0 | configuration file > unix_socket_permissions | 0770 | configuration file > work_mem | 30MB | configuration file > (40 rows) > > ams=# > > --> Operating system and version: > Ubuntu 16.04.5 LTS > > --> What program you're using to connect to PostgreSQL: > Most DML is performed by a standalone JVM using the PostgreSQL JDBC Driver v42.0.0. > > --> Is there anything relevant or unusual in the PostgreSQL server logs?: > We have sometimes but not always seen "invalid memory alloc request size" errors around the time of these events. Our experiencesuggests that "invalid memory alloc request size" errors are a symptom specifically of index corruption and areremediated by reindexing. > > --> For questions about any kind of error: > --> What you were doing when the error happened / how to cause the error: > Periodic DML (inserts and updates) focused primarily on two DB tables (ams.alert_instance and ams.alert_attribute) withhourly batched transfers of inactive alerts to history tables (deletes from ams.alert_instance, ams.alert_attribute andcorresponding inserts to ams.alert_instance_hist, ams.alert_attribute_hist within the same transaction). > > --> The EXACT TEXT of the error message you're getting, if there is one: (Copy and paste the message to the email, do notsend a screenshot) > 2018-12-27 07:25:24.527 GMT 30462 appuser 127.0.0.1 ams ERROR: invalid byte sequence for encoding "UTF8": 0x86 > The specific identified byte varies with each incident. > > --> CPU Info > 2*4-core Intel(R) Xeon(R) CPU E31270 @ 3.40GHz > > --> Mem Info > 16GB RAM > > --> Storage Info > root@ams4.dfw.netmgmt:~# lsscsi -l > > [0:0:0:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sda > state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 > [0:0:1:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdb > state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 > [0:0:2:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdc > state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 > [0:0:3:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdd > state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 > [0:0:4:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sde > state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 > [0:0:5:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdf > state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 > [0:0:6:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdg > state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 > [0:0:7:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdh > state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30 > root@ams4.dfw.netmgmt:~# > > root@ams4.dfw.netmgmt:~# modinfo md_mod > filename: /lib/modules/4.14.68-4.14.3-amd64-3adf3675665129fa/kernel/drivers/md/md-mod.ko > alias: block-major-9-* > alias: md > description: MD RAID framework > license: GPL > depends: > retpoline: Y > intree: Y > name: md_mod > vermagic: 4.14.68-4.14.3-amd64-3adf3675665129fa SMP mod_unload modversions > parm: start_dirty_degraded:int > parm: create_on_open:bool > root@ams4.dfw.netmgmt:~# > > > Tom :-)
Re: [External] postgres 9.5 DB corruption: invalid byte sequence forencoding "UTF8"
Vijaykumar Jain <vjain@opentable.com> wrote on 03/25/2019 03:07:19 PM:
> but why do you think this as db corruption and not just a bad input?
> INVALID URI REMOVED
> u=https-3A__github.com_postgres_postgres_blob_master_src_pl_plperl_expected_plperl-5Flc-5F1.out&d=DwIFaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=BX8eA7xgfVJIpaY_30xSZQ&m=7u71qfQylE2M0dQlbUBn399O53IK1HQHm-
> Unxl9LUzw&s=K6nXHvrx3aX4riGMLnucLoRa76QNC0_TOS5R4AziTVM&e=
This looked interesting to me in the settings below:
> client_encoding | SQL_ASCII | client
Unless you have set this explicitly, it will use the default encoding for the database. If it hasn't been explicitly set, then the source database (assuming that that output was from the source) is SQL_ASCII.
Double check the database encoding for the source database and target database. I'm wondering if you have SQL_ASCII for the source, and UTF8 for the target. If that is the case, you can take invalid UTF8 characters into the source, and they will fail to replicate to the target. That's not a Postgres problem, but an encoding mismatch
Brad
Re: [External] postgres 9.5 DB corruption: invalid byte sequencefor encoding "UTF8"
but why do you think this as db corruption and not just a bad input?
https://github.com/postgres/postgres/blob/master/src/pl/plperl/expected/plperl_lc_1.out
can you do a pg_dump and restore on a parallel instance? does it
result in failure?
we also ask the app to log data (temporarily) inserted so that we
could figure out directly if there was bad data upstream or have
validations to prevent inserts when there is bad data.
also, in our case the query was stuck at "PARSE" (if you do ps aux |
grep postgres) and in some cases did result in oom.
but upgrading the client and using session mode pooling in pgbouncer
worked for us.
tptignor: We depend on a JVM which works through the v42.0.0 postgres driver. We haven't taken a hard look yet at the driver, but certainly would if there was a reason. I would have expected the server to catch bad data on input from the driver in the same way it catches on COPY in.
Regards,
Vijay
On Tue, Mar 26, 2019 at 12:17 AM Thomas Tignor <tptignor@yahoo.com> wrote:
>
> Hoping someone may be able to offer some guidance on this recurring problem. I am providing this "problem report" to the general list as I understand the bugs list requires a set of reproduction steps we do not yet have. Please advise if I have the process wrong. I have tried to provide all known relevant info here. Thanks in advance for any insights.
>
> --> A description of what you are trying to achieve and what results you expect.:
>
> We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to identify and eliminate all sources. We are using two independent services for data replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based on COPY operations. DB corruption is observed when COPY operations fail with an error of the form: 'invalid byte sequence for encoding "UTF8"'. This occurs with a frequency ranging between a few days and several weeks. Each incident is followed by a race to find and repair or remove corrupted data, which we are getting good at. With well over a dozen incidents, the great majority originally showed corruption in a single VARCHAR(2000) column (value) in a single table (alert_attribute). In this time, we read about suspected and real problems with TOAST functionality and so made the decision to change alert_attribute.value to PLAIN storage. Since that change was made, most new incidents show corruption in the alert_attribute.name column instead (VARCHAR(200)). Another table (alert_instance) has been impacted as well. See below for their schemas.
>
> We have looked high and low through system logs and device reporting utility output for any sign of hardware failures. We haven't turned up anything yet. We also tried rebuilding an entire DB from scratch. That did not seem to help.
>
> We have not been performing routine reindexing. This is a problem we are working to correct. Normally our master DB serves for an 8-12 week period without reindexing before we failover to a peer. Before assuming the master role, the peer always begins by truncating the alert_instance and alert_attribute tables and loading all data from the current master.
>
> Hardware specs are listed below. For storage, we have 8 INTEL SSDSA2BW12 direct-attached disks. We can provide additional info as needed.
>
> ams=# \d ams.alert_attribute
> Table "ams.alert_attribute"
> Column | Type | Modifiers
> -------------------+-------------------------+-----------
> alert_instance_id | integer | not null
> name | character varying(200) | not null
> data_type | smallint | not null
> value | character varying(2000) |
> Indexes:
> "pk_alert_attributes" PRIMARY KEY, btree (alert_instance_id, name), tablespace "tbls5"
> "idx_aa_aval" btree (name, value)
> Foreign-key constraints:
> "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE
> Triggers:
> _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '2', 'kk')
> _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('2')
> Disabled user triggers:
> _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_attribute FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')
> _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_attribute FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()
>
> ams=#
> ams=# \d ams.alert_instance
> Table "ams.alert_instance"
> Column | Type | Modifiers
> ---------------------+--------------------------------+-----------
> alert_instance_id | integer | not null
> alert_definition_id | integer | not null
> alert_instance_key | character varying(500) | not null
> start_active_date | timestamp(0) without time zone | not null
> stop_active_date | timestamp(0) without time zone |
> active | smallint | not null
> acknowledged | smallint | not null
> ack_clear_time | timestamp(0) without time zone |
> user_set_clear_time | smallint |
> category_id | integer | not null
> condition_start | timestamp(0) without time zone | not null
> unack_reason | character varying(1) |
> viewer_visible | smallint | not null
> Indexes:
> "pk_alert_instance" PRIMARY KEY, btree (alert_instance_id), tablespace "tbls5"
> "idx_alert_inst_1" btree (alert_instance_key, alert_definition_id, alert_instance_id, active, acknowledged, ack_clear_time), tablespace "tbls5"
> "idx_alert_inst_cat_id" btree (category_id), tablespace "tbls5"
> "idx_alert_inst_def_id" btree (alert_definition_id), tablespace "tbls5"
> Check constraints:
> "ck_alert_inst_acked" CHECK (acknowledged = 0 OR acknowledged = 1)
> "ck_alert_inst_active" CHECK (active = 0 OR active = 1)
> "ck_alert_inst_set_cl_tm" CHECK (user_set_clear_time = 0 OR user_set_clear_time = 1)
> "ck_alert_inst_viewer_vis" CHECK (viewer_visible = 0 OR viewer_visible = 1)
> Foreign-key constraints:
> "fk_alert_inst_cat_id" FOREIGN KEY (category_id) REFERENCES ams.category(category_id)
> "fk_alert_inst_def_id" FOREIGN KEY (alert_definition_id) REFERENCES ams.alert_definition(alert_definition_id)
> "fk_alert_inst_unack_reason" FOREIGN KEY (unack_reason) REFERENCES ams.unack_reason(unack_reason)
> Referenced by:
> TABLE "ams.alert_attribute" CONSTRAINT "fk_alert_attr_instance_id" FOREIGN KEY (alert_instance_id) REFERENCES ams.alert_instance(alert_instance_id) ON DELETE CASCADE
> Triggers:
> _ams_cluster_logtrigger AFTER INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.logtrigger('_ams_cluster', '1', 'k')
> _ams_cluster_truncatetrigger BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.log_truncate('1')
> Disabled user triggers:
> _ams_cluster_denyaccess BEFORE INSERT OR DELETE OR UPDATE ON ams.alert_instance FOR EACH ROW EXECUTE PROCEDURE _ams_cluster.denyaccess('_ams_cluster')
> _ams_cluster_truncatedeny BEFORE TRUNCATE ON ams.alert_instance FOR EACH STATEMENT EXECUTE PROCEDURE _ams_cluster.deny_truncate()
>
> ams=#
>
> --> PostgreSQL version number you are running:
> 9.5.14
>
> --> How you installed PostgreSQL:
> We use a customized installer. It copies a “make install”-based distribution into place on a target install host.
> Our compiler is gcc 5.4.0.
>
> CFLAGS: -m32 -march=opteron -mno-3dnow -ggdb -O2 -Wall
>
> $P/CONFIGURE_OPTIONS := \
> --with-perl \
> --with-openssl \
> --with-libxml \
> --with-libxslt \
> --with-includes="$(COMMON)/include/perl $(COMMON)/include" \
> --with-libs="$(COMMON)/lib" --prefix=$($P/PREFIX) \
> --enable-debug \
> --disable-nls
>
> --> Changes made to the settings in the postgresql.conf file: see Server Configuration for a quick way to list them all.
> ams=# SELECT version();
> version
> ------------------------------------------------------------------------------------------------------------------------
> PostgreSQL 9.5.14 on i686-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9aka9.0.3) 5.4.0 20160609, 32-bit
> (1 row)
>
> ams=# SELECT name, current_setting(name), SOURCE
> ams-# FROM pg_settings
> ams-# WHERE SOURCE NOT IN ('default', 'override');
> name | current_setting | source
> --------------------------------+--------------------+----------------------
> application_name | psql | client
> autovacuum_vacuum_cost_delay | 100ms | configuration file
> checkpoint_completion_target | 0.9 | configuration file
> client_encoding | SQL_ASCII | client
> cpu_index_tuple_cost | 0.001 | configuration file
> cpu_operator_cost | 0.0005 | configuration file
> cpu_tuple_cost | 0.002 | configuration file
> DateStyle | ISO, MDY | configuration file
> default_statistics_target | 150 | configuration file
> default_text_search_config | pg_catalog.english | configuration file
> effective_cache_size | 8007MB | configuration file
> lc_messages | C | configuration file
> lc_monetary | C | configuration file
> lc_numeric | C | configuration file
> lc_time | C | configuration file
> listen_addresses | 0.0.0.0 | configuration file
> log_autovacuum_min_duration | 0 | configuration file
> log_checkpoints | on | configuration file
> log_connections | on | configuration file
> log_disconnections | on | configuration file
> log_line_prefix | %m %p %u %h %d | configuration file
> log_min_duration_statement | 1min | configuration file
> log_min_messages | notice | configuration file
> log_temp_files | 0 | configuration file
> maintenance_work_mem | 512MB | configuration file
> max_connections | 600 | configuration file
> max_stack_depth | 2MB | environment variable
> max_wal_size | 15360TB | configuration file
> random_page_cost | 2.5 | configuration file
> restart_after_crash | off | configuration file
> shared_buffers | 2GB | configuration file
> ssl | on | configuration file
> ssl_ca_file | root.crt | configuration file
> standard_conforming_strings | on | configuration file
> statement_timeout | 1min | configuration file
> superuser_reserved_connections | 10 | configuration file
> syslog_ident | postgresql9.5-1.12 | configuration file
> tcp_keepalives_idle | 0 | configuration file
> unix_socket_permissions | 0770 | configuration file
> work_mem | 30MB | configuration file
> (40 rows)
>
> ams=#
>
> --> Operating system and version:
> Ubuntu 16.04.5 LTS
>
> --> What program you're using to connect to PostgreSQL:
> Most DML is performed by a standalone JVM using the PostgreSQL JDBC Driver v42.0.0.
>
> --> Is there anything relevant or unusual in the PostgreSQL server logs?:
> We have sometimes but not always seen "invalid memory alloc request size" errors around the time of these events. Our experience suggests that "invalid memory alloc request size" errors are a symptom specifically of index corruption and are remediated by reindexing.
>
> --> For questions about any kind of error:
> --> What you were doing when the error happened / how to cause the error:
> Periodic DML (inserts and updates) focused primarily on two DB tables (ams.alert_instance and ams.alert_attribute) with hourly batched transfers of inactive alerts to history tables (deletes from ams.alert_instance, ams.alert_attribute and corresponding inserts to ams.alert_instance_hist, ams.alert_attribute_hist within the same transaction).
>
> --> The EXACT TEXT of the error message you're getting, if there is one: (Copy and paste the message to the email, do not send a screenshot)
> 2018-12-27 07:25:24.527 GMT 30462 appuser 127.0.0.1 ams ERROR: invalid byte sequence for encoding "UTF8": 0x86
> The specific identified byte varies with each incident.
>
> --> CPU Info
> 2*4-core Intel(R) Xeon(R) CPU E31270 @ 3.40GHz
>
> --> Mem Info
> 16GB RAM
>
> --> Storage Info
> root@ams4.dfw.netmgmt:~# lsscsi -l
>
> [0:0:0:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sda
> state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
> [0:0:1:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdb
> state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
> [0:0:2:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdc
> state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
> [0:0:3:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdd
> state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
> [0:0:4:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sde
> state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
> [0:0:5:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdf
> state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
> [0:0:6:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdg
> state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
> [0:0:7:0] disk ATA INTEL SSDSA2BW12 0362 /dev/sdh
> state=running queue_depth=32 scsi_level=7 type=0 device_blocked=0 timeout=30
> root@ams4.dfw.netmgmt:~#
>
> root@ams4.dfw.netmgmt:~# modinfo md_mod
> filename: /lib/modules/4.14.68-4.14.3-amd64-3adf3675665129fa/kernel/drivers/md/md-mod.ko
> alias: block-major-9-*
> alias: md
> description: MD RAID framework
> license: GPL
> depends:
> retpoline: Y
> intree: Y
> name: md_mod
> vermagic: 4.14.68-4.14.3-amd64-3adf3675665129fa SMP mod_unload modversions
> parm: start_dirty_degraded:int
> parm: create_on_open:bool
> root@ams4.dfw.netmgmt:~#
>
>
> Tom :-)
|
Re: [External] postgres 9.5 DB corruption: invalid byte sequencefor encoding "UTF8"
Vijaykumar Jain <vjain@opentable.com> wrote on 03/25/2019 03:07:19 PM:
> but why do you think this as db corruption and not just a bad input?
> INVALID URI REMOVED
> u=https-3A__github.com_postgres_postgres_blob_master_src_pl_plperl_expected_plperl-5Flc-5F1.out&d=DwIFaQ&c=jf_iaSHvJObTbx-
> siA1ZOg&r=BX8eA7xgfVJIpaY_30xSZQ&m=7u71qfQylE2M0dQlbUBn399O53IK1HQHm-
> Unxl9LUzw&s=K6nXHvrx3aX4riGMLnucLoRa76QNC0_TOS5R4AziTVM&e=
This looked interesting to me in the settings below:
> client_encoding | SQL_ASCII | client
Unless you have set this explicitly, it will use the default encoding for the database. If it hasn't been explicitly set, then the source database (assuming that that output was from the source) is SQL_ASCII.
Double check the database encoding for the source database and target database. I'm wondering if you have SQL_ASCII for the source, and UTF8 for the target. If that is the case, you can take invalid UTF8 characters into the source, and they will fail to replicate to the target. That's not a Postgres problem, but an encoding mismatch
Brad
Thomas Tignor wrote: > We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to > identify and eliminate all sources. We are using two independent services for data > replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based > on COPY operations. DB corruption is observed when COPY operations fail with an error > of the form: 'invalid byte sequence for encoding "UTF8"'. > This occurs with a frequency ranging between a few days and several weeks. It might be a storage problem, but it could also be a problem with how the data get into the database in the first place. I would consider it unlikely that there is a bug in the TOAST code that changes just one byte. You could use block checksums to see if the storage is at fault, but unfortunately it requires an "initdb", dump and restore to switch to checksums. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
> We are experiencing intermittent DB corruption in postgres 9.5.14. We are trying to
> identify and eliminate all sources. We are using two independent services for data
> replication, Slony-I v2.2.6 and a custom service developed in-house. Both are based
> on COPY operations. DB corruption is observed when COPY operations fail with an error
> of the form: 'invalid byte sequence for encoding "UTF8"'.
> This occurs with a frequency ranging between a few days and several weeks.
It might be a storage problem, but it could also be a problem with how the data
get into the database in the first place.
I would consider it unlikely that there is a bug in the TOAST code that changes
just one byte.
You could use block checksums to see if the storage is at fault, but unfortunately
it requires an "initdb", dump and restore to switch to checksums.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Re: [External] postgres 9.5 DB corruption: invalid byte sequence forencoding "UTF8"
Thomas Tignor <tptignor@yahoo.com> wrote on 03/25/2019 08:25:49 PM:
>
> Hi Brad,
> Thanks for writing. As I mentioned to Vijay, the "source" is a JVM
> using the postgres v42.0.0 JDBC driver. I do not believe we have any
> explicit encoding set, and so I expect the client encoding is
> SQL_ASCII. The DB is most definitely UTF8.
These statements are contradictory.
The value of client_encoding from your select on pg_settings is SQL_ASCII.
The docs clearly state:
https://www.postgresql.org/docs/9.5/runtime-config-client.html
"Sets the client-side encoding (character set). The default is to use the database encoding. "
If you don't have client_encoding explicitly, then it is using the database encoding.
Run this on both the primary and replica to confirm the DB's are indeed both in UTF8 encoding:
SELECT datname, pg_encoding_to_char(encoding) FROM pg_database ;
Brad