Thread: BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database
BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15640 Logged by: Mikhail Kiyanov Email address: m.kiianov@rtk-dc.ru PostgreSQL version: 9.6.11 Operating system: Debian GNU/Linux 9.1 (stretch) Description: Hi! We are currently having a problems with our production database. The problem is that database goes to crash and after that put into recovery mode with the following errors. Please see log files. 2019-02-18 05:30:35 GMT [40461]: [28685-1] db=vc_production,user=estore LOCATION: exec_execute_message, postgres.c:2052 2019-02-18 05:30:36 GMT [45739]: [1-1] db=,user= LOG: 00000: automatic analyze of table "template1.pg_catalog.pg_shdepend" system usage: CPU 0.00s/0.00u sec elapsed 0.07 sec 2019-02-18 05:30:36 GMT [45739]: [2-1] db=,user= LOCATION: do_analyze_rel, analyze.c:665 2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL: XX000: cannot read pg_class without having selected a database 2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION: ScanPgRelation, relcache.c:314 2019-02-18 05:30:36 GMT [40267]: [7-1] db=,user= LOG: 00000: autovacuum launcher process (PID 40378) exited with exit code 1 2019-02-18 05:30:36 GMT [40267]: [8-1] db=,user= LOCATION: LogChildExit, postmaster.c:3552 2019-02-18 05:30:36 GMT [40267]: [9-1] db=,user= LOG: 00000: terminating any other active server processes 2019-02-18 05:30:36 GMT [40267]: [10-1] db=,user= LOCATION: HandleChildCrash, postmaster.c:3294 2019-02-18 05:30:36 GMT [44893]: [5-1] db=vc_production,user=vcfront WARNING: 57P02: terminating connection because of crash of another server process 2019-02-18 05:30:36 GMT [44893]: [6-1] db=vc_production,user=vcfront DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. It happened several times. And everytime it happens at different time. Some addtitional information about our server: postgres=# select version(); version -------------------------------------------------------------------------------------------------------------------------------------- PostgreSQL 9.6.11 on x86_64-pc-linux-gnu (Debian 9.6.11-1.pgdg90+1), compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit (1 row) $ cat /etc/postgresql/9.6/main/postgresql.conf shared_preload_libraries = 'pg_stat_statements, timescaledb, pg_wait_sampling, repmgr, auto_explain' track_io_timing = on listen_addresses = '0.0.0.0' port = 5432 superuser_reserved_connections = 3 unix_socket_group = '' unix_socket_permissions = 0777 log_filename = 'postgresql-%a.log' log_rotation_age = 1d log_truncate_on_rotation = on log_line_prefix = '%t [%p]: [%l-1] db=%d,user=%u ' log_lock_waits = on log_temp_files = 0 log_connections = on log_checkpoints = on log_disconnections = on log_directory = '/storage/log/postgresql' log_error_verbosity = verbose logging_collector = on log_min_duration_statement = 1000 log_autovacuum_min_duration = 10 checkpoint_completion_target = 0.9 checkpoint_timeout = 30min effective_cache_size = 200GB shared_buffers = 32GB fsync = on autovacuum_max_workers = 25 bgwriter_delay = 10 bgwriter_lru_maxpages = 800 vacuum_cost_delay = 1 autovacuum_analyze_scale_factor = 0.01 data_directory = '/var/lib/postgresql/9.6/main' hba_file = '/etc/postgresql/9.6/main/pg_hba.conf' # master specific wal_level = 'hot_standby' max_wal_senders = 10 max_replication_slots = 9 wal_keep_segments = 100 max_connections = 500 work_mem = 400MB max_wal_size = 140GB min_wal_size = 40GB autovacuum_work_mem = 11GB autovacuum_vacuum_scale_factor = 0.01 autovacuum_freeze_max_age = 1000000000 autovacuum_multixact_freeze_max_age = 800000000 autovacuum_naptime = 1 maintenance_work_mem = 4096MB random_page_cost = 1 synchronous_commit = off # Extentions auto_explain.log_min_duration = '5s' vc_production=# select extname,extversion from pg_extension; extname | extversion --------------------+------------ plpgsql | 1.0 pageinspect | 1.5 citext | 1.3 pg_buffercache | 1.2 pg_stat_statements | 1.4 pgstattuple | 1.4 pg_repack | 1.4.4 pg_wait_sampling | 1.1 timescaledb | 0.8.0 (9 rows) Some tables have personal autovacuum settings: vc_production=# \d+ estore.events_0 Table "estore.events_0" Column | Type | Modifiers | Storage | Stats target | Description ------------+-----------------------------+------------------------------------------------------------+----------+--------------+------------- id | bigint | not null default nextval('estore.events_id_seq'::regclass) | plain | | camera_uid | text | not null | extended | | source | smallint | not null | plain | | kind | integer | not null | plain | | begin | timestamp without time zone | not null | plain | | end | timestamp without time zone | | plain | | properties | jsonb | not null | extended | | is_active | boolean | not null | plain | | Indexes: "events_0_pkey" PRIMARY KEY, btree (id), tablespace "local_01_ts" "events_0_camera_uid_begin_idx" btree (camera_uid, begin), tablespace "local_01_ts" "events_0_camera_uid_end_idx" btree (camera_uid, "end"), tablespace "local_01_ts" "events_0_camera_uid_expr_idx" btree (camera_uid, (properties ->> 'uid'::text)), tablespace "local_01_ts" "events_0_camera_uid_kind_source_idx" btree (camera_uid, kind, source), tablespace "local_01_ts" "events_0_camera_uid_source_begin_end_idx" btree (camera_uid, source, begin, "end"), tablespace "local_01_ts" "events_0_end_idx" btree ("end"), tablespace "local_01_ts" "events_0_src_knd_cmr_uid_prpt_idx" btree (source, kind, camera_uid, (properties ->> 'channel_uid'::text)), tablespace "local_01_ts" Check constraints: "events_0_camera_uid_check" CHECK (camera_uid >= '0'::text AND camera_uid < '1'::text) Inherits: estore.events Tablespace: "estore_ts" Options: autovacuum_vacuum_scale_factor=0.001 vc_production=# \d+ violet.alarm_events Table "violet.alarm_events" Column | Type | Modifiers | Storage | Stats target | Description ------------+-----------------------------+------------------------------------------------------------------+----------+--------------+------------- id | bigint | not null default nextval('violet.alarm_events_id_seq'::regclass) | plain | | camera_id | integer | not null | plain | | kind | integer | not null | plain | | begin | timestamp without time zone | not null | plain | | end | timestamp without time zone | | plain | | uid | text | not null | extended | | properties | jsonb | not null default '{}'::jsonb | extended | | created_at | timestamp without time zone | not null default now() | plain | | updated_at | timestamp without time zone | not null default now() | plain | | deleted_at | timestamp without time zone | | plain | | Indexes: "alarm_events_new_pkey" PRIMARY KEY, btree (id), tablespace "violet_ts" "alarm_events_new_begin_camera_id_idx" btree (begin, camera_id), tablespace "violet_ts" "alarm_events_new_begin_idx" btree (begin), tablespace "violet_ts" "alarm_events_new_camera_id_idx" btree (camera_id), tablespace "violet_ts" "alarm_events_new_camera_id_uid_idx" btree (camera_id, uid), tablespace "violet_ts" Foreign-key constraints: "alarm_events_camera_id_fkey" FOREIGN KEY (camera_id) REFERENCES violet.cameras(id) Tablespace: "violet_ts" Options: autovacuum_vacuum_cost_delay=0, autovacuum_vacuum_scale_factor=0.00001 This is a physical server with: $ free -m total used free shared buff/cache available Mem: 515897 47349 4112 34023 464436 431171 Swap: 0 0 0 $ grep 'processor' -c /proc/cpuinfo 80 model name : Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz The database located on ssd disks. This database is mission critical for our business. Would you be so kind to help us with our problem.
Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database
From
Alvaro Herrera
Date:
On 2019-Feb-18, PG Bug reporting form wrote: > 2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL: XX000: cannot read pg_class without having selected a database > 2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION: ScanPgRelation, relcache.c:314 Hmm, I wonder if this is a bug introduced in 6a46aba1cd6d (in 9.6.10) that only hits autovacuum launcher after a relcache init removal at just the wrong time, or something like that? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database
From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes: > 2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL: XX000: cannot read > pg_class without having selected a database > 2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION: ScanPgRelation, > relcache.c:314 > 2019-02-18 05:30:36 GMT [40267]: [7-1] db=,user= LOG: 00000: autovacuum > launcher process (PID 40378) exited with exit code 1 Hm. So something is trying to access pg_class (perhaps only on the way to reading something else) inside the autovacuum launcher, which it cannot do because the launcher isn't connected to any specific database. I have not heard of such behavior in core Postgres, which makes me think that the problem is in one of your extensions. You mention > shared_preload_libraries = 'pg_stat_statements, timescaledb, > pg_wait_sampling, repmgr, auto_explain' I'd suggest removing entries from shared_preload_libraries one at a time to see if the problem stops occurring. (Obviously, you have to restart the postmaster each time you change the list.) Once you've found the culprit, closer analysis would be possible. regards, tom lane
Re: BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database
From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Feb-18, PG Bug reporting form wrote: >> 2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL: XX000: cannot read pg_class without having selected a database >> 2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION: ScanPgRelation, relcache.c:314 > Hmm, I wonder if this is a bug introduced in 6a46aba1cd6d (in 9.6.10) > that only hits autovacuum launcher after a relcache init removal at just > the wrong time, or something like that? I think/hope that the launcher isn't connected to sinval, either. regards, tom lane
Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database
From
Andres Freund
Date:
Hi, On 2019-02-18 11:29:49 -0500, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > On 2019-Feb-18, PG Bug reporting form wrote: > >> 2019-02-18 05:30:36 GMT [40378]: [2-1] db=,user= FATAL: XX000: cannot read pg_class without having selected a database > >> 2019-02-18 05:30:36 GMT [40378]: [3-1] db=,user= LOCATION: ScanPgRelation, relcache.c:314 > > > Hmm, I wonder if this is a bug introduced in 6a46aba1cd6d (in 9.6.10) > > that only hits autovacuum launcher after a relcache init removal at just > > the wrong time, or something like that? I don't immediately see how that'd be the culprit - we'd processed sinvals in some cases before that change too. I'm inclined to think Tom's guess is the right one, but ... My initial guess would be that one of the extensions registered a relcache callback, and does catalog accesses in an unconditional manner. > I think/hope that the launcher isn't connected to sinval, either. Hm, I don't think that's true? The launcher does InitPostgres() and everything (albeit without a database). And it needs to process relmapper updates, otherwise it can't access pg_database. And that's accessed via a normal heap_open(). Greetings, Andres Freund
Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database
From
Alvaro Herrera
Date:
On 2019-Feb-18, Andres Freund wrote: > My initial guess would be that one of the extensions registered a > relcache callback, and does catalog accesses in an unconditional manner. Hmm but why is extension code running in autovacuum launcher? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database
From
Andres Freund
Date:
On 2019-02-18 13:44:12 -0300, Alvaro Herrera wrote: > On 2019-Feb-18, Andres Freund wrote: > > > > My initial guess would be that one of the extensions registered a > > relcache callback, and does catalog accesses in an unconditional manner. > > Hmm but why is extension code running in autovacuum launcher? shared_preload_libraries + CacheRegisterRelcacheCallback().
Re: BUG #15640: FATAL: XX000: cannot read pg_class without having selected a database
From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-Feb-18, Andres Freund wrote: >> My initial guess would be that one of the extensions registered a >> relcache callback, and does catalog accesses in an unconditional manner. > Hmm but why is extension code running in autovacuum launcher? Anything registered in shared_preload_libraries is going to be active in every process. I don't particularly believe Andres' theory that CacheRegisterRelcacheCallback is involved, but it seems very likely that one of these extensions has hooked into some code path that the launcher can traverse, and it's doing more than is safe there. If this weren't a production database, I'd suggest changing that error to a PANIC so we could get a stack trace. That would almost certainly pinpoint the culprit. regards, tom lane
Re: BUG #15640: FATAL: XX000: cannot read pg_class without havingselected a database
From
Andres Freund
Date:
On 2019-02-18 12:32:43 -0500, Tom Lane wrote: > I don't particularly believe Andres' theory that > CacheRegisterRelcacheCallback is involved It's how I'd broken postgres temporarily while converting citus from a fork into an extension :)