Thread: BUG #15299: relation does not exist errors
The following bug has been logged on the website: Bug reference: 15299 Logged by: Jeff Frost Email address: jeff@pgexperts.com PostgreSQL version: 9.5.13 Operating system: Ubuntu 14.04 Description: We recently upgraded some production servers to 9.5.13 on Saturday (7/21/2018) and they ran fine for 2 days. On Wednesday morning, we started seeing some strange issues where postgres was reporting tons of "relation" does not exist errors. Here are some snippets from the logs: 2018-07-25 14:33:53.243 GMT,"app_db","webapp",34887,"127.0.0.1:56740",5b587d3b.8847,90363,"PARSE",2018-07-25 13:38:03 GMT,287/3021104,0,ERROR,42P01,"relation ""login_informations"" does not exist",,, ,,,"SELECT ""login_informations"".* FROM ""login_informations"" WHERE ""login_informations"".""id"" = 735211 LIMIT 1",37,,"sidekiq 4.1.2 procore [0 of 5 busy]" 2018-07-25 14:33:53.251 GMT,"app_db","webapp",34887,"127.0.0.1:56740",5b587d3b.8847,90364,"PARSE",2018-07-25 13:38:03 GMT,287/3021107,0,ERROR,42P01,"relation ""login_informations"" does not exist",,, ,,,"SELECT ""login_informations"".* FROM ""login_informations"" WHERE ""login_informations"".""id"" = 735211 LIMIT 1 ",37,,"sidekiq 4.1.2 procore [0 of 5 busy]" 2018-07-25 14:33:53.252 GMT,"app_db","webapp",34887,"127.0.0.1:56740",5b587d3b.8847,90365,"PARSE",2018-07-25 13:38:03 GMT,287/3021109,0,ERROR,42P01,"relation ""login_informations"" does not exist",,, ,,,"SELECT ""login_informations"".* FROM ""login_informations"" WHERE ""login_informations"".""id"" = 735211 LIMIT 1 ",37,,"sidekiq 4.1.2 procore [0 of 5 busy]" 2018-07-25 14:33:53.258 GMT,"app_db","webapp",34887,"127.0.0.1:56740",5b587d3b.8847,90366,"PARSE",2018-07-25 13:38:03 GMT,287/3021110,0,ERROR,42P01,"relation ""login_informations"" does not exist",,, ,,,"SELECT ""login_informations"".* FROM ""login_informations"" WHERE ""login_informations"".""id"" = 735211 LIMIT 1",37,,"sidekiq 4.1.2 procore [0 of 5 busy]" 2018-07-25 14:33:53.295 GMT,"app_db","webapp",34887,"127.0.0.1:56740",5b587d3b.8847,90367,"PARSE",2018-07-25 13:38:03 GMT,287/3021111,0,ERROR,42P01,"relation ""login_informations"" does not exist",,, ,,,"SELECT ""login_informations"".* FROM ""login_informations"" WHERE ""login_informations"".""id"" = 735211 LIMIT 1 ",37,,"sidekiq 4.1.2 procore [0 of 5 busy]" 2018-07-25 14:33:53.296 GMT,"app_db","webapp",34887,"127.0.0.1:56740",5b587d3b.8847,90368,"PARSE",2018-07-25 13:38:03 GMT,287/3021112,0,ERROR,42P01,"relation ""permission_templates"" does not exist", ,,,,,"SELECT ""permission_templates"".* FROM ""permission_templates"" WHERE ""permission_templates"".""id"" = 137880 LIMIT 1 ",39,, "sidekiq 4.1.2 procore [0 of 5 busy]" Connecting to the DB via psql and issuing a \d login_informations showed that the table was there. Running one of the queries from the logs by hand worked fine. Since we use pgbouncer as a connection pooler, we then connected using psql through pgbouncer on one of the affected hosts and were able to reproduce the issue, so the theory was that some postgresql backends had lost track of relations. We stopped/started pgbouncer to allow those backends to exit and the issue was resolved. A little about the DB in case this is helpful: DB size: 12TB DB load: 25k xacts/sec Characterization: 80% read / 20% write Replication role: primary with 1 direct replica and a few cascaded replicas So far haven't been able to reproduce synthetically, but let me know if there is any more info we can pull from the server if this happens again.
=?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes: > We recently upgraded some production servers to 9.5.13 on Saturday > (7/21/2018) and they ran fine for 2 days. On Wednesday morning, we started > seeing some strange issues where postgres was reporting tons of "relation" > does not exist errors. Here are some snippets from the logs: > 2018-07-25 14:33:53.243 > GMT,"app_db","webapp",34887,"127.0.0.1:56740",5b587d3b.8847,90363,"PARSE",2018-07-25 > 13:38:03 GMT,287/3021104,0,ERROR,42P01,"relation ""login_informations"" does > not exist",,, > ,,,"SELECT ""login_informations"".* FROM ""login_informations"" WHERE > ""login_informations"".""id"" = 735211 LIMIT 1",37,,"sidekiq 4.1.2 procore > [0 of 5 busy]" > Connecting to the DB via psql and issuing a \d login_informations showed > that the table was there. Running one of the queries from the logs by hand > worked fine. My first thought would be something to do with a restrictive search_path. regards, tom lane
> On Jul 26, 2018, at 9:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > =?utf-8?q?PG_Bug_reporting_form?= <noreply@postgresql.org> writes: >> We recently upgraded some production servers to 9.5.13 on Saturday >> (7/21/2018) and they ran fine for 2 days. On Wednesday morning, we started >> seeing some strange issues where postgres was reporting tons of "relation" >> does not exist errors. Here are some snippets from the logs: > >> 2018-07-25 14:33:53.243 >> GMT,"app_db","webapp",34887,"127.0.0.1:56740",5b587d3b.8847,90363,"PARSE",2018-07-25 >> 13:38:03 GMT,287/3021104,0,ERROR,42P01,"relation ""login_informations"" does >> not exist",,, >> ,,,"SELECT ""login_informations"".* FROM ""login_informations"" WHERE >> ""login_informations"".""id"" = 735211 LIMIT 1",37,,"sidekiq 4.1.2 procore >> [0 of 5 busy]" > >> Connecting to the DB via psql and issuing a \d login_informations showed >> that the table was there. Running one of the queries from the logs by hand >> worked fine. > > My first thought would be something to do with a restrictive search_path. Good thought. As far as I can tell we never change it from the default and everything in that DB is in the public schema.
Jeff Frost <jeff@pgexperts.com> writes: > On Jul 26, 2018, at 9:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> My first thought would be something to do with a restrictive search_path. > Good thought. As far as I can tell we never change it from the default and everything in that DB is in the public schema. Right, but what I'm half-theorizing is that the patches for CVE-2018-1058 affected you somehow. Maybe something "transiently" installed pg_catalog (only) as search_path and forgot to remove it again. It would be easy to explain it that way if the errors had been coming out of autovacuum worker processes, for instance, though your log didn't match that exactly. If you see it again it would be a good idea to look very closely at what happened right before the errors started. regards, tom lane
On Jul 26, 2018, at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Jeff Frost <jeff@pgexperts.com> writes: >> On Jul 26, 2018, at 9:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> My first thought would be something to do with a restrictive search_path. > >> Good thought. As far as I can tell we never change it from the default and everything in that DB is in the public schema. > > Right, but what I'm half-theorizing is that the patches for CVE-2018-1058 > affected you somehow. Maybe something "transiently" installed pg_catalog > (only) as search_path and forgot to remove it again. It would be easy > to explain it that way if the errors had been coming out of autovacuum > worker processes, for instance, though your log didn't match that exactly. > > If you see it again it would be a good idea to look very closely at what > happened right before the errors started. Oh, I see what you're saying. Don't the autovacuum processes connect to their own backends which then exit after they'recomplete though? I have the logs around, so if you know what I should look for, I'd be happy to dig.
On 2018-07-26 13:18:33 -0700, Jeff Frost wrote: > Oh, I see what you're saying. Don't the autovacuum processes connect to their own backends which then exit after they'recomplete though? Yes, that's right. Creating an extension or calling a function that internally does a SET search_path seem like a more likely cause. - Andres
On Jul 26, 2018, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2018-07-26 13:18:33 -0700, Jeff Frost wrote: >> Oh, I see what you're saying. Don't the autovacuum processes connect to their own backends which then exit after they'recomplete though? > > Yes, that's right. Creating an extension or calling a function that > internally does a SET search_path seem like a more likely cause. I grepped our code and didn't find anywhere that we set search_path, though it's possible some ruby gem we use could be doingit. I'll go spelunking there next!
> On Jul 26, 2018, at 1:53 PM, Jeff Frost <jeff@pgexperts.com> wrote: > > On Jul 26, 2018, at 1:23 PM, Andres Freund <andres@anarazel.de> wrote: >> >> On 2018-07-26 13:18:33 -0700, Jeff Frost wrote: >>> Oh, I see what you're saying. Don't the autovacuum processes connect to their own backends which then exit after they'recomplete though? >> >> Yes, that's right. Creating an extension or calling a function that >> internally does a SET search_path seem like a more likely cause. > > I grepped our code and didn't find anywhere that we set search_path, though it's possible some ruby gem we use could bedoing it. I'll go spelunking there next! Wanted to close the loop on this. As usual, Tom was right. We have a rake task in the application used to pull a production copy of structure.sql which uses pg_dump under the covers.Because it uses the same conninfo as the app, it happily goes through pgbouncer. So, when that rake task is run,pg_dump connects through pgbouncer and happily sets that restrictive search_path. It then creates the schema-only dumpand upon completion, that connection is returned to the pool as a big fat boobytrap for the next application connection.