how to check that recovery is complete - Mailing list pgsql-general

From Dmitry O Litvintsev
Subject how to check that recovery is complete
Date
Msg-id SA0PR09MB71613DE760DBADC8712FDF17B9EE0@SA0PR09MB7161.namprd09.prod.outlook.com
Whole thread Raw
Responses Re: how to check that recovery is complete
List pgsql-general
Hi,

I have a workflow where I recover from PITR backup and run a query on it. The program that runs query
checks that it can connect to database in a loop, until it can, and then runs the query.
This has worked fine far. Recently I upgraded to 11 and I see that I can connect to DB while recovery is
not complete yet. See this:

< 2020-11-05 03:34:36.114 CST  >LOG:  starting archive recovery
< 2020-11-05 03:34:36.590 CST  >LOG:  restored log file "00000001000002EF000000F9" from archive
< 2020-11-05 03:34:36.641 CST  >LOG:  redo starts at 2EF/F9000028
...
< 2020-11-05 03:34:46.392 CST  >LOG:  restored log file "00000001000002F000000008" from archive
< 2020-11-05 03:34:46.658 CST 127.0.0.1 >FATAL:  the database system is starting up
< 2020-11-05 03:34:47.028 CST  >LOG:  restored log file "00000001000002F000000009" from archive

You can see above fail to connect , but sometime into recover I see;

< 2020-11-05 04:07:51.987 CST  >LOG:  restored log file "00000001000002F200000029" from archive
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >ERROR:  canceling statement due to conflict with recovery
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >DETAIL:  User query might have needed to see row versions that must be
removed.
< 2020-11-05 04:08:23.195 CST 127.0.0.1 >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:23.195 CST  >FATAL:  terminating connection due to administrator command
< 2020-11-05 04:08:23.195 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:23.195 CST  >FATAL:  terminating connection due to administrator command
< 2020-11-05 04:08:23.195 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:23.232 CST  >LOG:  background worker "parallel worker" (PID 13577) exited with exit code 1
< 2020-11-05 04:08:23.244 CST  >LOG:  background worker "parallel worker" (PID 13578) exited with exit code 1
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >FATAL:  terminating connection due to conflict with recovery
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >DETAIL:  User query might have needed to see row versions that must be
removed.
< 2020-11-05 04:08:23.244 CST 127.0.0.1 >HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand. 
< 2020-11-05 04:08:25.354 CST  >LOG:  restored log file "00000001000002F20000002A" from archive
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >ERROR:  canceling statement due to conflict with recovery
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >DETAIL:  User query might have needed to see row versions that must be
removed.
< 2020-11-05 04:08:55.555 CST 127.0.0.1 >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:55.556 CST  >FATAL:  terminating connection due to administrator command
< 2020-11-05 04:08:55.556 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:55.561 CST  >FATAL:  terminating connection due to administrator command
< 2020-11-05 04:08:55.561 CST  >STATEMENT:  select count(*) from file
< 2020-11-05 04:08:55.640 CST  >LOG:  background worker "parallel worker" (PID 13683) exited with exit code 1
< 2020-11-05 04:08:55.653 CST  >LOG:  background worker "parallel worker" (PID 13684) exited with exit code 1
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >FATAL:  terminating connection due to conflict with recovery
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >DETAIL:  User query might have needed to see row versions that must be
removed.
< 2020-11-05 04:08:55.653 CST 127.0.0.1 >HINT:  In a moment you should be able to reconnect to the database and repeat
yourcommand. 
< 2020-11-05 04:09:00.307 CST  >LOG:  restored log file "00000001000002F20000002B" from archive

As you can see a query "select count(*) from file" failed due to table not being restored yet. BUT connection was
allowed 
before DB was ready.... Only few hours after ;

< 2020-11-05 09:31:30.319 CST  >LOG:  archive recovery complete
< 2020-11-05 09:34:51.729 CST  >LOG:  database system is ready to accept connections

After which the query runs fine without errors. This is bad because I see that select count(*) takes progressively
longer to execute and the count(*) is not what I am interested in. I run the "real" query after that. As a result I add
hours to program execution time.

Is there a more robust method to it? Ideally I do not want to be able to connect to db until :

< 2020-11-05 09:31:30.319 CST  >LOG:  archive recovery complete
< 2020-11-05 09:34:51.729 CST  >LOG:  database system is ready to accept connections


And I believe this was the behavior before upgrade. If connection can't be disabled, how can I detect
condition "database system is ready to accept connections"

I believe "pg_isready" utility would succeed once it can connect. And as can see I could
connect way before DB is really ready.

Thanks!
Dmitry



pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: Fwd: JSONB order?
Next
From: Gabi Draghici
Date:
Subject: pgagent