Thread: Reindex "locked" standby database

Reindex "locked" standby database

From

Martín Fernández

Date:

15 December 2021, 06:15:27

Hello pg hackers!

Today we had to run a `REINDEX table CONCURRENTLY my_table;` in our production database due to considerable index
bloat.We used to deal with this problem in the past by using pg_repack but we stopped using it because our data
replicationtool doesn’t support “re creating” tables in the way that pg_repack does it (we are using Fivetran). 

The reindex went fine in the primary database and in one of our standby. The other standby that we also operate for
somereason ended up in a state where all transactions were locked by the WAL process and the WAL process was not able
tomake any progress. In order to solve this issue we had to move traffic from the “bad” standby to the healthy one and
thenkill all transactions that were running in the “bad” standby. After that, replication was able to resume
successfully. 

I’m just trying to understand what could have caused this issue. I was not able to identify any queries in the standby
thatwould be locking the WAL process. Any insight would be more than welcome! 

We are running pg12 in our cluster and standbys are replicating using physical replication lots.

Best,
Martín

Re: Reindex "locked" standby database

From

Michael Paquier

Date:

15 December 2021, 06:37:35

On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote:
> The reindex went fine in the primary database and in one of our
> standby. The other standby that we also operate for some reason
> ended up in a state where all transactions were locked by the WAL
> process and the WAL process was not able to make any progress. In
> order to solve this issue we had to move traffic from the “bad”
> standby to the healthy one and then kill all transactions that were
> running in the “bad” standby. After that, replication was able to
> resume successfully.

You are referring to the startup process that replays WAL, right?
Without having an idea about the type of workload your primary and/or
standbys are facing, as well as an idea of the configuration you are
using on both (hot_standby_feedback for one), I have no direct idea,
but that could be a conflict caused by a concurrent vacuum.

Seeing where things got stuck could also be useful, perhaps with a
backtrace of the area where it happens and some information around
it.

> I’m just trying to understand what could have caused this issue. I
> was not able to identify any queries in the standby that would be
> locking the WAL process. Any insight would be more than welcome!

That's not going to be easy without more information, I am afraid.
--
Michael

Attachment

signature.asc

Re: Reindex "locked" standby database

From

Mladen Gogala

Date:

15 December 2021, 06:52:36

On 12/14/21 22:37, Michael Paquier wrote:
> You are referring to the startup process that replays WAL, right?
> Without having an idea about the type of workload your primary and/or
> standbys are facing, as well as an idea of the configuration you are
> using on both (hot_standby_feedback for one), I have no direct idea,
> but that could be a conflict caused by a concurrent vacuum.

Hi Michael,

I am preparing for a standby deployment. I don't have a standby yet and, 
therefore, I don't have any standby problems. Would it be advisable to 
turn vacuum off on the standby? Applying WAL will also, in theory, 
populate the statistics which is also held in the database blocks.

Regards

-- 
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

Re: Reindex "locked" standby database

From

Martín Fernández

Date:

15 December 2021, 07:10:45

Micheal,

Thanks for much for the quick response.

On 15 Dec 2021, at 00:37, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote:
The reindex went fine in the primary database and in one of our
standby. The other standby that we also operate for some reason
ended up in a state where all transactions were locked by the WAL
process and the WAL process was not able to make any progress. In
order to solve this issue we had to move traffic from the “bad”
standby to the healthy one and then kill all transactions that were
running in the “bad” standby. After that, replication was able to
resume successfully.

You are referring to the startup process that replays WAL, right?

That is correct, I’m talking about the startup process that replays the WAL files.

Without having an idea about the type of workload your primary and/or
standbys are facing, as well as an idea of the configuration you are
using on both (hot_standby_feedback for one), I have no direct idea,

Primary handle IOT data ingestion. The table that we had to REINDEX gets updated every time a new message arrives in the system so updated are happening very often on that table, thus, the index/table bloat. The standby at any point in time would be receiving queries that would take advantage of the indexes that were being re indexed. hot_standby_feedback is currently turned OFF on the standbys.

but that could be a conflict caused by a concurrent vacuum.

Seeing where things got stuck could also be useful, perhaps with a
backtrace of the area where it happens and some information around
it.

I’m just trying to understand what could have caused this issue. I
was not able to identify any queries in the standby that would be
locking the WAL process. Any insight would be more than welcome!

That's not going to be easy without more information, I am afraid.
--
Michael

Re: Reindex "locked" standby database

From

Martín Fernández

Date:

15 December 2021, 07:20:51


> On 15 Dec 2021, at 00:52, Mladen Gogala <gogala.mladen@gmail.com> wrote:
>
> On 12/14/21 22:37, Michael Paquier wrote:
>> You are referring to the startup process that replays WAL, right?
>> Without having an idea about the type of workload your primary and/or
>> standbys are facing, as well as an idea of the configuration you are
>> using on both (hot_standby_feedback for one), I have no direct idea,
>> but that could be a conflict caused by a concurrent vacuum.
>
> Hi Michael,
>
> I am preparing for a standby deployment. I don't have a standby yet and, therefore, I don't have any standby
problems.Would it be advisable to turn vacuum off on the standby? Applying WAL will also, in theory, populate the
statisticswhich is also held in the database blocks. 

Take this with grain of salt since I’m far from being an expert :) , just trying to help. To my knowledge, assuming you
wouldbe running a physical standby, vacuum operations wouldn’t run there, only in the primary. Changes would get
propagatedvia physical replication to your standby (blocks that change due to vacuuming on the primary).  

Hope that helps.

>
> Regards
>
> --
> Mladen Gogala
> Database Consultant
> Tel: (347) 321-1217
> https://dbwhisperer.wordpress.com
>
>
>