Thread: Reindex "locked" standby database
Hello pg hackers! Today we had to run a `REINDEX table CONCURRENTLY my_table;` in our production database due to considerable index bloat.We used to deal with this problem in the past by using pg_repack but we stopped using it because our data replicationtool doesn’t support “re creating” tables in the way that pg_repack does it (we are using Fivetran). The reindex went fine in the primary database and in one of our standby. The other standby that we also operate for somereason ended up in a state where all transactions were locked by the WAL process and the WAL process was not able tomake any progress. In order to solve this issue we had to move traffic from the “bad” standby to the healthy one and thenkill all transactions that were running in the “bad” standby. After that, replication was able to resume successfully. I’m just trying to understand what could have caused this issue. I was not able to identify any queries in the standby thatwould be locking the WAL process. Any insight would be more than welcome! We are running pg12 in our cluster and standbys are replicating using physical replication lots. Best, Martín
On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote: > The reindex went fine in the primary database and in one of our > standby. The other standby that we also operate for some reason > ended up in a state where all transactions were locked by the WAL > process and the WAL process was not able to make any progress. In > order to solve this issue we had to move traffic from the “bad” > standby to the healthy one and then kill all transactions that were > running in the “bad” standby. After that, replication was able to > resume successfully. You are referring to the startup process that replays WAL, right? Without having an idea about the type of workload your primary and/or standbys are facing, as well as an idea of the configuration you are using on both (hot_standby_feedback for one), I have no direct idea, but that could be a conflict caused by a concurrent vacuum. Seeing where things got stuck could also be useful, perhaps with a backtrace of the area where it happens and some information around it. > I’m just trying to understand what could have caused this issue. I > was not able to identify any queries in the standby that would be > locking the WAL process. Any insight would be more than welcome! That's not going to be easy without more information, I am afraid. -- Michael
Attachment
On 12/14/21 22:37, Michael Paquier wrote: > You are referring to the startup process that replays WAL, right? > Without having an idea about the type of workload your primary and/or > standbys are facing, as well as an idea of the configuration you are > using on both (hot_standby_feedback for one), I have no direct idea, > but that could be a conflict caused by a concurrent vacuum. Hi Michael, I am preparing for a standby deployment. I don't have a standby yet and, therefore, I don't have any standby problems. Would it be advisable to turn vacuum off on the standby? Applying WAL will also, in theory, populate the statistics which is also held in the database blocks. Regards -- Mladen Gogala Database Consultant Tel: (347) 321-1217 https://dbwhisperer.wordpress.com
Micheal,
Thanks for much for the quick response.
That is correct, I’m talking about the startup process that replays the WAL files.On 15 Dec 2021, at 00:37, Michael Paquier <michael@paquier.xyz> wrote:On Wed, Dec 15, 2021 at 12:15:27AM -0300, Martín Fernández wrote:The reindex went fine in the primary database and in one of our
standby. The other standby that we also operate for some reason
ended up in a state where all transactions were locked by the WAL
process and the WAL process was not able to make any progress. In
order to solve this issue we had to move traffic from the “bad”
standby to the healthy one and then kill all transactions that were
running in the “bad” standby. After that, replication was able to
resume successfully.
You are referring to the startup process that replays WAL, right?
Without having an idea about the type of workload your primary and/or
standbys are facing, as well as an idea of the configuration you are
using on both (hot_standby_feedback for one), I have no direct idea,
Primary handle IOT data ingestion. The table that we had to REINDEX gets updated every time a new message arrives in the system so updated are happening very often on that table, thus, the index/table bloat. The standby at any point in time would be receiving queries that would take advantage of the indexes that were being re indexed. hot_standby_feedback is currently turned OFF on the standbys.
but that could be a conflict caused by a concurrent vacuum.
Seeing where things got stuck could also be useful, perhaps with a
backtrace of the area where it happens and some information around
it.I’m just trying to understand what could have caused this issue. I
was not able to identify any queries in the standby that would be
locking the WAL process. Any insight would be more than welcome!
That's not going to be easy without more information, I am afraid.
--
Michael
> On 15 Dec 2021, at 00:52, Mladen Gogala <gogala.mladen@gmail.com> wrote: > > On 12/14/21 22:37, Michael Paquier wrote: >> You are referring to the startup process that replays WAL, right? >> Without having an idea about the type of workload your primary and/or >> standbys are facing, as well as an idea of the configuration you are >> using on both (hot_standby_feedback for one), I have no direct idea, >> but that could be a conflict caused by a concurrent vacuum. > > Hi Michael, > > I am preparing for a standby deployment. I don't have a standby yet and, therefore, I don't have any standby problems.Would it be advisable to turn vacuum off on the standby? Applying WAL will also, in theory, populate the statisticswhich is also held in the database blocks. Take this with grain of salt since I’m far from being an expert :) , just trying to help. To my knowledge, assuming you wouldbe running a physical standby, vacuum operations wouldn’t run there, only in the primary. Changes would get propagatedvia physical replication to your standby (blocks that change due to vacuuming on the primary). Hope that helps. > > Regards > > -- > Mladen Gogala > Database Consultant > Tel: (347) 321-1217 > https://dbwhisperer.wordpress.com > > >