Re: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead - Mailing list pgsql-bugs

From Kyotaro Horiguchi
Subject Re: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead
Date
Msg-id 20220225.105921.2129677102787844421.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
List pgsql-bugs
At Fri, 25 Feb 2022 10:51:28 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> (I don't think this is a bug report..)
> 
> At Thu, 24 Feb 2022 19:51:41 +0530, Anurag Shrivastava <anurag.shrivastava@hevodata.com> wrote in 
> > Hi Postgres team,
> > We are facing an issue where we are unable to read data from a logical
> > replication slot after a certain period of time. Every time after dropping
> > the slot, it works fine for a few days and then again we are not able to
> > read from the slot(we have been unable to find any queries that might be
> > causing this issue). Each time the walsender process is getting stuck
> > wait event type: IO and wait_event: WALRead. We've tried this with
> > pgoutput, test_decoding and wal2json, with all three we have faced
> > same issue.
> > Is there a way to be able to read the data from the same slot?
> 
> I guess you can see lines in server log like this.
> 
> could not read from log segment %s, offset %d: %m
> could not read from log segment %s, offset %d: read %d of %d
> 
> If so, there's a possibility that you have a bad block in the pg_wal
> partition and periodically step on that block.  The slot drop causes
> to skip the bad block to allow start replication, but WAL recycling
> places the bad block in a future WAL segment file again and that
> repeats.
> 
> (I'm not sure there's a case where write scceeds but read fails on the
> same block.)

(correction)

Even if you didn't have that lines, there could be a case where pread
is stucking on the bad block at the first trial before emitting the
log lines.

Maybe you can see what block walsender is stucking on by the following
query.

select slot_name, confirmed_flush_lsn, pg_walfile_name(confirmed_flush_lsn) from pg_replication_slots;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-bugs by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead
Next
From: "egashira.yusuke@fujitsu.com"
Date:
Subject: RE: Reconnect a single connection used by multiple threads in embedded SQL in C application causes error.