Thread: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead

Wal sender process not moving past wait_event_type: IO and wait_event: WALRead

From

Anurag Shrivastava

Date:

24 February 2022, 17:21:41

Hi Postgres team,

We are facing an issue where we are unable to read data from a logical replication slot after a certain period of time. Every time after dropping the slot, it works fine for a few days and then again we are not able to read from the slot(we have been unable to find any queries that might be causing this issue). Each time the walsender process is getting stuck wait event type: IO and wait_event: WALRead. We've tried this with pgoutput, test_decoding and wal2json, with all three we have faced same issue.

Is there a way to be able to read the data from the same slot?

result for select * from pg_stat_activity:

datid	datname	pid	usesysid	usename	application_name	client_addr	client_hostname	client_port	backend_start	xact_start	query_start	state_change	wait_event_type	wait_event	state	backend_xid	backend_xmin	query	backend_type
16404	*******	115407	2354767601	*****	PostgreSQL JDBC Driver	<binary>	-	36322	1645710936593	-	-	1645710936941	IO	WALRead	active	-	-	-	walsender


									1645710936593	-	-	1645710936941	IO	WALRead	active	-	-	-	walsender

Regards,

Anurag Shrivastava,

Software Development Engineer, Hevo data

Re: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead

From

Kyotaro Horiguchi

Date:

25 February 2022, 04:51:28

(I don't think this is a bug report..)

At Thu, 24 Feb 2022 19:51:41 +0530, Anurag Shrivastava <anurag.shrivastava@hevodata.com> wrote in 
> Hi Postgres team,
> We are facing an issue where we are unable to read data from a logical
> replication slot after a certain period of time. Every time after dropping
> the slot, it works fine for a few days and then again we are not able to
> read from the slot(we have been unable to find any queries that might be
> causing this issue). Each time the walsender process is getting stuck
> wait event type: IO and wait_event: WALRead. We've tried this with
> pgoutput, test_decoding and wal2json, with all three we have faced
> same issue.
> Is there a way to be able to read the data from the same slot?

I guess you can see lines in server log like this.

could not read from log segment %s, offset %d: %m
could not read from log segment %s, offset %d: read %d of %d

If so, there's a possibility that you have a bad block in the pg_wal
partition and periodically step on that block.  The slot drop causes
to skip the bad block to allow start replication, but WAL recycling
places the bad block in a future WAL segment file again and that
repeats.

(I'm not sure there's a case where write scceeds but read fails on the
same block.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead

From

Kyotaro Horiguchi

Date:

25 February 2022, 04:59:21

At Fri, 25 Feb 2022 10:51:28 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in 
> (I don't think this is a bug report..)
> 
> At Thu, 24 Feb 2022 19:51:41 +0530, Anurag Shrivastava <anurag.shrivastava@hevodata.com> wrote in 
> > Hi Postgres team,
> > We are facing an issue where we are unable to read data from a logical
> > replication slot after a certain period of time. Every time after dropping
> > the slot, it works fine for a few days and then again we are not able to
> > read from the slot(we have been unable to find any queries that might be
> > causing this issue). Each time the walsender process is getting stuck
> > wait event type: IO and wait_event: WALRead. We've tried this with
> > pgoutput, test_decoding and wal2json, with all three we have faced
> > same issue.
> > Is there a way to be able to read the data from the same slot?
> 
> I guess you can see lines in server log like this.
> 
> could not read from log segment %s, offset %d: %m
> could not read from log segment %s, offset %d: read %d of %d
> 
> If so, there's a possibility that you have a bad block in the pg_wal
> partition and periodically step on that block.  The slot drop causes
> to skip the bad block to allow start replication, but WAL recycling
> places the bad block in a future WAL segment file again and that
> repeats.
> 
> (I'm not sure there's a case where write scceeds but read fails on the
> same block.)

(correction)

Even if you didn't have that lines, there could be a case where pread
is stucking on the bad block at the first trial before emitting the
log lines.

Maybe you can see what block walsender is stucking on by the following
query.

select slot_name, confirmed_flush_lsn, pg_walfile_name(confirmed_flush_lsn) from pg_replication_slots;

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center