Thread: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead
Wal sender process not moving past wait_event_type: IO and wait_event: WALRead
From
Anurag Shrivastava
Date:
Hi Postgres team,
We are facing an issue where we are unable to read data from a logical replication slot after a certain period of time. Every time after dropping the slot, it works fine for a few days and then again we are not able to read from the slot(we have been unable to find any queries that might be causing this issue). Each time the walsender process is getting stuck wait event type: IO and wait_event: WALRead. We've tried this with pgoutput, test_decoding and wal2json, with all three we have faced same issue.
Is there a way to be able to read the data from the same slot?
result for select * from pg_stat_activity:
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16404 | ******* | 115407 | 2354767601 | ***** | PostgreSQL JDBC Driver | <binary> | - | 36322 | 1645710936593 | - | - | 1645710936941 | IO | WALRead | active | - | - | - | walsender |
1645710936593 | - | - | 1645710936941 | IO | WALRead | active | - | - | - | walsender |
Regards,
Anurag Shrivastava,
Software Development Engineer, Hevo data
Re: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead
From
Kyotaro Horiguchi
Date:
(I don't think this is a bug report..) At Thu, 24 Feb 2022 19:51:41 +0530, Anurag Shrivastava <anurag.shrivastava@hevodata.com> wrote in > Hi Postgres team, > We are facing an issue where we are unable to read data from a logical > replication slot after a certain period of time. Every time after dropping > the slot, it works fine for a few days and then again we are not able to > read from the slot(we have been unable to find any queries that might be > causing this issue). Each time the walsender process is getting stuck > wait event type: IO and wait_event: WALRead. We've tried this with > pgoutput, test_decoding and wal2json, with all three we have faced > same issue. > Is there a way to be able to read the data from the same slot? I guess you can see lines in server log like this. could not read from log segment %s, offset %d: %m could not read from log segment %s, offset %d: read %d of %d If so, there's a possibility that you have a bad block in the pg_wal partition and periodically step on that block. The slot drop causes to skip the bad block to allow start replication, but WAL recycling places the bad block in a future WAL segment file again and that repeats. (I'm not sure there's a case where write scceeds but read fails on the same block.) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Re: Wal sender process not moving past wait_event_type: IO and wait_event: WALRead
From
Kyotaro Horiguchi
Date:
At Fri, 25 Feb 2022 10:51:28 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in > (I don't think this is a bug report..) > > At Thu, 24 Feb 2022 19:51:41 +0530, Anurag Shrivastava <anurag.shrivastava@hevodata.com> wrote in > > Hi Postgres team, > > We are facing an issue where we are unable to read data from a logical > > replication slot after a certain period of time. Every time after dropping > > the slot, it works fine for a few days and then again we are not able to > > read from the slot(we have been unable to find any queries that might be > > causing this issue). Each time the walsender process is getting stuck > > wait event type: IO and wait_event: WALRead. We've tried this with > > pgoutput, test_decoding and wal2json, with all three we have faced > > same issue. > > Is there a way to be able to read the data from the same slot? > > I guess you can see lines in server log like this. > > could not read from log segment %s, offset %d: %m > could not read from log segment %s, offset %d: read %d of %d > > If so, there's a possibility that you have a bad block in the pg_wal > partition and periodically step on that block. The slot drop causes > to skip the bad block to allow start replication, but WAL recycling > places the bad block in a future WAL segment file again and that > repeats. > > (I'm not sure there's a case where write scceeds but read fails on the > same block.) (correction) Even if you didn't have that lines, there could be a case where pread is stucking on the bad block at the first trial before emitting the log lines. Maybe you can see what block walsender is stucking on by the following query. select slot_name, confirmed_flush_lsn, pg_walfile_name(confirmed_flush_lsn) from pg_replication_slots; regards. -- Kyotaro Horiguchi NTT Open Source Software Center