I think we are getting the error (ERROR: could not find logical decoding starting point) because we wouldn't have waited for WAL to become available before reading it. It could happen due to the following code: WalSndWaitForWal() { ... if (streamingDoneReceiving && streamingDoneSending && !pq_is_send_pending()) break; .. }
Now, it seems that in 0003 patch, instead of resetting flags streamingDoneSending, and streamingDoneReceiving before start replication, we should reset before create logical slots because we need to read the WAL during that time as well to find the consistent point.
Thanks for the suggestion Amit. I've been looking into this recently and couldn't figure out the cause until now.
I quickly made the fix in 0003. Seems like it resolved the "could not find logical decoding starting point" errors.
vignesh C <vignesh21@gmail.com>, 1 Ağu 2023 Sal, 09:32 tarihinde şunu yazdı:
I agree that "no copy in progress issue" issue has nothing to do with 0001 patch. This issue is present with the 0002 patch. In the case when the tablesync worker has to apply the transactions after the table is synced, the tablesync worker sends the feedback of writepos, applypos and flushpos which results in "No copy in progress" error as the stream has ended already. Fixed it by exiting the streaming loop if the tablesync worker is done with the synchronization. The attached 0004 patch has the changes for the same. The rest of v22 patches are the same patch that were posted by Melih in the earlier mail.
Thanks for the fix. I placed it into 0002 with a slight change as follows:
IMHO relsync_completed means simply the same with streaming_done, that's why I wanted to check that flag instead of an additional goto statement. Does it make sense to you as well?