On 3/17/22 20:15, Tomas Vondra wrote:
> Hmm, this seems to have failed on wrasse [1], due to a timeout when
> waiting for tablesync to complete:
>
> 2022-03-17 17:39:28.247 CET [19962:1] LOG: logical replication table
> synchronization worker for subscription "sub2", table "tab1" has started
> 2022-03-17 17:39:28.258 CET [19964:1] LOG: logical replication table
> synchronization worker for subscription "sub2", table "tab4" has started
>
> In the previous runs this completed pretty much immediately (less than a
> second), but this time the workers got stuck, so the script keeps
> looping on the $synced_query. There's nothing in the log, so either it's
> some sort of lock wait or infinite loop.
>
> However, this fails in 013_partition.sql, which was not modified in this
> commit. And there have been multiple successful runs since it was
> modified (in c91f71b9dc). So it's not clear if this is a pre-existing
> issue and we just happened to hit it now, or maybe it's introduced by
> either c91f71b9dc or 5a07966225. But neither of these commits touched
> tablesync at all, so I'm puzzled how could it happen.
>
And sure enough - now it passed on wrasse, while lapwing failed with
exactly the same symptoms. Clearly some sort of race condition, but I've
been unable to reproduce that :-(
I'll try on my rpi4 once I get back home next week, but maybe we could
try reproducing this on one of the machines that triggered this so far.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company