We had an issue with Postgres logical replication after 8h transaction which rogue user run at night.
All logical replication threads were 100% cpu bound and stuck in this:
unlink("pg_replslot/data/xid-1719052643-lsn-4854E-FE000000.spill") = -1 ENOENT (No such file or directory) <0.000008>
unlink("pg_replslot/data/xid-1719052643-lsn-4854E-FF000000.spill") = -1 ENOENT (No such file or directory) <0.000008>
unlink("pg_replslot/data/xid-1719052643-lsn-4854F-0.spill") = -1 ENOENT (No such file or directory) <0.000010>
unlink("pg_replslot/data/xid-1719052643-lsn-4854F-1000000.spill") = -1 ENOENT (No such file or directory) <0.000008>
unlink("pg_replslot/data/xid-1719052643-lsn-4854F-2000000.spill") = -1 ENOENT (No such file or directory) <0.000008>
After stopping publisher, which wasnt easy too - we had to change dir rights, so it will crash, increasing logical replication memory to a few hundreds GB, chown dir bask to postgres and start it back it went through in 20 min top, but before with 30GB committed to logical replication the speed was so slow - basic math was telling us that we will finish this replication lag in 2 weeks.
There were hundreds of thousands of files in this dir. But EOENT was much more than existing files.
Maybe ReorderBufferRestoreCleanup can be optimized somehow? Run in parallel?