Craig Ringer wrote: > Hi all > > I think I found a couple of logical decoding issues while writing tests for > failover slots. > > Despite the docs' claim that a logical slot will replay data "exactly > once", a slot's confirmed_lsn can go backwards and the SQL functions can > replay the same data more than once.We don't mark a slot as dirty if only > its confirmed_lsn is advanced, so it isn't flushed to disk. For failover > slots this means it also doesn't get replicated via WAL. After a master > crash, or for failover slots after a promote event, the confirmed_lsn will > go backwards. Users of the SQL interface must keep track of the safely > locally flushed slot position themselves and throw the repeated data away. > Unlike with the walsender protocol it has no way to ask the server to skip > that data. > > Worse, because we don't dirty the slot even a *clean shutdown* causes slot > confirmed_lsn to go backwards. That's a bug IMO. We should force a flush of > all slots at the shutdown checkpoint, whether dirty or not, to address it.
Why don't we mark the slot dirty when confirmed_lsn advances? If we fix that, doesn't it fix the other problems too?
Yes, it does.
That'll cause slots to be written out at checkpoints when they otherwise wouldn't have to be, but I'd rather be doing a little more work in this case. Compared to the disk activity from WAL decoding etc the effect should be undetectable anyway.
Andres? Any objection to dirtying a slot when the confirmed lsn advances, so we write it out at the next checkpoint?