Hi,
Sorry for the late response. The current patch only fixes the scenario-1 listed below. It will not address the scenario-2. Also we need a fix in unix_latch.c where the remaining sleep time is evaluated, if latch is woken by other events (or result=0). Here to it is possible the latch might go in long sleep if time shifts to past time.
Scenario-1: current_time (2015) -> changed_to_past (1995) -> stays-here-for-half-day -> corrected to current_time (2015)
Scenario-2: current_time (2015) -> changed_to_future (2020) -> stays-here-for-half-day -> corrected to current_time (2015)
Results:
Scenario-1: Auto-vacuuming not done from the time system time changed to 1995 until it is corrected to current time. In current context half-day.
Scenario-2: Auto-vacuuming keeps running if system time shifts to future. However after correcting time back to current time (from 2020->2015), the auto-vacuuming goes into 5 year sleep. Though current patch fixes waking up from sleep it will not allow to launch auto-vacuum worker as the dblist still holds previously set time i.e. 2020.
Proposed Fixes:
autovacuum.c: I will rebuild_database_list if time shift is detected. The time-shift is detected if sleep time evaluated is zero or greater than autovacuum_naptime. Currently the list is rebuilt only if time shifts to future. I added a check to rebuild it if sleep time is greater than autovacuum_naptime. Secondly I included the patch from Alvaro and changed the default 300 seconds value to autovacuum_naptime. This will avoid multiple wakeups if autovacuum_naptime is set to greater than 300 seconds.
unix_latch.c: Current implementation evaluates the remaining sleep time using "cur_timeout = timeout - (start_time - cur_time)". If the time is shifted back to past then cur_timeout will be evaluated to long time (for eg. start_time=2015 and cur_time=1995 then cur_timeout=timeout - (-20 years) = timeout + 20years). To avoid this wrong calculation I added a check and treat it as timeout.
With above mentioned fixes the auto-vacuuming will be robust enough to handle any system time changes. We tested the scenarios in our setup and they seem to work fine. I hope these are valid fixes and they do not affect any other flows.
Please review and share your review comments/suggestions.
PS: In our product database is used in update-heavy mode with limited disc space. So we need to be robust to handle such time changes to avoid any system failures due to disc full.