Thread: Auto-vacuum is not running in 9.1.12

Auto-vacuum is not running in 9.1.12

From
Prakash Itnal
Date:
Hello,

Recently we encountered a issue where the disc space is continuously increasing towards 100%. Then a manual vacuum freed the disc space. But again it is increasing. When digged more it is found that auto-vacuuming was not running or it is either stucked/hanged.

Version: 9.1.12
Auto vacuum is enabled: check configuration details attached file.
Auto vacuum daemon running.
From stats it shows that auto-vacuum last run almost more than month back.
There are no error logs from database. 

The attached file has all these details. If any other details needed please let me know. I will try to collect it and share. Please help to analyze why auto-vacuum stopped suddenly?


--
Cheers,
Prakash
Attachment

Re: Auto-vacuum is not running in 9.1.12

From
Alvaro Herrera
Date:
Prakash Itnal wrote:
> Hello,
> 
> Recently we encountered a issue where the disc space is continuously
> increasing towards 100%. Then a manual vacuum freed the disc space. But
> again it is increasing. When digged more it is found that auto-vacuuming
> was not running or it is either stucked/hanged.

Hm, we have seen this on Windows, I think.

Is the "stats collector" process running?  Is it stuck?

If you attach to process 6504 (autovac launcher), what's the backtrace?

> 4) Last run auto-vacuum:
> SELECT now(), schemaname, relname, last_vacuum, last_autovacuum, vacuum_count, autovacuum_count FROM
pg_stat_user_tables;
> 
>               now              | schemaname |    relname    | last_vacuum |        last_autovacuum        |
vacuum_count| autovacuum_count 
 
>
-------------------------------+------------+---------------+-------------+-------------------------------+--------------+------------------
>  2015-06-10 01:03:03.574212+02 | public     | abcd          |             | 2015-04-18 00:52:35.008874+02 |
0 |                2
 
>  2015-06-10 01:03:03.574212+02 | public     | xyz           |             | 2015-05-02 06:01:35.220651+02 |
0 |               20
 
> 
> NOTE: I changed the relname for above two tables due to confidentiality.

Are there dead tuples in tables?  Maybe vacuums are getting executed and
these values are not updated, for instance?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Auto-vacuum is not running in 9.1.12

From
Prakash Itnal
Date:
Hi,

@Avaro Herrera, Thanks for quick reply. I was on leave and hence not able to reply soon.

This issue was observed on customer site. However after long discussion and digging into what happened around the date 2nd May 2015, we got to know that NTP server suddenly went back in time to 1995. It remained there for some time until it is noticed and corrected. So after correcting NTP server time the whole cluster is synced to current date. After this change in time the auto-vacuum stopped. Since auto-vacuuming is triggered periodically, I doubt if this time change has affected any timer! 

So I suspect the time change is the root cause! It would be great if someone can clarify if this is the root cause for auto-vacuum stopped.


On Wed, Jun 10, 2015 at 8:19 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Prakash Itnal wrote:
> Hello,
>
> Recently we encountered a issue where the disc space is continuously
> increasing towards 100%. Then a manual vacuum freed the disc space. But
> again it is increasing. When digged more it is found that auto-vacuuming
> was not running or it is either stucked/hanged.

Hm, we have seen this on Windows, I think.

Is the "stats collector" process running?  Is it stuck?

If you attach to process 6504 (autovac launcher), what's the backtrace?

> 4) Last run auto-vacuum:
> SELECT now(), schemaname, relname, last_vacuum, last_autovacuum, vacuum_count, autovacuum_count FROM pg_stat_user_tables;
>
>               now              | schemaname |    relname    | last_vacuum |        last_autovacuum        | vacuum_count | autovacuum_count
> -------------------------------+------------+---------------+-------------+-------------------------------+--------------+------------------
>  2015-06-10 01:03:03.574212+02 | public     | abcd          |             | 2015-04-18 00:52:35.008874+02 |            0 |                2
>  2015-06-10 01:03:03.574212+02 | public     | xyz           |             | 2015-05-02 06:01:35.220651+02 |            0 |               20
>
> NOTE: I changed the relname for above two tables due to confidentiality.

Are there dead tuples in tables?  Maybe vacuums are getting executed and
these values are not updated, for instance?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Cheers,
Prakash

Re: Auto-vacuum is not running in 9.1.12

From
Prakash Itnal
Date:
Hi,

Currently the issue is easily reproducible. Steps to reproduce:
* Set some aggressive values for auto-vacuuming.
* Run a heavy database update/delete/insert queries. This leads to invoking auto-vacuuming in quick successions. 
* Change the system time to older for eg. 1995-01-01

Suddenly auto-vacuuming stops working. Even after changing system time back to current time, the auto-vacuuming did not resume.

So the question is, "does postrges supports system time changes?". 


On Tue, Jun 16, 2015 at 10:12 AM, Prakash Itnal <prakash074@gmail.com> wrote:
Hi,

@Avaro Herrera, Thanks for quick reply. I was on leave and hence not able to reply soon.

This issue was observed on customer site. However after long discussion and digging into what happened around the date 2nd May 2015, we got to know that NTP server suddenly went back in time to 1995. It remained there for some time until it is noticed and corrected. So after correcting NTP server time the whole cluster is synced to current date. After this change in time the auto-vacuum stopped. Since auto-vacuuming is triggered periodically, I doubt if this time change has affected any timer! 

So I suspect the time change is the root cause! It would be great if someone can clarify if this is the root cause for auto-vacuum stopped.


On Wed, Jun 10, 2015 at 8:19 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Prakash Itnal wrote:
> Hello,
>
> Recently we encountered a issue where the disc space is continuously
> increasing towards 100%. Then a manual vacuum freed the disc space. But
> again it is increasing. When digged more it is found that auto-vacuuming
> was not running or it is either stucked/hanged.

Hm, we have seen this on Windows, I think.

Is the "stats collector" process running?  Is it stuck?

If you attach to process 6504 (autovac launcher), what's the backtrace?

> 4) Last run auto-vacuum:
> SELECT now(), schemaname, relname, last_vacuum, last_autovacuum, vacuum_count, autovacuum_count FROM pg_stat_user_tables;
>
>               now              | schemaname |    relname    | last_vacuum |        last_autovacuum        | vacuum_count | autovacuum_count
> -------------------------------+------------+---------------+-------------+-------------------------------+--------------+------------------
>  2015-06-10 01:03:03.574212+02 | public     | abcd          |             | 2015-04-18 00:52:35.008874+02 |            0 |                2
>  2015-06-10 01:03:03.574212+02 | public     | xyz           |             | 2015-05-02 06:01:35.220651+02 |            0 |               20
>
> NOTE: I changed the relname for above two tables due to confidentiality.

Are there dead tuples in tables?  Maybe vacuums are getting executed and
these values are not updated, for instance?

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Cheers,
Prakash



--
Cheers,
Prakash

Re: Auto-vacuum is not running in 9.1.12

From
Haribabu Kommi
Date:
On Wed, Jun 17, 2015 at 2:17 PM, Prakash Itnal <prakash074@gmail.com> wrote:
> Hi,
>
> Currently the issue is easily reproducible. Steps to reproduce:
> * Set some aggressive values for auto-vacuuming.
> * Run a heavy database update/delete/insert queries. This leads to invoking
> auto-vacuuming in quick successions.
> * Change the system time to older for eg. 1995-01-01
>
> Suddenly auto-vacuuming stops working. Even after changing system time back
> to current time, the auto-vacuuming did not resume.
>
> So the question is, "does postrges supports system time changes?".

PostgreSQL uses the system time to check whether it reached the
specific nap time to trigger the autovacuum worker.

Is it possible for you to check what autovauum launcher is doing even
after the system time is reset to current time?

I can think of a case where the "launcher_determine_sleep" function
returns a big sleep value because of system time change.
Because of that it is possible that the launcher is not generating
workers to do the vacuum. May be I am wrong.


Regards,
Hari Babu
Fujitsu Australia



Re: Auto-vacuum is not running in 9.1.12

From
Alvaro Herrera
Date:
Prakash Itnal wrote:

> Currently the issue is easily reproducible. Steps to reproduce:
> * Set some aggressive values for auto-vacuuming.
> * Run a heavy database update/delete/insert queries. This leads to invoking
> auto-vacuuming in quick successions.
> * Change the system time to older for eg. 1995-01-01
> 
> Suddenly auto-vacuuming stops working. Even after changing system time back
> to current time, the auto-vacuuming did not resume.
> 
> So the question is, "does postrges supports system time changes?".

So Tom Lane just pinged me about this.  As far as I can tell, the
problem is that the clock goes backwards 20 years, then autovacuum
figures that it needs to sleep for 20 years until the next vacuum is
scheduled.  Then regardless of the clock moving forwards again, the
sleep is not affected by this.  (I didn't actually test this.)

A simple workaround would be to stop autovacuum then restart it, that
is, set autovacuum=off in postgresql.conf, send SIGHUP to postmaster
which should stop the autovacuum launcher, then set autovacuum=on and
SIGHUP again, which would start a new launcher.

As a proposed code fix, we could just clamp the sleep time to, say, 5
minutes.  In that case the launcher would wake up even if the next
database check is still a ways off, but that should be pretty harmless
-- we just compute a new sleep period and continue sleeping.  (That is,
notwitshstanding the fact that the sleep time should never really go
above a minute in most configurations.)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Auto-vacuum is not running in 9.1.12

From
Tom Lane
Date:
Haribabu Kommi <kommi.haribabu@gmail.com> writes:
> I can think of a case where the "launcher_determine_sleep" function
> returns a big sleep value because of system time change.
> Because of that it is possible that the launcher is not generating
> workers to do the vacuum. May be I am wrong.

I talked with Alvaro about this and we agreed that's most likely what
happened.  The launcher tracks future times-to-wake-up as absolute times,
so shortly after the system clock went backwards, it could have computed
that the next time to wake up was 20 years in the future, and issued a
sleep() call for 20 years.  Fixing the system clock after that would not
have caused it to wake up again.

It looks like a SIGHUP (pg_ctl reload) ought to be enough to wake it up,
or of course you could restart the server.

In HEAD this doesn't seem like it could cause an indefinite sleep because
if nothing else, sinval queue overrun would eventually wake the launcher
even without any manual action from the DBA.  But the loop logic is
different in 9.1.

launcher_determine_sleep() does have a minimum sleep time, and it seems
like we could fairly cheaply guard against this kind of scenario by also
enforcing a maximum sleep time (of say 5 or 10 minutes).  Not quite
convinced whether it's worth the trouble though.
        regards, tom lane



Re: Auto-vacuum is not running in 9.1.12

From
Alvaro Herrera
Date:
Tom Lane wrote:

> In HEAD this doesn't seem like it could cause an indefinite sleep because
> if nothing else, sinval queue overrun would eventually wake the launcher
> even without any manual action from the DBA.  But the loop logic is
> different in 9.1.

Just waiting for the sinval queue to overrun doesn't sound like a great
mechanism to me.

> launcher_determine_sleep() does have a minimum sleep time, and it seems
> like we could fairly cheaply guard against this kind of scenario by also
> enforcing a maximum sleep time (of say 5 or 10 minutes).  Not quite
> convinced whether it's worth the trouble though.

Yeah, the case is pretty weird and I'm not really sure that the server
ought to be expected to behave.  But if this is actually the only part
of the server that misbehaves because of sudden gigantic time jumps, I
think it's fair to patch it.  Here's a proposed patch.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Auto-vacuum is not running in 9.1.12

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Yeah, the case is pretty weird and I'm not really sure that the server
> ought to be expected to behave.  But if this is actually the only part
> of the server that misbehaves because of sudden gigantic time jumps, I
> think it's fair to patch it.  Here's a proposed patch.

I think the parenthetical bit in the comment should read "plus any
fractional second, for simplicity".  Also, maybe replace "300" by
a #define constant MAX_AUTOVAC_SLEEPTIME for readability?
        regards, tom lane



Re: Auto-vacuum is not running in 9.1.12

From
Cédric Villemain
Date:

Le 17/06/2015 23:10, Alvaro Herrera a écrit :
> Tom Lane wrote:
>> launcher_determine_sleep() does have a minimum sleep time, and it seems
>> like we could fairly cheaply guard against this kind of scenario by also
>> enforcing a maximum sleep time (of say 5 or 10 minutes).  Not quite
>> convinced whether it's worth the trouble though.
> 
> Yeah, the case is pretty weird and I'm not really sure that the server
> ought to be expected to behave.  But if this is actually the only part
> of the server that misbehaves because of sudden gigantic time jumps, I
> think it's fair to patch it.  Here's a proposed patch.

Does that still respect an autovacuum_naptime (GUC) greater than 5 minutes ?
-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation



Re: Auto-vacuum is not running in 9.1.12

From
Alvaro Herrera
Date:
Cédric Villemain wrote:
> 
> 
> Le 17/06/2015 23:10, Alvaro Herrera a écrit :
> > Tom Lane wrote:
> >> launcher_determine_sleep() does have a minimum sleep time, and it seems
> >> like we could fairly cheaply guard against this kind of scenario by also
> >> enforcing a maximum sleep time (of say 5 or 10 minutes).  Not quite
> >> convinced whether it's worth the trouble though.
> > 
> > Yeah, the case is pretty weird and I'm not really sure that the server
> > ought to be expected to behave.  But if this is actually the only part
> > of the server that misbehaves because of sudden gigantic time jumps, I
> > think it's fair to patch it.  Here's a proposed patch.
> 
> Does that still respect an autovacuum_naptime (GUC) greater than 5 minutes ?

Yes, sure.  The launcher will wake up, see time-to-vacuum further away,
then continue sleeping.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Auto-vacuum is not running in 9.1.12

From
Robert Haas
Date:
On Wed, Jun 17, 2015 at 5:10 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Yeah, the case is pretty weird and I'm not really sure that the server
> ought to be expected to behave.  But if this is actually the only part
> of the server that misbehaves because of sudden gigantic time jumps, I
> think it's fair to patch it.  Here's a proposed patch.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Auto-vacuum is not running in 9.1.12

From
Alvaro Herrera
Date:
Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Yeah, the case is pretty weird and I'm not really sure that the server
> > ought to be expected to behave.  But if this is actually the only part
> > of the server that misbehaves because of sudden gigantic time jumps, I
> > think it's fair to patch it.  Here's a proposed patch.
> 
> I think the parenthetical bit in the comment should read "plus any
> fractional second, for simplicity".  Also, maybe replace "300" by
> a #define constant MAX_AUTOVAC_SLEEPTIME for readability?

Okay, done.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Auto-vacuum is not running in 9.1.12

From
Andres Freund
Date:
On 2015-06-17 18:10:42 -0300, Alvaro Herrera wrote:
> Yeah, the case is pretty weird and I'm not really sure that the server
> ought to be expected to behave.  But if this is actually the only part
> of the server that misbehaves because of sudden gigantic time jumps, I
> think it's fair to patch it.  Here's a proposed patch.

We probably should go through the server and look at the various sleeps
and make sure thy all have a upper limit. I doubt this is the only
location without one.

Greetings,

Andres Freund



Re: Auto-vacuum is not running in 9.1.12

From
Prakash Itnal
Date:
Hi,

Sorry for the late response. The current patch only fixes the scenario-1 listed below. It will not address the scenario-2. Also we need a fix in unix_latch.c where the remaining sleep time is evaluated, if latch is woken by other events (or result=0). Here to it is possible the latch might go in long sleep if time shifts to past time.

Scenario-1: current_time (2015) -> changed_to_past (1995) -> stays-here-for-half-day -> corrected to current_time (2015)
Scenario-2: current_time (2015) -> changed_to_future (2020) -> stays-here-for-half-day -> corrected to current_time (2015)

Results: 
Scenario-1: Auto-vacuuming not done from the time system time changed to 1995 until it is corrected to current time. In current context half-day.
Scenario-2: Auto-vacuuming keeps running if system time shifts to future. However after correcting time back to current time (from 2020->2015), the auto-vacuuming goes into 5 year sleep. Though current patch fixes waking up from sleep it will not allow to launch auto-vacuum worker as the dblist still holds previously set time i.e. 2020.

Proposed Fixes:
autovacuum.c: I will rebuild_database_list if time shift is detected. The time-shift is detected if sleep time evaluated is zero or greater than autovacuum_naptime. Currently the list is rebuilt only if time shifts to future. I added a check to rebuild it if sleep time is greater than autovacuum_naptime. Secondly I included the patch from Alvaro and changed the default 300 seconds value to autovacuum_naptime. This will avoid multiple wakeups if autovacuum_naptime is set to greater than 300 seconds.

unix_latch.c: Current implementation evaluates the remaining sleep time using "cur_timeout = timeout - (start_time - cur_time)". If the time is shifted back to past then cur_timeout will be evaluated to long time (for eg. start_time=2015 and cur_time=1995 then cur_timeout=timeout - (-20 years) = timeout + 20years). To avoid this wrong calculation I added a check and treat it as timeout.

With above mentioned fixes the auto-vacuuming will be robust enough to handle any system time changes. We tested the scenarios in our setup and they seem to work fine. I hope these are valid fixes and they do not affect any other flows.

Please review and share your review comments/suggestions.

PS: In our product database is used in update-heavy mode with limited disc space. So we need to be robust to handle such time changes to avoid any system failures due to disc full.

On Fri, Jun 19, 2015 at 10:28 PM, Andres Freund <andres@anarazel.de> wrote:
On 2015-06-17 18:10:42 -0300, Alvaro Herrera wrote:
> Yeah, the case is pretty weird and I'm not really sure that the server
> ought to be expected to behave.  But if this is actually the only part
> of the server that misbehaves because of sudden gigantic time jumps, I
> think it's fair to patch it.  Here's a proposed patch.

We probably should go through the server and look at the various sleeps
and make sure thy all have a upper limit. I doubt this is the only
location without one.

Greetings,

Andres Freund



--
Cheers,
Prakash
Attachment

Re: Auto-vacuum is not running in 9.1.12

From
Tom Lane
Date:
Prakash Itnal <prakash074@gmail.com> writes:
> Sorry for the late response. The current patch only fixes the scenario-1
> listed below. It will not address the scenario-2. Also we need a fix in
> unix_latch.c where the remaining sleep time is evaluated, if latch is woken
> by other events (or result=0). Here to it is possible the latch might go in
> long sleep if time shifts to past time.

Forcing WL_TIMEOUT if the clock goes backwards seems like quite a bad
idea to me.  That seems like a great way to make a bad situation worse,
ie it induces failures where there were none before.
        regards, tom lane



Re: Auto-vacuum is not running in 9.1.12

From
Prakash Itnal
Date:
Hi,

To my understanding it will probably not open doors for worst situations! Please correct if my below understanding is correct.

The latch will wake up under below three situations:
a) Socket error (=> result is set to negative number)
b) timeout (=> result is set to TIMEOUT)
c) some event arrived on socket (=> result is set to non-zero value, if caller registers for arrived events otherwise no value is set)

Given the above conditions, the result can be zero only if there is an unregistered event which breaks the latch (*). In such case, current implementation evaluates the remaining sleep time. This calculation is making the situation worst, if time goes back. 

The time difference between cur_time (current time) and start_time (time when latch started) should always be a positive integer because cur_time is always greater than start_time under all normal conditions. 

    delta_timeout = cur_time - start_time;

The difference can be negative only if time shifts to past. So it is possible to detect if time shifted to past. When it is possible to detect can it be possible to correct? I think we can correct and prevent long sleeps due to time shifts.

Currently I treat it as TIMEOUT, though conceptually it is not. The ideal solution would be to leave this decision to the caller of WaitLatch(). With my little knowledge of postgres code, I think TIMEOUT would be fine! 


(*) The above description is true only for timed wait. If latch is started with blocking wait (no timeout) then above logic is not applicable.

On Sat, Jun 20, 2015 at 10:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Prakash Itnal <prakash074@gmail.com> writes:
> Sorry for the late response. The current patch only fixes the scenario-1
> listed below. It will not address the scenario-2. Also we need a fix in
> unix_latch.c where the remaining sleep time is evaluated, if latch is woken
> by other events (or result=0). Here to it is possible the latch might go in
> long sleep if time shifts to past time.

Forcing WL_TIMEOUT if the clock goes backwards seems like quite a bad
idea to me.  That seems like a great way to make a bad situation worse,
ie it induces failures where there were none before.

                        regards, tom lane



--
Cheers,
Prakash

Re: Auto-vacuum is not running in 9.1.12

From
Prakash Itnal
Date:
Hi Tom/Alvaro,

Kindly let us know if the correction provided in previous mail is fine or not! Current code any way handle scenario-1 whereas it is still vulnerable to scenario-2.

From previous mail:
Scenario-1: current_time (2015) -> changed_to_past (1995) -> stays-here-for-half-day -> corrected to current_time (2015)
Scenario-2: current_time (2015) -> changed_to_future (2020) -> stays-here-for-half-day -> corrected to current_time (2015)

We are waiting for your response.

On Sun, Jun 21, 2015 at 2:56 PM, Prakash Itnal <prakash074@gmail.com> wrote:
Hi,

To my understanding it will probably not open doors for worst situations! Please correct if my below understanding is correct.

The latch will wake up under below three situations:
a) Socket error (=> result is set to negative number)
b) timeout (=> result is set to TIMEOUT)
c) some event arrived on socket (=> result is set to non-zero value, if caller registers for arrived events otherwise no value is set)

Given the above conditions, the result can be zero only if there is an unregistered event which breaks the latch (*). In such case, current implementation evaluates the remaining sleep time. This calculation is making the situation worst, if time goes back. 

The time difference between cur_time (current time) and start_time (time when latch started) should always be a positive integer because cur_time is always greater than start_time under all normal conditions. 

    delta_timeout = cur_time - start_time;

The difference can be negative only if time shifts to past. So it is possible to detect if time shifted to past. When it is possible to detect can it be possible to correct? I think we can correct and prevent long sleeps due to time shifts.

Currently I treat it as TIMEOUT, though conceptually it is not. The ideal solution would be to leave this decision to the caller of WaitLatch(). With my little knowledge of postgres code, I think TIMEOUT would be fine! 


(*) The above description is true only for timed wait. If latch is started with blocking wait (no timeout) then above logic is not applicable.

On Sat, Jun 20, 2015 at 10:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Prakash Itnal <prakash074@gmail.com> writes:
> Sorry for the late response. The current patch only fixes the scenario-1
> listed below. It will not address the scenario-2. Also we need a fix in
> unix_latch.c where the remaining sleep time is evaluated, if latch is woken
> by other events (or result=0). Here to it is possible the latch might go in
> long sleep if time shifts to past time.

Forcing WL_TIMEOUT if the clock goes backwards seems like quite a bad
idea to me.  That seems like a great way to make a bad situation worse,
ie it induces failures where there were none before.

                        regards, tom lane



--
Cheers,
Prakash



--
Cheers,
Prakash