Thread: autovacuum daemon stops doing work after about an hour

autovacuum daemon stops doing work after about an hour

From

Vivek Khera

Date:

02 December 2003, 16:38:23

I took advantage of last weekend to upgrade from 7.2.4 to 7.4.0 on a
new faster box.

Now I'm trying to implement pg_autovacuum.  It seems to work ok, but
after about an hour or so, it does nothing.  The process still is
running, but nothing is sent to the log file.

I'm running the daemon as distributed with PG 7.4 release as follows:

pg_autovacuum -d4 -V 0.15 -A 1 -U postgres -L /var/tmp/autovacuum.log -D

the last few lines of the log are:

[2003-12-02 11:43:58 AM] VACUUM ANALYZE "public"."msg_recipients"
[2003-12-02 12:24:33 PM] select relfilenode,reltuples,relpages from pg_class where relfilenode=18588239
[2003-12-02 12:24:33 PM]   table name:     vkmlm."public"."msg_recipients"
[2003-12-02 12:24:33 PM]      relfilenode: 18588239;   relisshared: 0
[2003-12-02 12:24:33 PM]      reltuples: 9;  relpages: 529132
[2003-12-02 12:24:33 PM]      curr_analyze_count:  1961488; cur_delete_count:   1005040
[2003-12-02 12:24:33 PM]      ins_at_last_analyze: 1961488; del_at_last_vacuum: 1005040
[2003-12-02 12:24:33 PM]      insert_threshold:    509; delete_threshold    1001
[2003-12-02 12:24:33 PM] Performing: VACUUM ANALYZE "public"."user_list"
[2003-12-02 12:24:33 PM] VACUUM ANALYZE "public"."user_list"
[2003-12-02 12:43:19 PM] select relfilenode,reltuples,relpages from pg_class where relfilenode=18588202
[2003-12-02 12:43:19 PM]   table name:     vkmlm."public"."user_list"
[2003-12-02 12:43:19 PM]      relfilenode: 18588202;   relisshared: 0
[2003-12-02 12:43:19 PM]      reltuples: 9;  relpages: 391988
[2003-12-02 12:43:19 PM]      curr_analyze_count:  1159843; cur_delete_count:   1118540
[2003-12-02 12:43:19 PM]      ins_at_last_analyze: 1159843; del_at_last_vacuum: 1118540
[2003-12-02 12:43:19 PM]      insert_threshold:    509; delete_threshold    1001

Then it just sits there.  I started it at 11:35am, and it is now
3:30pm.

I did the same last night at about 10:58pm, and it ran and did work until
11:57pm, then sat there until I killed/restarted pg_autovacuum this
morning at 11:35.  The process is not using any CPU time.

I just killed/restarted it and it found work to do on my busy tables
which I'd expect.

I'm running Postgres 7.4 release on FreeBSD 4.9-RELEASE.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.                Khera Communications, Inc.
Internet: khera@kciLink.com       Rockville, MD       +1-240-453-8497
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

Re: autovacuum daemon stops doing work after about an hour

From

"Matthew T. O'Connor"

Date:

04 December 2003, 01:41:40

On Tue, 2003-12-02 at 15:37, Vivek Khera wrote:
> Now I'm trying to implement pg_autovacuum.  It seems to work ok, but
> after about an hour or so, it does nothing.  The process still is
> running, but nothing is sent to the log file.
>
> I'm running the daemon as distributed with PG 7.4 release as follows:
>
> pg_autovacuum -d4 -V 0.15 -A 1 -U postgres -L /var/tmp/autovacuum.log -D
>
> the last few lines of the log are:
>
> [2003-12-02 11:43:58 AM] VACUUM ANALYZE "public"."msg_recipients"
> [2003-12-02 12:24:33 PM] select relfilenode,reltuples,relpages from pg_class where relfilenode=18588239
> [2003-12-02 12:24:33 PM]   table name:     vkmlm."public"."msg_recipients"
> [2003-12-02 12:24:33 PM]      relfilenode: 18588239;   relisshared: 0
> [2003-12-02 12:24:33 PM]      reltuples: 9;  relpages: 529132
> [2003-12-02 12:24:33 PM]      curr_analyze_count:  1961488; cur_delete_count:   1005040
> [2003-12-02 12:24:33 PM]      ins_at_last_analyze: 1961488; del_at_last_vacuum: 1005040
> [2003-12-02 12:24:33 PM]      insert_threshold:    509; delete_threshold    1001
> [2003-12-02 12:24:33 PM] Performing: VACUUM ANALYZE "public"."user_list"
> [2003-12-02 12:24:33 PM] VACUUM ANALYZE "public"."user_list"
> [2003-12-02 12:43:19 PM] select relfilenode,reltuples,relpages from pg_class where relfilenode=18588202
> [2003-12-02 12:43:19 PM]   table name:     vkmlm."public"."user_list"
> [2003-12-02 12:43:19 PM]      relfilenode: 18588202;   relisshared: 0
> [2003-12-02 12:43:19 PM]      reltuples: 9;  relpages: 391988
> [2003-12-02 12:43:19 PM]      curr_analyze_count:  1159843; cur_delete_count:   1118540
> [2003-12-02 12:43:19 PM]      ins_at_last_analyze: 1159843; del_at_last_vacuum: 1118540
> [2003-12-02 12:43:19 PM]      insert_threshold:    509; delete_threshold    1001
>
> Then it just sits there.  I started it at 11:35am, and it is now
> 3:30pm.

Weird.... Alphabetically speaking, is vkmlm."public"."user_list" be the
last table in the last schema in the last database?  You are running
with -d4, so you would get a message about going to sleep shortly after
dealing with the last table, but you didn't get the sleep message, so I
don't think the problem is that pg_autovacuum is sleeping for an
inordinate amount time.

> I did the same last night at about 10:58pm, and it ran and did work until
> 11:57pm, then sat there until I killed/restarted pg_autovacuum this
> morning at 11:35.  The process is not using any CPU time.
>
> I just killed/restarted it and it found work to do on my busy tables
> which I'd expect.

when you kill it, do you get a core file?  Could you do a backtrace and
see where pg_autovacuum is hung up?

> I'm running Postgres 7.4 release on FreeBSD 4.9-RELEASE.

I don't run FreeBSD, so I haven't tested with FreeBSD.  Recently Craig
Boston reported and submitted a patch for a crash on FreeBSD, but that
doesn't sound like your problem.  Could be some other type of platform
dependent problem.

Re: autovacuum daemon stops doing work after about an hour

From

Vivek Khera

Date:

04 December 2003, 12:34:36

>>>>> "MTO" == Matthew T O'Connor <matthew@zeut.net> writes:

>> Then it just sits there.  I started it at 11:35am, and it is now
>> 3:30pm.

MTO> Weird.... Alphabetically speaking, is vkmlm."public"."user_list" be the
MTO> last table in the last schema in the last database?  You are running

conveniently, yes it is...

MTO> with -d4, so you would get a message about going to sleep shortly after
MTO> dealing with the last table, but you didn't get the sleep message, so I
MTO> don't think the problem is that pg_autovacuum is sleeping for an
MTO> inordinate amount time.

The only sleep logged was

[2003-12-03 04:47:13 PM] 1 All DBs checked in: 84996853 usec, will sleep for 469 secs.

Here's all it did on yesterday afternoon's "hour of work":

[2003-12-03 04:45:48 PM] Performing: ANALYZE "public"."url_track"
[2003-12-03 04:46:27 PM] Performing: ANALYZE "public"."msg_recipients"
[2003-12-03 04:46:55 PM] Performing: ANALYZE "public"."deliveries"
[2003-12-03 04:46:55 PM] Performing: ANALYZE "public"."user_list"
[2003-12-03 04:47:12 PM] Performing: ANALYZE "public"."sessions"
[2003-12-03 04:55:02 PM] Performing: ANALYZE "public"."url_track"
[2003-12-03 04:55:22 PM] Performing: VACUUM ANALYZE "public"."msg_recipients"
[2003-12-03 05:40:11 PM] Performing: VACUUM ANALYZE "public"."user_list"

then 18 minutes later, it reported:

[2003-12-03 05:58:25 PM] select relfilenode,reltuples,relpages from pg_class where relfilenode=18588202
[2003-12-03 05:58:25 PM]   table name:     vkmlm."public"."user_list"
[2003-12-03 05:58:25 PM]      relfilenode: 18588202;   relisshared: 0
[2003-12-03 05:58:25 PM]      reltuples: 9;  relpages: 427920
[2003-12-03 05:58:25 PM]      curr_analyze_count:  2559236; cur_delete_count:   2475824
[2003-12-03 05:58:25 PM]      ins_at_last_analyze: 2559236; del_at_last_vacuum: 2475824
[2003-12-03 05:58:25 PM]      insert_threshold:    509; delete_threshold    1001

and stopped doing anything.

MTO> when you kill it, do you get a core file?  Could you do a backtrace and
MTO> see where pg_autovacuum is hung up?

nope.  unfortunately my PG libs are without debugging, too.  I'll
rebuild pg_autovacuum with debugging and run it under gdb so I can see
where it gets stuck.

I'll report back when I find something.  I just wanted to check first
if anyone else ran into this.

Re: autovacuum daemon stops doing work after about an hour

From

Gaetano Mendola

Date:

04 December 2003, 14:36:52

Vivek Khera wrote:

>>>>>>"MTO" == Matthew T O'Connor <matthew@zeut.net> writes:
>
>
>>>Then it just sits there.  I started it at 11:35am, and it is now
>>>3:30pm.
>
>
> MTO> Weird.... Alphabetically speaking, is vkmlm."public"."user_list" be the
> MTO> last table in the last schema in the last database?  You are running
>
> conveniently, yes it is...
>
> MTO> with -d4, so you would get a message about going to sleep shortly after
> MTO> dealing with the last table, but you didn't get the sleep message, so I
> MTO> don't think the problem is that pg_autovacuum is sleeping for an
> MTO> inordinate amount time.
>
> The only sleep logged was
>
> [2003-12-03 04:47:13 PM] 1 All DBs checked in: 84996853 usec, will sleep for 469 secs.

What I seen is:


# tail -f auto.log
[2003-12-04 07:10:18 PM] reltuples: 72;  relpages: 1
[2003-12-04 07:10:18 PM] curr_analyze_count:  72; cur_delete_count:   0
[2003-12-04 07:10:18 PM] ins_at_last_analyze: 72; del_at_last_vacuum: 0
[2003-12-04 07:10:18 PM] insert_threshold:    572; delete_threshold    536
[2003-12-04 07:10:18 PM] table name:     empdb."public"."contracts"
[2003-12-04 07:10:18 PM] relfilenode: 17784;   relisshared: 0
[2003-12-04 07:10:18 PM] reltuples: 347;  relpages: 5
[2003-12-04 07:10:18 PM] curr_analyze_count:  347; cur_delete_count:   0
[2003-12-04 07:10:18 PM] ins_at_last_analyze: 347; del_at_last_vacuum: 0
[2003-12-04 07:10:18 PM] insert_threshold:    847; delete_threshold    673


[ 5 minutes of delay ]  <----- LOOK THIS


[2003-12-04 07:10:18 PM] 503 All DBs checked in: 179396 usec, will sleep
for 300 secs.
[2003-12-04 07:15:19 PM] 504 All DBs checked in: 98814 usec, will sleep
for 300 secs.

I think is a good Idea put a fflush after:

fprintf(LOGOUTPUT, "[%s] %s\n", timebuffer, logentry);


Regards
Gaetano Mendola

Re: autovacuum daemon stops doing work after about an hour

From

Vivek Khera

Date:

04 December 2003, 15:37:10

>>>>> "MTO" == Matthew T O'Connor <matthew@zeut.net> writes:

>> I'm running Postgres 7.4 release on FreeBSD 4.9-RELEASE.

MTO> I don't run FreeBSD, so I haven't tested with FreeBSD.  Recently Craig
MTO> Boston reported and submitted a patch for a crash on FreeBSD, but that
MTO> doesn't sound like your problem.  Could be some other type of platform
MTO> dependent problem.

Oh lucky me.

I think I found it.  I compiled with -g -O and ran it under gdb, so
the output is line buffered.  The last thing it prints out now is
this:

[2003-12-04 02:11:17 PM] 3 All DBs checked in: -786419782 usec, will sleep for -1272 secs.

since sleep() takes an unsigned int as its parameter, we are actually
sleeping for 4294966024 seconds == 136 years.

I recall reading about the negative time to test the dbs
somewhere... I guess I'll get on debugging that.  The time keeper in
this box is pretty darned accurate otherwise (using ntpd).

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.                Khera Communications, Inc.
Internet: khera@kciLink.com       Rockville, MD       +1-240-453-8497
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

Re: autovacuum daemon stops doing work after about an hour

From

Vivek Khera

Date:

04 December 2003, 15:46:01

>>>>> "MTO" == Matthew T O'Connor <matthew@zeut.net> writes:

MTO> I don't run FreeBSD, so I haven't tested with FreeBSD.  Recently Craig
MTO> Boston reported and submitted a patch for a crash on FreeBSD, but that

some more debugging data:

(gdb) print now
$2 = {tv_sec = 1070565077, tv_usec = 216477}
(gdb) print then
$3 = {tv_sec = 1070561568, tv_usec = 668963}
(gdb) print diff
$4 = -5459981371352
(gdb) print sleep_secs
$5 = -1272

so for some reason, instead of calculating 3508547514 as the diff, it
got a hugely negative number.

I'll bet it has something to do with the compiler...  more debugging
to follow (without -O compilation...)



MTO> ---------------------------(end of broadcast)---------------------------
MTO> TIP 7: don't forget to increase your free space map settings

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.                Khera Communications, Inc.
Internet: khera@kciLink.com       Rockville, MD       +1-240-453-8497
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

Re: autovacuum daemon stops doing work after about an hour

From

"Matthew T. O'Connor"

Date:

04 December 2003, 17:00:03

>>>>>> "MTO" == Matthew T O'Connor <matthew@zeut.net> writes:
>
> MTO> I don't run FreeBSD, so I haven't tested with FreeBSD.  Recently
> Craig MTO> Boston reported and submitted a patch for a crash on FreeBSD,
> but that
>
> some more debugging data:
>
> (gdb) print now
> $2 = {tv_sec = 1070565077, tv_usec = 216477}
> (gdb) print then
> $3 = {tv_sec = 1070561568, tv_usec = 668963}
> (gdb) print diff
> $4 = -5459981371352
> (gdb) print sleep_secs
> $5 = -1272
>
> so for some reason, instead of calculating 3508547514 as the diff, it
> got a hugely negative number.
>
> I'll bet it has something to do with the compiler...  more debugging to
> follow (without -O compilation...)

Could this be the recently reported bug where time goes backwards on
FreeBSD?  Can anyone who knows more about this problem chime in, I know it
was recently discussed on Hackers.

The simple fix is to just make sure it's a positive number.  If not, then
just sleep for some small positive amount.  I can make a patch for this,
probably sometime this weekend.

Thanks for tracking this down.

Re: autovacuum daemon stops doing work after about an hour

From

Vivek Khera

Date:

04 December 2003, 17:20:55

>>>>> "MTO" == Matthew T O'Connor <matthew@zeut.net> writes:

MTO> Could this be the recently reported bug where time goes backwards on
MTO> FreeBSD?  Can anyone who knows more about this problem chime in, I know it
MTO> was recently discussed on Hackers.


Time does not go backwards -- the now and then variables are properly
incrementing in time as you see from the debugging output.

The error appears to be with the computation of the "diff".  It is
either a C programming error, or a compiler error.  I'm not a C "cop"
so I can't tell you which it is.

Witness this program, below, compiled as "cc -g -o t t.c" and the
output here:

% ./t
seconds = 3509
seconds1 = 3509000000
useconds = -452486
stepped diff = 3508547514
seconds2 = -785967296
seconds3 = 3509000000
diff = -786419782
long long diff = 3508547514
%

apperantly, if you compute (now.tv_sec - then.tv_sec) * 1000000 all at
once, it overflows since the RHS is all computed using longs rather
than long longs.  Fix is to cast at least one of the values to long
long on the RHS, as in the computation of seconds3 below.  compare
that to the computation of seconds2 and you'll see that this is the
cause.

I'd be curious to see the output of this program on other platforms
and other compilers.  I'm using gcc 2.95.4 as shipped with FreeBSD
4.8+.

That all being said, you should never sleep less than the base time,
and never for more than a max amount, perhaps 1 hour?


--cut here--
#include <sys/time.h>
#include <stdio.h>

int
main()
{
  struct timeval now, then;
  long long diff = 0;
  long long seconds, seconds1, seconds2, seconds3, useconds;

  now.tv_sec = 1070565077L;
  now.tv_usec = 216477L;

  then.tv_sec = 1070561568L;
  then.tv_usec = 668963L;

  seconds = now.tv_sec - then.tv_sec;
  printf("seconds = %lld\n",seconds);
  seconds1 = seconds * 1000000;
  printf("seconds1 = %lld\n",seconds1);
  useconds = now.tv_usec - then.tv_usec;
  printf("useconds = %lld\n",useconds);

  diff = seconds1 + useconds;
  printf("stepped diff = %lld\n",diff);

  /* this appears to be the culprit... it should be same as seconds1 */
  seconds2 = (now.tv_sec - then.tv_sec) * 1000000;
  printf("seconds2 = %lld\n",seconds2);

  /* seems we need to cast long's to long long's for this computation */
  seconds3 = ((long long)now.tv_sec - (long long)then.tv_sec) * 1000000;
  printf("seconds3 = %lld\n",seconds3);


  diff = (now.tv_sec - then.tv_sec) * 1000000 + (now.tv_usec - then.tv_usec);
  printf ("diff = %lld\n",diff);

  diff = ((long long)now.tv_sec - (long long)then.tv_sec) * 1000000 + (now.tv_usec - then.tv_usec);
  printf ("long long diff = %lld\n",diff);

  exit(0);
}


--cut here--

Re: autovacuum daemon stops doing work after about an hour

From

Vivek Khera

Date:

04 December 2003, 17:22:40

Actually, you can simplify the fix thusly:

  diff = (long long)(now.tv_sec - then.tv_sec) * 1000000 + (now.tv_usec - then.tv_usec);

Re: autovacuum daemon stops doing work after about an

From

Larry Rosenman

Date:

04 December 2003, 17:26:22


--On Thursday, December 04, 2003 16:20:22 -0500 Vivek Khera
<khera@kcilink.com> wrote:

>>>>>> "MTO" == Matthew T O'Connor <matthew@zeut.net> writes:
>
> MTO> Could this be the recently reported bug where time goes backwards on
> MTO> FreeBSD?  Can anyone who knows more about this problem chime in, I
> know it MTO> was recently discussed on Hackers.
>
>
> Time does not go backwards -- the now and then variables are properly
> incrementing in time as you see from the debugging output.
>
> The error appears to be with the computation of the "diff".  It is
> either a C programming error, or a compiler error.  I'm not a C "cop"
> so I can't tell you which it is.
>
> Witness this program, below, compiled as "cc -g -o t t.c" and the
> output here:
>
> % ./t
> seconds = 3509
> seconds1 = 3509000000
> useconds = -452486
> stepped diff = 3508547514
> seconds2 = -785967296
> seconds3 = 3509000000
> diff = -786419782
> long long diff = 3508547514
> %
>
> apperantly, if you compute (now.tv_sec - then.tv_sec) * 1000000 all at
> once, it overflows since the RHS is all computed using longs rather
> than long longs.  Fix is to cast at least one of the values to long
> long on the RHS, as in the computation of seconds3 below.  compare
> that to the computation of seconds2 and you'll see that this is the
> cause.
>
> I'd be curious to see the output of this program on other platforms
> and other compilers.  I'm using gcc 2.95.4 as shipped with FreeBSD
> 4.8+.
this is with the UnixWare compiler:
$ cc -O -o testvk testvk.c
$ ./testvk
seconds = 3509
seconds1 = 3509000000
useconds = -452486
stepped diff = 3508547514
seconds2 = -785967296
seconds3 = 3509000000
diff = -786419782
long long diff = 3508547514
$


I think this is a C bug.



>
> That all being said, you should never sleep less than the base time,
> and never for more than a max amount, perhaps 1 hour?
>
>
> --cut here--
># include <sys/time.h>
># include <stdio.h>
>
> int
> main()
> {
>   struct timeval now, then;
>   long long diff = 0;
>   long long seconds, seconds1, seconds2, seconds3, useconds;
>
>   now.tv_sec = 1070565077L;
>   now.tv_usec = 216477L;
>
>   then.tv_sec = 1070561568L;
>   then.tv_usec = 668963L;
>
>   seconds = now.tv_sec - then.tv_sec;
>   printf("seconds = %lld\n",seconds);
>   seconds1 = seconds * 1000000;
>   printf("seconds1 = %lld\n",seconds1);
>   useconds = now.tv_usec - then.tv_usec;
>   printf("useconds = %lld\n",useconds);
>
>   diff = seconds1 + useconds;
>   printf("stepped diff = %lld\n",diff);
>
>   /* this appears to be the culprit... it should be same as seconds1 */
>   seconds2 = (now.tv_sec - then.tv_sec) * 1000000;
>   printf("seconds2 = %lld\n",seconds2);
>
>   /* seems we need to cast long's to long long's for this computation */
>   seconds3 = ((long long)now.tv_sec - (long long)then.tv_sec) * 1000000;
>   printf("seconds3 = %lld\n",seconds3);
>
>
>   diff = (now.tv_sec - then.tv_sec) * 1000000 + (now.tv_usec -
> then.tv_usec);   printf ("diff = %lld\n",diff);
>
>   diff = ((long long)now.tv_sec - (long long)then.tv_sec) * 1000000 +
> (now.tv_usec - then.tv_usec);   printf ("long long diff = %lld\n",diff);
>
>   exit(0);
> }
>
>
> --cut here--
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>



--
Larry Rosenman                     http://www.lerctr.org/~ler
Phone: +1 972-414-9812                 E-Mail: ler@lerctr.org
US Mail: 1905 Steamboat Springs Drive, Garland, TX 75044-6749

Re: autovacuum daemon stops doing work after about an

From

Vivek Khera

Date:

04 December 2003, 17:37:45

>>>>> "LR" == Larry Rosenman <ler@lerctr.org> writes:

>> I'd be curious to see the output of this program on other platforms
>> and other compilers.  I'm using gcc 2.95.4 as shipped with FreeBSD
>> 4.8+.
LR> this is with the UnixWare compiler:
LR> $ cc -O -o testvk testvk.c
LR> $ ./testvk
LR> seconds = 3509
LR> seconds1 = 3509000000
LR> useconds = -452486
LR> stepped diff = 3508547514
LR> seconds2 = -785967296
LR> seconds3 = 3509000000
LR> diff = -786419782
LR> long long diff = 3508547514
LR> $

LR> I think this is a C bug.

Upon further reflection, I think so to.  The entire RHS is long's so
the arithmetic is done in longs, then assigned to a long long when
done (after things have overflowed).  Forcing any one of the RHS
values to be long long causes the arithmetic to all be done using long
longs, and then you get the numbers you expect.

I think you only notice this in autovacuum when it takes a long time
to complete the work, like my example of about 3500 seconds.

Re: autovacuum daemon stops doing work after about an

From

Bruce Momjian

Date:

07 December 2003, 17:16:44

This has been fixed and will be in 7.4.1.

---------------------------------------------------------------------------

Vivek Khera wrote:
> >>>>> "LR" == Larry Rosenman <ler@lerctr.org> writes:
>
> >> I'd be curious to see the output of this program on other platforms
> >> and other compilers.  I'm using gcc 2.95.4 as shipped with FreeBSD
> >> 4.8+.
> LR> this is with the UnixWare compiler:
> LR> $ cc -O -o testvk testvk.c
> LR> $ ./testvk
> LR> seconds = 3509
> LR> seconds1 = 3509000000
> LR> useconds = -452486
> LR> stepped diff = 3508547514
> LR> seconds2 = -785967296
> LR> seconds3 = 3509000000
> LR> diff = -786419782
> LR> long long diff = 3508547514
> LR> $
>
> LR> I think this is a C bug.
>
> Upon further reflection, I think so to.  The entire RHS is long's so
> the arithmetic is done in longs, then assigned to a long long when
> done (after things have overflowed).  Forcing any one of the RHS
> values to be long long causes the arithmetic to all be done using long
> longs, and then you get the numbers you expect.
>
> I think you only notice this in autovacuum when it takes a long time
> to complete the work, like my example of about 3500 seconds.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: autovacuum daemon stops doing work after about an hour

From

Gaetano Mendola

Date:

07 December 2003, 20:47:32

Gaetano Mendola wrote:
> Vivek Khera wrote:
>
>>>>>>> "MTO" == Matthew T O'Connor <matthew@zeut.net> writes:
>>
>>
>>
>>>> Then it just sits there.  I started it at 11:35am, and it is now
>>>> 3:30pm.
>>
>>
>>
>> MTO> Weird.... Alphabetically speaking, is vkmlm."public"."user_list"
>> be the
>> MTO> last table in the last schema in the last database?  You are running
>>
>> conveniently, yes it is...
>>
>> MTO> with -d4, so you would get a message about going to sleep shortly
>> after
>> MTO> dealing with the last table, but you didn't get the sleep
>> message, so I
>> MTO> don't think the problem is that pg_autovacuum is sleeping for an
>> MTO> inordinate amount time.
>>
>> The only sleep logged was
>>
>> [2003-12-03 04:47:13 PM] 1 All DBs checked in: 84996853 usec, will
>> sleep for 469 secs.
>
>
> What I seen is:
>
>
> # tail -f auto.log
> [2003-12-04 07:10:18 PM] reltuples: 72;  relpages: 1
> [2003-12-04 07:10:18 PM] curr_analyze_count:  72; cur_delete_count:   0
> [2003-12-04 07:10:18 PM] ins_at_last_analyze: 72; del_at_last_vacuum: 0
> [2003-12-04 07:10:18 PM] insert_threshold:    572; delete_threshold    536
> [2003-12-04 07:10:18 PM] table name:     empdb."public"."contracts"
> [2003-12-04 07:10:18 PM] relfilenode: 17784;   relisshared: 0
> [2003-12-04 07:10:18 PM] reltuples: 347;  relpages: 5
> [2003-12-04 07:10:18 PM] curr_analyze_count:  347; cur_delete_count:   0
> [2003-12-04 07:10:18 PM] ins_at_last_analyze: 347; del_at_last_vacuum: 0
> [2003-12-04 07:10:18 PM] insert_threshold:    847; delete_threshold    673
>
>
> [ 5 minutes of delay ]  <----- LOOK THIS
>
>
> [2003-12-04 07:10:18 PM] 503 All DBs checked in: 179396 usec, will sleep
> for 300 secs.
> [2003-12-04 07:15:19 PM] 504 All DBs checked in: 98814 usec, will sleep
> for 300 secs.
>
> I think is a good Idea put a fflush after:
>
> fprintf(LOGOUTPUT, "[%s] %s\n", timebuffer, logentry);

Was I wrong ? If you are watching in tail the log believeme is really
annoying.

Regards
Gaetano Mendola

Re: autovacuum daemon stops doing work after about an hour

From

Christopher Browne

Date:

07 December 2003, 21:29:30

The world rejoiced as mendola@bigfoot.com (Gaetano Mendola) wrote:
> I think is a good Idea put a fflush after:
>
> fprintf(LOGOUTPUT, "[%s] %s\n", timebuffer, logentry);

I thought I had put fflush()es at all the interesting locations...

Apparently it was an error to not go to the effort of making sure it
worked well on FreeBSD.  (It was on my list, but I never got the Round
Tuits...)  There's an AMD-64 box coming in soon, targeted at FreeBSD,
so that should change...
--
let name="cbbrowne" and tld="acm.org" in name ^ "@" ^ tld;;
http://www3.sympatico.ca/cbbrowne/linux.html
What would a chair look like, if your knees bent the other way?

Re: autovacuum daemon stops doing work after about an hour

From

"Matthew T. O'Connor"

Date:

08 December 2003, 01:28:03

Christopher Browne wrote:

>The world rejoiced as mendola@bigfoot.com (Gaetano Mendola) wrote:
>
>
>>I think is a good Idea put a fflush after:
>>
>>fprintf(LOGOUTPUT, "[%s] %s\n", timebuffer, logentry);
>>
>>
>
>I thought I had put fflush()es at all the interesting locations...
>
>

I just looked through the code, I think there are fflush()es at all but
one interesting locations.  The last log_entry call before sleeping
doesn't have an fflush call after it.  I'll submit a patch that adds it.

>Apparently it was an error to not go to the effort of making sure it
>worked well on FreeBSD.  (It was on my list, but I never got the Round
>Tuits...)  There's an AMD-64 box coming in soon, targeted at FreeBSD,
>so that should change...
>
>
Yeah, FreeBSD testing would have been nice, but I don't have access to
any FreeBSD boxes so.....

Re: autovacuum daemon stops doing work after about an hour

From

Vivek Khera

Date:

08 December 2003, 14:41:46

>>>>> "MTO" == Matthew T O'Connor <matthew@zeut.net> writes:

MTO> Yeah, FreeBSD testing would have been nice, but I don't have access to
MTO> any FreeBSD boxes so.....

FWIW, with the fflush() added after that sleep, and the fix to the
long long computation of sleep time to keep it from overflowing,
pg_autovacuum has been working flawlessly on my FreeBSD 4.9 + PG 7.4.0
production server.  I'm just still playing with tuning pg_autovacuum
to keep it from vacuuming my busy tables *too* often.

Just a question: will my test program show negative sleep 'diff' on
your linux box?  I can't imagine that it would give different results
than on freebsd.

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.                Khera Communications, Inc.
Internet: khera@kciLink.com       Rockville, MD       +1-240-453-8497
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/