Thread: reducing statistics write overhead

reducing statistics write overhead

From

Martin Pihlak

Date:

05 September 2008, 16:00:46

Howdy,

The statistics collector currently dumps the stats file at every 500ms.  This
is a major problem if the file becomes large -- occasionally we've been forced
to disable stats collection to cope with it.  Another issue is that while the
file is frequently written, it is seldom read. Typically once a minute -
autovacuum plus possible user initiated stats queries.

So, as a simple optimization I am proposing that the file should be only
written when some backend requests statistics. This would significantly reduce
the undesired write traffic at the cost of slightly slower stats access.

Attached is a WIP patch, which basically implements this:

Disable periodic writing of the stats file. Introduce new stats message type -
PGSTAT_MTYPE_INQUIRY. Backends send this to notify collector that stats is needed.
Pid of the requestor is provided in the message. Backend then installs an alarm
handler and starts a timer.  Collector processes the messages and compiles a list
of pids to be notified. If there are any, the stats file is written and SIGALRM
is sent to the requestors. Backend then proceeds to read the stats file a usual.

Thoughts, comments?

regards,
Martin
Index: src/backend/postmaster/pgstat.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/pgstat.c,v
retrieving revision 1.181
diff -c -r1.181 pgstat.c
*** src/backend/postmaster/pgstat.c    25 Aug 2008 18:55:43 -0000    1.181
--- src/backend/postmaster/pgstat.c    5 Sep 2008 18:36:24 -0000
***************
*** 85,90 ****
--- 85,92 ----
  #define PGSTAT_SELECT_TIMEOUT    2        /* How often to check for postmaster
                                           * death; in seconds. */

+ #define PGSTAT_WRITE_TIMEOUT    5        /* How long to wait for the collector
+                                          * to finish writing the file; in seconds. */

  /* ----------
   * The initial size hints for the hash tables used in the collector.
***************
*** 94,100 ****
  #define PGSTAT_TAB_HASH_SIZE    512
  #define PGSTAT_FUNCTION_HASH_SIZE    512

-
  /* ----------
   * GUC parameters
   * ----------
--- 96,101 ----
***************
*** 201,208 ****
--- 202,213 ----
   */
  static PgStat_GlobalStats globalStats;

+ /* Last time we wrote the stats file */
+ static TimestampTz last_statwrite;
+
  static volatile bool need_exit = false;
  static volatile bool need_statwrite = false;
+ static volatile bool stats_ready = false;
  static volatile bool got_SIGHUP = false;

  /*
***************
*** 213,218 ****
--- 218,229 ----
  static instr_time total_func_time;


+ #define PGSTAT_MAX_INQUIRIES    16        /* Max number of outstanding stats inquiries */
+
+ /* List of backends that need notifications */
+ static pid_t inquiring_backends[PGSTAT_MAX_INQUIRIES];
+ static int num_inquiries = 0;
+
  /* ----------
   * Local function forward declarations
   * ----------
***************
*** 223,229 ****

  NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]);
  static void pgstat_exit(SIGNAL_ARGS);
! static void force_statwrite(SIGNAL_ARGS);
  static void pgstat_beshutdown_hook(int code, Datum arg);
  static void pgstat_sighup_handler(SIGNAL_ARGS);

--- 234,240 ----

  NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]);
  static void pgstat_exit(SIGNAL_ARGS);
! static void pgstat_alarm_handler(SIGNAL_ARGS);
  static void pgstat_beshutdown_hook(int code, Datum arg);
  static void pgstat_sighup_handler(SIGNAL_ARGS);

***************
*** 254,259 ****
--- 265,271 ----
  static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
  static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
  static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
+ static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);


  /* ------------------------------------------------------------
***************
*** 2463,2468 ****
--- 2475,2496 ----
      hdr->m_type = mtype;
  }

+ /* ----------
+  * pgstat_send_inquiry() -
+  *
+  *        Notify collector that we need fresh data.
+  * ----------
+  */
+ static void
+ pgstat_send_inquiry(void)
+ {
+     PgStat_MsgInquiry msg;
+
+     elog(DEBUG1, "pgstat_send_inquiry: I am %u", getpid());
+     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
+     msg.m_pid = getpid();
+     pgstat_send(&msg, sizeof(msg));
+ }

  /* ----------
   * pgstat_send() -
***************
*** 2538,2545 ****
  NON_EXEC_STATIC void
  PgstatCollectorMain(int argc, char *argv[])
  {
-     struct itimerval write_timeout;
-     bool        need_timer = false;
      int            len;
      PgStat_Msg    msg;

--- 2566,2571 ----
***************
*** 2577,2583 ****
      pqsignal(SIGINT, SIG_IGN);
      pqsignal(SIGTERM, SIG_IGN);
      pqsignal(SIGQUIT, pgstat_exit);
-     pqsignal(SIGALRM, force_statwrite);
      pqsignal(SIGPIPE, SIG_IGN);
      pqsignal(SIGUSR1, SIG_IGN);
      pqsignal(SIGUSR2, SIG_IGN);
--- 2603,2608 ----
***************
*** 2598,2608 ****
       */
      need_statwrite = true;

-     /* Preset the delay between status file writes */
-     MemSet(&write_timeout, 0, sizeof(struct itimerval));
-     write_timeout.it_value.tv_sec = PGSTAT_STAT_INTERVAL / 1000;
-     write_timeout.it_value.tv_usec = (PGSTAT_STAT_INTERVAL % 1000) * 1000;
-
      /*
       * Read in an existing statistics stats file or initialize the stats to
       * zero.
--- 2623,2628 ----
***************
*** 2628,2634 ****
       */
      for (;;)
      {
!         int            got_data;

          /*
           * Quit if we get SIGQUIT from the postmaster.
--- 2648,2654 ----
       */
      for (;;)
      {
!         int            i, got_data;

          /*
           * Quit if we get SIGQUIT from the postmaster.
***************
*** 2649,2670 ****
          }

          /*
!          * If time to write the stats file, do so.    Note that the alarm
!          * interrupt isn't re-enabled immediately, but only after we next
!          * receive a stats message; so no cycles are wasted when there is
!          * nothing going on.
           */
          if (need_statwrite)
          {
              /* Check for postmaster death; if so we'll write file below */
              if (!PostmasterIsAlive(true))
                  break;

!             pgstat_write_statsfile(false);
              need_statwrite = false;
-             need_timer = true;
          }

          /*
           * Wait for a message to arrive; but not for more than
           * PGSTAT_SELECT_TIMEOUT seconds. (This determines how quickly we will
--- 2669,2708 ----
          }

          /*
!          * If time to write the stats file, do so.
           */
          if (need_statwrite)
          {
+             TimestampTz now = GetCurrentTimestamp();
+             long sec;
+             int us;
+
              /* Check for postmaster death; if so we'll write file below */
              if (!PostmasterIsAlive(true))
                  break;

!             /* Write the file, but only if enough time has passed since last write */
!             if (TimestampDifferenceExceeds(last_statwrite, now, PGSTAT_STAT_INTERVAL))
!             {
!                 pgstat_write_statsfile(false);
!                 last_statwrite = now;
!             }
!
!             TimestampDifference(now, GetCurrentTimestamp(), &sec, &us);
!             elog(DEBUG1, "pgstat_write_statsfile: written in %ldms",
!                 (long)(sec*1000 + us/1000.0));
!
              need_statwrite = false;
          }

+         /* Notify inquiring backends that we are done with the file */
+         for (i = 0; i < num_inquiries; i++)
+         {
+             elog(DEBUG1, "signalling backend %u", (unsigned)inquiring_backends[i]);
+             kill(inquiring_backends[i], SIGALRM);
+         }
+         num_inquiries = 0;
+
          /*
           * Wait for a message to arrive; but not for more than
           * PGSTAT_SELECT_TIMEOUT seconds. (This determines how quickly we will
***************
*** 2797,2817 ****
                      pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
                      break;

!                 default:
                      break;
-             }

!             /*
!              * If this is the first message after we wrote the stats file the
!              * last time, enable the alarm interrupt to make it be written
!              * again later.
!              */
!             if (need_timer)
!             {
!                 if (setitimer(ITIMER_REAL, &write_timeout, NULL))
!                     ereport(ERROR,
!                     (errmsg("could not set statistics collector timer: %m")));
!                 need_timer = false;
              }
          }
          else
--- 2835,2846 ----
                      pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
                      break;

!                 case PGSTAT_MTYPE_INQUIRY:
!                     pgstat_recv_inquiry((PgStat_MsgInquiry *)&msg, len);
                      break;

!                 default:
!                     break;
              }
          }
          else
***************
*** 2841,2853 ****
      need_exit = true;
  }

- /* SIGALRM signal handler for collector process */
- static void
- force_statwrite(SIGNAL_ARGS)
- {
-     need_statwrite = true;
- }
-
  /* SIGHUP handler for collector process */
  static void
  pgstat_sighup_handler(SIGNAL_ARGS)
--- 2870,2875 ----
***************
*** 2855,2860 ****
--- 2877,2888 ----
      got_SIGHUP = true;
  }

+ /* SIGALRM signal handler for backend process */
+ static void
+ pgstat_alarm_handler(SIGNAL_ARGS)
+ {
+     stats_ready = true;
+ }

  /*
   * Lookup the hash table entry for the specified database. If no hash
***************
*** 3297,3307 ****
--- 3325,3368 ----
  static void
  backend_read_statsfile(void)
  {
+     TimestampTz start;
+     pqsigfunc oldalarm;
+     static struct itimerval stat_timeout;
+     int count = 0;
+     long sec;
+     int usec;
+
      /* already read it? */
      if (pgStatDBHash)
          return;
      Assert(!pgStatRunningInCollector);

+     start = GetCurrentTimestamp();
+
+     /* These need only be done once */
+     pqsignal(SIGALRM, pgstat_alarm_handler);
+     stat_timeout.it_value.tv_sec = PGSTAT_WRITE_TIMEOUT;
+
+     stats_ready = false;
+     if (setitimer(ITIMER_REAL, &stat_timeout, NULL))
+        ereport(ERROR, (errmsg("could not set statistics collector poll timer: %m")));
+
+     while (!stats_ready)
+     {
+         CHECK_FOR_INTERRUPTS();
+
+         pgstat_send_inquiry();
+         pg_usleep(5000);
+         count += 1;
+     }
+
+     /* Stats ready or timeout - either way we have to proceed */
+
+     TimestampDifference(start, GetCurrentTimestamp(), &sec, &usec);
+     elog(DEBUG1, "backend_read_statsfile: %u: waited for %lu ms, looped %d times.",
+         getpid(),
+         (unsigned long)(sec*1000 + usec/1000.0), count);
+
      /* Autovacuum launcher wants stats about all databases */
      if (IsAutoVacuumLauncherProcess())
          pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
***************
*** 3779,3781 ****
--- 3840,3870 ----
                             HASH_REMOVE, NULL);
      }
  }
+
+ /* ----------
+  * pgstat_recv_inquiry() -
+  *
+  *    Process stat inquiry requests.
+  * ----------
+  */
+ static void
+ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+ {
+     int i;
+
+     /* filter out duplicates */
+     for (i = 0; i < num_inquiries; i++)
+         if (inquiring_backends[i] == msg->m_pid)
+             return;
+
+     if (num_inquiries >= PGSTAT_MAX_INQUIRIES)
+     {
+         /* List is full - just drop the message, it'll be resent */
+         elog(DEBUG1, "pgstat_recv_inquiry: too many outstanding inquiries - %d",
+             num_inquiries);
+         return;
+     }
+     inquiring_backends[num_inquiries++] = msg->m_pid;
+     need_statwrite = true;
+ }
+
Index: src/include/pgstat.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/pgstat.h,v
retrieving revision 1.78
diff -c -r1.78 pgstat.h
*** src/include/pgstat.h    15 Aug 2008 08:37:40 -0000    1.78
--- src/include/pgstat.h    5 Sep 2008 18:36:27 -0000
***************
*** 42,48 ****
      PGSTAT_MTYPE_ANALYZE,
      PGSTAT_MTYPE_BGWRITER,
      PGSTAT_MTYPE_FUNCSTAT,
!     PGSTAT_MTYPE_FUNCPURGE
  } StatMsgType;

  /* ----------
--- 42,49 ----
      PGSTAT_MTYPE_ANALYZE,
      PGSTAT_MTYPE_BGWRITER,
      PGSTAT_MTYPE_FUNCSTAT,
!     PGSTAT_MTYPE_FUNCPURGE,
!     PGSTAT_MTYPE_INQUIRY
  } StatMsgType;

  /* ----------
***************
*** 385,390 ****
--- 386,404 ----


  /* ----------
+  * PgStat_MsgInquiry            Sent by the backend to tell the collector
+  *                                to write the stats file.
+  * ----------
+  */
+
+ typedef struct PgStat_MsgInquiry
+ {
+     PgStat_MsgHdr m_hdr;
+     int            m_pid;
+ } PgStat_MsgInquiry;
+
+
+ /* ----------
   * PgStat_Msg                    Union over all possible messages.
   * ----------
   */
***************
*** 402,407 ****
--- 416,422 ----
      PgStat_MsgBgWriter msg_bgwriter;
      PgStat_MsgFuncstat msg_funcstat;
      PgStat_MsgFuncpurge msg_funcpurge;
+     PgStat_MsgInquiry msg_inquiry;
  } PgStat_Msg;

Re: reducing statistics write overhead

From

Tom Lane

Date:

05 September 2008, 16:23:26

Martin Pihlak <martin.pihlak@gmail.com> writes:
> So, as a simple optimization I am proposing that the file should be
> only written when some backend requests statistics. This would
> significantly reduce the undesired write traffic at the cost of
> slightly slower stats access.

How necessary is this given the recent fixes to allow the stats file to
be kept on a ramdisk?

> Attached is a WIP patch, which basically implements this:

This patch breaks deadlock checking and statement_timeout, because
backends already use SIGALRM.  You can't just take over that signal.
It's possible that you could get things to work by treating this as an
additional reason for SIGALRM, but that code is unreasonably complex
already.  I'd suggest finding some other way.
        regards, tom lane

Re: reducing statistics write overhead

From

Joshua Drake

Date:

05 September 2008, 16:29:16

On Fri, 05 Sep 2008 15:23:18 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Martin Pihlak <martin.pihlak@gmail.com> writes:
> > So, as a simple optimization I am proposing that the file should be
> > only written when some backend requests statistics. This would
> > significantly reduce the undesired write traffic at the cost of
> > slightly slower stats access.
>
> How necessary is this given the recent fixes to allow the stats file
> to be kept on a ramdisk?

From an usability and integration perspective this patch is a nice
touch. On demand is a nice feature when used correctly.

Sincerely,

Joshua D. Drake

--
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

05 September 2008, 16:51:44

Tom Lane wrote:
> Martin Pihlak <martin.pihlak@gmail.com> writes:
>> So, as a simple optimization I am proposing that the file should be
>> only written when some backend requests statistics. This would
>> significantly reduce the undesired write traffic at the cost of
>> slightly slower stats access.
> 
> How necessary is this given the recent fixes to allow the stats file to
> be kept on a ramdisk?
> 

Ramdisk helps, but requires additional effort to set up. Also the stats
file has a tendency to creep up on you -- as the database evolves the file
size gradually increases and suddenly the DBA is left wondering what
happened to performance.

>> Attached is a WIP patch, which basically implements this:
> 
> This patch breaks deadlock checking and statement_timeout, because
> backends already use SIGALRM.  You can't just take over that signal.
> It's possible that you could get things to work by treating this as an
> additional reason for SIGALRM, but that code is unreasonably complex
> already.  I'd suggest finding some other way.
> 

I suspected that, but somehow managed to overlook it :( I guess it was
too tempting to use it. I'll start looking for alternatives.

regards,
Martin

Re: reducing statistics write overhead

From

Euler Taveira de Oliveira

Date:

05 September 2008, 20:31:40

Martin Pihlak escreveu:
> I suspected that, but somehow managed to overlook it :( I guess it was
> too tempting to use it. I'll start looking for alternatives.
> 
If you can't afford a 500 msec pgstat time, then you need to make it
tunable. Another ideas are (i) turn on/off pgstat per table or database
and (ii) make the pgstat time tunable per table or database. You can use
the reloptions column to store these info. These workarounds are much
simpler than that you proposed and they're almost for free.


--  Euler Taveira de Oliveira http://www.timbira.com/

Re: reducing statistics write overhead

From

Tom Lane

Date:

05 September 2008, 21:48:15

Euler Taveira de Oliveira <euler@timbira.com> writes:
> If you can't afford a 500 msec pgstat time, then you need to make it
> tunable. Another ideas are (i) turn on/off pgstat per table or database
> and (ii) make the pgstat time tunable per table or database. You can use
> the reloptions column to store these info. These workarounds are much
> simpler than that you proposed and they're almost for free.

For normal usage on-demand dumping would be a really good thing; it'd
cut the overhead of having stats on tremendously, especially for people
who don't really use 'em.  The particular signaling proposed here is
bogus, but if Martin can make it work in a cleaner fashion I think it's
likely a good idea.
        regards, tom lane

Re: reducing statistics write overhead

From

"Asko Oja"

Date:

06 September 2008, 04:22:27

On Sat, Sep 6, 2008 at 2:29 AM, Euler Taveira de Oliveira <euler@timbira.com> wrote:

Martin Pihlak escreveu:
> I suspected that, but somehow managed to overlook it :( I guess it was
> too tempting to use it. I'll start looking for alternatives.
>
If you can't afford a 500 msec pgstat time, then you need to make it
tunable.

Additional parameter in config file. Not good.

Another ideas are (i) turn on/off pgstat per table or database
and (ii) make the pgstat time tunable per table or database. You can use
the reloptions column to store these info. These workarounds are much
simpler than that you proposed and they're almost for free.

Does not seem simple to me. Why would dba's want extra management work. We want all the stats to be there as we don't know when we need to look at it.

--
Euler Taveira de Oliveira
http://www.timbira.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: reducing statistics write overhead

From

Simon Riggs

Date:

06 September 2008, 12:49:11

On Fri, 2008-09-05 at 15:23 -0400, Tom Lane wrote:

> How necessary is this given the recent fixes to allow the stats file to
> be kept on a ramdisk?

I would prefer this approach and back-out the other change.

On-demand is cheaper and easier to use.

> > Attached is a WIP patch, which basically implements this:
> 
> This patch breaks deadlock checking and statement_timeout, because
> backends already use SIGALRM.  You can't just take over that signal.
> It's possible that you could get things to work by treating this as an
> additional reason for SIGALRM, but that code is unreasonably complex
> already.  I'd suggest finding some other way.

There are other ways already in use in backend, so just use those.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: reducing statistics write overhead

From

Tom Lane

Date:

06 September 2008, 14:53:00

Simon Riggs <simon@2ndQuadrant.com> writes:
> On Fri, 2008-09-05 at 15:23 -0400, Tom Lane wrote:
>> How necessary is this given the recent fixes to allow the stats file to
>> be kept on a ramdisk?

> I would prefer this approach and back-out the other change.

Even if we get on-demand done, I wouldn't see it as a reason to back out
the statfile relocation work.  In an environment where the stats are
demanded frequently, you could still need that for performance.

(In fact, maybe this patch ought to include some sort of maximum update
rate tunable?  The worst case behavior could actually be WORSE than now.)
        regards, tom lane

Re: reducing statistics write overhead

From

Alvaro Herrera

Date:

06 September 2008, 15:02:44

Tom Lane escribió:

> (In fact, maybe this patch ought to include some sort of maximum update
> rate tunable?  The worst case behavior could actually be WORSE than now.)

Some sort of "if stats were requested in the last 500 ms, just tell the
requester to read the existing file".

Things that come to mind:

- autovacuum could use a more frequent stats update in certain cases

- Maybe we oughta have separate files, one for each database?  That way
we'd reduce unnecessary I/O traffic for both the reader and the writer.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: reducing statistics write overhead

From

Tom Lane

Date:

06 September 2008, 15:07:29

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Some sort of "if stats were requested in the last 500 ms, just tell the
> requester to read the existing file".

Hmm, I was thinking of delaying both the write and the reply signal
until 500ms had elapsed.  But the above behavior would certainly be
easier to implement, and would probably be good enough (TM).

> - Maybe we oughta have separate files, one for each database?  That way
> we'd reduce unnecessary I/O traffic for both the reader and the writer.

The signaling would become way too complex, I think.  Also what do you
do about shared tables?
        regards, tom lane

Re: reducing statistics write overhead

From

Tom Lane

Date:

06 September 2008, 15:12:46

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Some sort of "if stats were requested in the last 500 ms, just tell the
> requester to read the existing file".

> Things that come to mind:

> - autovacuum could use a more frequent stats update in certain cases

BTW, we could implement that by, instead of having a global tunable,
including a field in the request message saying how stale an existing
file is acceptable for this requestor.  500ms might be the standard
value but autovac could use a smaller number.
        regards, tom lane

Re: reducing statistics write overhead

From

"Asko Oja"

Date:

06 September 2008, 16:45:19

Too frequent read protection is already handled in the patch but these comments might lead it into new directions. Current implementation had this same limit that file was written no more than once per 500 ms.

On Sat, Sep 6, 2008 at 9:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Some sort of "if stats were requested in the last 500 ms, just tell the
> requester to read the existing file".

> Things that come to mind:

> - autovacuum could use a more frequent stats update in certain cases

BTW, we could implement that by, instead of having a global tunable,
including a field in the request message saying how stale an existing
file is acceptable for this requestor. 500ms might be the standard
value but autovac could use a smaller number.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: reducing statistics write overhead

From

Alvaro Herrera

Date:

06 September 2008, 19:31:12

Tom Lane escribió:
> Alvaro Herrera <alvherre@commandprompt.com> writes:

> > - Maybe we oughta have separate files, one for each database?  That way
> > we'd reduce unnecessary I/O traffic for both the reader and the writer.
> 
> The signaling would become way too complex, I think.  Also what do you
> do about shared tables?

They are already stored in a "separate database" (denoted with
InvalidOid dbid), and autovacuum grabs it separately.  I admit I don't
know what do regular backends do about it.

As for signalling, maybe we could implement something like we do for the
postmaster signal stuff: the requestor stores a dbid in shared memory
and sends a SIGUSR2 to pgstat or some such.  We'd have enough shmem
space for a reasonable number of requests, and pgstat consumes them from
there into local memory (similar to what Andrew proposes for
LISTEN/NOTIFY); it stores the dbid and PID of the requestor.  As soon as
the request has been fulfilled, pgstat responds by <fill in magical
mechanism that Martin is about to propose> to that particular backend.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: reducing statistics write overhead

From

Tom Lane

Date:

06 September 2008, 19:38:58

Alvaro Herrera <alvherre@commandprompt.com> writes:
> As for signalling, maybe we could implement something like we do for the
> postmaster signal stuff: the requestor stores a dbid in shared memory
> and sends a SIGUSR2 to pgstat or some such.

No, no, no.  Martin already had a perfectly sane design for that
direction of signalling: send a special stats message to the collector.
That can carry whatever baggage it needs to.  It's the reverse direction
of "the data you requested is available now, sir" that is tricky.
And I fear that having to keep track of multiple stats-collector output
files would make it very significantly trickier --- both for the stats
collector's own bookkeeping and for the signaling mechanism itself.
I don't believe it's gonna be worth that.
        regards, tom lane

Re: reducing statistics write overhead

From

Tom Lane

Date:

06 September 2008, 19:45:47

I wrote:
> No, no, no.  Martin already had a perfectly sane design for that
> direction of signalling: send a special stats message to the collector.

Actually ... given that the stats message mechanism is designed to be
lossy under high load, maybe that isn't so sane.  At the very least
there would have to be timeout-and-resend logic on the backend side.

I dislike the alternative of communicating through shared memory,
though.  Right now the stats collector isn't even connected to shared
memory.
        regards, tom lane

Re: reducing statistics write overhead

From

Magnus Hagander

Date:

07 September 2008, 10:09:33

Martin Pihlak wrote:
>>> Attached is a WIP patch, which basically implements this:
>> This patch breaks deadlock checking and statement_timeout, because
>> backends already use SIGALRM.  You can't just take over that signal.
>> It's possible that you could get things to work by treating this as an
>> additional reason for SIGALRM, but that code is unreasonably complex
>> already.  I'd suggest finding some other way.
>>
> 
> I suspected that, but somehow managed to overlook it :( I guess it was
> too tempting to use it. I'll start looking for alternatives.

I wrote a patch for this some time back, that was actually applied.
Turns out it didn't work, and I ran out of time to fix it, so it was
backed out again. And then I forgot about it :-) If you look through the
cvs history of pgstat you should be able to find it - maybe it can give
you some further ideas.

//Magnus

Re: reducing statistics write overhead

From

Magnus Hagander

Date:

07 September 2008, 10:11:30

Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> As for signalling, maybe we could implement something like we do for the
>> postmaster signal stuff: the requestor stores a dbid in shared memory
>> and sends a SIGUSR2 to pgstat or some such.
> 
> No, no, no.  Martin already had a perfectly sane design for that
> direction of signalling: send a special stats message to the collector.
> That can carry whatever baggage it needs to.  It's the reverse direction
> of "the data you requested is available now, sir" that is tricky.

IIRC, my previous patch looked at the inode of the stats file, then sent
of the "gimme a new file" signal, and then read the file once the inode
change.

But also IIRC, that's the area where there was a problem - sometimes it
didn't properly pick up changes...

//Magnus

Re: reducing statistics write overhead

From

Dimitri Fontaine

Date:

07 September 2008, 15:03:02

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

Le 7 sept. 08 à 00:45, Tom Lane a écrit :
> I dislike the alternative of communicating through shared memory,
> though.  Right now the stats collector isn't even connected to shared
> memory.

Maybe Markus Wanner work for Postgres-R internal messaging, now it has
been reworked to follow your advices, could be of some use here?
http://archives.postgresql.org/pgsql-hackers/2008-07/msg01114.php
http://archives.postgresql.org/pgsql-hackers/2008-07/msg01420.php

Regards,
- --
dim



- --
Dimitri Fontaine
PostgreSQL DBA, Architecte



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)

iEYEARECAAYFAkjEF0sACgkQlBXRlnbh1bl/FACeORN+NjEFC9wi22suNaSoWmi5
LBEAnj9Qo2E6GWqVjdtsSCG7JILBPmX6
=5jPo
-----END PGP SIGNATURE-----

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

07 September 2008, 16:47:42

Magnus Hagander wrote:
> I wrote a patch for this some time back, that was actually applied.
> Turns out it didn't work, and I ran out of time to fix it, so it was
> backed out again. And then I forgot about it :-) If you look through the
> cvs history of pgstat you should be able to find it - maybe it can give
> you some further ideas.

Got it - this was 1.126. Looks very familiar indeed :)

I had also previously experimented with stat() based polling but ran into
the same issues - no portable high resolution timestamp on files. I guess
stat() is unusable unless we can live with 1 second update interval for the
stats (eg. backend reads the file if it is within 1 second of the request).

One alternative is to include a timestamp in the stats file header - the
backend can then wait on that -- check the timestamp, sleep, resend the
request, loop. Not particularly elegant, but easy to implement. Would this
be acceptable?

regards,
Martin

Re: reducing statistics write overhead

From

Tom Lane

Date:

07 September 2008, 16:52:56

Martin Pihlak <martin.pihlak@gmail.com> writes:
> I had also previously experimented with stat() based polling but ran into
> the same issues - no portable high resolution timestamp on files. I guess
> stat() is unusable unless we can live with 1 second update interval for the
> stats (eg. backend reads the file if it is within 1 second of the request).

> One alternative is to include a timestamp in the stats file header - the
> backend can then wait on that -- check the timestamp, sleep, resend the
> request, loop. Not particularly elegant, but easy to implement. Would this
> be acceptable?

Timestamp within the file is certainly better than trying to rely on
filesystem timestamps.  I doubt 1 sec resolution is good enough, and
I'd also be worried about issues like clock skew between the
postmaster's time and the filesystem's time.
        regards, tom lane

Re: reducing statistics write overhead

From

Magnus Hagander

Date:

08 September 2008, 04:51:08

Martin Pihlak wrote:
> Magnus Hagander wrote:
>> I wrote a patch for this some time back, that was actually applied.
>> Turns out it didn't work, and I ran out of time to fix it, so it was
>> backed out again. And then I forgot about it :-) If you look through the
>> cvs history of pgstat you should be able to find it - maybe it can give
>> you some further ideas.
> 
> Got it - this was 1.126. Looks very familiar indeed :)

:-)


> I had also previously experimented with stat() based polling but ran into
> the same issues - no portable high resolution timestamp on files. I guess
> stat() is unusable unless we can live with 1 second update interval for the
> stats (eg. backend reads the file if it is within 1 second of the request).

If the filesystem has inodes, noticing that the inode changes should
usually be enough, no? Since we always create a new file and rename it
into place, there is no way that inode can be reused right away. But it
does require that the stat info is not cached somewhere in between...


//Magnus

Re: reducing statistics write overhead

From

Magnus Hagander

Date:

08 September 2008, 04:52:12

Tom Lane wrote:
> Martin Pihlak <martin.pihlak@gmail.com> writes:
>> I had also previously experimented with stat() based polling but ran into
>> the same issues - no portable high resolution timestamp on files. I guess
>> stat() is unusable unless we can live with 1 second update interval for the
>> stats (eg. backend reads the file if it is within 1 second of the request).
> 
>> One alternative is to include a timestamp in the stats file header - the
>> backend can then wait on that -- check the timestamp, sleep, resend the
>> request, loop. Not particularly elegant, but easy to implement. Would this
>> be acceptable?
> 
> Timestamp within the file is certainly better than trying to rely on
> filesystem timestamps.  I doubt 1 sec resolution is good enough, and

We'd need half a second resolution just to keep up with the level we
have *now*, don't we?

> I'd also be worried about issues like clock skew between the
> postmaster's time and the filesystem's time.

Can that even happen on a local filesystem? I guess you could put the
file on NFS though, but that seems to be.. eh. sub-optimal.. in more
than one way..

//Magnus

Re: reducing statistics write overhead

From

Tom Lane

Date:

08 September 2008, 09:15:18

Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> I'd also be worried about issues like clock skew between the
>> postmaster's time and the filesystem's time.

> Can that even happen on a local filesystem? I guess you could put the
> file on NFS though, but that seems to be.. eh. sub-optimal.. in more
> than one way..

Lots of people use NAS/SAN type setups.
        regards, tom lane

Re: reducing statistics write overhead

From

Magnus Hagander

Date:

08 September 2008, 10:05:21

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Tom Lane wrote:
>>> I'd also be worried about issues like clock skew between the
>>> postmaster's time and the filesystem's time.
> 
>> Can that even happen on a local filesystem? I guess you could put the
>> file on NFS though, but that seems to be.. eh. sub-optimal.. in more
>> than one way..
> 
> Lots of people use NAS/SAN type setups.

For NAS it could happen, but certainly not for SAN, no? SANs have all
the filesystem processing in the kernel of the server...

I still agree that it's not a good thing to rely on, though :-)

//Magnus

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

08 September 2008, 10:25:30

Tom Lane wrote:
> Timestamp within the file is certainly better than trying to rely on
> filesystem timestamps.  I doubt 1 sec resolution is good enough, and
> I'd also be worried about issues like clock skew between the
> postmaster's time and the filesystem's time.
>

Attached is a patch which adds a timestamp to pgstat.stat file header,
backend_read_statsfile uses this to determine if the file is fresh.
During the wait loop, the stats request message is retransmitted to
compensate for possible loss of message(s).

The collector only writes the file mostly at PGSTAT_STAT_INTERVAL frequency,
currently no extra custom timeouts can be passed in the message. This can
of course be added if need arises.

regards,
Martin


Index: src/backend/postmaster/pgstat.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/pgstat.c,v
retrieving revision 1.181
diff -c -r1.181 pgstat.c
*** src/backend/postmaster/pgstat.c    25 Aug 2008 18:55:43 -0000    1.181
--- src/backend/postmaster/pgstat.c    8 Sep 2008 12:17:21 -0000
***************
*** 85,90 ****
--- 85,94 ----
  #define PGSTAT_SELECT_TIMEOUT    2        /* How often to check for postmaster
                                           * death; in seconds. */

+ #define PGSTAT_POLL_RETRY_DELAY    10        /* How long to pause between statistics
+                                          * update requests; in milliseconds. */
+
+ #define PGSTAT_POLL_RETRIES        500        /* How many times to repeat stats inquiry */

  /* ----------
   * The initial size hints for the hash tables used in the collector.
***************
*** 201,206 ****
--- 205,215 ----
   */
  static PgStat_GlobalStats globalStats;

+ /* Last time the collector wrote the stats file */
+ static TimestampTz last_statwrite;
+ /* Last time a backend requested a new file */
+ static TimestampTz last_statrequest;
+
  static volatile bool need_exit = false;
  static volatile bool need_statwrite = false;
  static volatile bool got_SIGHUP = false;
***************
*** 223,229 ****

  NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]);
  static void pgstat_exit(SIGNAL_ARGS);
- static void force_statwrite(SIGNAL_ARGS);
  static void pgstat_beshutdown_hook(int code, Datum arg);
  static void pgstat_sighup_handler(SIGNAL_ARGS);

--- 232,237 ----
***************
*** 254,259 ****
--- 262,268 ----
  static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
  static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
  static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
+ static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);


  /* ------------------------------------------------------------
***************
*** 2463,2468 ****
--- 2472,2491 ----
      hdr->m_type = mtype;
  }

+ /* ----------
+  * pgstat_send_inquiry() -
+  *
+  *        Notify collector that we need fresh data.
+  * ----------
+  */
+ static void
+ pgstat_send_inquiry(void)
+ {
+     PgStat_MsgInquiry msg;
+
+     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
+     pgstat_send(&msg, sizeof(msg));
+ }

  /* ----------
   * pgstat_send() -
***************
*** 2538,2545 ****
  NON_EXEC_STATIC void
  PgstatCollectorMain(int argc, char *argv[])
  {
-     struct itimerval write_timeout;
-     bool        need_timer = false;
      int            len;
      PgStat_Msg    msg;

--- 2561,2566 ----
***************
*** 2571,2583 ****

      /*
       * Ignore all signals usually bound to some action in the postmaster,
!      * except SIGQUIT and SIGALRM.
       */
      pqsignal(SIGHUP, pgstat_sighup_handler);
      pqsignal(SIGINT, SIG_IGN);
      pqsignal(SIGTERM, SIG_IGN);
      pqsignal(SIGQUIT, pgstat_exit);
-     pqsignal(SIGALRM, force_statwrite);
      pqsignal(SIGPIPE, SIG_IGN);
      pqsignal(SIGUSR1, SIG_IGN);
      pqsignal(SIGUSR2, SIG_IGN);
--- 2592,2603 ----

      /*
       * Ignore all signals usually bound to some action in the postmaster,
!      * except SIGQUIT
       */
      pqsignal(SIGHUP, pgstat_sighup_handler);
      pqsignal(SIGINT, SIG_IGN);
      pqsignal(SIGTERM, SIG_IGN);
      pqsignal(SIGQUIT, pgstat_exit);
      pqsignal(SIGPIPE, SIG_IGN);
      pqsignal(SIGUSR1, SIG_IGN);
      pqsignal(SIGUSR2, SIG_IGN);
***************
*** 2598,2608 ****
       */
      need_statwrite = true;

-     /* Preset the delay between status file writes */
-     MemSet(&write_timeout, 0, sizeof(struct itimerval));
-     write_timeout.it_value.tv_sec = PGSTAT_STAT_INTERVAL / 1000;
-     write_timeout.it_value.tv_usec = (PGSTAT_STAT_INTERVAL % 1000) * 1000;
-
      /*
       * Read in an existing statistics stats file or initialize the stats to
       * zero.
--- 2618,2623 ----
***************
*** 2649,2668 ****
          }

          /*
!          * If time to write the stats file, do so.    Note that the alarm
!          * interrupt isn't re-enabled immediately, but only after we next
!          * receive a stats message; so no cycles are wasted when there is
!          * nothing going on.
           */
          if (need_statwrite)
          {
              /* Check for postmaster death; if so we'll write file below */
              if (!PostmasterIsAlive(true))
                  break;

!             pgstat_write_statsfile(false);
              need_statwrite = false;
-             need_timer = true;
          }

          /*
--- 2664,2687 ----
          }

          /*
!          * If time to write the stats file, do so.
           */
          if (need_statwrite)
          {
+             TimestampTz now = GetCurrentTimestamp();
+
              /* Check for postmaster death; if so we'll write file below */
              if (!PostmasterIsAlive(true))
                  break;

!             /* Write the file, but only if enough time has passed since last write */
!             if (TimestampDifferenceExceeds(last_statwrite, now, PGSTAT_STAT_INTERVAL))
!             {
!                 pgstat_write_statsfile(false);
!                 last_statwrite = now;
!             }
!
              need_statwrite = false;
          }

          /*
***************
*** 2797,2817 ****
                      pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
                      break;

!                 default:
                      break;
-             }

!             /*
!              * If this is the first message after we wrote the stats file the
!              * last time, enable the alarm interrupt to make it be written
!              * again later.
!              */
!             if (need_timer)
!             {
!                 if (setitimer(ITIMER_REAL, &write_timeout, NULL))
!                     ereport(ERROR,
!                     (errmsg("could not set statistics collector timer: %m")));
!                 need_timer = false;
              }
          }
          else
--- 2816,2827 ----
                      pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
                      break;

!                 case PGSTAT_MTYPE_INQUIRY:
!                     pgstat_recv_inquiry((PgStat_MsgInquiry *)&msg, len);
                      break;

!                 default:
!                     break;
              }
          }
          else
***************
*** 2841,2853 ****
      need_exit = true;
  }

- /* SIGALRM signal handler for collector process */
- static void
- force_statwrite(SIGNAL_ARGS)
- {
-     need_statwrite = true;
- }
-
  /* SIGHUP handler for collector process */
  static void
  pgstat_sighup_handler(SIGNAL_ARGS)
--- 2851,2856 ----
***************
*** 2937,2949 ****
      PgStat_StatFuncEntry *funcentry;
      FILE       *fpout;
      int32        format_id;
      const char *tmpfile = permanent?PGSTAT_STAT_PERMANENT_TMPFILE:pgstat_stat_tmpname;
      const char *statfile = permanent?PGSTAT_STAT_PERMANENT_FILENAME:pgstat_stat_filename;

      /*
       * Open the statistics temp file to write out the current values.
       */
!     fpout = fopen(tmpfile, PG_BINARY_W);
      if (fpout == NULL)
      {
          ereport(LOG,
--- 2940,2953 ----
      PgStat_StatFuncEntry *funcentry;
      FILE       *fpout;
      int32        format_id;
+     TimestampTz    ts;
      const char *tmpfile = permanent?PGSTAT_STAT_PERMANENT_TMPFILE:pgstat_stat_tmpname;
      const char *statfile = permanent?PGSTAT_STAT_PERMANENT_FILENAME:pgstat_stat_filename;

      /*
       * Open the statistics temp file to write out the current values.
       */
!     fpout = AllocateFile(tmpfile, PG_BINARY_W);
      if (fpout == NULL)
      {
          ereport(LOG,
***************
*** 2954,2963 ****
      }

      /*
!      * Write the file header --- currently just a format ID.
       */
      format_id = PGSTAT_FILE_FORMAT_ID;
      fwrite(&format_id, sizeof(format_id), 1, fpout);

      /*
       * Write global stats struct
--- 2958,2969 ----
      }

      /*
!      * Write the file header --- currently just a format ID and a timestamp.
       */
      format_id = PGSTAT_FILE_FORMAT_ID;
+     ts = GetCurrentTimestamp();
      fwrite(&format_id, sizeof(format_id), 1, fpout);
+     fwrite(&ts, sizeof(ts), 1, fpout);

      /*
       * Write global stats struct
***************
*** 3017,3026 ****
                  (errcode_for_file_access(),
                 errmsg("could not write temporary statistics file \"%s\": %m",
                        tmpfile)));
!         fclose(fpout);
          unlink(tmpfile);
      }
!     else if (fclose(fpout) < 0)
      {
          ereport(LOG,
                  (errcode_for_file_access(),
--- 3023,3032 ----
                  (errcode_for_file_access(),
                 errmsg("could not write temporary statistics file \"%s\": %m",
                        tmpfile)));
!         FreeFile(fpout);
          unlink(tmpfile);
      }
!     else if (FreeFile(fpout) < 0)
      {
          ereport(LOG,
                  (errcode_for_file_access(),
***************
*** 3108,3113 ****
--- 3114,3127 ----
          goto done;
      }

+     /* Skip over the timestamp */
+     if (fseek(fpin, sizeof(TimestampTz), SEEK_CUR) != 0)
+     {
+         ereport(pgStatRunningInCollector ? LOG : WARNING,
+                 (errmsg("corrupted pgstat.stat file")));
+         goto done;
+     }
+
      /*
       * Read global stats struct
       */
***************
*** 3297,3307 ****
--- 3311,3365 ----
  static void
  backend_read_statsfile(void)
  {
+     TimestampTz start;
+     bool stats_ready = false;
+     int count;
+
      /* already read it? */
      if (pgStatDBHash)
          return;
      Assert(!pgStatRunningInCollector);

+     start = GetCurrentTimestamp();
+
+     /*
+      * If the stats requests are sent too frequently the collector
+      * might refuse to write the file and cause a longish wait here.
+      * So we reuse the old file if we know that not enough time has
+      * passsed since last request.
+      */
+     if (!TimestampDifferenceExceeds(last_statrequest, start, PGSTAT_STAT_INTERVAL))
+         stats_ready = true;
+     last_statrequest = start;
+
+     /*
+      * Loop until fresh stats file is available or we ran out of time.
+      * The stats inquiry message is resent in case collector drops our
+      * first message.
+      */
+     for (count = 0; !stats_ready && count < PGSTAT_POLL_RETRIES; count++)
+     {
+         TimestampTz ts;
+         FILE *fp;
+
+         CHECK_FOR_INTERRUPTS();
+         pgstat_send_inquiry();
+         pg_usleep(PGSTAT_POLL_RETRY_DELAY * 1000);
+
+         /* Read the timestamp from beginning of the stats file */
+         if ((fp = AllocateFile(pgstat_stat_filename, PG_BINARY_R)) != NULL)
+         {
+             fseek(fp, sizeof(int32), SEEK_SET);
+             fread(&ts, sizeof(ts), 1, fp);
+             if (!ferror(fp) && TimestampDifferenceExceeds(start, ts, 0))
+                 stats_ready = true;
+             FreeFile(fp);
+         }
+     }
+
+     if (count >= PGSTAT_POLL_RETRIES)
+         elog(WARNING, "pgstat wait timeout");
+
      /* Autovacuum launcher wants stats about all databases */
      if (IsAutoVacuumLauncherProcess())
          pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
***************
*** 3779,3781 ****
--- 3837,3852 ----
                             HASH_REMOVE, NULL);
      }
  }
+
+ /* ----------
+  * pgstat_recv_inquiry() -
+  *
+  *    Process stat inquiry requests.
+  * ----------
+  */
+ static void
+ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+ {
+     need_statwrite = true;
+ }
+
Index: src/include/pgstat.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/pgstat.h,v
retrieving revision 1.78
diff -c -r1.78 pgstat.h
*** src/include/pgstat.h    15 Aug 2008 08:37:40 -0000    1.78
--- src/include/pgstat.h    8 Sep 2008 12:17:23 -0000
***************
*** 42,48 ****
      PGSTAT_MTYPE_ANALYZE,
      PGSTAT_MTYPE_BGWRITER,
      PGSTAT_MTYPE_FUNCSTAT,
!     PGSTAT_MTYPE_FUNCPURGE
  } StatMsgType;

  /* ----------
--- 42,49 ----
      PGSTAT_MTYPE_ANALYZE,
      PGSTAT_MTYPE_BGWRITER,
      PGSTAT_MTYPE_FUNCSTAT,
!     PGSTAT_MTYPE_FUNCPURGE,
!     PGSTAT_MTYPE_INQUIRY
  } StatMsgType;

  /* ----------
***************
*** 385,390 ****
--- 386,403 ----


  /* ----------
+  * PgStat_MsgInquiry            Sent by the backend to tell the collector
+  *                                to write the stats file.
+  * ----------
+  */
+
+ typedef struct PgStat_MsgInquiry
+ {
+     PgStat_MsgHdr m_hdr;
+ } PgStat_MsgInquiry;
+
+
+ /* ----------
   * PgStat_Msg                    Union over all possible messages.
   * ----------
   */
***************
*** 402,407 ****
--- 415,421 ----
      PgStat_MsgBgWriter msg_bgwriter;
      PgStat_MsgFuncstat msg_funcstat;
      PgStat_MsgFuncpurge msg_funcpurge;
+     PgStat_MsgInquiry msg_inquiry;
  } PgStat_Msg;


***************
*** 413,419 ****
   * ------------------------------------------------------------
   */

! #define PGSTAT_FILE_FORMAT_ID    0x01A5BC97

  /* ----------
   * PgStat_StatDBEntry            The collector's data per database
--- 427,433 ----
   * ------------------------------------------------------------
   */

! #define PGSTAT_FILE_FORMAT_ID    0x01A5BC98

  /* ----------
   * PgStat_StatDBEntry            The collector's data per database

Re: reducing statistics write overhead

From

Tom Lane

Date:

08 September 2008, 11:06:15

Martin Pihlak <martin.pihlak@gmail.com> writes:
> Attached is a patch which adds a timestamp to pgstat.stat file header,
> backend_read_statsfile uses this to determine if the file is fresh.
> During the wait loop, the stats request message is retransmitted to
> compensate for possible loss of message(s).

> The collector only writes the file mostly at PGSTAT_STAT_INTERVAL frequency,
> currently no extra custom timeouts can be passed in the message. This can
> of course be added if need arises.

Hmm.  With the timestamp in the file, ISTM that we could put all the
intelligence on the reader side.  Reader checks file, sends message if
it's too stale.  The collector just writes when it's told to, no
filtering.  In this design, rate limiting relies on the readers to not
be unreasonable in how old a file they'll accept; and there's no problem
with different readers having different requirements.

A possible problem with this idea is that two readers might send request
messages at about the same time, resulting in two writes where there
need have been only one.  However I think that could be fixed if we add
a little more information to the request messages and have the collector
remember the file timestamp it last wrote out.  There are various ways
you could design it but what comes to mind here is for readers to send
a timestamp defined as minimum acceptable file timestamp (ie, their
current clock time less whatever slop they want to allow).  Then,
when the collector gets a request with that timestamp <= its last
write timestamp, it knows it need not do anything.

With signaling like that, there's no excess writes *and* no delay in
responding to a new live write request.  It's sort of annoying that
the readers have to sleep for an arbitrary interval though.  If we could
get rid of that part...
        regards, tom lane

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

08 September 2008, 18:10:58

Tom Lane wrote:
> Hmm.  With the timestamp in the file, ISTM that we could put all the
> intelligence on the reader side.  Reader checks file, sends message if
... snip ...
> remember the file timestamp it last wrote out.  There are various ways
> you could design it but what comes to mind here is for readers to send
> a timestamp defined as minimum acceptable file timestamp (ie, their
> current clock time less whatever slop they want to allow).  Then,
> when the collector gets a request with that timestamp <= its last
> write timestamp, it knows it need not do anything.

Attached is a patch that implements the described signalling. Additionally
following non-related changes have been made:
1. fopen/fclose replaced with AllocateFile/FreeFile
2. pgstat_report_stat() now also checks if there are functions to report
before returning. Previous behavior was to return immediately if no table
stats were available for reporting.

> responding to a new live write request.  It's sort of annoying that
> the readers have to sleep for an arbitrary interval though.  If we could
> get rid of that part...
>
It is, but I guess its unavoidable if we have to wait for the file to be
written. I'll try to do some load testing tomorrow, to see if something
nasty comes out.

regards,
Martin

Index: src/backend/postmaster/pgstat.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/postmaster/pgstat.c,v
retrieving revision 1.181
diff -c -r1.181 pgstat.c
*** src/backend/postmaster/pgstat.c    25 Aug 2008 18:55:43 -0000    1.181
--- src/backend/postmaster/pgstat.c    8 Sep 2008 21:00:17 -0000
***************
*** 85,90 ****
--- 85,97 ----
  #define PGSTAT_SELECT_TIMEOUT    2        /* How often to check for postmaster
                                           * death; in seconds. */

+ #define PGSTAT_POLL_RETRY_DELAY    5        /* How long to pause between statistics
+                                          * update requests; in milliseconds. */
+
+ #define PGSTAT_POLL_TIME        5000    /* How long are we allowed to poll for
+                                          * the stats file; in milliseconds. */
+
+ #define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_POLL_TIME / PGSTAT_POLL_RETRY_DELAY)

  /* ----------
   * The initial size hints for the hash tables used in the collector.
***************
*** 159,164 ****
--- 166,177 ----
  static HTAB *pgStatFunctions = NULL;

  /*
+  * Indicates if backend has some function stats which it hasn't yet
+  * sent to the collector.
+  */
+ static bool have_function_stats = false;
+
+ /*
   * Tuple insertion/deletion counts for an open transaction can't be propagated
   * into PgStat_TableStatus counters until we know if it is going to commit
   * or abort.  Hence, we keep these counts in per-subxact structs that live
***************
*** 201,208 ****
   */
  static PgStat_GlobalStats globalStats;

  static volatile bool need_exit = false;
- static volatile bool need_statwrite = false;
  static volatile bool got_SIGHUP = false;

  /*
--- 214,225 ----
   */
  static PgStat_GlobalStats globalStats;

+ /* Last time the collector wrote the stats file */
+ static TimestampTz last_statwrite;
+ /* Latest statistics request from backend */
+ static TimestampTz last_statrequest;
+
  static volatile bool need_exit = false;
  static volatile bool got_SIGHUP = false;

  /*
***************
*** 223,229 ****

  NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]);
  static void pgstat_exit(SIGNAL_ARGS);
- static void force_statwrite(SIGNAL_ARGS);
  static void pgstat_beshutdown_hook(int code, Datum arg);
  static void pgstat_sighup_handler(SIGNAL_ARGS);

--- 240,245 ----
***************
*** 254,259 ****
--- 270,276 ----
  static void pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len);
  static void pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len);
  static void pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len);
+ static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);


  /* ------------------------------------------------------------
***************
*** 655,663 ****
      int            i;

      /* Don't expend a clock check if nothing to do */
!     /* Note: we ignore pending function stats in this test ... OK? */
!     if (pgStatTabList == NULL ||
!         pgStatTabList->tsa_used == 0)
          return;

      /*
--- 672,679 ----
      int            i;

      /* Don't expend a clock check if nothing to do */
!     if ((pgStatTabList == NULL || pgStatTabList->tsa_used == 0)
!         && !have_function_stats)
          return;

      /*
***************
*** 823,828 ****
--- 839,846 ----
      if (msg.m_nentries > 0)
          pgstat_send(&msg, offsetof(PgStat_MsgFuncstat, m_entry[0]) +
                      msg.m_nentries * sizeof(PgStat_FunctionEntry));
+
+     have_function_stats = false;
  }


***************
*** 1341,1346 ****
--- 1359,1367 ----
          fs->f_numcalls++;
      fs->f_time = f_total;
      INSTR_TIME_ADD(fs->f_time_self, f_self);
+
+     /* indicate that we have something to upload */
+     have_function_stats = true;
  }


***************
*** 2463,2468 ****
--- 2484,2505 ----
      hdr->m_type = mtype;
  }

+ /* ----------
+  * pgstat_send_inquiry() -
+  *
+  *        Notify collector that we need fresh data.
+  *        ts is used to specify the minimum valid timestamp for the file.
+  * ----------
+  */
+ static void
+ pgstat_send_inquiry(TimestampTz ts)
+ {
+     PgStat_MsgInquiry msg;
+
+     pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
+     msg.inquiry_time = ts;
+     pgstat_send(&msg, sizeof(msg));
+ }

  /* ----------
   * pgstat_send() -
***************
*** 2538,2545 ****
  NON_EXEC_STATIC void
  PgstatCollectorMain(int argc, char *argv[])
  {
-     struct itimerval write_timeout;
-     bool        need_timer = false;
      int            len;
      PgStat_Msg    msg;

--- 2575,2580 ----
***************
*** 2571,2583 ****

      /*
       * Ignore all signals usually bound to some action in the postmaster,
!      * except SIGQUIT and SIGALRM.
       */
      pqsignal(SIGHUP, pgstat_sighup_handler);
      pqsignal(SIGINT, SIG_IGN);
      pqsignal(SIGTERM, SIG_IGN);
      pqsignal(SIGQUIT, pgstat_exit);
-     pqsignal(SIGALRM, force_statwrite);
      pqsignal(SIGPIPE, SIG_IGN);
      pqsignal(SIGUSR1, SIG_IGN);
      pqsignal(SIGUSR2, SIG_IGN);
--- 2606,2617 ----

      /*
       * Ignore all signals usually bound to some action in the postmaster,
!      * except SIGQUIT
       */
      pqsignal(SIGHUP, pgstat_sighup_handler);
      pqsignal(SIGINT, SIG_IGN);
      pqsignal(SIGTERM, SIG_IGN);
      pqsignal(SIGQUIT, pgstat_exit);
      pqsignal(SIGPIPE, SIG_IGN);
      pqsignal(SIGUSR1, SIG_IGN);
      pqsignal(SIGUSR2, SIG_IGN);
***************
*** 2594,2609 ****
      init_ps_display("stats collector process", "", "", "");

      /*
-      * Arrange to write the initial status file right away
-      */
-     need_statwrite = true;
-
-     /* Preset the delay between status file writes */
-     MemSet(&write_timeout, 0, sizeof(struct itimerval));
-     write_timeout.it_value.tv_sec = PGSTAT_STAT_INTERVAL / 1000;
-     write_timeout.it_value.tv_usec = (PGSTAT_STAT_INTERVAL % 1000) * 1000;
-
-     /*
       * Read in an existing statistics stats file or initialize the stats to
       * zero.
       */
--- 2628,2633 ----
***************
*** 2645,2668 ****
          {
              ProcessConfigFile(PGC_SIGHUP);
              got_SIGHUP = false;
!             need_statwrite = true;
          }

          /*
!          * If time to write the stats file, do so.    Note that the alarm
!          * interrupt isn't re-enabled immediately, but only after we next
!          * receive a stats message; so no cycles are wasted when there is
!          * nothing going on.
           */
!         if (need_statwrite)
          {
              /* Check for postmaster death; if so we'll write file below */
              if (!PostmasterIsAlive(true))
                  break;
-
              pgstat_write_statsfile(false);
-             need_statwrite = false;
-             need_timer = true;
          }

          /*
--- 2669,2687 ----
          {
              ProcessConfigFile(PGC_SIGHUP);
              got_SIGHUP = false;
!             last_statrequest = last_statwrite;
          }

          /*
!          * Write the stats file if so requested and not satisfied by
!          * existing file.
           */
!         if (TimestampDifferenceExceeds(last_statwrite, last_statrequest, 0))
          {
              /* Check for postmaster death; if so we'll write file below */
              if (!PostmasterIsAlive(true))
                  break;
              pgstat_write_statsfile(false);
          }

          /*
***************
*** 2797,2817 ****
                      pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
                      break;

!                 default:
                      break;
-             }

!             /*
!              * If this is the first message after we wrote the stats file the
!              * last time, enable the alarm interrupt to make it be written
!              * again later.
!              */
!             if (need_timer)
!             {
!                 if (setitimer(ITIMER_REAL, &write_timeout, NULL))
!                     ereport(ERROR,
!                     (errmsg("could not set statistics collector timer: %m")));
!                 need_timer = false;
              }
          }
          else
--- 2816,2827 ----
                      pgstat_recv_funcpurge((PgStat_MsgFuncpurge *) &msg, len);
                      break;

!                 case PGSTAT_MTYPE_INQUIRY:
!                     pgstat_recv_inquiry((PgStat_MsgInquiry *)&msg, len);
                      break;

!                 default:
!                     break;
              }
          }
          else
***************
*** 2841,2853 ****
      need_exit = true;
  }

- /* SIGALRM signal handler for collector process */
- static void
- force_statwrite(SIGNAL_ARGS)
- {
-     need_statwrite = true;
- }
-
  /* SIGHUP handler for collector process */
  static void
  pgstat_sighup_handler(SIGNAL_ARGS)
--- 2851,2856 ----
***************
*** 2943,2949 ****
      /*
       * Open the statistics temp file to write out the current values.
       */
!     fpout = fopen(tmpfile, PG_BINARY_W);
      if (fpout == NULL)
      {
          ereport(LOG,
--- 2946,2952 ----
      /*
       * Open the statistics temp file to write out the current values.
       */
!     fpout = AllocateFile(tmpfile, PG_BINARY_W);
      if (fpout == NULL)
      {
          ereport(LOG,
***************
*** 2954,2963 ****
      }

      /*
!      * Write the file header --- currently just a format ID.
       */
      format_id = PGSTAT_FILE_FORMAT_ID;
      fwrite(&format_id, sizeof(format_id), 1, fpout);

      /*
       * Write global stats struct
--- 2957,2968 ----
      }

      /*
!      * Write the file header --- currently just a format ID and a timestamp.
       */
      format_id = PGSTAT_FILE_FORMAT_ID;
      fwrite(&format_id, sizeof(format_id), 1, fpout);
+     last_statwrite = GetCurrentTimestamp();
+     fwrite(&last_statwrite, sizeof(last_statwrite), 1, fpout);

      /*
       * Write global stats struct
***************
*** 3017,3026 ****
                  (errcode_for_file_access(),
                 errmsg("could not write temporary statistics file \"%s\": %m",
                        tmpfile)));
!         fclose(fpout);
          unlink(tmpfile);
      }
!     else if (fclose(fpout) < 0)
      {
          ereport(LOG,
                  (errcode_for_file_access(),
--- 3022,3031 ----
                  (errcode_for_file_access(),
                 errmsg("could not write temporary statistics file \"%s\": %m",
                        tmpfile)));
!         FreeFile(fpout);
          unlink(tmpfile);
      }
!     else if (FreeFile(fpout) < 0)
      {
          ereport(LOG,
                  (errcode_for_file_access(),
***************
*** 3108,3113 ****
--- 3113,3126 ----
          goto done;
      }

+     /* Skip over the timestamp */
+     if (fseek(fpin, sizeof(TimestampTz), SEEK_CUR) != 0)
+     {
+         ereport(pgStatRunningInCollector ? LOG : WARNING,
+                 (errmsg("corrupted pgstat.stat file")));
+         goto done;
+     }
+
      /*
       * Read global stats struct
       */
***************
*** 3297,3307 ****
--- 3310,3363 ----
  static void
  backend_read_statsfile(void)
  {
+     TimestampTz file_min_ts;
+     bool stats_ready = false;
+     int count;
+
      /* already read it? */
      if (pgStatDBHash)
          return;
      Assert(!pgStatRunningInCollector);

+     /*
+      * This is the earliest timestamp we can use. Usually we can tolerate
+      * up to PGSTAT_STAT_INTERVAL ms stale files.
+      */
+     file_min_ts = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
+         -PGSTAT_STAT_INTERVAL);
+
+     /*
+      * Loop until fresh enough stats file is available or we ran out of time.
+      * The stats inquiry message is resent in case collector drops our first
+      * message.
+      */
+     for (count = 0; !stats_ready && count < PGSTAT_POLL_LOOP_COUNT; count++)
+     {
+         TimestampTz ts;
+         FILE *fp;
+
+         CHECK_FOR_INTERRUPTS();
+
+         /* Read the timestamp from beginning of the stats file */
+         if ((fp = AllocateFile(pgstat_stat_filename, PG_BINARY_R)) != NULL)
+         {
+             fseek(fp, sizeof(int32), SEEK_SET);
+             fread(&ts, sizeof(ts), 1, fp);
+             if (!ferror(fp) && TimestampDifferenceExceeds(file_min_ts, ts, 0))
+                 stats_ready = true;
+             FreeFile(fp);
+         }
+
+         if (!stats_ready)
+         {
+             pgstat_send_inquiry(file_min_ts);
+             pg_usleep(PGSTAT_POLL_RETRY_DELAY * 1000);
+         }
+     }
+
+     if (!stats_ready)
+         elog(WARNING, "pgstat wait timeout");
+
      /* Autovacuum launcher wants stats about all databases */
      if (IsAutoVacuumLauncherProcess())
          pgStatDBHash = pgstat_read_statsfile(InvalidOid, false);
***************
*** 3779,3781 ****
--- 3835,3851 ----
                             HASH_REMOVE, NULL);
      }
  }
+
+ /* ----------
+  * pgstat_recv_inquiry() -
+  *
+  *    Process stat inquiry requests.
+  * ----------
+  */
+ static void
+ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
+ {
+     if (msg->inquiry_time > last_statrequest)
+         last_statrequest = msg->inquiry_time;
+ }
+
Index: src/include/pgstat.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/pgstat.h,v
retrieving revision 1.78
diff -c -r1.78 pgstat.h
*** src/include/pgstat.h    15 Aug 2008 08:37:40 -0000    1.78
--- src/include/pgstat.h    8 Sep 2008 21:00:20 -0000
***************
*** 42,48 ****
      PGSTAT_MTYPE_ANALYZE,
      PGSTAT_MTYPE_BGWRITER,
      PGSTAT_MTYPE_FUNCSTAT,
!     PGSTAT_MTYPE_FUNCPURGE
  } StatMsgType;

  /* ----------
--- 42,49 ----
      PGSTAT_MTYPE_ANALYZE,
      PGSTAT_MTYPE_BGWRITER,
      PGSTAT_MTYPE_FUNCSTAT,
!     PGSTAT_MTYPE_FUNCPURGE,
!     PGSTAT_MTYPE_INQUIRY
  } StatMsgType;

  /* ----------
***************
*** 385,390 ****
--- 386,404 ----


  /* ----------
+  * PgStat_MsgInquiry            Sent by the backend to tell the collector
+  *                                to write the stats file.
+  * ----------
+  */
+
+ typedef struct PgStat_MsgInquiry
+ {
+     PgStat_MsgHdr    m_hdr;
+     TimestampTz        inquiry_time;
+ } PgStat_MsgInquiry;
+
+
+ /* ----------
   * PgStat_Msg                    Union over all possible messages.
   * ----------
   */
***************
*** 402,407 ****
--- 416,422 ----
      PgStat_MsgBgWriter msg_bgwriter;
      PgStat_MsgFuncstat msg_funcstat;
      PgStat_MsgFuncpurge msg_funcpurge;
+     PgStat_MsgInquiry msg_inquiry;
  } PgStat_Msg;


***************
*** 413,419 ****
   * ------------------------------------------------------------
   */

! #define PGSTAT_FILE_FORMAT_ID    0x01A5BC97

  /* ----------
   * PgStat_StatDBEntry            The collector's data per database
--- 428,434 ----
   * ------------------------------------------------------------
   */

! #define PGSTAT_FILE_FORMAT_ID    0x01A5BC98

  /* ----------
   * PgStat_StatDBEntry            The collector's data per database

Re: reducing statistics write overhead

From

Alvaro Herrera

Date:

08 September 2008, 20:36:30

Martin Pihlak escribió:
> Tom Lane wrote:
> > Hmm.  With the timestamp in the file, ISTM that we could put all the
> > intelligence on the reader side.  Reader checks file, sends message if

> Attached is a patch that implements the described signalling.

Are we keeping the idea that the reader sends a stat message when it
needs to read stats?  What about the lossiness of the transport?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

09 September 2008, 10:12:23

Alvaro Herrera wrote:
>> Attached is a patch that implements the described signalling.
> 
> Are we keeping the idea that the reader sends a stat message when it
> needs to read stats?  What about the lossiness of the transport?
> 

As the message is resent in the wait loop, the collector should receive
it sooner or later. And initial testing shows that its not really easy to
make the collector lose messages.

I used a modified version of the patch to run a simple load test on a 4 core
amd64 Linux box:

- patch modified so that pgstat_send_inquiry() is sent only once - before wait loop, so it times out if message is
lost.
- stats file bloated to ~ 5MB by creating 40k tables.
- 4 pgbench instances running: -c 500 -t 500
- 2 clients constantly pulling stats
- all cores near 100% busy, tx traffic over loopback ~ 200kB/sec.

Most of the stats requests required 1 or 2 file wait iterations (5ms sleep each).
Occasionally 3, but no timeouts (ie. no lost messages). Maybe other platforms are
more sensitive, but I have none available for testing at the moment.

regards,
Martin

Re: reducing statistics write overhead

From

Tom Lane

Date:

02 November 2008, 21:18:19

Martin Pihlak <martin.pihlak@gmail.com> writes:
> Attached is a patch that implements the described signalling. Additionally
> following non-related changes have been made:
> 1. fopen/fclose replaced with AllocateFile/FreeFile
> 2. pgstat_report_stat() now also checks if there are functions to report
> before returning. Previous behavior was to return immediately if no table
> stats were available for reporting.

Applied with minor corrections.
        regards, tom lane

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

07 January 2009, 06:37:03

Alvaro Herrera wrote:
> Tom Lane escribió:
> 
>> (In fact, maybe this patch ought to include some sort of maximum update
>> rate tunable?  The worst case behavior could actually be WORSE than now.)
> 
> Some sort of "if stats were requested in the last 500 ms, just tell the
> requester to read the existing file".
> 
> Things that come to mind:
> 
> - autovacuum could use a more frequent stats update in certain cases
> 

[ digging up an old tread ... ]

I recently tested the on-demand stats write on an installation with over 100
databases on a single cluster (ca 5MB stats file). As I saw no visible reduction
in the stats file updates, I started investigating. Turned out that autovacuum
was configured with 1minute naptime, and was constantly walking the database list
and launching workers. Each worker does several stats file requests, and often
it turns out that the file is older than the allowed 10ms. Increasing the naptime
somewhat alleviates the problem, but still introduces peaks.

As I understand the autovacuum workers need up to date stats to minimize the
risk of re-vacuuming a table (in case it was already vacuumed by someone else).
However, with the visibility map based vacuums the cost of re-vacuuming  should
be reasonably low. So I'd propose to increase the max allowed stats age for
autovacuum workers. Perhaps even as high as to allow reusing of the file requested
by the launcher. Or am I missing something?

regards,
Martin

Re: reducing statistics write overhead

From

Tom Lane

Date:

07 January 2009, 10:44:32

Martin Pihlak <martin.pihlak@gmail.com> writes:
> As I understand the autovacuum workers need up to date stats to minimize the
> risk of re-vacuuming a table (in case it was already vacuumed by someone else).

I never understood why autovacuum should need a particularly short fuse
on the stats file age to start with.  If the launcher is launching
multiple workers into the same database with only a few milliseconds
between them, isn't the launcher pretty broken anyhow?  ISTM that stats
files even several seconds old ought to be plenty good enough.
        regards, tom lane

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

12 January 2009, 05:38:59

Tom Lane wrote:
> I never understood why autovacuum should need a particularly short fuse
> on the stats file age to start with.  If the launcher is launching
> multiple workers into the same database with only a few milliseconds
> between them, isn't the launcher pretty broken anyhow?  ISTM that stats
> files even several seconds old ought to be plenty good enough.
> 

I was thinking that the launcher should only request fresh stats at wakeup,
the workers could then reuse that file. This could be implemented by calling
pgstat_clear_snapshot only at launcher wakeup and setting max stats age to
to autovacuum_naptime for the workers.

This would effectively disable the table vacuum recheck though. If this is
OK, I'll start working on it.

regards,
Martin

Re: reducing statistics write overhead

From

Alvaro Herrera

Date:

21 January 2009, 08:52:38

Martin Pihlak escribió:
> I wrote:
> > I was thinking that the launcher should only request fresh stats at wakeup,
> > the workers could then reuse that file. This could be implemented by calling
> > pgstat_clear_snapshot only at launcher wakeup and setting max stats age to
> > to autovacuum_naptime for the workers.
> 
> Attached is a patch that increases the autovacuum stats age tolerance to
> autovacuum_naptime. This is handled by autovac_refresh_stats() by not clearing
> the stats snapshot unless nap time elapsed or explicitly forced by an error
> or SIGHUP.

You missed putting back the BUG comment that used to be there about
this.

In other words I think this is a bad idea, because there is a very wide
window for a table to be vacuumed twice.  Since naptime can be
arbitrarily large, this is an arbitrarily large bug.  I'm sure there are
other ways to fix this, but please propose those before this patch.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

21 January 2009, 11:18:59

I wrote:
> I was thinking that the launcher should only request fresh stats at wakeup,
> the workers could then reuse that file. This could be implemented by calling
> pgstat_clear_snapshot only at launcher wakeup and setting max stats age to
> to autovacuum_naptime for the workers.
>

Attached is a patch that increases the autovacuum stats age tolerance to
autovacuum_naptime. This is handled by autovac_refresh_stats() by not clearing
the stats snapshot unless nap time elapsed or explicitly forced by an error
or SIGHUP.

For the time being, I left the table vacuum recheck in place. Removing the
table_recheck_autovac function requires some further work. I have started on
this, but decided to defer until it is clear whether the whole approach is
acceptable or not.

regards,
Martin

*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
***************
*** 44,54 ****
   * Note that there can be more than one worker in a database concurrently.
   * They will store the table they are currently vacuuming in shared memory, so
   * that other workers avoid being blocked waiting for the vacuum lock for that
!  * table.  They will also reload the pgstats data just before vacuuming each
!  * table, to avoid vacuuming a table that was just finished being vacuumed by
!  * another worker and thus is no longer noted in shared memory.  However,
!  * there is a window (caused by pgstat delay) on which a worker may choose a
!  * table that was already vacuumed; this is a bug in the current design.
   *
   * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
--- 44,53 ----
   * Note that there can be more than one worker in a database concurrently.
   * They will store the table they are currently vacuuming in shared memory, so
   * that other workers avoid being blocked waiting for the vacuum lock for that
!  * table. There is a possibility that a worker might pick up a table that was
!  * already vacuumed by another process. This isn't really a problem, as the
!  * odds of this happening are low and the revacuum is made cheap by the use of
!  * visibility map.
   *
   * Portions Copyright (c) 1996-2009, PostgreSQL Global Development Group
   * Portions Copyright (c) 1994, Regents of the University of California
***************
*** 120,128 **** int            autovacuum_vac_cost_limit;

  int            Log_autovacuum_min_duration = -1;

- /* how long to keep pgstat data in the launcher, in milliseconds */
- #define STATS_READ_DELAY 1000
-

  /* Flags to tell if we are in an autovacuum process */
  static bool am_autovacuum_launcher = false;
--- 119,124 ----
***************
*** 298,304 **** static void avl_sighup_handler(SIGNAL_ARGS);
  static void avl_sigusr1_handler(SIGNAL_ARGS);
  static void avl_sigterm_handler(SIGNAL_ARGS);
  static void avl_quickdie(SIGNAL_ARGS);
! static void autovac_refresh_stats(void);



--- 294,300 ----
  static void avl_sigusr1_handler(SIGNAL_ARGS);
  static void avl_sigterm_handler(SIGNAL_ARGS);
  static void avl_quickdie(SIGNAL_ARGS);
! static void autovac_refresh_stats(bool force);



***************
*** 500,509 **** AutoVacLauncherMain(int argc, char *argv[])
          DatabaseList = NULL;

          /*
!          * Make sure pgstat also considers our stat data as gone.  Note: we
!          * mustn't use autovac_refresh_stats here.
           */
!         pgstat_clear_snapshot();

          /* Now we can allow interrupts again */
          RESUME_INTERRUPTS();
--- 496,504 ----
          DatabaseList = NULL;

          /*
!          * Make sure pgstat also considers our stat data as gone.
           */
!         autovac_refresh_stats(true);

          /* Now we can allow interrupts again */
          RESUME_INTERRUPTS();
***************
*** 598,603 **** AutoVacLauncherMain(int argc, char *argv[])
--- 593,601 ----
          if (got_SIGTERM)
              break;

+         /* Refresh stats. Force it, if reloaded via SIGHUP */
+         autovac_refresh_stats(got_SIGHUP);
+
          if (got_SIGHUP)
          {
              got_SIGHUP = false;
***************
*** 851,859 **** rebuild_database_list(Oid newdb)
      int            nelems;
      HTAB       *dbhash;

-     /* use fresh stats */
-     autovac_refresh_stats();
-
      newcxt = AllocSetContextCreate(AutovacMemCxt,
                                     "AV dblist",
                                     ALLOCSET_DEFAULT_MINSIZE,
--- 849,854 ----
***************
*** 1078,1086 **** do_start_worker(void)
                                     ALLOCSET_DEFAULT_MAXSIZE);
      oldcxt = MemoryContextSwitchTo(tmpcxt);

-     /* use fresh stats */
-     autovac_refresh_stats();
-
      /* Get a list of databases */
      dblist = get_database_list();

--- 1073,1078 ----
***************
*** 2145,2158 **** do_autovacuum(void)
          }

          /*
!          * Check whether pgstat data still says we need to vacuum this table.
!          * It could have changed if something else processed the table while
!          * we weren't looking.
!          *
!          * Note: we have a special case in pgstat code to ensure that the stats
!          * we read are as up-to-date as possible, to avoid the problem that
!          * somebody just finished vacuuming this table.  The window to the race
!          * condition is not closed but it is very small.
           */
          MemoryContextSwitchTo(AutovacMemCxt);
          tab = table_recheck_autovac(relid, table_toast_map);
--- 2137,2145 ----
          }

          /*
!          * Check whether we still need to vacuum this table.  It could have
!          * changed if something else processed the table while we weren't
!          * looking.
           */
          MemoryContextSwitchTo(AutovacMemCxt);
          tab = table_recheck_autovac(relid, table_toast_map);
***************
*** 2377,2385 **** table_recheck_autovac(Oid relid, HTAB *table_toast_map)
      PgStat_StatDBEntry *dbentry;
      bool        wraparound;

-     /* use fresh stats */
-     autovac_refresh_stats();
-
      shared = pgstat_fetch_stat_dbentry(InvalidOid);
      dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);

--- 2364,2369 ----
***************
*** 2816,2844 **** AutoVacuumShmemInit(void)
   *        Refresh pgstats data for an autovacuum process
   *
   * Cause the next pgstats read operation to obtain fresh data, but throttle
!  * such refreshing in the autovacuum launcher.    This is mostly to avoid
!  * rereading the pgstats files too many times in quick succession when there
!  * are many databases.
!  *
!  * Note: we avoid throttling in the autovac worker, as it would be
!  * counterproductive in the recheck logic.
   */
  static void
! autovac_refresh_stats(void)
  {
!     if (IsAutoVacuumLauncherProcess())
!     {
!         static TimestampTz last_read = 0;
!         TimestampTz current_time;

!         current_time = GetCurrentTimestamp();

!         if (!TimestampDifferenceExceeds(last_read, current_time,
!                                         STATS_READ_DELAY))
!             return;

!         last_read = current_time;
!     }

      pgstat_clear_snapshot();
  }
--- 2800,2822 ----
   *        Refresh pgstats data for an autovacuum process
   *
   * Cause the next pgstats read operation to obtain fresh data, but throttle
!  * the refreshing so it only occurs at autovacuum_naptime intervals. This is
!  * mostly to avoid rereading the pgstats files too many times in quick
!  * succession when there are many databases.
   */
  static void
! autovac_refresh_stats(bool force)
  {
!     static TimestampTz last_read = 0;
!     TimestampTz current_time;

!     current_time = GetCurrentTimestamp();

!     if (!force && !TimestampDifferenceExceeds(last_read, current_time,
!                                               autovacuum_naptime * 1000))
!         return;

!     last_read = current_time;

      pgstat_clear_snapshot();
  }
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 3388,3396 **** backend_read_statsfile(void)
       * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL msec
       * before now.  This indirectly ensures that the collector needn't write
       * the file more often than PGSTAT_STAT_INTERVAL.  In an autovacuum
!      * worker, however, we want a lower delay to avoid using stale data, so we
!      * use PGSTAT_RETRY_DELAY (since the number of worker is low, this
!      * shouldn't be a problem).
       *
       * Note that we don't recompute min_ts after sleeping; so we might end up
       * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In practice
--- 3388,3395 ----
       * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL msec
       * before now.  This indirectly ensures that the collector needn't write
       * the file more often than PGSTAT_STAT_INTERVAL.  In an autovacuum
!      * worker, however, we can reuse the stats file that was requested by the
!      * launcher.
       *
       * Note that we don't recompute min_ts after sleeping; so we might end up
       * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In practice
***************
*** 3400,3406 **** backend_read_statsfile(void)
       */
      if (IsAutoVacuumWorkerProcess())
          min_ts = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
!                                              -PGSTAT_RETRY_DELAY);
      else
          min_ts = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
                                               -PGSTAT_STAT_INTERVAL);
--- 3399,3405 ----
       */
      if (IsAutoVacuumWorkerProcess())
          min_ts = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
!                                              -autovacuum_naptime * 1000);
      else
          min_ts = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
                                               -PGSTAT_STAT_INTERVAL);

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

21 January 2009, 12:16:09

Alvaro Herrera wrote:
> You missed putting back the BUG comment that used to be there about
> this.
> 

This was deliberate, I did mention the condition in the comment at
the beginning of the file. This actually makes it a feature :)

Seriously though, do you think that this is still a problem? Given
the rare occurrence of the revacuum and the fact that it is made
cheap by visibility map? In my initial testing, I couldn't reproduce
the revacuum. But I'll keep at it.

> In other words I think this is a bad idea, because there is a very wide
> window for a table to be vacuumed twice.  Since naptime can be
> arbitrarily large, this is an arbitrarily large bug.  I'm sure there are
> other ways to fix this, but please propose those before this patch.
> 

I was wondering that maybe the stats subsystem shouldn't be used for
vacuum tracking at all. It maybe convenient to use, but has several
deficiencies (pobig file, lossy, no crash safety, etc). Could we move
vacuum tracking to pg_class instead?

regards,
Martin

Re: reducing statistics write overhead

From

Alvaro Herrera

Date:

21 January 2009, 12:28:16

Martin Pihlak escribió:
> Alvaro Herrera wrote:
> > You missed putting back the BUG comment that used to be there about
> > this.
> 
> This was deliberate, I did mention the condition in the comment at
> the beginning of the file. This actually makes it a feature :)
> 
> Seriously though, do you think that this is still a problem? Given
> the rare occurrence of the revacuum and the fact that it is made
> cheap by visibility map?

Hmm, maybe it's no longer an issue with the visibility map, yes.

> I was wondering that maybe the stats subsystem shouldn't be used for
> vacuum tracking at all. It maybe convenient to use, but has several
> deficiencies (pobig file, lossy, no crash safety, etc). Could we move
> vacuum tracking to pg_class instead?

I agree that pgstats is not ideal (we've said this from the very
beginning), but I doubt that updating pg_class is the answer; you'd be
generating thousands of dead tuples there.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: reducing statistics write overhead

From

Heikki Linnakangas

Date:

21 January 2009, 14:26:12

Alvaro Herrera wrote:
> Martin Pihlak escribió:
>> Alvaro Herrera wrote:
>>> You missed putting back the BUG comment that used to be there about
>>> this.
>> This was deliberate, I did mention the condition in the comment at
>> the beginning of the file. This actually makes it a feature :)
>>
>> Seriously though, do you think that this is still a problem? Given
>> the rare occurrence of the revacuum and the fact that it is made
>> cheap by visibility map?
> 
> Hmm, maybe it's no longer an issue with the visibility map, yes.

You still have to scan all indexes, so it's still not free by any means.

(I haven't been paying attention to what kind of a risk we're talking 
about..)

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: reducing statistics write overhead

From

Martin Pihlak

Date:

21 January 2009, 15:45:16

Alvaro Herrera wrote:
> I agree that pgstats is not ideal (we've said this from the very
> beginning), but I doubt that updating pg_class is the answer; you'd be
> generating thousands of dead tuples there.
> 

But we already do update pg_class after vacuum -- in vac_update_relstats().
Hmm, that performs a heap_inplace_update() ... I assume that this is cheap,
but have no idea as if it is suitable for the purpouse.

regards,
Martin

Re: reducing statistics write overhead

From

Alvaro Herrera

Date:

21 January 2009, 16:17:22

Martin Pihlak escribió:
> Alvaro Herrera wrote:
> > I agree that pgstats is not ideal (we've said this from the very
> > beginning), but I doubt that updating pg_class is the answer; you'd be
> > generating thousands of dead tuples there.
> 
> But we already do update pg_class after vacuum -- in vac_update_relstats().
> Hmm, that performs a heap_inplace_update() ... I assume that this is cheap,
> but have no idea as if it is suitable for the purpouse.

Oh, sorry, I thought you were suggesting to use pg_class to store number
of tuples dead/alive/etc.

I had a patch to introduce a new type of table, which would only be used
for non-transactional updates.  That would allow what you're proposing.
I think we discussed something similar to what you propose and rejected
it for some reason I can't recall offhand.  Search the archives for
pg_class_nt and pg_ntclass, that might give you some ideas.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: reducing statistics write overhead

From

Tom Lane

Date:

22 January 2009, 17:53:41

Alvaro Herrera <alvherre@commandprompt.com> writes:
> Martin Pihlak escribi�:
>> [ patch to fool with stats refresh logic in autovac ]

(1) I still don't understand why we don't just make the launcher request
a new stats file once per naptime cycle, and then allow the workers to
work from that.

(2) The current code in autovacuum.c seems to be redundant with the
logic that now exists in the stats mechanism itself to decide whether a
stats file is too old.
        regards, tom lane

Re: reducing statistics write overhead

From

Alvaro Herrera

Date:

22 January 2009, 18:01:11

Tom Lane escribió:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Martin Pihlak escribi�:
> >> [ patch to fool with stats refresh logic in autovac ]
> 
> (1) I still don't understand why we don't just make the launcher request
> a new stats file once per naptime cycle, and then allow the workers to
> work from that.

The problem is workers that spend too much time on a single database.
If the stats at the time they both start say that a given table must be
vacuumed (consider for example that the first one spent too much time
vacuuming some other big table), they could end up both vacuuming that
table.  The second vacuum would be a waste.

This could be solved if the workers kept the whole history of tables
that they have vacuumed.  Currently we keep only a single table (the one
being vacuumed right now).  I proposed writing these history files back
when workers were first implemented, but the idea was shot down before
flying very far because it was way too complex (the rest of the patch
was more than complex enough.)  Maybe we can implement this now.

> (2) The current code in autovacuum.c seems to be redundant with the
> logic that now exists in the stats mechanism itself to decide whether a
> stats file is too old.

Hmm, yeah, possibly.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: reducing statistics write overhead

From

Euler Taveira de Oliveira

Date:

22 January 2009, 19:02:39

Alvaro Herrera escreveu:
> This could be solved if the workers kept the whole history of tables
> that they have vacuumed.  Currently we keep only a single table (the one
> being vacuumed right now).  I proposed writing these history files back
> when workers were first implemented, but the idea was shot down before
> flying very far because it was way too complex (the rest of the patch
> was more than complex enough.)  Maybe we can implement this now.
> 
[I don't remember your proposal...] Isn't it just add a circular linked list
at AutoVacuumShmemStruct? Of course some lock mechanism needs to exist to
guarantee that we don't write at the same time. The size of this linked list
would be scale by a startup-time-guc or a reasonable fixed value.


--  Euler Taveira de Oliveira http://www.timbira.com/

Re: reducing statistics write overhead

From

Alvaro Herrera

Date:

22 January 2009, 19:14:52

Euler Taveira de Oliveira escribió:
> Alvaro Herrera escreveu:
> > This could be solved if the workers kept the whole history of tables
> > that they have vacuumed.  Currently we keep only a single table (the one
> > being vacuumed right now).  I proposed writing these history files back
> > when workers were first implemented, but the idea was shot down before
> > flying very far because it was way too complex (the rest of the patch
> > was more than complex enough.)  Maybe we can implement this now.
> > 
> [I don't remember your proposal...] Isn't it just add a circular linked list
> at AutoVacuumShmemStruct? Of course some lock mechanism needs to exist to
> guarantee that we don't write at the same time. The size of this linked list
> would be scale by a startup-time-guc or a reasonable fixed value.

Well, the problem is precisely how to size the list.  I don't like the
idea of keeping an arbitrary number in memory; it adds another
mostly-useless tunable that we'll need to answer questions about for all
eternity.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: reducing statistics write overhead

From

Euler Taveira de Oliveira

Date:

23 January 2009, 00:31:57

Alvaro Herrera escreveu:
> Euler Taveira de Oliveira escribió:
>> Alvaro Herrera escreveu:
>>> This could be solved if the workers kept the whole history of tables
>>> that they have vacuumed.  Currently we keep only a single table (the one
>>> being vacuumed right now).  I proposed writing these history files back
>>> when workers were first implemented, but the idea was shot down before
>>> flying very far because it was way too complex (the rest of the patch
>>> was more than complex enough.)  Maybe we can implement this now.
>>>
>> [I don't remember your proposal...] Isn't it just add a circular linked list
>> at AutoVacuumShmemStruct? Of course some lock mechanism needs to exist to
>> guarantee that we don't write at the same time. The size of this linked list
>> would be scale by a startup-time-guc or a reasonable fixed value.
> 
> Well, the problem is precisely how to size the list.  I don't like the
> idea of keeping an arbitrary number in memory; it adds another
> mostly-useless tunable that we'll need to answer questions about for all
> eternity.
> 
[Poking the code a little...] You're right. We could do that but it isn't an
elegant solution. What about tracking that information at table_oids?

struct table_oids {bool skipit;    /* initially false */Oid relid;
};


--  Euler Taveira de Oliveira http://www.timbira.com/

Re: reducing statistics write overhead

From

Tom Lane

Date:

23 January 2009, 00:59:10

Euler Taveira de Oliveira <euler@timbira.com> writes:
> Alvaro Herrera escreveu:
>> Well, the problem is precisely how to size the list.  I don't like the
>> idea of keeping an arbitrary number in memory; it adds another
>> mostly-useless tunable that we'll need to answer questions about for all
>> eternity.

Is it so hard?  In particular, rather than making it a tunable, what say
we freeze the list size at exactly two, ie each AV worker advertises its
current and most recent target table in shared memory.  Other workers
avoid re-vacuuming those.  Then the most work you can "waste" by extra
vacuuming is less than the maximum allowed stats file age.  I'd have no
problem whatsoever in letting that run into multiple seconds, as long
as it doesn't get into minutes or hours.
        regards, tom lane