Thread: [HACKERS] Logging idle checkpoints

[HACKERS] Logging idle checkpoints

From

Vik Fearing

Date:

02 October 2017, 04:19:33

I recently had a sad because I noticed that checkpoint counts were
increasing in pg_stat_bgwriter, but weren't accounted for in my logs
with log_checkpoints enabled.

After some searching, I found that it was the idle checkpoints that
weren't being logged.  I think this is a missed trick in 6ef2eba3f57.

Attached is a one-liner fix.  I realize how imminent we are from
releasing v10 but I hope there is still time for such a minor issue as this.
-- 
Vik Fearing                                          +33 6 46 75 15 36
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

log-idle-checkpoints.patch

Re: [HACKERS] Logging idle checkpoints

From

Andres Freund

Date:

02 October 2017, 04:27:46

Hi,

On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
> I recently had a sad because I noticed that checkpoint counts were
> increasing in pg_stat_bgwriter, but weren't accounted for in my logs
> with log_checkpoints enabled.
> 
> After some searching, I found that it was the idle checkpoints that
> weren't being logged.  I think this is a missed trick in 6ef2eba3f57.
> 
> Attached is a one-liner fix.  I realize how imminent we are from
> releasing v10 but I hope there is still time for such a minor issue as this.


> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index dd028a12a4..75f6bd4cc1 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -8724,7 +8724,7 @@ CreateCheckPoint(int flags)
>              WALInsertLockRelease();
>              LWLockRelease(CheckpointLock);
>              END_CRIT_SECTION();
> -            ereport(DEBUG1,
> +            ereport(log_checkpoints ? LOG : DEBUG1,
>                      (errmsg("checkpoint skipped because system is idle")));
>              return;
>          }

I'd be ok with applying this now, or in 10.1 - but I do think we should
fix this before 11.  If nobody protests I'll push later today, so we can
get some bf cycles for the very remote case that this causes problems.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging idle checkpoints

From

Michael Paquier

Date:

02 October 2017, 04:39:18

On Mon, Oct 2, 2017 at 7:27 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
> I'd be ok with applying this now, or in 10.1 - but I do think we should
> fix this before 11.  If nobody protests I'll push later today, so we can
> get some bf cycles for the very remote case that this causes problems.

This point has been discussed during review and removed from the patch
(adding Stephen in the loop here):
https://www.postgresql.org/message-id/CAOuzzgq8pHneMHy6JiNiG6Xm5V=cm+K2wCd2W-SCtpJDg7Xn3g@mail.gmail.com
Actually, shouldn't we make BgWriterStats a bit smarter? We could add
a counter for skipped checkpoints in v11 (too late for v10).
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging idle checkpoints

From

Andres Freund

Date:

02 October 2017, 04:41:48

On 2017-10-02 07:39:18 +0900, Michael Paquier wrote:
> On Mon, Oct 2, 2017 at 7:27 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
> > I'd be ok with applying this now, or in 10.1 - but I do think we should
> > fix this before 11.  If nobody protests I'll push later today, so we can
> > get some bf cycles for the very remote case that this causes problems.
> 
> This point has been discussed during review and removed from the patch
> (adding Stephen in the loop here):
> https://www.postgresql.org/message-id/CAOuzzgq8pHneMHy6JiNiG6Xm5V=cm+K2wCd2W-SCtpJDg7Xn3g@mail.gmail.com

I find that reasoning unconvincing. log_checkpoints is enabled after
all. And we're not talking about 10 log messages a second. There's
plenty systems that analyze the logs that'd possibly be affected by
this.


> Actually, shouldn't we make BgWriterStats a bit smarter? We could add
> a counter for skipped checkpoints in v11 (too late for v10).

Wouldn't hurt, but seems orthogonal.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging idle checkpoints

From

Michael Paquier

Date:

02 October 2017, 04:43:31

On Mon, Oct 2, 2017 at 7:41 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-10-02 07:39:18 +0900, Michael Paquier wrote:
>> On Mon, Oct 2, 2017 at 7:27 AM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
>> > I'd be ok with applying this now, or in 10.1 - but I do think we should
>> > fix this before 11.  If nobody protests I'll push later today, so we can
>> > get some bf cycles for the very remote case that this causes problems.
>>
>> This point has been discussed during review and removed from the patch
>> (adding Stephen in the loop here):
>> https://www.postgresql.org/message-id/CAOuzzgq8pHneMHy6JiNiG6Xm5V=cm+K2wCd2W-SCtpJDg7Xn3g@mail.gmail.com
>
> I find that reasoning unconvincing. log_checkpoints is enabled after
> all. And we're not talking about 10 log messages a second. There's
> plenty systems that analyze the logs that'd possibly be affected by
> this.

No real objections from here, actually.

>> Actually, shouldn't we make BgWriterStats a bit smarter? We could add
>> a counter for skipped checkpoints in v11 (too late for v10).
>
> Wouldn't hurt, but seems orthogonal.

Sure.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging idle checkpoints

From

Andres Freund

Date:

02 October 2017, 09:14:23

On 2017-10-02 07:43:31 +0900, Michael Paquier wrote:
> On Mon, Oct 2, 2017 at 7:41 AM, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-10-02 07:39:18 +0900, Michael Paquier wrote:
> >> On Mon, Oct 2, 2017 at 7:27 AM, Andres Freund <andres@anarazel.de> wrote:
> >> > On 2017-10-02 00:19:33 +0200, Vik Fearing wrote:
> >> > I'd be ok with applying this now, or in 10.1 - but I do think we should
> >> > fix this before 11.  If nobody protests I'll push later today, so we can
> >> > get some bf cycles for the very remote case that this causes problems.
> >>
> >> This point has been discussed during review and removed from the patch
> >> (adding Stephen in the loop here):
> >> https://www.postgresql.org/message-id/CAOuzzgq8pHneMHy6JiNiG6Xm5V=cm+K2wCd2W-SCtpJDg7Xn3g@mail.gmail.com
> >
> > I find that reasoning unconvincing. log_checkpoints is enabled after
> > all. And we're not talking about 10 log messages a second. There's
> > plenty systems that analyze the logs that'd possibly be affected by
> > this.
> 
> No real objections from here, actually.

Vik, because there was some, even though mild, objections, I'd rather
not push this right now. Stephen deserves a chance to reply.  So this'll
have to wait for 10.1, sorry :(

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging idle checkpoints

From

Stephen Frost

Date:

02 October 2017, 21:01:30

Vik, all,

* Vik Fearing (vik.fearing@2ndquadrant.com) wrote:
> I recently had a sad because I noticed that checkpoint counts were
> increasing in pg_stat_bgwriter, but weren't accounted for in my logs
> with log_checkpoints enabled.

> After some searching, I found that it was the idle checkpoints that
> weren't being logged.  I think this is a missed trick in 6ef2eba3f57.

> Attached is a one-liner fix.  I realize how imminent we are from
> releasing v10 but I hope there is still time for such a minor issue as this.

Idle checkpoints aren't, well, really checkpoints though.  If anything,
seems like we shouldn't be including skipped checkpoints in the
pg_stat_bgwriter count because we aren't actually doing a checkpoint.

I certainly don't care for the idea of adding log messages saying we
aren't doing anything just to match a count that's incorrectly claiming
that checkpoints are happening when they aren't.

The down-thread suggestion of keeping track of skipped checkpoints might
be interesting, but I'm not entirely convinced it really is.  We have
time to debate that, of course, but I don't really see how that's
helpful.  At the moment, it seems like the suggestion to add that column
is based on the assumption that we're going to start logging skipped
checkpoints and having that column would allow us to match up the count
between the new column and the "skipped checkpoint" messages in the logs
and I can not help but feel that this is a ridiculous amount of effort
being put into the analysis of something that *didn't* happen.

Thanks!

Stephen

Re: [HACKERS] Logging idle checkpoints

From

Michael Paquier

Date:

03 October 2017, 07:23:08

On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
> I certainly don't care for the idea of adding log messages saying we
> aren't doing anything just to match a count that's incorrectly claiming
> that checkpoints are happening when they aren't.
>
> The down-thread suggestion of keeping track of skipped checkpoints might
> be interesting, but I'm not entirely convinced it really is.  We have
> time to debate that, of course, but I don't really see how that's
> helpful.  At the moment, it seems like the suggestion to add that column
> is based on the assumption that we're going to start logging skipped
> checkpoints and having that column would allow us to match up the count
> between the new column and the "skipped checkpoint" messages in the logs
> and I can not help but feel that this is a ridiculous amount of effort
> being put into the analysis of something that *didn't* happen.

Being able to look at how many checkpoints are skipped can be used as
a tuning indicator of max_wal_size and checkpoint_timeout, or in short
increase them if those remain idle. Since their introduction in
335feca4, m_timed_checkpoints and m_requested_checkpoints track the
number of checkpoint requests, not if a checkpoint has been actually
executed or not, I am not sure that this should be changed after 10
years. So, to put it in other words, wouldn't we want a way to track
checkpoints that are *executed*, meaning that we could increment a
counter after doing the skip checks in CreateRestartPoint() and
CreateCheckPoint()?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging idle checkpoints

From

Kyotaro HORIGUCHI

Date:

03 October 2017, 10:17:00

At Tue, 3 Oct 2017 10:23:08 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQ3Q1J_wBC7yPXk39dO0RGvbM4-nYp2gMrCJ7pfPJXcYw@mail.gmail.com>
> On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > I certainly don't care for the idea of adding log messages saying we
> > aren't doing anything just to match a count that's incorrectly claiming
> > that checkpoints are happening when they aren't.
> >
> > The down-thread suggestion of keeping track of skipped checkpoints might
> > be interesting, but I'm not entirely convinced it really is.  We have
> > time to debate that, of course, but I don't really see how that's
> > helpful.  At the moment, it seems like the suggestion to add that column
> > is based on the assumption that we're going to start logging skipped
> > checkpoints and having that column would allow us to match up the count
> > between the new column and the "skipped checkpoint" messages in the logs
> > and I can not help but feel that this is a ridiculous amount of effort
> > being put into the analysis of something that *didn't* happen.
> 
> Being able to look at how many checkpoints are skipped can be used as
> a tuning indicator of max_wal_size and checkpoint_timeout, or in short
> increase them if those remain idle.

We ususally adjust the GUCs based on how often checkpoint is
*executed* and how many of the executed checkpoints have been
triggered by xlog progress (or with shorter interval than
timeout). It seems enough. Counting skipped checkpoints gives
just a rough estimate of how long the system was getting no
substantial updates. I doubt that users get something valuable by
counting skipped checkpoints.

> Since their introduction in
> 335feca4, m_timed_checkpoints and m_requested_checkpoints track the
> number of checkpoint requests, not if a checkpoint has been actually
> executed or not, I am not sure that this should be changed after 10
> years. So, to put it in other words, wouldn't we want a way to track
> checkpoints that are *executed*, meaning that we could increment a
> counter after doing the skip checks in CreateRestartPoint() and
> CreateCheckPoint()?

This sounds reasonable to me.

CreateRestartPoint() is already returning ckpt_performed, it is
used to let checkpointer retry in 15 seconds rather than waiting
the next checkpoint_timeout. Checkpoint might deserve the same
treatment on skipping.

By the way RestartCheckPoint emits DEBUG2 messages on skipping.
Although restartpoint has different characteristics from
checkpoint, if we change the message level for CreateCheckPoint
(currently DEBUG1), CreateRestartPoint might should get the same
change.  (Elsewise at least they ought to have the same message
level?)

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging idle checkpoints

From

Stephen Frost

Date:

03 October 2017, 18:22:27

Greetings,

* Kyotaro HORIGUCHI (horiguchi.kyotaro@lab.ntt.co.jp) wrote:
> At Tue, 3 Oct 2017 10:23:08 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQ3Q1J_wBC7yPXk39dO0RGvbM4-nYp2gMrCJ7pfPJXcYw@mail.gmail.com>
> > On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > > I certainly don't care for the idea of adding log messages saying we
> > > aren't doing anything just to match a count that's incorrectly claiming
> > > that checkpoints are happening when they aren't.
> > >
> > > The down-thread suggestion of keeping track of skipped checkpoints might
> > > be interesting, but I'm not entirely convinced it really is.  We have
> > > time to debate that, of course, but I don't really see how that's
> > > helpful.  At the moment, it seems like the suggestion to add that column
> > > is based on the assumption that we're going to start logging skipped
> > > checkpoints and having that column would allow us to match up the count
> > > between the new column and the "skipped checkpoint" messages in the logs
> > > and I can not help but feel that this is a ridiculous amount of effort
> > > being put into the analysis of something that *didn't* happen.
> >
> > Being able to look at how many checkpoints are skipped can be used as
> > a tuning indicator of max_wal_size and checkpoint_timeout, or in short
> > increase them if those remain idle.
>
> We ususally adjust the GUCs based on how often checkpoint is
> *executed* and how many of the executed checkpoints have been
> triggered by xlog progress (or with shorter interval than
> timeout). It seems enough. Counting skipped checkpoints gives
> just a rough estimate of how long the system was getting no
> substantial updates. I doubt that users get something valuable by
> counting skipped checkpoints.

Yeah, I tend to agree.  I don't really see how counting skipped
checkpoints helps to size max_wal_size or even checkpoint_timeout.  The
whole point here is that nothing is happening and if nothing is
happening then there's no real need to adjust max_wal_size or
checkpoint_timeout or, well, much of anything really..

> > Since their introduction in
> > 335feca4, m_timed_checkpoints and m_requested_checkpoints track the
> > number of checkpoint requests, not if a checkpoint has been actually
> > executed or not, I am not sure that this should be changed after 10
> > years. So, to put it in other words, wouldn't we want a way to track
> > checkpoints that are *executed*, meaning that we could increment a
> > counter after doing the skip checks in CreateRestartPoint() and
> > CreateCheckPoint()?
>
> This sounds reasonable to me.

I agree that tracking executed checkpoints is valuable, but, and perhaps
I'm missing something, isn't that the same as tracking non-skipped
checkpoints?  I suppose we could have both, if we really feel the need,
provided that doesn't result in more work or effort being done than
simply keeping the count.  I'd hate to end up in a situation where we're
writing things out unnecessairly just to keep track of checkpoints that
were requested but ultimately skipped because there wasn't anything to
do.

Thanks!

Stephen

Re: [HACKERS] Logging idle checkpoints

From

Kyotaro HORIGUCHI

Date:

05 October 2017, 11:00:41

At Tue, 3 Oct 2017 08:22:27 -0400, Stephen Frost <sfrost@snowman.net> wrote in
<20171003122227.GJ4628@tamriel.snowman.net>
> Greetings,
> 
> * Kyotaro HORIGUCHI (horiguchi.kyotaro@lab.ntt.co.jp) wrote:
> > At Tue, 3 Oct 2017 10:23:08 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQ3Q1J_wBC7yPXk39dO0RGvbM4-nYp2gMrCJ7pfPJXcYw@mail.gmail.com>
> > > On Tue, Oct 3, 2017 at 12:01 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > > Since their introduction in
> > > 335feca4, m_timed_checkpoints and m_requested_checkpoints track the
> > > number of checkpoint requests, not if a checkpoint has been actually
> > > executed or not, I am not sure that this should be changed after 10
> > > years. So, to put it in other words, wouldn't we want a way to track
> > > checkpoints that are *executed*, meaning that we could increment a
> > > counter after doing the skip checks in CreateRestartPoint() and
> > > CreateCheckPoint()?
> > 
> > This sounds reasonable to me.
> 
> I agree that tracking executed checkpoints is valuable, but, and perhaps
> I'm missing something, isn't that the same as tracking non-skipped
> checkpoints? I suppose we could have both, if we really feel the need,
> provided that doesn't result in more work or effort being done than
> simply keeping the count.  I'd hate to end up in a situation where we're
> writing things out unnecessairly just to keep track of checkpoints that
> were requested but ultimately skipped because there wasn't anything to
> do.

I'm fine with counting both executed and skipped. But perhaps the
time of lastest checkpoint fits the concern better, like
vacuum. It is seen in control file but not in system views. If we
have count skipped checkpoints, I'd like to see the time (or LSN)
of last checkpoint in system views.
 checkpoints_timed     | bigint                   |           |          |  checkpoints_req       | bigint
    |           |          | 
 
+ checkpoints_skipped   | bigint
+ last_checkpint        | timestamp with time zone or LSN?


# This reminded me of a concern. I'd like to count vacuums that
# are required but skipped by lock-failure, or killed by other
# backend.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging idle checkpoints

From

Alvaro Herrera

Date:

05 October 2017, 17:41:42

Kyotaro HORIGUCHI wrote:

> # This reminded me of a concern. I'd like to count vacuums that
> # are required but skipped by lock-failure, or killed by other
> # backend.

We clearly need to improve the stats and logs related to vacuuming work
executed, both by autovacuum and manually invoked.  One other item I
have in my head is to report numbers related to the truncation phase of
a vacuum run, since in some cases it causes horrible and hard to
diagnose problems.  (Also, add an reloption to stop vacuum from doing
the truncation phase at all -- for some usage patterns that is a serious
problem.)

However, please do open a new thread about it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logging idle checkpoints

From

Kyotaro HORIGUCHI

Date:

05 October 2017, 18:02:41

At Thu, 5 Oct 2017 13:41:42 +0200, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote in
<20171005114142.dupjeqe2cnplhgkx@alvherre.pgsql>
> Kyotaro HORIGUCHI wrote:
> 
> > # This reminded me of a concern. I'd like to count vacuums that
> > # are required but skipped by lock-failure, or killed by other
> > # backend.
> 
> We clearly need to improve the stats and logs related to vacuuming work
> executed, both by autovacuum and manually invoked.  One other item I
> have in my head is to report numbers related to the truncation phase of
> a vacuum run, since in some cases it causes horrible and hard to
> diagnose problems.  (Also, add an reloption to stop vacuum from doing
> the truncation phase at all -- for some usage patterns that is a serious
> problem.)
> 
> However, please do open a new thread about it.

Thanks! Will do after a bit time of organization of the thougts.

reagareds,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

10 October 2017, 16:26:16

Hello.
Once in a while I am asked about table bloat. In most cases the
cause is long lasting transactions and vacuum canceling in some
cases. Whatever the case users don't have enough clues to why
they have bloated tables.

At the top of the annoyances list for users would be that they
cannot know whether autovacuum decided that a table needs vacuum
or not. I suppose that it could be shown in pg_stat_*_tables.
n_mod_since_analyze | 20000
+ vacuum_requred | true last_vacuum | 2017-10-10 17:21:54.380805+09

If vacuum_required remains true for a certain time, it means that
vacuuming stopped halfway or someone is killing it repeatedly.
That status could be shown in the same view.
n_mod_since_analyze | 20000
+ vacuum_requred | true last_vacuum | 2017-10-10 17:21:54.380805+09 last_autovacuum
| 2017-10-10 17:21:54.380805+09

+ last_autovacuum_status | Killed by lock conflict

Where the "Killed by lock conflict" would be one of the followings.
- Completed (oldest xmin = 8023) - May not be fully truncated (yielded at 1324 of 6447 expected) - Truncation skipped
-Skipped by lock failure - Killed by lock conflict

If we want more formal expression, we can show the values in the
following shape. And adding some more values could be useful.
n_mod_since_analyze | 20000
+ vacuum_requred | true
+ last_vacuum_oldest_xid | 8023
+ last_vacuum_left_to_truncate | 5123
+ last_vacuum_truncated | 387 last_vacuum | 2017-10-10 17:21:54.380805+09 last_autovacuum
| 2017-10-10 17:21:54.380805+09

+ last_autovacuum_status | Killed by lock conflict
... autovacuum_count | 128
+ incomplete_autovacuum_count | 53

# The last one might be needless..

Where the "Killed by lock conflict" is one of the followings.
- Completed - Truncation skipped - Partially truncated - Skipped - Killed by lock conflict

This seems enough to find the cause of a table bloat. The same
discussion could be applied to analyze but it might be the
another issue.

There may be a better way to indicate the vacuum soundness. Any
opinions and suggestions are welcome.

I'm going to make a patch to do the 'formal' one for the time
being.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More stats about skipped vacuums

From

Masahiko Sawada

Date:

20 October 2017, 16:15:16

On Tue, Oct 10, 2017 at 7:26 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello.
> Once in a while I am asked about table bloat. In most cases the
> cause is long lasting transactions and vacuum canceling in some
> cases. Whatever the case users don't have enough clues to why
> they have bloated tables.
>
> At the top of the annoyances list for users would be that they
> cannot know whether autovacuum decided that a table needs vacuum
> or not. I suppose that it could be shown in pg_stat_*_tables.
>
>   n_mod_since_analyze | 20000
> + vacuum_requred      | true
>   last_vacuum         | 2017-10-10 17:21:54.380805+09
>
> If vacuum_required remains true for a certain time, it means that
> vacuuming stopped halfway or someone is killing it repeatedly.
> That status could be shown in the same view.

Because the table statistics is updated at end of the vacuum I think
that the autovacuum will process the table at the next cycle if it has
stopped halfway or has killed. So you mean that vacuum_required is for
uses who want to reclaim garbage without wait for autovacuum retry?

>   n_mod_since_analyze         | 20000
> + vacuum_requred              | true
>   last_vacuum                 | 2017-10-10 17:21:54.380805+09
>   last_autovacuum             | 2017-10-10 17:21:54.380805+09
> + last_autovacuum_status      | Killed by lock conflict
>
> Where the "Killed by lock conflict" would be one of the followings.
>
>   - Completed (oldest xmin = 8023)
>   - May not be fully truncated (yielded at 1324 of 6447 expected)
>   - Truncation skipped
>   - Skipped by lock failure
>   - Killed by lock conflict
>
>
> If we want more formal expression, we can show the values in the
> following shape. And adding some more values could be useful.
>
>   n_mod_since_analyze          | 20000
> + vacuum_requred               | true
> + last_vacuum_oldest_xid       | 8023
> + last_vacuum_left_to_truncate | 5123
> + last_vacuum_truncated        | 387
>   last_vacuum                  | 2017-10-10 17:21:54.380805+09
>   last_autovacuum              | 2017-10-10 17:21:54.380805+09
> + last_autovacuum_status       | Killed by lock conflict
> ...
>   autovacuum_count             | 128
> + incomplete_autovacuum_count  | 53
>
> # The last one might be needless..

I'm not sure that the above informations will help for users or DBA
but personally I sometimes want to have the number of index scans of
the last autovacuum in the pg_stat_user_tables view. That value
indicates how efficiently vacuums performed and would be a signal to
increase the setting of autovacuum_work_mem for user.

> Where the "Killed by lock conflict" is one of the followings.
>
>    - Completed
>    - Truncation skipped
>    - Partially truncated
>    - Skipped
>    - Killed by lock conflict
>
> This seems enough to find the cause of a table bloat. The same
> discussion could be applied to analyze but it might be the
> another issue.
>
> There may be a better way to indicate the vacuum soundness. Any
> opinions and suggestions are welcome.
>
> I'm going to make a patch to do the 'formal' one for the time
> being.
>

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

26 October 2017, 12:06:30

Mmm. I've failed to create a brand-new thread..

Thank you for the comment.

At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAkaw-u0feAVN_VrKZA5tvzp7jT=mQCQP-SvMegKXHHaw@mail.gmail.com>
> On Tue, Oct 10, 2017 at 7:26 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello.
> > Once in a while I am asked about table bloat. In most cases the
> > cause is long lasting transactions and vacuum canceling in some
> > cases. Whatever the case users don't have enough clues to why
> > they have bloated tables.
> >
> > At the top of the annoyances list for users would be that they
> > cannot know whether autovacuum decided that a table needs vacuum
> > or not. I suppose that it could be shown in pg_stat_*_tables.
> >
> >   n_mod_since_analyze | 20000
> > + vacuum_requred      | true
> >   last_vacuum         | 2017-10-10 17:21:54.380805+09
> >
> > If vacuum_required remains true for a certain time, it means that
> > vacuuming stopped halfway or someone is killing it repeatedly.
> > That status could be shown in the same view.
> 
> Because the table statistics is updated at end of the vacuum I think
> that the autovacuum will process the table at the next cycle if it has
> stopped halfway or has killed. So you mean that vacuum_required is for
> uses who want to reclaim garbage without wait for autovacuum retry?

It could be used for the purpose and for just knowing that a
table is left for a long time needing a vacuum and it would be a
trigger for users to take measures to deal with the situation.

> >   n_mod_since_analyze         | 20000
> > + vacuum_requred              | true
> >   last_vacuum                 | 2017-10-10 17:21:54.380805+09
> >   last_autovacuum             | 2017-10-10 17:21:54.380805+09
> > + last_autovacuum_status      | Killed by lock conflict
> >
> > Where the "Killed by lock conflict" would be one of the followings.
> >
> >   - Completed (oldest xmin = 8023)
> >   - May not be fully truncated (yielded at 1324 of 6447 expected)
> >   - Truncation skipped
> >   - Skipped by lock failure
> >   - Killed by lock conflict
> >
> >
> > If we want more formal expression, we can show the values in the
> > following shape. And adding some more values could be useful.
> >
> >   n_mod_since_analyze          | 20000
> > + vacuum_requred               | true
> > + last_vacuum_oldest_xid       | 8023
> > + last_vacuum_left_to_truncate | 5123
> > + last_vacuum_truncated        | 387
> >   last_vacuum                  | 2017-10-10 17:21:54.380805+09
> >   last_autovacuum              | 2017-10-10 17:21:54.380805+09
> > + last_autovacuum_status       | Killed by lock conflict
> > ...
> >   autovacuum_count             | 128
> > + incomplete_autovacuum_count  | 53
> >
> > # The last one might be needless..
> 
> I'm not sure that the above informations will help for users or DBA
> but personally I sometimes want to have the number of index scans of
> the last autovacuum in the pg_stat_user_tables view. That value
> indicates how efficiently vacuums performed and would be a signal to
> increase the setting of autovacuum_work_mem for user.

Btree and all existing index AMs (except brin) seem to visit the
all pages in every index scan so it would be valuable. Instead
the number of visited index pages during a table scan might be
usable. It is more relevant to performance than the number of
scans, on the other hand it is a bit difficult to get something
worth from the number in a moment. I'll show the number of scans
in the first cut.

> > Where the "Killed by lock conflict" is one of the followings.
> >
> >    - Completed
> >    - Truncation skipped
> >    - Partially truncated
> >    - Skipped
> >    - Killed by lock conflict
> >
> > This seems enough to find the cause of a table bloat. The same
> > discussion could be applied to analyze but it might be the
> > another issue.
> >
> > There may be a better way to indicate the vacuum soundness. Any
> > opinions and suggestions are welcome.
> >
> > I'm going to make a patch to do the 'formal' one for the time
> > being.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

30 October 2017, 17:57:50

At Thu, 26 Oct 2017 15:06:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20171026.150630.115694437.horiguchi.kyotaro@lab.ntt.co.jp>
> At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAkaw-u0feAVN_VrKZA5tvzp7jT=mQCQP-SvMegKXHHaw@mail.gmail.com>
> > >   n_mod_since_analyze          | 20000
> > > + vacuum_requred               | true
> > > + last_vacuum_oldest_xid       | 8023
> > > + last_vacuum_left_to_truncate | 5123
> > > + last_vacuum_truncated        | 387
> > >   last_vacuum                  | 2017-10-10 17:21:54.380805+09
> > >   last_autovacuum              | 2017-10-10 17:21:54.380805+09
> > > + last_autovacuum_status       | Killed by lock conflict
> > > ...
> > >   autovacuum_count             | 128
> > > + incomplete_autovacuum_count  | 53
> > >
> > > # The last one might be needless..
> > 
> > I'm not sure that the above informations will help for users or DBA
> > but personally I sometimes want to have the number of index scans of
> > the last autovacuum in the pg_stat_user_tables view. That value
> > indicates how efficiently vacuums performed and would be a signal to
> > increase the setting of autovacuum_work_mem for user.
> 
> Btree and all existing index AMs (except brin) seem to visit the
> all pages in every index scan so it would be valuable. Instead
> the number of visited index pages during a table scan might be
> usable. It is more relevant to performance than the number of
> scans, on the other hand it is a bit difficult to get something
> worth from the number in a moment. I'll show the number of scans
> in the first cut.
> 
> > > Where the "Killed by lock conflict" is one of the followings.
> > >
> > >    - Completed
> > >    - Truncation skipped
> > >    - Partially truncated
> > >    - Skipped
> > >    - Killed by lock conflict
> > >
> > > This seems enough to find the cause of a table bloat. The same
> > > discussion could be applied to analyze but it might be the
> > > another issue.
> > >
> > > There may be a better way to indicate the vacuum soundness. Any
> > > opinions and suggestions are welcome.
> > >
> > > I'm going to make a patch to do the 'formal' one for the time
> > > being.

Done with small modifications. In the attached patch
pg_stat_all_tables has the following new columns. Documentations
is not provided at this stage.

----- n_mod_since_analyze     | 0
+ vacuum_required         | not requried last_vacuum             |  last_autovacuum         | 2017-10-30
18:51:32.060551+09last_analyze            |  last_autoanalyze        | 2017-10-30 18:48:33.414711+09 vacuum_count
    | 0
 
+ last_vacuum_truncated   | 0
+ last_vacuum_untruncated | 0
+ last_vacuum_index_scans | 0
+ last_vacuum_oldest_xmin | 2134
+ last_vacuum_status      | agressive vacuum completed
+ autovacuum_fail_count   | 0 autovacuum_count        | 5 analyze_count           | 0 autoanalyze_count       | 1
-----
Where each column shows the following infomation.

+ vacuum_required         | not requried
VACUUM requirement status. Takes the following values.
 - partial   Partial (or normal) will be performed by the next autovacuum.   The word "partial" is taken from the
commentfor   vacuum_set_xid_limits.
 
 - aggressive   Aggressive scan will be performed by the next autovacuum.
 - required   Any type of autovacuum will be performed. The type of scan is   unknown because the view failed to take
therequired lock on   the table. (AutoVacuumrequirement())
 
 - not required   Next autovacuum won't perform scan on this relation.
 - not required (lock not acquired)
   Autovacuum should be disabled and the distance to   freeze-limit is not known because required lock is not
available.
 - close to freeze-limit xid   Shown while autovacuum is disabled. The table is in the   manual vacuum window to avoid
anti-wraparoundautovacuum.
 

+ last_vacuum_truncated | 0
 The number of truncated pages in the last completed (auto)vacuum.

+ last_vacuum_untruncated | 0 The number of pages the last completed (auto)vacuum tried to truncate but could not for
somereason.
 

+ last_vacuum_index_scans | 0 The number of index scans performed in the last completed (auto)vacuum.

+ last_vacuum_oldest_xmin | 2134 The oldest xmin used in the last completed (auto)vacuum.

+ last_vacuum_status      | agressive vacuum completed
 The finish status of the last vacuum. Takes the following values. (pg_stat_get_last_vacuum_status())
  - completed    The last partial (auto)vacuum is completed.
  - vacuum full completed    The last VACUUM FULL is completed.
  - aggressive vacuum completed    The last aggressive (auto)vacuum is completed.
  - error while $progress    The last vacuum stopped by error while $progress.    The $progress one of the vacuum
progressphases.
 
  - canceled while $progress    The last vacuum was canceled while $progress
    This is caused by user cancellation of manual vacuum or    killed by another backend who wants to acquire lock on
the   relation.
 
  - skipped - lock unavailable    The last autovacuum on the relation was skipped because    required lock was not
available.
  - unvacuumable    A past autovacuum tried vacuum on the relation but it is not    vacuumable for reasons of ownership
oraccessibility problem.    (Such relations are not shown in pg_stat_all_tables..)
 

+ autovacuum_fail_count   | 0 The number of successive failure of vacuum on the relation. Reset to zero by completed
vacuum.

======

In the patch, vacrelstats if pointed from a static variable and
cancel reporting is performed in PG_CATCH() section in vacuum().
Every unthrown error like lock acquisition failure is reported by
explicit pgstat_report_vacuum() with the corresponding finish
code.

Vacuum requirement status is calculated in AutoVacuumRequirment()
and returned as a string. Access share lock on the target
relation is required but it returns only available values if the
lock is not available. I decided to return incomplete (coarse
grained) result than wait for a lock that isn't known to be
relased in a short time for a perfect result.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 336748b61559bee66328a241394b365ebaacba6a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 27 Oct 2017 17:36:12 +0900
Subject: [PATCH] Add several vacuum information in pg_stat_*_tables.

---src/backend/catalog/system_views.sql |   7 ++src/backend/commands/cluster.c       |   2
+-src/backend/commands/vacuum.c       | 105 ++++++++++++++++++++++--src/backend/commands/vacuumlazy.c    | 103
+++++++++++++++++++++---src/backend/postmaster/autovacuum.c | 115
++++++++++++++++++++++++++src/backend/postmaster/pgstat.c     |  80
+++++++++++++++---src/backend/utils/adt/pgstatfuncs.c | 152
++++++++++++++++++++++++++++++++++-src/include/catalog/pg_proc.h       |  14 ++++src/include/commands/vacuum.h        |
 4 +-src/include/pgstat.h                 |  38 ++++++++-src/include/postmaster/autovacuum.h  |   1
+src/test/regress/expected/rules.out |  21 +++++12 files changed, 606 insertions(+), 36 deletions(-)
 

diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index dc40cde..452bf5d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -523,11 +523,18 @@ CREATE VIEW pg_stat_all_tables AS            pg_stat_get_live_tuples(C.oid) AS n_live_tup,
   pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,            pg_stat_get_mod_since_analyze(C.oid) AS
n_mod_since_analyze,
+            pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,            pg_stat_get_last_vacuum_time(C.oid) as
last_vacuum,           pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid)as last_analyze,            pg_stat_get_last_autoanalyze_time(C.oid) as
last_autoanalyze,           pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
 
+            pg_stat_get_last_vacuum_truncated(C.oid) AS last_vacuum_truncated,
+            pg_stat_get_last_vacuum_untruncated(C.oid) AS last_vacuum_untruncated,
+            pg_stat_get_last_vacuum_index_scans(C.oid) AS last_vacuum_index_scans,
+            pg_stat_get_last_vacuum_oldest_xmin(C.oid) AS last_vacuum_oldest_xmin,
+            pg_stat_get_last_vacuum_status(C.oid) AS last_vacuum_status,
+            pg_stat_get_autovacuum_fail_count(C.oid) AS autovacuum_fail_count,
pg_stat_get_autovacuum_count(C.oid)AS autovacuum_count,            pg_stat_get_analyze_count(C.oid) AS analyze_count,
        pg_stat_get_autoanalyze_count(C.oid) AS autoanalyze_count
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 48f1e6e..403b76d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -850,7 +850,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,     */
vacuum_set_xid_limits(OldHeap,0, 0, 0, 0,                          &OldestXmin, &FreezeXid, NULL, &MultiXactCutoff,
 
-                          NULL);
+                          NULL, NULL, NULL);    /*     * FreezeXid will become the table's new relfrozenxid, and that
mustn'tgo
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index cbd6e9b..a0c5a12 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -35,6 +35,7 @@#include "catalog/pg_inherits_fn.h"#include "catalog/pg_namespace.h"#include "commands/cluster.h"
+#include "commands/progress.h"#include "commands/vacuum.h"#include "miscadmin.h"#include "nodes/makefuncs.h"
@@ -367,6 +368,9 @@ vacuum(int options, List *relations, VacuumParams *params,    }    PG_CATCH();    {
+        /* report the final status of this vacuum */
+        lazy_vacuum_cancel_handler();
+        in_vacuum = false;        VacuumCostActive = false;        PG_RE_THROW();
@@ -585,6 +589,10 @@ get_all_vacuum_rels(void) *     Xmax. * - mxactFullScanLimit is a value against which a table's
relminmxidvalue is *     compared to produce a full-table vacuum, as with xidFullScanLimit.
 
+ * - aggressive is set if it is not NULL and set true if the table needs
+ *   aggressive scan.
+ * - close_to_wrap_around_limit is set if it is not NULL and set true if it is
+ *   in anti-anti-wraparound window. * * xidFullScanLimit and mxactFullScanLimit can be passed as NULL if caller is *
notinterested.
 
@@ -599,9 +607,11 @@ vacuum_set_xid_limits(Relation rel,                      TransactionId *freezeLimit,
      TransactionId *xidFullScanLimit,                      MultiXactId *multiXactCutoff,
 
-                      MultiXactId *mxactFullScanLimit)
+                      MultiXactId *mxactFullScanLimit,
+                      bool *aggressive, bool *close_to_wrap_around_limit){    int            freezemin;
+    int            freezemax;    int            mxid_freezemin;    int            effective_multixact_freeze_max_age;
 TransactionId limit;
 
@@ -701,11 +711,13 @@ vacuum_set_xid_limits(Relation rel,    *multiXactCutoff = mxactLimit;
-    if (xidFullScanLimit != NULL)
+    if (xidFullScanLimit != NULL || aggressive != NULL)    {        int            freezetable;
+        bool        maybe_anti_wrapround = false;
-        Assert(mxactFullScanLimit != NULL);
+        /* these two output should be requested together  */
+        Assert(xidFullScanLimit == NULL || mxactFullScanLimit != NULL);        /*         * Determine the table freeze
ageto use: as specified by the caller,
 
@@ -717,7 +729,14 @@ vacuum_set_xid_limits(Relation rel,        freezetable = freeze_table_age;        if (freezetable
<0)            freezetable = vacuum_freeze_table_age;
 
-        freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
+
+        freezemax = autovacuum_freeze_max_age * 0.95;
+        if (freezemax < freezetable)
+        {
+            /* We may be in anti-anti-warparound window */
+            freezetable = freezemax;
+            maybe_anti_wrapround = true;
+        }        Assert(freezetable >= 0);        /*
@@ -728,7 +747,8 @@ vacuum_set_xid_limits(Relation rel,        if (!TransactionIdIsNormal(limit))            limit =
FirstNormalTransactionId;
-        *xidFullScanLimit = limit;
+        if (xidFullScanLimit)
+            *xidFullScanLimit = limit;        /*         * Similar to the above, determine the table freeze age to use
for
@@ -741,10 +761,20 @@ vacuum_set_xid_limits(Relation rel,        freezetable = multixact_freeze_table_age;        if
(freezetable< 0)            freezetable = vacuum_multixact_freeze_table_age;
 
-        freezetable = Min(freezetable,
-                          effective_multixact_freeze_max_age * 0.95);
+
+        freezemax = effective_multixact_freeze_max_age * 0.95;
+        if (freezemax < freezetable)
+        {
+            /* We may be in anti-anti-warparound window */
+            freezetable = freezemax;
+            maybe_anti_wrapround = true;
+        }        Assert(freezetable >= 0);
+        /* We may be in anti-anti-warparound window */
+        if (effective_multixact_freeze_max_age * 0.95 < freezetable)
+            maybe_anti_wrapround = true;
+        /*         * Compute MultiXact limit causing a full-table vacuum, being careful         * to generate a valid
MultiXactvalue.
 
@@ -753,11 +783,38 @@ vacuum_set_xid_limits(Relation rel,        if (mxactLimit < FirstMultiXactId)
mxactLimit= FirstMultiXactId;
 
-        *mxactFullScanLimit = mxactLimit;
+        if (mxactFullScanLimit)
+            *mxactFullScanLimit = mxactLimit;
+
+        /*
+         * We request an aggressive scan if the table's frozen Xid is now
+         * older than or equal to the requested Xid full-table scan limit; or
+         * if the table's minimum MultiXactId is older than or equal to the
+         * requested mxid full-table scan limit.
+         */
+        if (aggressive)
+        {
+            *aggressive =
+                TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
+                                              limit);
+            *aggressive |=
+                MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
+                                            mxactLimit);
+
+            /* set close_to_wrap_around_limit if requested */
+            if (close_to_wrap_around_limit)
+                *close_to_wrap_around_limit =
+                    (*aggressive && maybe_anti_wrapround);
+        }
+        else
+        {
+            Assert (!close_to_wrap_around_limit);
+        }    }    else    {        Assert(mxactFullScanLimit == NULL);
+        Assert(aggressive == NULL);    }}
@@ -1410,6 +1467,9 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if (!onerel)
{
+        pgstat_report_vacuum(relid, false,
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_LOCK_FAILED,
+                             InvalidTransactionId, 0, 0);        PopActiveSnapshot();
CommitTransactionCommand();       return false;
 
@@ -1441,6 +1501,12 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
(errmsg("skipping\"%s\" --- only table or database owner can vacuum it",
RelationGetRelationName(onerel))));       relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET,
+                             InvalidTransactionId, 0, 0);
+        PopActiveSnapshot();        CommitTransactionCommand();        return false;
@@ -1458,6 +1524,12 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
(errmsg("skipping\"%s\" --- cannot vacuum non-tables or special system tables",
RelationGetRelationName(onerel))));       relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET,
+                             InvalidTransactionId, 0, 0);
+        PopActiveSnapshot();        CommitTransactionCommand();        return false;
@@ -1473,6 +1545,12 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if
(RELATION_IS_OTHER_TEMP(onerel))   {        relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET,
+                             InvalidTransactionId, 0, 0);
+        PopActiveSnapshot();        CommitTransactionCommand();        return false;
@@ -1486,6 +1564,12 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if
(onerel->rd_rel->relkind== RELKIND_PARTITIONED_TABLE)    {        relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET,
+                             InvalidTransactionId, 0, 0);
+        PopActiveSnapshot();        CommitTransactionCommand();        /* It's OK to proceed with ANALYZE on this
table*/
 
@@ -1531,6 +1615,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)     */    if
(options& VACOPT_FULL)    {
 
+        bool isshared = onerel->rd_rel->relisshared;
+        /* close relation before vacuuming, but hold lock until commit */        relation_close(onerel, NoLock);
onerel = NULL;
 
@@ -1538,6 +1624,9 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)        /* VACUUM
FULLis now a variant of CLUSTER; see cluster.c */        cluster_rel(relid, InvalidOid, false,
(options& VACOPT_VERBOSE) != 0);
 
+        pgstat_report_vacuum(relid, isshared, 0, 0, 0, 0, 0,
+                             PGSTAT_VACUUM_FULL_FINISHED,
+                             InvalidTransactionId, 0, 0);    }    else        lazy_vacuum_rel(onerel, options, params,
vac_strategy);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 172d213..372d661 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -55,6 +55,7 @@#include "postmaster/autovacuum.h"#include "storage/bufmgr.h"#include "storage/freespace.h"
+#include "storage/ipc.h"#include "storage/lmgr.h"#include "utils/lsyscache.h"#include "utils/memutils.h"
@@ -105,6 +106,8 @@typedef struct LVRelStats{
+    Oid            reloid;            /* oid of the target relation */
+    bool        shared;            /* is shared relation? */    /* hasindex = true means two-pass strategy; false
meansone-pass */    bool        hasindex;    /* Overall statistics about rel */
 
@@ -119,6 +122,7 @@ typedef struct LVRelStats    double        new_rel_tuples; /* new estimated total # of tuples */
double       new_dead_tuples;    /* new estimated total # of dead tuples */    BlockNumber pages_removed;
 
+    BlockNumber pages_not_removed;    double        tuples_deleted;    BlockNumber nonempty_pages; /* actually, last
nonemptypage + 1 */    /* List of TIDs of tuples we intend to delete */
 
@@ -138,6 +142,7 @@ static int    elevel = -1;static TransactionId OldestXmin;static TransactionId FreezeLimit;static
MultiXactIdMultiXactCutoff;
 
+static LVRelStats *current_lvstats;static BufferAccessStrategy vac_strategy;
@@ -216,6 +221,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    else        elevel =
DEBUG2;
+    current_lvstats = NULL;    pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
RelationGetRelid(onerel));
@@ -227,29 +233,30 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
params->multixact_freeze_min_age,                         params->multixact_freeze_table_age,
&OldestXmin,&FreezeLimit, &xidFullScanLimit,
 
-                          &MultiXactCutoff, &mxactFullScanLimit);
+                          &MultiXactCutoff, &mxactFullScanLimit,
+                          &aggressive, NULL);
-    /*
-     * We request an aggressive scan if the table's frozen Xid is now older
-     * than or equal to the requested Xid full-table scan limit; or if the
-     * table's minimum MultiXactId is older than or equal to the requested
-     * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
-     */
-    aggressive = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
-                                               xidFullScanLimit);
-    aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
-                                              mxactFullScanLimit);
+    /* force aggressive scan if DISABLE_PAGE_SKIPPING was specified */    if (options & VACOPT_DISABLE_PAGE_SKIPPING)
     aggressive = true;    vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats));
 
+    vacrelstats->reloid = RelationGetRelid(onerel);
+    vacrelstats->shared = onerel->rd_rel->relisshared;    vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
vacrelstats->old_rel_tuples= onerel->rd_rel->reltuples;    vacrelstats->num_index_scans = 0;
vacrelstats->pages_removed= 0;
 
+    vacrelstats->pages_not_removed = 0;    vacrelstats->lock_waiter_detected = false;
+    /*
+     * Register current vacrelstats so that final status can be reported on
+     * interrupts
+     */
+    current_lvstats = vacrelstats;
+    /* Open all indexes of the relation */    vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
vacrelstats->hasindex= (nindexes > 0);
 
@@ -280,8 +287,15 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,     * Optionally truncate the
relation.    */    if (should_attempt_truncation(vacrelstats))
 
+    {        lazy_truncate_heap(onerel, vacrelstats);
+        /* just paranoia */
+        if (vacrelstats->rel_pages >= vacrelstats->nonempty_pages)
+            vacrelstats->pages_not_removed +=
+                vacrelstats->rel_pages - vacrelstats->nonempty_pages;
+    }
+    /* Report that we are now doing final cleanup */    pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
                    PROGRESS_VACUUM_PHASE_FINAL_CLEANUP);
 
@@ -339,10 +353,22 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    if (new_live_tuples < 0)
    new_live_tuples = 0;    /* just in case */
 
-    pgstat_report_vacuum(RelationGetRelid(onerel),
+    /* vacuum successfully finished. nothing to do on exit */
+    current_lvstats = NULL;
+
+    pgstat_report_vacuum(vacrelstats->reloid,                         onerel->rd_rel->relisshared,
   new_live_tuples,
 
-                         vacrelstats->new_dead_tuples);
+                         vacrelstats->new_dead_tuples,
+                         vacrelstats->pages_removed,
+                         vacrelstats->pages_not_removed,
+                         vacrelstats->num_index_scans,
+                         OldestXmin,
+                         aggressive ?
+                         PGSTAT_VACUUM_AGGRESSIVE_FINISHED :
+                         PGSTAT_VACUUM_FINISHED,
+                         0, 0);
+    pgstat_progress_end_command();    /* and log the action if appropriate */
@@ -2205,3 +2231,54 @@ heap_page_is_all_visible(Relation rel, Buffer buf,    return all_visible;}
+
+/*
+ * lazy_vacuum_cancel_handler - report interrupted vacuum status
+ */
+void
+lazy_vacuum_cancel_handler(void)
+{
+    LVRelStats *stats = current_lvstats;
+    LocalPgBackendStatus *local_beentry;
+    PgBackendStatus *beentry;
+    int                phase;
+    int                err;
+
+    current_lvstats = NULL;
+
+    /* we have nothing to report */
+    if (!stats)
+        return;
+
+    /* get vacuum progress stored in backend status */
+    local_beentry = pgstat_fetch_stat_local_beentry(MyBackendId);
+    if (!local_beentry)
+        return;
+
+    beentry = &local_beentry->backendStatus;
+
+    Assert (beentry && beentry->st_progress_command == PROGRESS_COMMAND_VACUUM);
+
+    phase = beentry->st_progress_param[PROGRESS_VACUUM_PHASE];
+
+    /* we can reach here both on interrupt and error */
+    if (geterrcode() == ERRCODE_QUERY_CANCELED)
+        err = PGSTAT_VACUUM_CANCELED;
+    else
+        err = PGSTAT_VACUUM_ERROR;
+
+    /*
+     * vacuum has been canceled, report stats numbers without normalization
+     * here. (But currently they are not used.)
+     */
+    pgstat_report_vacuum(stats->reloid,
+                         stats->shared,
+                         stats->new_rel_tuples,
+                         stats->new_dead_tuples,
+                         stats->pages_removed,
+                         stats->pages_not_removed,
+                         stats->num_index_scans,
+                         OldestXmin,
+                         err,
+                         phase, geterrcode());
+}
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c04c0b5..6c32d0b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -831,6 +831,121 @@ shutdown:}/*
+ * Returns status string of auto vacuum on the relation
+ */
+char *
+AutoVacuumRequirement(Oid reloid)
+{
+    Relation classRel;
+    Relation rel;
+    TupleDesc    pg_class_desc;
+    HeapTuple tuple;
+    Form_pg_class classForm;
+    AutoVacOpts *relopts;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatDBEntry *shared;
+    PgStat_StatDBEntry *dbentry;
+    int            effective_multixact_freeze_max_age;
+    bool        dovacuum;
+    bool        doanalyze;
+    bool        wraparound;
+    bool        aggressive;
+    bool        xid_calculated = false;
+    bool        in_anti_wa_window = false;
+    char       *ret = "not requried";
+
+    /* Compute the multixact age for which freezing is urgent. */
+    effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+
+    /* Fetch the pgclass entry for this relation */
+    tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(reloid));
+    if (!HeapTupleIsValid(tuple))
+        elog(ERROR, "cache lookup failed for relation %u", reloid);
+    classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+    /* extract relopts for autovacuum */
+    classRel = heap_open(RelationRelationId, AccessShareLock);
+    pg_class_desc = RelationGetDescr(classRel);
+    relopts = extract_autovac_opts(tuple, pg_class_desc);
+    heap_close(classRel, AccessShareLock);
+
+    /* Fetch the pgstat shared entry and entry for this database */
+    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+
+    /* Fetch the pgstat entry for this table */
+    tabentry = get_pgstat_tabentry_relid(reloid, classForm->relisshared,
+                                         shared, dbentry);
+
+    /*
+     * Check if the relation needs vacuum. This function is intended to
+     * suggest aggresive vacuum for the last 5% window in
+     * autovacuum_freeze_max_age so the variable wraparound is ignored
+     * here. See vacuum_set_xid_limits for details.
+     */
+    relation_needs_vacanalyze(reloid, relopts, classForm, tabentry,
+                              effective_multixact_freeze_max_age,
+                              &dovacuum, &doanalyze, &wraparound);
+    ReleaseSysCache(tuple);
+
+    /* get further information if needed */
+    rel = NULL;
+
+    /* don't get stuck with lock  */
+    if (ConditionalLockRelationOid(reloid, AccessShareLock))
+        rel = try_relation_open(reloid, NoLock);
+
+    if (rel)
+    {
+        TransactionId OldestXmin, FreezeLimit;
+        MultiXactId MultiXactCutoff;
+
+        vacuum_set_xid_limits(rel,
+                              vacuum_freeze_min_age,
+                              vacuum_freeze_table_age,
+                              vacuum_multixact_freeze_min_age,
+                              vacuum_multixact_freeze_table_age,
+                              &OldestXmin, &FreezeLimit, NULL,
+                              &MultiXactCutoff, NULL,
+                              &aggressive, &in_anti_wa_window);
+
+        xid_calculated = true;
+        relation_close(rel, AccessShareLock);
+    }
+
+    /* choose the proper message according to the calculation above */
+    if (xid_calculated)
+    {
+        if (dovacuum)
+        {
+            /* we don't care anti-wraparound if autovacuum is on */
+            if (aggressive)
+                ret = "aggressive";
+            else
+                ret = "partial";
+        }
+        else if (in_anti_wa_window)
+            ret = "close to freeze-limit xid";
+        /* otherwise just "not requried" */
+    }
+    else
+    {
+        /*
+         * failed to compute xid limits. show less-grained messages. We can
+         * use just "required" in the autovacuum case is enough to distinguish
+         * from full-grained messages, but we require additional words in the
+         * case where autovacuum is turned off.
+         */
+        if (dovacuum)
+            ret = "required";
+        else
+            ret = "not required (lock not acquired)";
+    }
+
+    return ret;
+}
+
+/* * Determine the time to sleep, based on the database list. * * The "canlaunch" parameter indicates whether we can
starta worker right now,
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3a0b49c..721b172 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1403,7 +1403,13 @@ pgstat_report_autovac(Oid dboid) */voidpgstat_report_vacuum(Oid tableoid, bool shared,
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                     PgStat_Counter pages_removed,
+                     PgStat_Counter pages_not_removed,
+                     PgStat_Counter num_index_scans,
+                     TransactionId    oldestxmin,
+                     PgStat_Counter status, PgStat_Counter last_phase,
+                     PgStat_Counter errcode){    PgStat_MsgVacuum msg;
@@ -1417,6 +1423,13 @@ pgstat_report_vacuum(Oid tableoid, bool shared,    msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples= livetuples;    msg.m_dead_tuples = deadtuples;
 
+    msg.m_pages_removed = pages_removed;
+    msg.m_pages_not_removed = pages_not_removed;
+    msg.m_num_index_scans = num_index_scans;
+    msg.m_oldest_xmin = oldestxmin;
+    msg.m_vacuum_status = status;
+    msg.m_vacuum_last_phase = last_phase;
+    msg.m_vacuum_errcode = errcode;    pgstat_send(&msg, sizeof(msg));}
@@ -4576,17 +4589,25 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)    if (!found)
{       result->numscans = 0;
 
+        result->tuples_returned = 0;        result->tuples_fetched = 0;        result->tuples_inserted = 0;
result->tuples_updated= 0;        result->tuples_deleted = 0;        result->tuples_hot_updated = 0;
 
+        result->n_live_tuples = 0;        result->n_dead_tuples = 0;        result->changes_since_analyze = 0;
+        result->n_pages_removed = 0;
+        result->n_pages_not_removed = 0;
+        result->n_index_scans = 0;
+        result->oldest_xmin = InvalidTransactionId;
+        result->blocks_fetched = 0;        result->blocks_hit = 0;
+        result->vacuum_timestamp = 0;        result->vacuum_count = 0;        result->autovac_vacuum_timestamp = 0;
@@ -4595,6 +4616,11 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->analyze_count= 0;        result->autovac_analyze_timestamp = 0;        result->autovac_analyze_count = 0;
 
+
+        result->vacuum_status = 0;
+        result->vacuum_last_phase = 0;
+        result->vacuum_errcode = 0;
+        result->vacuum_failcount = 0;    }    return result;
@@ -5979,18 +6005,50 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)    tabentry = pgstat_get_tab_entry(dbentry,
msg->m_tableoid,true);
 
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
+    tabentry->vacuum_status = msg->m_vacuum_status;
+    tabentry->vacuum_last_phase = msg->m_vacuum_last_phase;
+    tabentry->vacuum_errcode = msg->m_vacuum_errcode;
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
+    /*
+     * We store the numbers only when the vacuum has been completed. They
+     * might be usable to find how much the stopped vacuum processed but we
+     * choose not to show them rather than show bogus numbers.
+     */
+    switch ((StatVacuumStatus)msg->m_vacuum_status)    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
+    case PGSTAT_VACUUM_FINISHED:
+    case PGSTAT_VACUUM_FULL_FINISHED:
+    case PGSTAT_VACUUM_AGGRESSIVE_FINISHED:
+        tabentry->n_live_tuples = msg->m_live_tuples;
+        tabentry->n_dead_tuples = msg->m_dead_tuples;
+        tabentry->n_pages_removed = msg->m_pages_removed;
+        tabentry->n_pages_not_removed = msg->m_pages_not_removed;
+        tabentry->n_index_scans = msg->m_num_index_scans;
+        tabentry->oldest_xmin = msg->m_oldest_xmin;
+        tabentry->vacuum_failcount = 0;
+
+        if (msg->m_autovacuum)
+        {
+            tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
+            tabentry->autovac_vacuum_count++;
+        }
+        else
+        {
+            tabentry->vacuum_timestamp = msg->m_vacuumtime;
+            tabentry->vacuum_count++;
+        }
+        break;
+
+    case PGSTAT_VACUUM_ERROR:
+    case PGSTAT_VACUUM_CANCELED:
+    case PGSTAT_VACUUM_SKIP_LOCK_FAILED:
+        tabentry->vacuum_failcount++;
+        break;
+
+    case PGSTAT_VACUUM_SKIP_NONTARGET:
+    default:
+        /* don't increment failure count for non-target tables */
+        break;    }}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 8d9e7c1..bddc243 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -23,6 +23,7 @@#include "pgstat.h"#include "postmaster/bgworker_internals.h"#include "postmaster/postmaster.h"
+#include "postmaster/autovacuum.h"#include "storage/proc.h"#include "storage/procarray.h"#include "utils/acl.h"
@@ -194,6 +195,156 @@ pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS)    PG_RETURN_INT64(result);}
+Datum
+pg_stat_get_vacuum_necessity(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+
+    PG_RETURN_TEXT_P(cstring_to_text(AutoVacuumRequirement(relid)));
+}
+
+Datum
+pg_stat_get_last_vacuum_truncated(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int64        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int64) (tabentry->n_pages_removed);
+
+    PG_RETURN_INT64(result);
+}
+
+Datum
+pg_stat_get_last_vacuum_untruncated(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int64        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int64) (tabentry->n_pages_not_removed);
+
+    PG_RETURN_INT64(result);
+}
+
+Datum
+pg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int32        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int32) (tabentry->n_index_scans);
+
+    PG_RETURN_INT32(result);
+}
+
+Datum
+pg_stat_get_last_vacuum_oldest_xmin(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    TransactionId    result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = InvalidTransactionId;
+    else
+        result = (int32) (tabentry->oldest_xmin);
+
+    return TransactionIdGetDatum(result);
+}
+
+Datum
+pg_stat_get_last_vacuum_status(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    char        *result = "unknown";
+    PgStat_StatTabEntry *tabentry;
+
+    /*
+     * status string. this must be synced with the strings shown by the
+     * statistics view "pg_stat_progress_vacuum"
+     */
+    static char *phasestr[] =
+        {"initialization",
+         "scanning heap",
+         "vacuuming indexes",
+         "vacuuming heap",
+         "cleaning up indexes",
+         "trucating heap",
+         "performing final cleanup"};
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) != NULL)
+    {
+        int                    phase;
+        StatVacuumStatus    status;
+
+        status = tabentry->vacuum_status;
+        switch (status)
+        {
+        case PGSTAT_VACUUM_FINISHED:
+            result = "completed";
+            break;
+        case PGSTAT_VACUUM_ERROR:
+        case PGSTAT_VACUUM_CANCELED:
+            phase = tabentry->vacuum_last_phase;
+            /* number of elements of phasestr above */
+            if (phase >= 0 && phase <= 7)
+                result = psprintf("%s while %s",
+                                  status == PGSTAT_VACUUM_CANCELED ?
+                                  "canceled" : "error",
+                                  phasestr[phase]);
+            else
+                result = psprintf("unknown vacuum phase: %d", phase);
+            break;
+        case PGSTAT_VACUUM_SKIP_LOCK_FAILED:
+            result = "skipped - lock unavailable";
+            break;
+
+        case PGSTAT_VACUUM_AGGRESSIVE_FINISHED:
+            result = "aggressive vacuum completed";
+            break;
+
+        case PGSTAT_VACUUM_FULL_FINISHED:
+            result = "vacuum full completed";
+            break;
+
+        case PGSTAT_VACUUM_SKIP_NONTARGET:
+            result = "unvacuumable";
+            break;
+
+        default:
+            result = "unknown status";
+            break;
+        }
+    }
+
+    PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+Datum
+pg_stat_get_autovacuum_fail_count(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int64        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int32) (tabentry->vacuum_failcount);
+
+    PG_RETURN_INT32(result);
+}Datumpg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)
@@ -210,7 +361,6 @@ pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)    PG_RETURN_INT64(result);}
-Datumpg_stat_get_blocks_hit(PG_FUNCTION_ARGS){
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 93c031a..5a1c77d 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2887,6 +2887,20 @@ DATA(insert OID = 3317 (  pg_stat_get_wal_receiver    PGNSP PGUID 12 1 0 0 0 f f
fDESCR("statistics:information about WAL receiver");DATA(insert OID = 6118 (  pg_stat_get_subscription    PGNSP PGUID
121 0 0 0 f f f f f f s r 1 0 2249 "26" "{26,26,26,23,3220,1184,1184,3220,1184}" "{i,o,o,o,o,o,o,o,o}"
"{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}"_null_
_null_pg_stat_get_subscription _null_ _null_ _null_ ));DESCR("statistics: information about subscription");
 
+DATA(insert OID = 3419 (  pg_stat_get_vacuum_necessity    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_
_null__null_ _null_ _null_ pg_stat_get_vacuum_necessity _null_ _null_ _null_ ));
 
+DESCR("statistics: true if needs vacuum");
+DATA(insert OID = 3420 (  pg_stat_get_last_vacuum_untruncated    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_untruncated _null_ _null_ _null_ ));
 
+DESCR("statistics: pages left untruncated in the last vacuum");
+DATA(insert OID = 3421 (  pg_stat_get_last_vacuum_truncated    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_truncated _null_ _null_ _null_ ));
 
+DESCR("statistics: pages truncated in the last vacuum");
+DATA(insert OID = 3422 (  pg_stat_get_last_vacuum_index_scans    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_index_scans _null_ _null_ _null_ ));
 
+DESCR("statistics: number of index scans in the last vacuum");
+DATA(insert OID = 3423 (  pg_stat_get_last_vacuum_oldest_xmin    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 28 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_oldest_xmin _null_ _null_ _null_ ));
 
+DESCR("statistics: The oldest xmin used in the last vacuum");
+DATA(insert OID = 3424 (  pg_stat_get_last_vacuum_status    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_
_null__null_ _null_ _null_ pg_stat_get_last_vacuum_status _null_ _null_ _null_ ));
 
+DESCR("statistics: ending status of the last vacuum");
+DATA(insert OID = 3425 (  pg_stat_get_autovacuum_fail_count    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_autovacuum_fail_count _null_ _null_ _null_ ));
 
+DESCR("statistics: number of successively failed vacuum trials");DATA(insert OID = 2026 (  pg_backend_pid
 PGNSP PGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_
_null_));DESCR("statistics: current backend PID");DATA(insert OID = 1937 (  pg_stat_get_backend_pid        PGNSP PGUID
121 0 0 0 f f f f t f s r 1 0 23 "23" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_pid _null_ _null_ _null_
));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 7a7b793..6091bab 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -182,13 +182,15 @@ extern void vacuum_set_xid_limits(Relation rel,                      TransactionId *freezeLimit,
                   TransactionId *xidFullScanLimit,                      MultiXactId *multiXactCutoff,
 
-                      MultiXactId *mxactFullScanLimit);
+                      MultiXactId *mxactFullScanLimit,
+                      bool *aggressive, bool *in_wa_window);extern void vac_update_datfrozenxid(void);extern void
vacuum_delay_point(void);/*in commands/vacuumlazy.c */extern void lazy_vacuum_rel(Relation onerel, int options,
      VacuumParams *params, BufferAccessStrategy bstrategy);
 
+extern void lazy_vacuum_cancel_handler(void);/* in commands/analyze.c */extern void analyze_rel(Oid relid, RangeVar
*relation,int options,
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..bab8332 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -67,6 +67,20 @@ typedef enum StatMsgType    PGSTAT_MTYPE_DEADLOCK} StatMsgType;
+/*
+ * The exit status stored in vacuum report.
+ */
+typedef enum StatVacuumStatus
+{
+    PGSTAT_VACUUM_FINISHED,
+    PGSTAT_VACUUM_CANCELED,
+    PGSTAT_VACUUM_ERROR,
+    PGSTAT_VACUUM_SKIP_LOCK_FAILED,
+    PGSTAT_VACUUM_SKIP_NONTARGET,
+    PGSTAT_VACUUM_AGGRESSIVE_FINISHED,
+    PGSTAT_VACUUM_FULL_FINISHED
+} StatVacuumStatus;
+/* ---------- * The data type used for counters. * ----------
@@ -369,6 +383,13 @@ typedef struct PgStat_MsgVacuum    TimestampTz m_vacuumtime;    PgStat_Counter m_live_tuples;
PgStat_Counterm_dead_tuples;
 
+    PgStat_Counter m_pages_removed;
+    PgStat_Counter m_pages_not_removed;
+    PgStat_Counter m_num_index_scans;
+    TransactionId  m_oldest_xmin;
+    PgStat_Counter m_vacuum_status;
+    PgStat_Counter m_vacuum_last_phase;
+    PgStat_Counter m_vacuum_errcode;} PgStat_MsgVacuum;
@@ -629,6 +650,10 @@ typedef struct PgStat_StatTabEntry    PgStat_Counter n_live_tuples;    PgStat_Counter
n_dead_tuples;   PgStat_Counter changes_since_analyze;
 
+    PgStat_Counter n_pages_removed;
+    PgStat_Counter n_pages_not_removed;
+    PgStat_Counter n_index_scans;
+    TransactionId  oldest_xmin;    PgStat_Counter blocks_fetched;    PgStat_Counter blocks_hit;
@@ -641,6 +666,11 @@ typedef struct PgStat_StatTabEntry    PgStat_Counter analyze_count;    TimestampTz
autovac_analyze_timestamp;   /* autovacuum initiated */    PgStat_Counter autovac_analyze_count;
 
+
+    PgStat_Counter    vacuum_status;
+    PgStat_Counter    vacuum_last_phase;
+    PgStat_Counter    vacuum_errcode;
+    PgStat_Counter    vacuum_failcount;} PgStat_StatTabEntry;
@@ -1165,7 +1195,13 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type textern void
pgstat_report_autovac(Oiddboid);extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples);
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                     PgStat_Counter pages_removed,
+                     PgStat_Counter pages_not_removed,
+                     PgStat_Counter num_index_scans,
+                     TransactionId oldextxmin,
+                     PgStat_Counter status, PgStat_Counter last_phase,
+                     PgStat_Counter errcode);extern void pgstat_report_analyze(Relation rel,
PgStat_Counterlivetuples, PgStat_Counter deadtuples,                      bool resetcounter);
 
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index 3469915..848a322 100644
--- a/src/include/postmaster/autovacuum.h
+++ b/src/include/postmaster/autovacuum.h
@@ -49,6 +49,7 @@ extern int    Log_autovacuum_min_duration;extern bool AutoVacuumingActive(void);extern bool
IsAutoVacuumLauncherProcess(void);externbool IsAutoVacuumWorkerProcess(void);
 
+extern char *AutoVacuumRequirement(Oid reloid);#define IsAnyAutoVacuumProcess() \    (IsAutoVacuumLauncherProcess() ||
IsAutoVacuumWorkerProcess())
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..fb1ea49 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1759,11 +1759,18 @@ pg_stat_all_tables| SELECT c.oid AS relid,    pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid)AS n_dead_tup,    pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
 
+    pg_stat_get_vacuum_necessity(c.oid) AS vacuum_required,    pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid)AS last_autovacuum,    pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
pg_stat_get_last_autoanalyze_time(c.oid)AS last_autoanalyze,    pg_stat_get_vacuum_count(c.oid) AS vacuum_count,
 
+    pg_stat_get_last_vacuum_truncated(c.oid) AS last_vacuum_truncated,
+    pg_stat_get_last_vacuum_untruncated(c.oid) AS last_vacuum_untruncated,
+    pg_stat_get_last_vacuum_index_scans(c.oid) AS last_vacuum_index_scans,
+    pg_stat_get_last_vacuum_oldest_xmin(c.oid) AS last_vacuum_oldest_xmin,
+    pg_stat_get_last_vacuum_status(c.oid) AS last_vacuum_status,
+    pg_stat_get_autovacuum_fail_count(c.oid) AS autovacuum_fail_count,    pg_stat_get_autovacuum_count(c.oid) AS
autovacuum_count,   pg_stat_get_analyze_count(c.oid) AS analyze_count,    pg_stat_get_autoanalyze_count(c.oid) AS
autoanalyze_count
@@ -1906,11 +1913,18 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,   pg_stat_all_tables.n_mod_since_analyze,
 
+    pg_stat_all_tables.vacuum_required,    pg_stat_all_tables.last_vacuum,    pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,   pg_stat_all_tables.last_autoanalyze,    pg_stat_all_tables.vacuum_count,
 
+    pg_stat_all_tables.last_vacuum_truncated,
+    pg_stat_all_tables.last_vacuum_untruncated,
+    pg_stat_all_tables.last_vacuum_index_scans,
+    pg_stat_all_tables.last_vacuum_oldest_xmin,
+    pg_stat_all_tables.last_vacuum_status,
+    pg_stat_all_tables.autovacuum_fail_count,    pg_stat_all_tables.autovacuum_count,
pg_stat_all_tables.analyze_count,   pg_stat_all_tables.autoanalyze_count
 
@@ -1949,11 +1963,18 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,   pg_stat_all_tables.n_mod_since_analyze,
 
+    pg_stat_all_tables.vacuum_required,    pg_stat_all_tables.last_vacuum,    pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,   pg_stat_all_tables.last_autoanalyze,    pg_stat_all_tables.vacuum_count,
 
+    pg_stat_all_tables.last_vacuum_truncated,
+    pg_stat_all_tables.last_vacuum_untruncated,
+    pg_stat_all_tables.last_vacuum_index_scans,
+    pg_stat_all_tables.last_vacuum_oldest_xmin,
+    pg_stat_all_tables.last_vacuum_status,
+    pg_stat_all_tables.autovacuum_fail_count,    pg_stat_all_tables.autovacuum_count,
pg_stat_all_tables.analyze_count,   pg_stat_all_tables.autoanalyze_count
 
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] More stats about skipped vacuums

From

Michael Paquier

Date:

15 November 2017, 13:13:01

On Mon, Oct 30, 2017 at 8:57 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Thu, 26 Oct 2017 15:06:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote
in<20171026.150630.115694437.horiguchi.kyotaro@lab.ntt.co.jp>
 
>> At Fri, 20 Oct 2017 19:15:16 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoAkaw-u0feAVN_VrKZA5tvzp7jT=mQCQP-SvMegKXHHaw@mail.gmail.com>
>> > >   n_mod_since_analyze          | 20000
>> > > + vacuum_requred               | true
>> > > + last_vacuum_oldest_xid       | 8023
>> > > + last_vacuum_left_to_truncate | 5123
>> > > + last_vacuum_truncated        | 387
>> > >   last_vacuum                  | 2017-10-10 17:21:54.380805+09
>> > >   last_autovacuum              | 2017-10-10 17:21:54.380805+09
>> > > + last_autovacuum_status       | Killed by lock conflict
>> > > ...
>> > >   autovacuum_count             | 128
>> > > + incomplete_autovacuum_count  | 53
>> > >
>> > > # The last one might be needless..
>> >
>> > I'm not sure that the above informations will help for users or DBA
>> > but personally I sometimes want to have the number of index scans of
>> > the last autovacuum in the pg_stat_user_tables view. That value
>> > indicates how efficiently vacuums performed and would be a signal to
>> > increase the setting of autovacuum_work_mem for user.
>>
>> Btree and all existing index AMs (except brin) seem to visit the
>> all pages in every index scan so it would be valuable. Instead
>> the number of visited index pages during a table scan might be
>> usable. It is more relevant to performance than the number of
>> scans, on the other hand it is a bit difficult to get something
>> worth from the number in a moment. I'll show the number of scans
>> in the first cut.
>>
>> > > Where the "Killed by lock conflict" is one of the followings.
>> > >
>> > >    - Completed
>> > >    - Truncation skipped
>> > >    - Partially truncated
>> > >    - Skipped
>> > >    - Killed by lock conflict
>> > >
>> > > This seems enough to find the cause of a table bloat. The same
>> > > discussion could be applied to analyze but it might be the
>> > > another issue.
>> > >
>> > > There may be a better way to indicate the vacuum soundness. Any
>> > > opinions and suggestions are welcome.
>> > >
>> > > I'm going to make a patch to do the 'formal' one for the time
>> > > being.
>
> Done with small modifications. In the attached patch
> pg_stat_all_tables has the following new columns. Documentations
> is not provided at this stage.
>
> -----
>   n_mod_since_analyze     | 0
> + vacuum_required         | not requried
>   last_vacuum             |
>   last_autovacuum         | 2017-10-30 18:51:32.060551+09
>   last_analyze            |
>   last_autoanalyze        | 2017-10-30 18:48:33.414711+09
>   vacuum_count            | 0
> + last_vacuum_truncated   | 0
> + last_vacuum_untruncated | 0
> + last_vacuum_index_scans | 0
> + last_vacuum_oldest_xmin | 2134
> + last_vacuum_status      | agressive vacuum completed
> + autovacuum_fail_count   | 0
>   autovacuum_count        | 5
>   analyze_count           | 0
>   autoanalyze_count       | 1
> -----
> Where each column shows the following infomation.
>
> + vacuum_required         | not requried
>
>  VACUUM requirement status. Takes the following values.
>
>   - partial
>     Partial (or normal) will be performed by the next autovacuum.
>     The word "partial" is taken from the comment for
>     vacuum_set_xid_limits.
>
>   - aggressive
>     Aggressive scan will be performed by the next autovacuum.
>
>   - required
>     Any type of autovacuum will be performed. The type of scan is
>     unknown because the view failed to take the required lock on
>     the table. (AutoVacuumrequirement())
>
>   - not required
>     Next autovacuum won't perform scan on this relation.
>
>   - not required (lock not acquired)
>
>     Autovacuum should be disabled and the distance to
>     freeze-limit is not known because required lock is not
>     available.
>
>   - close to freeze-limit xid
>     Shown while autovacuum is disabled. The table is in the
>     manual vacuum window to avoid anti-wraparound autovacuum.
>
> + last_vacuum_truncated | 0
>
>   The number of truncated pages in the last completed
>   (auto)vacuum.
>
> + last_vacuum_untruncated | 0
>   The number of pages the last completed (auto)vacuum tried to
>   truncate but could not for some reason.
>
> + last_vacuum_index_scans | 0
>   The number of index scans performed in the last completed
>   (auto)vacuum.
>
> + last_vacuum_oldest_xmin | 2134
>   The oldest xmin used in the last completed (auto)vacuum.
>
> + last_vacuum_status      | agressive vacuum completed
>
>   The finish status of the last vacuum. Takes the following
>   values. (pg_stat_get_last_vacuum_status())
>
>    - completed
>      The last partial (auto)vacuum is completed.
>
>    - vacuum full completed
>      The last VACUUM FULL is completed.
>
>    - aggressive vacuum completed
>      The last aggressive (auto)vacuum is completed.
>
>    - error while $progress
>      The last vacuum stopped by error while $progress.
>      The $progress one of the vacuum progress phases.
>
>    - canceled while $progress
>      The last vacuum was canceled while $progress
>
>      This is caused by user cancellation of manual vacuum or
>      killed by another backend who wants to acquire lock on the
>      relation.
>
>    - skipped - lock unavailable
>      The last autovacuum on the relation was skipped because
>      required lock was not available.
>
>    - unvacuumable
>      A past autovacuum tried vacuum on the relation but it is not
>      vacuumable for reasons of ownership or accessibility problem.
>      (Such relations are not shown in pg_stat_all_tables..)
>
> + autovacuum_fail_count   | 0
>   The number of successive failure of vacuum on the relation.
>   Reset to zero by completed vacuum.
>
> ======
>
> In the patch, vacrelstats if pointed from a static variable and
> cancel reporting is performed in PG_CATCH() section in vacuum().
> Every unthrown error like lock acquisition failure is reported by
> explicit pgstat_report_vacuum() with the corresponding finish
> code.
>
> Vacuum requirement status is calculated in AutoVacuumRequirment()
> and returned as a string. Access share lock on the target
> relation is required but it returns only available values if the
> lock is not available. I decided to return incomplete (coarse
> grained) result than wait for a lock that isn't known to be
> relased in a short time for a perfect result.
            pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
+           pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,            pg_stat_get_last_vacuum_time(C.oid) as
last_vacuum,           pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid)as last_analyze,            pg_stat_get_last_autoanalyze_time(C.oid) as
last_autoanalyze,           pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
 
+           pg_stat_get_last_vacuum_truncated(C.oid) AS last_vacuum_truncated,
+           pg_stat_get_last_vacuum_untruncated(C.oid) AS
last_vacuum_untruncated,
+           pg_stat_get_last_vacuum_index_scans(C.oid) AS
last_vacuum_index_scans,
+           pg_stat_get_last_vacuum_oldest_xmin(C.oid) AS
last_vacuum_oldest_xmin,
+           pg_stat_get_last_vacuum_status(C.oid) AS last_vacuum_status,
+           pg_stat_get_autovacuum_fail_count(C.oid) AS autovacuum_fail_count,
Please use spaces instead of tabs. Indentation is not consistent.

+       case PGSTAT_VACUUM_CANCELED:
+           phase = tabentry->vacuum_last_phase;
+           /* number of elements of phasestr above */
+           if (phase >= 0 && phase <= 7)
+               result = psprintf("%s while %s",
+                                 status == PGSTAT_VACUUM_CANCELED ?
+                                 "canceled" : "error",
+                                 phasestr[phase]);
Such complication is not necessary. The phase parameter is updated by
individual calls of pgstat_progress_update_param(), so the information
showed here overlaps with the existing information in the "phase"
field.

@@ -210,7 +361,6 @@ pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)   PG_RETURN_INT64(result);}

-Datumpg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
Noise diff.

Thinking about trying to et something into core by the end of the
commit fest, this patch presents multiple concepts at once which could
be split into separate patches for simplicity:
1) Additional data fields to help in debugging completed vacuums.
2) Tracking of interrupted vacuum jobs in progress table.
3) Get state of vacuum job on error.

However, progress reports are here to allow users to do decisions
based on the activity of how things are working. This patch proposes
to add multiple new fields:
- oldest Xmin.
- number of index scans.
- number of pages truncated.
- number of pages that should have been truncated, but are not truncated.
Among all this information, as Sawada-san has already mentioned
upthread, the more index scans the less dead tuples you can store at
once, so autovacuum_work_mem ought to be increases. This is useful for
tuning and should be documented properly if reported to give
indications about vacuum behavior. The rest though, could indicate how
aggressive autovacuum is able to remove tail blocks and do its work.
But what really matters for users to decide if autovacuum should be
more aggressive is tracking the number of dead tuples, something which
is already evaluated.

Tracking the number of failed vacuum attempts is also something
helpful to understand how much the job is able to complete. As there
is already tracking vacuum jobs that have completed, it could be
possible, instead of logging activity when a vacuum job has failed, to
track the number of *begun* jobs on a relation. Then it is possible to
guess how many have failed by taking the difference between those that
completed properly. Having counters per failure types could also be a
possibility.

For this commit fest, I would suggest a patch that simply adds
tracking for the number of index scans done, with documentation to
give recommendations about parameter tuning. i am switching the patch
as "waiting on author".
-- 
Michael

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

16 November 2017, 16:34:02

Thank you for reviewing this.

At Wed, 15 Nov 2017 16:13:01 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQm_WCKuUf5RD0CzeMuMO907ZPKP7mBh-3t2zSJ9jn+PA@mail.gmail.com>
>              pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
> +           pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,
>              pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
>              pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
>              pg_stat_get_last_analyze_time(C.oid) as last_analyze,
>              pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
>              pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
> Please use spaces instead of tabs. Indentation is not consistent.

Done. Thank you for pointing. (whitespace-mode showed me some
similar inconsistencies at the other places in the file...)

> +       case PGSTAT_VACUUM_CANCELED:
> +           phase = tabentry->vacuum_last_phase;
> +           /* number of elements of phasestr above */
> +           if (phase >= 0 && phase <= 7)
> +               result = psprintf("%s while %s",
> +                                 status == PGSTAT_VACUUM_CANCELED ?
> +                                 "canceled" : "error",
> +                                 phasestr[phase]);
> Such complication is not necessary. The phase parameter is updated by
> individual calls of pgstat_progress_update_param(), so the information
> showed here overlaps with the existing information in the "phase"
> field.

The "phase" is pg_stat_progress_vacuum's? If "complexy" means
phasestr[phase], the "phase" cannot be overlap with
last_vacuum_status since pg_stat_progress_vacuum's entry has
already gone when someone looks into pg_stat_all_tables and see a
failed vacuum status. Could you give a bit specific comment?

> @@ -210,7 +361,6 @@ pg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)
>     PG_RETURN_INT64(result);
>  }
> 
> -
>  Datum
>  pg_stat_get_blocks_hit(PG_FUNCTION_ARGS)
> Noise diff.

Removed.

> Thinking about trying to et something into core by the end of the
> commit fest, this patch presents multiple concepts at once which could
> be split into separate patches for simplicity:
> 1) Additional data fields to help in debugging completed vacuums.
> 2) Tracking of interrupted vacuum jobs in progress table.
> 3) Get state of vacuum job on error.
> 
> However, progress reports are here to allow users to do decisions
> based on the activity of how things are working. This patch proposes
> to add multiple new fields:
> - oldest Xmin.
> - number of index scans.
> - number of pages truncated.
> - number of pages that should have been truncated, but are not truncated.
> Among all this information, as Sawada-san has already mentioned
> upthread, the more index scans the less dead tuples you can store at
> once, so autovacuum_work_mem ought to be increases. This is useful for
> tuning and should be documented properly if reported to give
> indications about vacuum behavior. The rest though, could indicate how
> aggressive autovacuum is able to remove tail blocks and do its work.
> But what really matters for users to decide if autovacuum should be
> more aggressive is tracking the number of dead tuples, something which
> is already evaluated.

Hmm. I tend to agree. Such numbers are better to be shown as
average of the last n vacuums or maximum. I decided to show
last_vacuum_index_scan only and I think that someone can record
it continuously to elsewhere if wants.

> Tracking the number of failed vacuum attempts is also something
> helpful to understand how much the job is able to complete. As there
> is already tracking vacuum jobs that have completed, it could be
> possible, instead of logging activity when a vacuum job has failed, to
> track the number of *begun* jobs on a relation. Then it is possible to
> guess how many have failed by taking the difference between those that
> completed properly. Having counters per failure types could also be a
> possibility.

Maybe pg_stat_all_tables is not the place to hold such many kinds
of vacuum specific information. pg_stat_vacuum_all_tables or
something like?

> For this commit fest, I would suggest a patch that simply adds
> tracking for the number of index scans done, with documentation to
> give recommendations about parameter tuning. i am switching the patch
> as "waiting on author".

Ok, the patch has been split into the following four parts. (Not
split by function, but by the kind of information to add.)
The first one is that.

0001. Adds pg_stat_all_tables.last_vacuum_index_scans. Documentation is added.

0002. Adds pg_stat_all_tables.vacuum_required. And primitive documentation.

0003. Adds pg_stat_all_tables.last_vacuum_status/autovacuum_fail_count  plus primitive documentation.

0004. truncation information stuff.


One concern on pg_stat_all_tables view is the number of
predefined functions it uses. Currently 20 functions and this
patch adds more seven. I feel it's better that at least the
functions this patch adds are merged into one function..

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From f0132151fddddb6f8439b82465ba31e64bc3b8ad Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 16 Nov 2017 15:27:53 +0900
Subject: [PATCH 1/4] Show index scans of the last vacuum in pg_stat_all_tables

This number is already shown in the autovacuum completion log or the
result of VACUUM VERBOSE, but the number is useful to see whether
maintenance_work_mem is large enough so this patch adds the number in
pg_stat_all_tables view.
---doc/src/sgml/config.sgml             |  9 +++++++++doc/src/sgml/monitoring.sgml         |  5
+++++src/backend/catalog/system_views.sql|  1 +src/backend/commands/vacuumlazy.c    |  3
++-src/backend/postmaster/pgstat.c     |  6 +++++-src/backend/utils/adt/pgstatfuncs.c  | 14
++++++++++++++src/include/catalog/pg_proc.h       |  2 ++src/include/pgstat.h                 |  5
++++-src/test/regress/expected/rules.out |  3 +++9 files changed, 45 insertions(+), 3 deletions(-)
 

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fc1752f..41f0858 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1496,6 +1496,15 @@ include_dir 'conf.d'        performance for vacuuming and for restoring database dumps.
</para>      <para>
 
+         Vacuuming scans all index pages to remove index entries that pointed
+         the removed tuples. In order to finish vacuuming by as few index
+         scans as possible, the removed tuples are remembered in working
+         memory. If this setting is not large enough, vacuuming runs
+         additional index scans to vacate the memory and it might cause a
+         performance problem. That behavior can be monitored
+         in <xref linkend="pg-stat-all-tables-view">.
+       </para>
+       <para>        Note that when autovacuum runs, up to        <xref linkend="guc-autovacuum-max-workers"> times
thismemory        may be allocated, so be careful not to set the default value
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6f82033..71823c5 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2576,6 +2576,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i      (not
counting<command>VACUUM FULL</command>)</entry>    </row>    <row>
 
+     <entry><structfield>last_vacuum_index_scans</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of splitted index scans performed during the last vacuum on this table</entry>
+    </row>
+    <row>     <entry><structfield>autovacuum_count</structfield></entry>     <entry><type>bigint</type></entry>
<entry>Numberof times this table has been vacuumed by the autovacuum
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 394aea8..cf6621d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -528,6 +528,7 @@ CREATE VIEW pg_stat_all_tables AS            pg_stat_get_last_analyze_time(C.oid) as last_analyze,
         pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,            pg_stat_get_vacuum_count(C.oid) AS
vacuum_count,
+            pg_stat_get_last_vacuum_index_scans(C.oid) AS last_vacuum_index_scans,
pg_stat_get_autovacuum_count(C.oid)AS autovacuum_count,            pg_stat_get_analyze_count(C.oid) AS analyze_count,
        pg_stat_get_autoanalyze_count(C.oid) AS autoanalyze_count
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 6587db7..c482c8e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -342,7 +342,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),                        onerel->rd_rel->relisshared,
  new_live_tuples,
 
-                         vacrelstats->new_dead_tuples);
+                         vacrelstats->new_dead_tuples,
+                         vacrelstats->num_index_scans);    pgstat_progress_end_command();    /* and log the action if
appropriate*/
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..5f3fdf6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1403,7 +1403,8 @@ pgstat_report_autovac(Oid dboid) */voidpgstat_report_vacuum(Oid tableoid, bool shared,
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                     PgStat_Counter num_index_scans){    PgStat_MsgVacuum msg;
@@ -1417,6 +1418,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,    msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples= livetuples;    msg.m_dead_tuples = deadtuples;
 
+    msg.m_num_index_scans = num_index_scans;    pgstat_send(&msg, sizeof(msg));}
@@ -4585,6 +4587,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples= 0;        result->n_dead_tuples = 0;        result->changes_since_analyze = 0;
 
+        result->n_index_scans = 0;        result->blocks_fetched = 0;        result->blocks_hit = 0;
result->vacuum_timestamp= 0;
 
@@ -5981,6 +5984,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)    tabentry->n_live_tuples =
msg->m_live_tuples;   tabentry->n_dead_tuples = msg->m_dead_tuples;
 
+    tabentry->n_index_scans = msg->m_num_index_scans;    if (msg->m_autovacuum)    {
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 8d9e7c1..2956356 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -194,6 +194,20 @@ pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS)    PG_RETURN_INT64(result);}
+Datum
+pg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int32        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int32) (tabentry->n_index_scans);
+
+    PG_RETURN_INT32(result);
+}Datumpg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0330c04..f3b606b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2887,6 +2887,8 @@ DATA(insert OID = 3317 (  pg_stat_get_wal_receiver    PGNSP PGUID 12 1 0 0 0 f f
fDESCR("statistics:information about WAL receiver");DATA(insert OID = 6118 (  pg_stat_get_subscription    PGNSP PGUID
121 0 0 0 f f f f f f s r 1 0 2249 "26" "{26,26,26,23,3220,1184,1184,3220,1184}" "{i,o,o,o,o,o,o,o,o}"
"{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}"_null_
_null_pg_stat_get_subscription _null_ _null_ _null_ ));DESCR("statistics: information about subscription");
 
+DATA(insert OID = 3281 (  pg_stat_get_last_vacuum_index_scans    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_index_scans _null_ _null_ _null_ ));
 
+DESCR("statistics: number of index scans in the last vacuum");DATA(insert OID = 2026 (  pg_backend_pid
PGNSPPGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_
));DESCR("statistics:current backend PID");DATA(insert OID = 1937 (  pg_stat_get_backend_pid        PGNSP PGUID 12 1 0
00 f f f f t f s r 1 0 23 "23" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_pid _null_ _null_ _null_ ));
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..3ab5f4a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -369,6 +369,7 @@ typedef struct PgStat_MsgVacuum    TimestampTz m_vacuumtime;    PgStat_Counter m_live_tuples;
PgStat_Counterm_dead_tuples;
 
+    PgStat_Counter m_num_index_scans;} PgStat_MsgVacuum;
@@ -629,6 +630,7 @@ typedef struct PgStat_StatTabEntry    PgStat_Counter n_live_tuples;    PgStat_Counter
n_dead_tuples;   PgStat_Counter changes_since_analyze;
 
+    PgStat_Counter n_index_scans;    PgStat_Counter blocks_fetched;    PgStat_Counter blocks_hit;
@@ -1165,7 +1167,8 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type textern void
pgstat_report_autovac(Oiddboid);extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples);
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                     PgStat_Counter num_index_scans);extern void pgstat_report_analyze(Relation rel,
  PgStat_Counter livetuples, PgStat_Counter deadtuples,                      bool resetcounter);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..c334d20 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1764,6 +1764,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,    pg_stat_get_last_analyze_time(c.oid) AS
last_analyze,   pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze,    pg_stat_get_vacuum_count(c.oid) AS
vacuum_count,
+    pg_stat_get_last_vacuum_index_scans(c.oid) AS last_vacuum_index_scans,    pg_stat_get_autovacuum_count(c.oid) AS
autovacuum_count,   pg_stat_get_analyze_count(c.oid) AS analyze_count,    pg_stat_get_autoanalyze_count(c.oid) AS
autoanalyze_count
@@ -1911,6 +1912,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_analyze,
pg_stat_all_tables.last_autoanalyze,   pg_stat_all_tables.vacuum_count,
 
+    pg_stat_all_tables.last_vacuum_index_scans,    pg_stat_all_tables.autovacuum_count,
pg_stat_all_tables.analyze_count,   pg_stat_all_tables.autoanalyze_count
 
@@ -1954,6 +1956,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_analyze,
pg_stat_all_tables.last_autoanalyze,   pg_stat_all_tables.vacuum_count,
 
+    pg_stat_all_tables.last_vacuum_index_scans,    pg_stat_all_tables.autovacuum_count,
pg_stat_all_tables.analyze_count,   pg_stat_all_tables.autoanalyze_count
 
-- 
2.9.2

From 9619edd924c13337843f3fde221096b389701012 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 16 Nov 2017 16:18:54 +0900
Subject: [PATCH 2/4] Add vacuum_required to pg_stat_all_tables

If vacuum of a table has been failed for a long time for some reasons,
it is hard for uses to distinguish between that the server judged
vacuuming of the table is not required and that rquired but failed.
This offers convenient way to check that as the first step of trouble
shooting.
---doc/src/sgml/config.sgml             |   5 +-doc/src/sgml/maintenance.sgml        |   4
+-doc/src/sgml/monitoring.sgml        |   5 ++src/backend/catalog/system_views.sql |   1
+src/backend/commands/cluster.c      |   2 +-src/backend/commands/vacuum.c        |  69
++++++++++++++++++---src/backend/commands/vacuumlazy.c   |  14 +----src/backend/postmaster/autovacuum.c  | 115
+++++++++++++++++++++++++++++++++++src/backend/utils/adt/pgstatfuncs.c |   9 +++src/include/catalog/pg_proc.h        |
2 +src/include/commands/vacuum.h        |   3 +-src/include/postmaster/autovacuum.h  |   1
+src/test/regress/expected/rules.out |   3 +13 files changed, 210 insertions(+), 23 deletions(-)
 

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 41f0858..7262ffb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6579,7 +6579,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;       <para>
<command>VACUUM</command>performs an aggressive scan if the table's
<structname>pg_class</structname>.<structfield>relfrozenxid</structfield>field has reached
 
-        the age specified by this setting.  An aggressive scan differs from
+        the age specified by this setting. It is indicated
+        as <quote>aggressive</quote> in vacuum_required
+        of <xref linkend="pg-stat-all-tables-view">. An aggressive scan
+        differs from        a regular <command>VACUUM</command> in that it visits every page that might        contain
unfrozenXIDs or MXIDs, not just those that might contain dead        tuples.  The default is 150 million transactions.
Althoughusers can
 
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 1a37905..d045b09 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -514,7 +514,9 @@    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an    anti-wraparound
autovacuumwould be triggered at that point anyway, and    the 0.95 multiplier leaves some breathing room to run a
manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
+    <command>VACUUM</command> before that happens. It is indicated
+    as <quote>close to freeze-limit xid</quote> in vacuum_required
+    of <xref linkend="pg-stat-all-tables-view">. As a rule of thumb,    <command>vacuum_freeze_table_age</command>
shouldbe set to a value somewhat    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that    a
regularlyscheduled <command>VACUUM</command> or an autovacuum triggered by
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 71823c5..e8a8f77 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2547,6 +2547,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<entry>Estimatednumber of rows modified since this table was last analyzed</entry>    </row>    <row>
 
+     <entry><structfield>vacuum_required</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Vacuum requirement status. "partial", "aggressive", "required", "not requried" or "close to freeze-limit
xid".</entry>
+    </row>
+    <row>     <entry><structfield>last_vacuum</structfield></entry>     <entry><type>timestamp with time
zone</type></entry>    <entry>Last time at which this table was manually vacuumed
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index cf6621d..97bafb8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -523,6 +523,7 @@ CREATE VIEW pg_stat_all_tables AS            pg_stat_get_live_tuples(C.oid) AS n_live_tup,
 pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,            pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
 
+            pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,            pg_stat_get_last_vacuum_time(C.oid) as
last_vacuum,           pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid)as last_analyze,
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 48f1e6e..403b76d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -850,7 +850,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,     */
vacuum_set_xid_limits(OldHeap,0, 0, 0, 0,                          &OldestXmin, &FreezeXid, NULL, &MultiXactCutoff,
 
-                          NULL);
+                          NULL, NULL, NULL);    /*     * FreezeXid will become the table's new relfrozenxid, and that
mustn'tgo
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index cbd6e9b..f51dcdb 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -585,6 +585,10 @@ get_all_vacuum_rels(void) *     Xmax. * - mxactFullScanLimit is a value against which a table's
relminmxidvalue is *     compared to produce a full-table vacuum, as with xidFullScanLimit.
 
+ * - aggressive is set if it is not NULL and set true if the table needs
+ *   aggressive scan.
+ * - close_to_wrap_around_limit is set if it is not NULL and set true if it is
+ *   in anti-anti-wraparound window. * * xidFullScanLimit and mxactFullScanLimit can be passed as NULL if caller is *
notinterested.
 
@@ -599,9 +603,11 @@ vacuum_set_xid_limits(Relation rel,                      TransactionId *freezeLimit,
      TransactionId *xidFullScanLimit,                      MultiXactId *multiXactCutoff,
 
-                      MultiXactId *mxactFullScanLimit)
+                      MultiXactId *mxactFullScanLimit,
+                      bool *aggressive, bool *close_to_wrap_around_limit){    int            freezemin;
+    int            freezemax;    int            mxid_freezemin;    int            effective_multixact_freeze_max_age;
 TransactionId limit;
 
@@ -701,11 +707,13 @@ vacuum_set_xid_limits(Relation rel,    *multiXactCutoff = mxactLimit;
-    if (xidFullScanLimit != NULL)
+    if (xidFullScanLimit != NULL || aggressive != NULL)    {        int            freezetable;
+        bool        maybe_anti_wrapround = false;
-        Assert(mxactFullScanLimit != NULL);
+        /* these two output should be requested together  */
+        Assert(xidFullScanLimit == NULL || mxactFullScanLimit != NULL);        /*         * Determine the table freeze
ageto use: as specified by the caller,
 
@@ -717,7 +725,14 @@ vacuum_set_xid_limits(Relation rel,        freezetable = freeze_table_age;        if (freezetable
<0)            freezetable = vacuum_freeze_table_age;
 
-        freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
+
+        freezemax = autovacuum_freeze_max_age * 0.95;
+        if (freezemax < freezetable)
+        {
+            /* We may be in anti-anti-warparound window */
+            freezetable = freezemax;
+            maybe_anti_wrapround = true;
+        }        Assert(freezetable >= 0);        /*
@@ -728,7 +743,8 @@ vacuum_set_xid_limits(Relation rel,        if (!TransactionIdIsNormal(limit))            limit =
FirstNormalTransactionId;
-        *xidFullScanLimit = limit;
+        if (xidFullScanLimit)
+            *xidFullScanLimit = limit;        /*         * Similar to the above, determine the table freeze age to use
for
@@ -741,10 +757,20 @@ vacuum_set_xid_limits(Relation rel,        freezetable = multixact_freeze_table_age;        if
(freezetable< 0)            freezetable = vacuum_multixact_freeze_table_age;
 
-        freezetable = Min(freezetable,
-                          effective_multixact_freeze_max_age * 0.95);
+
+        freezemax = effective_multixact_freeze_max_age * 0.95;
+        if (freezemax < freezetable)
+        {
+            /* We may be in anti-anti-warparound window */
+            freezetable = freezemax;
+            maybe_anti_wrapround = true;
+        }        Assert(freezetable >= 0);
+        /* We may be in anti-anti-warparound window */
+        if (effective_multixact_freeze_max_age * 0.95 < freezetable)
+            maybe_anti_wrapround = true;
+        /*         * Compute MultiXact limit causing a full-table vacuum, being careful         * to generate a valid
MultiXactvalue.
 
@@ -753,11 +779,38 @@ vacuum_set_xid_limits(Relation rel,        if (mxactLimit < FirstMultiXactId)
mxactLimit= FirstMultiXactId;
 
-        *mxactFullScanLimit = mxactLimit;
+        if (mxactFullScanLimit)
+            *mxactFullScanLimit = mxactLimit;
+
+        /*
+         * We request an aggressive scan if the table's frozen Xid is now
+         * older than or equal to the requested Xid full-table scan limit; or
+         * if the table's minimum MultiXactId is older than or equal to the
+         * requested mxid full-table scan limit.
+         */
+        if (aggressive)
+        {
+            *aggressive =
+                TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
+                                              limit);
+            *aggressive |=
+                MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
+                                            mxactLimit);
+
+            /* set close_to_wrap_around_limit if requested */
+            if (close_to_wrap_around_limit)
+                *close_to_wrap_around_limit =
+                    (*aggressive && maybe_anti_wrapround);
+        }
+        else
+        {
+            Assert (!close_to_wrap_around_limit);
+        }    }    else    {        Assert(mxactFullScanLimit == NULL);
+        Assert(aggressive == NULL);    }}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c482c8e..4274043 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -227,18 +227,10 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
params->multixact_freeze_min_age,                         params->multixact_freeze_table_age,
&OldestXmin,&FreezeLimit, &xidFullScanLimit,
 
-                          &MultiXactCutoff, &mxactFullScanLimit);
+                          &MultiXactCutoff, &mxactFullScanLimit,
+                          &aggressive, NULL);
-    /*
-     * We request an aggressive scan if the table's frozen Xid is now older
-     * than or equal to the requested Xid full-table scan limit; or if the
-     * table's minimum MultiXactId is older than or equal to the requested
-     * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
-     */
-    aggressive = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
-                                               xidFullScanLimit);
-    aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
-                                              mxactFullScanLimit);
+    /* force aggressive scan if DISABLE_PAGE_SKIPPING was specified */    if (options & VACOPT_DISABLE_PAGE_SKIPPING)
     aggressive = true;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 48765bb..abbf660 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -831,6 +831,121 @@ shutdown:}/*
+ * Returns status string of auto vacuum on the relation
+ */
+char *
+AutoVacuumRequirement(Oid reloid)
+{
+    Relation classRel;
+    Relation rel;
+    TupleDesc    pg_class_desc;
+    HeapTuple tuple;
+    Form_pg_class classForm;
+    AutoVacOpts *relopts;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatDBEntry *shared;
+    PgStat_StatDBEntry *dbentry;
+    int            effective_multixact_freeze_max_age;
+    bool        dovacuum;
+    bool        doanalyze;
+    bool        wraparound;
+    bool        aggressive;
+    bool        xid_calculated = false;
+    bool        in_anti_wa_window = false;
+    char       *ret = "not requried";
+
+    /* Compute the multixact age for which freezing is urgent. */
+    effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+
+    /* Fetch the pgclass entry for this relation */
+    tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(reloid));
+    if (!HeapTupleIsValid(tuple))
+        elog(ERROR, "cache lookup failed for relation %u", reloid);
+    classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+    /* extract relopts for autovacuum */
+    classRel = heap_open(RelationRelationId, AccessShareLock);
+    pg_class_desc = RelationGetDescr(classRel);
+    relopts = extract_autovac_opts(tuple, pg_class_desc);
+    heap_close(classRel, AccessShareLock);
+
+    /* Fetch the pgstat shared entry and entry for this database */
+    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+
+    /* Fetch the pgstat entry for this table */
+    tabentry = get_pgstat_tabentry_relid(reloid, classForm->relisshared,
+                                         shared, dbentry);
+
+    /*
+     * Check if the relation needs vacuum. This function is intended to
+     * suggest aggresive vacuum for the last 5% window in
+     * autovacuum_freeze_max_age so the variable wraparound is ignored
+     * here. See vacuum_set_xid_limits for details.
+     */
+    relation_needs_vacanalyze(reloid, relopts, classForm, tabentry,
+                              effective_multixact_freeze_max_age,
+                              &dovacuum, &doanalyze, &wraparound);
+    ReleaseSysCache(tuple);
+
+    /* get further information if needed */
+    rel = NULL;
+
+    /* don't get stuck with lock  */
+    if (ConditionalLockRelationOid(reloid, AccessShareLock))
+        rel = try_relation_open(reloid, NoLock);
+
+    if (rel)
+    {
+        TransactionId OldestXmin, FreezeLimit;
+        MultiXactId MultiXactCutoff;
+
+        vacuum_set_xid_limits(rel,
+                              vacuum_freeze_min_age,
+                              vacuum_freeze_table_age,
+                              vacuum_multixact_freeze_min_age,
+                              vacuum_multixact_freeze_table_age,
+                              &OldestXmin, &FreezeLimit, NULL,
+                              &MultiXactCutoff, NULL,
+                              &aggressive, &in_anti_wa_window);
+
+        xid_calculated = true;
+        relation_close(rel, AccessShareLock);
+    }
+
+    /* choose the proper message according to the calculation above */
+    if (xid_calculated)
+    {
+        if (dovacuum)
+        {
+            /* we don't care anti-wraparound if autovacuum is on */
+            if (aggressive)
+                ret = "aggressive";
+            else
+                ret = "partial";
+        }
+        else if (in_anti_wa_window)
+            ret = "close to freeze-limit xid";
+        /* otherwise just "not requried" */
+    }
+    else
+    {
+        /*
+         * failed to compute xid limits. show less-grained messages. We can
+         * use just "required" in the autovacuum case is enough to distinguish
+         * from full-grained messages, but we require additional words in the
+         * case where autovacuum is turned off.
+         */
+        if (dovacuum)
+            ret = "required";
+        else
+            ret = "not required (lock not acquired)";
+    }
+
+    return ret;
+}
+
+/* * Determine the time to sleep, based on the database list. * * The "canlaunch" parameter indicates whether we can
starta worker right now,
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2956356..ab80794 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -23,6 +23,7 @@#include "pgstat.h"#include "postmaster/bgworker_internals.h"#include "postmaster/postmaster.h"
+#include "postmaster/autovacuum.h"#include "storage/proc.h"#include "storage/procarray.h"#include "utils/acl.h"
@@ -195,6 +196,14 @@ pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS)}Datum
+pg_stat_get_vacuum_necessity(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+
+    PG_RETURN_TEXT_P(cstring_to_text(AutoVacuumRequirement(relid)));
+}
+
+Datumpg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS){    Oid            relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f3b606b..6b84c9a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2887,6 +2887,8 @@ DATA(insert OID = 3317 (  pg_stat_get_wal_receiver    PGNSP PGUID 12 1 0 0 0 f f
fDESCR("statistics:information about WAL receiver");DATA(insert OID = 6118 (  pg_stat_get_subscription    PGNSP PGUID
121 0 0 0 f f f f f f s r 1 0 2249 "26" "{26,26,26,23,3220,1184,1184,3220,1184}" "{i,o,o,o,o,o,o,o,o}"
"{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}"_null_
_null_pg_stat_get_subscription _null_ _null_ _null_ ));DESCR("statistics: information about subscription"); 
+DATA(insert OID = 2579 (  pg_stat_get_vacuum_necessity    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_
_null__null_ _null_ _null_ pg_stat_get_vacuum_necessity _null_ _null_ _null_ ));
 
+DESCR("statistics: true if needs vacuum");DATA(insert OID = 3281 (  pg_stat_get_last_vacuum_index_scans    PGNSP PGUID
121 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_index_scans _null_
_null__null_ ));DESCR("statistics: number of index scans in the last vacuum");DATA(insert OID = 2026 (  pg_backend_pid
             PGNSP PGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_
_null__null_ ));
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 60586b2..84bec74 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -182,7 +182,8 @@ extern void vacuum_set_xid_limits(Relation rel,                      TransactionId *freezeLimit,
                 TransactionId *xidFullScanLimit,                      MultiXactId *multiXactCutoff,
 
-                      MultiXactId *mxactFullScanLimit);
+                      MultiXactId *mxactFullScanLimit,
+                      bool *aggressive, bool *in_wa_window);extern void vac_update_datfrozenxid(void);extern void
vacuum_delay_point(void);
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index 3469915..848a322 100644
--- a/src/include/postmaster/autovacuum.h
+++ b/src/include/postmaster/autovacuum.h
@@ -49,6 +49,7 @@ extern int    Log_autovacuum_min_duration;extern bool AutoVacuumingActive(void);extern bool
IsAutoVacuumLauncherProcess(void);externbool IsAutoVacuumWorkerProcess(void);
 
+extern char *AutoVacuumRequirement(Oid reloid);#define IsAnyAutoVacuumProcess() \    (IsAutoVacuumLauncherProcess() ||
IsAutoVacuumWorkerProcess())
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c334d20..2144269 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1759,6 +1759,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,    pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid)AS n_dead_tup,    pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
 
+    pg_stat_get_vacuum_necessity(c.oid) AS vacuum_required,    pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid)AS last_autovacuum,    pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
 
@@ -1907,6 +1908,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,   pg_stat_all_tables.n_mod_since_analyze,
 
+    pg_stat_all_tables.vacuum_required,    pg_stat_all_tables.last_vacuum,    pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1951,6 +1953,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,   pg_stat_all_tables.n_mod_since_analyze,
 
+    pg_stat_all_tables.vacuum_required,    pg_stat_all_tables.last_vacuum,    pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
-- 
2.9.2

From 8c76950d665cabd095b5eed34ad25854eb2dd5a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 16 Nov 2017 17:05:21 +0900
Subject: [PATCH 3/4] Add vacuum execution status in pg_stat_all_tables

The main objective of this patch is showing how vacuuming is
failing. It is sometimes very hard to diagnose since autovacuum stops
silently in most cases. This patch leaves the reason for vacuum
failure in pg_stat_all_tables and how many times it is continuing to
fail.
---doc/src/sgml/monitoring.sgml         | 10 +++++src/backend/catalog/system_views.sql |  2
+src/backend/commands/vacuum.c       | 33 ++++++++++++++src/backend/commands/vacuumlazy.c    | 72
++++++++++++++++++++++++++++++-src/backend/postmaster/pgstat.c     | 62
+++++++++++++++++++++------src/backend/utils/adt/pgstatfuncs.c | 83
++++++++++++++++++++++++++++++++++++src/include/catalog/pg_proc.h       |  4 ++src/include/commands/vacuum.h        |
1+src/include/pgstat.h                 | 25 ++++++++++-src/test/regress/expected/rules.out  |  6 +++10 files changed,
283insertions(+), 15 deletions(-)
 

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e8a8f77..e2bf2d2 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2586,6 +2586,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<entry>Numberof splitted index scans performed during the last vacuum on this table</entry>    </row>    <row>
 
+     <entry><structfield>last_vacuum_status</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>The status of last autovacuum.</entry>
+    </row>
+    <row>
+     <entry><structfield>autovacuum_fail_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>The number of continuously failed vacuum trials. Cleared to zero if completed.</entry>
+    </row>
+    <row>     <entry><structfield>autovacuum_count</structfield></entry>     <entry><type>bigint</type></entry>
<entry>Numberof times this table has been vacuumed by the autovacuum
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 97bafb8..cd0ea69 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -530,6 +530,8 @@ CREATE VIEW pg_stat_all_tables AS            pg_stat_get_last_autoanalyze_time(C.oid) as
last_autoanalyze,           pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
pg_stat_get_last_vacuum_index_scans(C.oid)AS last_vacuum_index_scans,
 
+            pg_stat_get_last_vacuum_status(C.oid) AS last_vacuum_status,
+            pg_stat_get_autovacuum_fail_count(C.oid) AS autovacuum_fail_count,
pg_stat_get_autovacuum_count(C.oid)AS autovacuum_count,            pg_stat_get_analyze_count(C.oid) AS analyze_count,
        pg_stat_get_autoanalyze_count(C.oid) AS autoanalyze_count
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f51dcdb..ac7c2ac 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -35,6 +35,7 @@#include "catalog/pg_inherits_fn.h"#include "catalog/pg_namespace.h"#include "commands/cluster.h"
+#include "commands/progress.h"#include "commands/vacuum.h"#include "miscadmin.h"#include "nodes/makefuncs.h"
@@ -367,6 +368,9 @@ vacuum(int options, List *relations, VacuumParams *params,    }    PG_CATCH();    {
+        /* report the final status of this vacuum */
+        lazy_vacuum_cancel_handler();
+        in_vacuum = false;        VacuumCostActive = false;        PG_RE_THROW();
@@ -1463,6 +1467,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if (!onerel)
{
+        pgstat_report_vacuum(relid, false, 0, 0, 0,
+                             PGSTAT_VACUUM_SKIP_LOCK_FAILED, 0, 0);        PopActiveSnapshot();
CommitTransactionCommand();       return false;
 
@@ -1494,6 +1500,11 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
(errmsg("skipping\"%s\" --- only table or database owner can vacuum it",
RelationGetRelationName(onerel))));       relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET, 0, 0);
+        PopActiveSnapshot();        CommitTransactionCommand();        return false;
@@ -1511,6 +1522,12 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
(errmsg("skipping\"%s\" --- cannot vacuum non-tables or special system tables",
RelationGetRelationName(onerel))));       relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0,
+                             PGSTAT_VACUUM_SKIP_NONTARGET, 0, 0);
+        PopActiveSnapshot();        CommitTransactionCommand();        return false;
@@ -1526,6 +1543,12 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if
(RELATION_IS_OTHER_TEMP(onerel))   {        relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0,
+                             PGSTAT_VACUUM_SKIP_NONTARGET, 0, 0);
+        PopActiveSnapshot();        CommitTransactionCommand();        return false;
@@ -1539,6 +1562,12 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if
(onerel->rd_rel->relkind== RELKIND_PARTITIONED_TABLE)    {        relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0,
+                             PGSTAT_VACUUM_SKIP_NONTARGET, 0, 0);
+        PopActiveSnapshot();        CommitTransactionCommand();        /* It's OK to proceed with ANALYZE on this
table*/
 
@@ -1584,6 +1613,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)     */    if
(options& VACOPT_FULL)    {
 
+        bool isshared = onerel->rd_rel->relisshared;
+        /* close relation before vacuuming, but hold lock until commit */        relation_close(onerel, NoLock);
onerel = NULL;
 
@@ -1591,6 +1622,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)        /* VACUUM
FULLis now a variant of CLUSTER; see cluster.c */        cluster_rel(relid, InvalidOid, false,
(options& VACOPT_VERBOSE) != 0);
 
+        pgstat_report_vacuum(relid, isshared, 0, 0, 0,
+                             PGSTAT_VACUUM_FULL_FINISHED, 0, 0);    }    else        lazy_vacuum_rel(onerel, options,
params,vac_strategy);
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4274043..af38962 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -105,6 +105,8 @@typedef struct LVRelStats{
+    Oid            reloid;            /* oid of the target relation */
+    bool        shared;            /* is shared relation? */    /* hasindex = true means two-pass strategy; false
meansone-pass */    bool        hasindex;    /* Overall statistics about rel */
 
@@ -138,6 +140,7 @@ static int    elevel = -1;static TransactionId OldestXmin;static TransactionId FreezeLimit;static
MultiXactIdMultiXactCutoff;
 
+static LVRelStats *current_lvstats;static BufferAccessStrategy vac_strategy;
@@ -216,6 +219,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    else        elevel =
DEBUG2;
+    current_lvstats = NULL;    pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
RelationGetRelid(onerel));
@@ -236,12 +240,20 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    vacrelstats = (LVRelStats
*)palloc0(sizeof(LVRelStats));
 
+    vacrelstats->reloid = RelationGetRelid(onerel);
+    vacrelstats->shared = onerel->rd_rel->relisshared;    vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
vacrelstats->old_rel_tuples= onerel->rd_rel->reltuples;    vacrelstats->num_index_scans = 0;
vacrelstats->pages_removed= 0;    vacrelstats->lock_waiter_detected = false;
 
+    /*
+     * Register current vacrelstats so that final status can be reported on
+     * interrupts
+     */
+    current_lvstats = vacrelstats;
+    /* Open all indexes of the relation */    vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel);
vacrelstats->hasindex= (nindexes > 0);
 
@@ -331,11 +343,19 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    if (new_live_tuples < 0)
    new_live_tuples = 0;    /* just in case */
 
-    pgstat_report_vacuum(RelationGetRelid(onerel),
+    /* vacuum successfully finished. nothing to do on exit */
+    current_lvstats = NULL;
+
+    pgstat_report_vacuum(vacrelstats->reloid,                         onerel->rd_rel->relisshared,
   new_live_tuples,                         vacrelstats->new_dead_tuples,
 
-                         vacrelstats->num_index_scans);
+                         vacrelstats->num_index_scans,
+                         aggressive ?
+                         PGSTAT_VACUUM_AGGRESSIVE_FINISHED :
+                         PGSTAT_VACUUM_FINISHED,
+                         0, 0);
+    pgstat_progress_end_command();    /* and log the action if appropriate */
@@ -2198,3 +2218,51 @@ heap_page_is_all_visible(Relation rel, Buffer buf,    return all_visible;}
+
+/*
+ * lazy_vacuum_cancel_handler - report interrupted vacuum status
+ */
+void
+lazy_vacuum_cancel_handler(void)
+{
+    LVRelStats *stats = current_lvstats;
+    LocalPgBackendStatus *local_beentry;
+    PgBackendStatus *beentry;
+    int                phase;
+    int                err;
+
+    current_lvstats = NULL;
+
+    /* we have nothing to report */
+    if (!stats)
+        return;
+
+    /* get vacuum progress stored in backend status */
+    local_beentry = pgstat_fetch_stat_local_beentry(MyBackendId);
+    if (!local_beentry)
+        return;
+
+    beentry = &local_beentry->backendStatus;
+
+    Assert (beentry && beentry->st_progress_command == PROGRESS_COMMAND_VACUUM);
+
+    phase = beentry->st_progress_param[PROGRESS_VACUUM_PHASE];
+
+    /* we can reach here both on interrupt and error */
+    if (geterrcode() == ERRCODE_QUERY_CANCELED)
+        err = PGSTAT_VACUUM_CANCELED;
+    else
+        err = PGSTAT_VACUUM_ERROR;
+
+    /*
+     * vacuum has been canceled, report stats numbers without normalization
+     * here. (But currently they are not used.)
+     */
+    pgstat_report_vacuum(stats->reloid,
+                         stats->shared,
+                         stats->new_rel_tuples,
+                         stats->new_dead_tuples,
+                         stats->num_index_scans,
+                         err,
+                         phase, geterrcode());
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5f3fdf6..540c580 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1404,7 +1404,9 @@ pgstat_report_autovac(Oid dboid)voidpgstat_report_vacuum(Oid tableoid, bool shared,
     PgStat_Counter livetuples, PgStat_Counter deadtuples,
 
-                     PgStat_Counter num_index_scans)
+                     PgStat_Counter num_index_scans,
+                     PgStat_Counter status, PgStat_Counter last_phase,
+                     PgStat_Counter errcode){    PgStat_MsgVacuum msg;
@@ -1419,6 +1421,9 @@ pgstat_report_vacuum(Oid tableoid, bool shared,    msg.m_live_tuples = livetuples;
msg.m_dead_tuples= deadtuples;    msg.m_num_index_scans = num_index_scans;
 
+    msg.m_vacuum_status = status;
+    msg.m_vacuum_last_phase = last_phase;
+    msg.m_vacuum_errcode = errcode;    pgstat_send(&msg, sizeof(msg));}
@@ -4598,6 +4603,11 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->analyze_count= 0;        result->autovac_analyze_timestamp = 0;        result->autovac_analyze_count = 0;
 
+
+        result->vacuum_status = 0;
+        result->vacuum_last_phase = 0;
+        result->vacuum_errcode = 0;
+        result->vacuum_failcount = 0;    }    return result;
@@ -5982,19 +5992,47 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)    tabentry = pgstat_get_tab_entry(dbentry,
msg->m_tableoid,true);
 
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-    tabentry->n_index_scans = msg->m_num_index_scans;
+    tabentry->vacuum_status = msg->m_vacuum_status;
+    tabentry->vacuum_last_phase = msg->m_vacuum_last_phase;
+    tabentry->vacuum_errcode = msg->m_vacuum_errcode;
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
+    /*
+     * We store the numbers only when the vacuum has been completed. They
+     * might be usable to find how much the stopped vacuum processed but we
+     * choose not to show them rather than show bogus numbers.
+     */
+    switch ((StatVacuumStatus)msg->m_vacuum_status)    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
+    case PGSTAT_VACUUM_FINISHED:
+    case PGSTAT_VACUUM_FULL_FINISHED:
+    case PGSTAT_VACUUM_AGGRESSIVE_FINISHED:
+        tabentry->n_live_tuples = msg->m_live_tuples;
+        tabentry->n_dead_tuples = msg->m_dead_tuples;
+        tabentry->n_index_scans = msg->m_num_index_scans;
+        tabentry->vacuum_failcount = 0;
+
+        if (msg->m_autovacuum)
+        {
+            tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
+            tabentry->autovac_vacuum_count++;
+        }
+        else
+        {
+            tabentry->vacuum_timestamp = msg->m_vacuumtime;
+            tabentry->vacuum_count++;
+        }
+        break;
+
+    case PGSTAT_VACUUM_ERROR:
+    case PGSTAT_VACUUM_CANCELED:
+    case PGSTAT_VACUUM_SKIP_LOCK_FAILED:
+        tabentry->vacuum_failcount++;
+        break;
+
+    case PGSTAT_VACUUM_SKIP_NONTARGET:
+    default:
+        /* don't increment failure count for non-target tables */
+        break;    }}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ab80794..bc5d370 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -219,6 +219,89 @@ pg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS)}Datum
+pg_stat_get_last_vacuum_status(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    char        *result = "unknown";
+    PgStat_StatTabEntry *tabentry;
+
+    /*
+     * status string. this must be synced with the strings shown by the
+     * statistics view "pg_stat_progress_vacuum"
+     */
+    static char *phasestr[] =
+        {"initialization",
+         "scanning heap",
+         "vacuuming indexes",
+         "vacuuming heap",
+         "cleaning up indexes",
+         "trucating heap",
+         "performing final cleanup"};
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) != NULL)
+    {
+        int                    phase;
+        StatVacuumStatus    status;
+
+        status = tabentry->vacuum_status;
+        switch (status)
+        {
+        case PGSTAT_VACUUM_FINISHED:
+            result = "completed";
+            break;
+        case PGSTAT_VACUUM_ERROR:
+        case PGSTAT_VACUUM_CANCELED:
+            phase = tabentry->vacuum_last_phase;
+            /* number of elements of phasestr above */
+            if (phase >= 0 && phase <= 7)
+                result = psprintf("%s while %s",
+                                  status == PGSTAT_VACUUM_CANCELED ?
+                                  "canceled" : "error",
+                                  phasestr[phase]);
+            else
+                result = psprintf("unknown vacuum phase: %d", phase);
+            break;
+        case PGSTAT_VACUUM_SKIP_LOCK_FAILED:
+            result = "skipped - lock unavailable";
+            break;
+
+        case PGSTAT_VACUUM_AGGRESSIVE_FINISHED:
+            result = "aggressive vacuum completed";
+            break;
+
+        case PGSTAT_VACUUM_FULL_FINISHED:
+            result = "vacuum full completed";
+            break;
+
+        case PGSTAT_VACUUM_SKIP_NONTARGET:
+            result = "unvacuumable";
+            break;
+
+        default:
+            result = "unknown status";
+            break;
+        }
+    }
+
+    PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+Datum
+pg_stat_get_autovacuum_fail_count(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int64        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int32) (tabentry->vacuum_failcount);
+
+    PG_RETURN_INT32(result);
+}
+
+Datumpg_stat_get_blocks_fetched(PG_FUNCTION_ARGS){    Oid            relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6b84c9a..a51e321 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2891,6 +2891,10 @@ DATA(insert OID = 2579 (  pg_stat_get_vacuum_necessity    PGNSP PGUID 12 1 0 0 0
fDESCR("statistics:true if needs vacuum");DATA(insert OID = 3281 (  pg_stat_get_last_vacuum_index_scans    PGNSP PGUID
121 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_index_scans _null_
_null__null_ ));DESCR("statistics: number of index scans in the last vacuum");
 
+DATA(insert OID = 3420 (  pg_stat_get_last_vacuum_status    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_
_null__null_ _null_ _null_ pg_stat_get_last_vacuum_status _null_ _null_ _null_ ));
 
+DESCR("statistics: ending status of the last vacuum");
+DATA(insert OID = 3421 (  pg_stat_get_autovacuum_fail_count    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_autovacuum_fail_count _null_ _null_ _null_ ));
 
+DESCR("statistics: number of successively failed vacuum trials");DATA(insert OID = 2026 (  pg_backend_pid
 PGNSP PGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_
_null_));DESCR("statistics: current backend PID");DATA(insert OID = 1937 (  pg_stat_get_backend_pid        PGNSP PGUID
121 0 0 0 f f f f t f s r 1 0 23 "23" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_pid _null_ _null_ _null_
));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 84bec74..da3107a 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -190,6 +190,7 @@ extern void vacuum_delay_point(void);/* in commands/vacuumlazy.c */extern void
lazy_vacuum_rel(Relationonerel, int options,                VacuumParams *params, BufferAccessStrategy bstrategy);
 
+extern void lazy_vacuum_cancel_handler(void);/* in commands/analyze.c */extern void analyze_rel(Oid relid, RangeVar
*relation,int options,
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3ab5f4a..62c2369 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -67,6 +67,20 @@ typedef enum StatMsgType    PGSTAT_MTYPE_DEADLOCK} StatMsgType;
+/*
+ * The exit status stored in vacuum report.
+ */
+typedef enum StatVacuumStatus
+{
+    PGSTAT_VACUUM_FINISHED,
+    PGSTAT_VACUUM_CANCELED,
+    PGSTAT_VACUUM_ERROR,
+    PGSTAT_VACUUM_SKIP_LOCK_FAILED,
+    PGSTAT_VACUUM_SKIP_NONTARGET,
+    PGSTAT_VACUUM_AGGRESSIVE_FINISHED,
+    PGSTAT_VACUUM_FULL_FINISHED
+} StatVacuumStatus;
+/* ---------- * The data type used for counters. * ----------
@@ -370,6 +384,9 @@ typedef struct PgStat_MsgVacuum    PgStat_Counter m_live_tuples;    PgStat_Counter m_dead_tuples;
PgStat_Counter m_num_index_scans;
 
+    PgStat_Counter m_vacuum_status;
+    PgStat_Counter m_vacuum_last_phase;
+    PgStat_Counter m_vacuum_errcode;} PgStat_MsgVacuum;
@@ -643,6 +660,10 @@ typedef struct PgStat_StatTabEntry    PgStat_Counter analyze_count;    TimestampTz
autovac_analyze_timestamp;   /* autovacuum initiated */    PgStat_Counter autovac_analyze_count;
 
+    PgStat_Counter    vacuum_status;
+    PgStat_Counter    vacuum_last_phase;
+    PgStat_Counter    vacuum_errcode;
+    PgStat_Counter    vacuum_failcount;} PgStat_StatTabEntry;
@@ -1168,7 +1189,9 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type textern void
pgstat_report_autovac(Oiddboid);extern void pgstat_report_vacuum(Oid tableoid, bool shared,
PgStat_Counterlivetuples, PgStat_Counter deadtuples,
 
-                     PgStat_Counter num_index_scans);
+                     PgStat_Counter num_index_scans,
+                     PgStat_Counter status, PgStat_Counter last_phase,
+                     PgStat_Counter errcode);extern void pgstat_report_analyze(Relation rel,
PgStat_Counterlivetuples, PgStat_Counter deadtuples,                      bool resetcounter);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 2144269..f0a8416 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1766,6 +1766,8 @@ pg_stat_all_tables| SELECT c.oid AS relid,    pg_stat_get_last_autoanalyze_time(c.oid) AS
last_autoanalyze,   pg_stat_get_vacuum_count(c.oid) AS vacuum_count,    pg_stat_get_last_vacuum_index_scans(c.oid) AS
last_vacuum_index_scans,
+    pg_stat_get_last_vacuum_status(c.oid) AS last_vacuum_status,
+    pg_stat_get_autovacuum_fail_count(c.oid) AS autovacuum_fail_count,    pg_stat_get_autovacuum_count(c.oid) AS
autovacuum_count,   pg_stat_get_analyze_count(c.oid) AS analyze_count,    pg_stat_get_autoanalyze_count(c.oid) AS
autoanalyze_count
@@ -1915,6 +1917,8 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_autoanalyze,
pg_stat_all_tables.vacuum_count,   pg_stat_all_tables.last_vacuum_index_scans,
 
+    pg_stat_all_tables.last_vacuum_status,
+    pg_stat_all_tables.autovacuum_fail_count,    pg_stat_all_tables.autovacuum_count,
pg_stat_all_tables.analyze_count,   pg_stat_all_tables.autoanalyze_count
 
@@ -1960,6 +1964,8 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_autoanalyze,
pg_stat_all_tables.vacuum_count,   pg_stat_all_tables.last_vacuum_index_scans,
 
+    pg_stat_all_tables.last_vacuum_status,
+    pg_stat_all_tables.autovacuum_fail_count,    pg_stat_all_tables.autovacuum_count,
pg_stat_all_tables.analyze_count,   pg_stat_all_tables.autoanalyze_count
 
-- 
2.9.2

From 6052a2e9c0c01e53fa083f9e63e1cee610ae09a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 16 Nov 2017 17:47:16 +0900
Subject: [PATCH 4/4] Add truncation information and oldestxmin topg_stat_all_tables

This patch adds truncated and tried-but-not-truncated pages in the
last vacuum. This is intended to use to find uncertain failure of
truncation or unwanted aggressive trancation.

This also adds oldestxmin, find whether a long transaction hindered
vacuuming or not.
---doc/src/sgml/monitoring.sgml         | 15 ++++++++++++src/backend/catalog/system_views.sql |  3
+++src/backend/commands/vacuum.c       | 25 +++++++++++---------src/backend/commands/vacuumlazy.c    | 15
++++++++++++src/backend/postmaster/pgstat.c     | 12 ++++++++++src/backend/utils/adt/pgstatfuncs.c  | 45
++++++++++++++++++++++++++++++++++++src/include/catalog/pg_proc.h       |  6 +++++src/include/pgstat.h
| 9 ++++++++src/test/regress/expected/rules.out  |  9 ++++++++9 files changed, 128 insertions(+), 11 deletions(-)
 

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e2bf2d2..d496fe8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2581,11 +2581,26 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i      (not
counting<command>VACUUM FULL</command>)</entry>    </row>    <row>
 
+     <entry><structfield>last_vacuum_truncated</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number actually truncated pages during the last vacuum on this table</entry>
+    </row>
+    <row>
+     <entry><structfield>last_vacuum_untruncated</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number tried but actually not truncated pages during the last vacuum on this table</entry>
+    </row>
+    <row>     <entry><structfield>last_vacuum_index_scans</structfield></entry>
<entry><type>integer</type></entry>    <entry>Number of splitted index scans performed during the last vacuum on this
table</entry>   </row>    <row>
 
+     <entry><structfield>last_vacuum_oldext_xmin</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>The oldest xmin used by the last vacuum on this table</entry>
+    </row>
+    <row>     <entry><structfield>last_vacuum_status</structfield></entry>     <entry><type>text</type></entry>
<entry>Thestatus of last autovacuum.</entry>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index cd0ea69..0eb3a76 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -529,7 +529,10 @@ CREATE VIEW pg_stat_all_tables AS            pg_stat_get_last_analyze_time(C.oid) as last_analyze,
          pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,            pg_stat_get_vacuum_count(C.oid) AS
vacuum_count,
+            pg_stat_get_last_vacuum_truncated(C.oid) AS last_vacuum_truncated,
+            pg_stat_get_last_vacuum_untruncated(C.oid) AS last_vacuum_untruncated,
pg_stat_get_last_vacuum_index_scans(C.oid)AS last_vacuum_index_scans,
 
+            pg_stat_get_last_vacuum_oldest_xmin(C.oid) AS last_vacuum_oldest_xmin,
pg_stat_get_last_vacuum_status(C.oid)AS last_vacuum_status,            pg_stat_get_autovacuum_fail_count(C.oid) AS
autovacuum_fail_count,           pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ac7c2ac..a0c5a12 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1467,8 +1467,9 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if (!onerel)
{
-        pgstat_report_vacuum(relid, false, 0, 0, 0,
-                             PGSTAT_VACUUM_SKIP_LOCK_FAILED, 0, 0);
+        pgstat_report_vacuum(relid, false,
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_LOCK_FAILED,
+                             InvalidTransactionId, 0, 0);        PopActiveSnapshot();
CommitTransactionCommand();       return false;
 
@@ -1503,7 +1504,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
pgstat_report_vacuum(RelationGetRelid(onerel),                            onerel->rd_rel->relisshared,
 
-                             0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET, 0, 0);
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET,
+                             InvalidTransactionId, 0, 0);        PopActiveSnapshot();
CommitTransactionCommand();
@@ -1525,8 +1527,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
pgstat_report_vacuum(RelationGetRelid(onerel),                            onerel->rd_rel->relisshared,
 
-                             0, 0, 0,
-                             PGSTAT_VACUUM_SKIP_NONTARGET, 0, 0);
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET,
+                             InvalidTransactionId, 0, 0);        PopActiveSnapshot();
CommitTransactionCommand();
@@ -1546,8 +1548,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
pgstat_report_vacuum(RelationGetRelid(onerel),                            onerel->rd_rel->relisshared,
 
-                             0, 0, 0,
-                             PGSTAT_VACUUM_SKIP_NONTARGET, 0, 0);
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET,
+                             InvalidTransactionId, 0, 0);        PopActiveSnapshot();
CommitTransactionCommand();
@@ -1565,8 +1567,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
pgstat_report_vacuum(RelationGetRelid(onerel),                            onerel->rd_rel->relisshared,
 
-                             0, 0, 0,
-                             PGSTAT_VACUUM_SKIP_NONTARGET, 0, 0);
+                             0, 0, 0, 0, 0, PGSTAT_VACUUM_SKIP_NONTARGET,
+                             InvalidTransactionId, 0, 0);        PopActiveSnapshot();
CommitTransactionCommand();
@@ -1622,8 +1624,9 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)        /* VACUUM
FULLis now a variant of CLUSTER; see cluster.c */        cluster_rel(relid, InvalidOid, false,
(options& VACOPT_VERBOSE) != 0);
 
-        pgstat_report_vacuum(relid, isshared, 0, 0, 0,
-                             PGSTAT_VACUUM_FULL_FINISHED, 0, 0);
+        pgstat_report_vacuum(relid, isshared, 0, 0, 0, 0, 0,
+                             PGSTAT_VACUUM_FULL_FINISHED,
+                             InvalidTransactionId, 0, 0);    }    else        lazy_vacuum_rel(onerel, options, params,
vac_strategy);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index af38962..fcd1e3e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -121,6 +121,7 @@ typedef struct LVRelStats    double        new_rel_tuples; /* new estimated total # of tuples */
double       new_dead_tuples;    /* new estimated total # of dead tuples */    BlockNumber pages_removed;
 
+    BlockNumber pages_not_removed;    double        tuples_deleted;    BlockNumber nonempty_pages; /* actually, last
nonemptypage + 1 */    /* List of TIDs of tuples we intend to delete */
 
@@ -246,6 +247,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    vacrelstats->old_rel_tuples
=onerel->rd_rel->reltuples;    vacrelstats->num_index_scans = 0;    vacrelstats->pages_removed = 0;
 
+    vacrelstats->pages_not_removed = 0;    vacrelstats->lock_waiter_detected = false;    /*
@@ -284,8 +286,15 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,     * Optionally truncate the
relation.    */    if (should_attempt_truncation(vacrelstats))
 
+    {        lazy_truncate_heap(onerel, vacrelstats);
+        /* just paranoia */
+        if (vacrelstats->rel_pages >= vacrelstats->nonempty_pages)
+            vacrelstats->pages_not_removed +=
+                vacrelstats->rel_pages - vacrelstats->nonempty_pages;
+    }
+    /* Report that we are now doing final cleanup */    pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
                    PROGRESS_VACUUM_PHASE_FINAL_CLEANUP);
 
@@ -350,7 +359,10 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
onerel->rd_rel->relisshared,                        new_live_tuples,
vacrelstats->new_dead_tuples,
+                         vacrelstats->pages_removed,
+                         vacrelstats->pages_not_removed,                         vacrelstats->num_index_scans,
+                         OldestXmin,                         aggressive ?
PGSTAT_VACUUM_AGGRESSIVE_FINISHED:                         PGSTAT_VACUUM_FINISHED,
 
@@ -2262,7 +2274,10 @@ lazy_vacuum_cancel_handler(void)                         stats->shared,
stats->new_rel_tuples,                        stats->new_dead_tuples,
 
+                         stats->pages_removed,
+                         stats->pages_not_removed,                         stats->num_index_scans,
+                         OldestXmin,                         err,                         phase, geterrcode());}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 540c580..2d3a6ae 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1404,7 +1404,10 @@ pgstat_report_autovac(Oid dboid)voidpgstat_report_vacuum(Oid tableoid, bool shared,
      PgStat_Counter livetuples, PgStat_Counter deadtuples,
 
+                     PgStat_Counter pages_removed,
+                     PgStat_Counter pages_not_removed,                     PgStat_Counter num_index_scans,
+                     TransactionId    oldestxmin,                     PgStat_Counter status, PgStat_Counter
last_phase,                    PgStat_Counter errcode){
 
@@ -1420,7 +1423,10 @@ pgstat_report_vacuum(Oid tableoid, bool shared,    msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples= livetuples;    msg.m_dead_tuples = deadtuples;
 
+    msg.m_pages_removed = pages_removed;
+    msg.m_pages_not_removed = pages_not_removed;    msg.m_num_index_scans = num_index_scans;
+    msg.m_oldest_xmin = oldestxmin;    msg.m_vacuum_status = status;    msg.m_vacuum_last_phase = last_phase;
msg.m_vacuum_errcode= errcode;
 
@@ -4592,7 +4598,10 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples= 0;        result->n_dead_tuples = 0;        result->changes_since_analyze = 0;
 
+        result->n_pages_removed = 0;
+        result->n_pages_not_removed = 0;        result->n_index_scans = 0;
+        result->oldest_xmin = InvalidTransactionId;        result->blocks_fetched = 0;        result->blocks_hit = 0;
     result->vacuum_timestamp = 0;
 
@@ -6008,7 +6017,10 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)    case PGSTAT_VACUUM_AGGRESSIVE_FINISHED:
   tabentry->n_live_tuples = msg->m_live_tuples;        tabentry->n_dead_tuples = msg->m_dead_tuples;
 
+        tabentry->n_pages_removed = msg->m_pages_removed;
+        tabentry->n_pages_not_removed = msg->m_pages_not_removed;        tabentry->n_index_scans =
msg->m_num_index_scans;
+        tabentry->oldest_xmin = msg->m_oldest_xmin;        tabentry->vacuum_failcount = 0;        if
(msg->m_autovacuum)
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index bc5d370..769a196 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -204,6 +204,36 @@ pg_stat_get_vacuum_necessity(PG_FUNCTION_ARGS)}Datum
+pg_stat_get_last_vacuum_truncated(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int64        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int64) (tabentry->n_pages_removed);
+
+    PG_RETURN_INT64(result);
+}
+
+Datum
+pg_stat_get_last_vacuum_untruncated(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int64        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int64) (tabentry->n_pages_not_removed);
+
+    PG_RETURN_INT64(result);
+}
+
+Datumpg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS){    Oid            relid = PG_GETARG_OID(0);
@@ -219,6 +249,21 @@ pg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS)}Datum
+pg_stat_get_last_vacuum_oldest_xmin(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    TransactionId    result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = InvalidTransactionId;
+    else
+        result = (int32) (tabentry->oldest_xmin);
+
+    return TransactionIdGetDatum(result);
+}
+
+Datumpg_stat_get_last_vacuum_status(PG_FUNCTION_ARGS){    Oid            relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index a51e321..a3623dd 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2889,6 +2889,12 @@ DATA(insert OID = 6118 (  pg_stat_get_subscription    PGNSP PGUID 12 1 0 0 0 f f
fDESCR("statistics:information about subscription");DATA(insert OID = 2579 (  pg_stat_get_vacuum_necessity    PGNSP
PGUID12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_vacuum_necessity _null_
_null__null_ ));DESCR("statistics: true if needs vacuum");
 
+DATA(insert OID = 3422 (  pg_stat_get_last_vacuum_untruncated    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_untruncated _null_ _null_ _null_ ));
 
+DESCR("statistics: pages left untruncated in the last vacuum");
+DATA(insert OID = 3423 (  pg_stat_get_last_vacuum_truncated    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_truncated _null_ _null_ _null_ ));
 
+DESCR("statistics: pages truncated in the last vacuum");
+DATA(insert OID = 3424 (  pg_stat_get_last_vacuum_oldest_xmin    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 28 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_oldest_xmin _null_ _null_ _null_ ));
 
+DESCR("statistics: The oldest xmin used in the last vacuum");DATA(insert OID = 3281 (
pg_stat_get_last_vacuum_index_scans   PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_
_null_pg_stat_get_last_vacuum_index_scans _null_ _null_ _null_ ));DESCR("statistics: number of index scans in the last
vacuum");DATA(insertOID = 3420 (  pg_stat_get_last_vacuum_status    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_status _null_ _null_ _null_ ));
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 62c2369..5b8bf7e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -383,7 +383,10 @@ typedef struct PgStat_MsgVacuum    TimestampTz m_vacuumtime;    PgStat_Counter m_live_tuples;
PgStat_Counterm_dead_tuples;
 
+    PgStat_Counter m_pages_removed;
+    PgStat_Counter m_pages_not_removed;    PgStat_Counter m_num_index_scans;
+    TransactionId  m_oldest_xmin;    PgStat_Counter m_vacuum_status;    PgStat_Counter m_vacuum_last_phase;
PgStat_Counterm_vacuum_errcode;
 
@@ -647,7 +650,10 @@ typedef struct PgStat_StatTabEntry    PgStat_Counter n_live_tuples;    PgStat_Counter
n_dead_tuples;   PgStat_Counter changes_since_analyze;
 
+    PgStat_Counter n_pages_removed;
+    PgStat_Counter n_pages_not_removed;    PgStat_Counter n_index_scans;
+    TransactionId  oldest_xmin;    PgStat_Counter blocks_fetched;    PgStat_Counter blocks_hit;
@@ -1189,7 +1195,10 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type textern void
pgstat_report_autovac(Oiddboid);extern void pgstat_report_vacuum(Oid tableoid, bool shared,
PgStat_Counterlivetuples, PgStat_Counter deadtuples,
 
+                     PgStat_Counter pages_removed,
+                     PgStat_Counter pages_not_removed,                     PgStat_Counter num_index_scans,
+                     TransactionId oldextxmin,                     PgStat_Counter status, PgStat_Counter last_phase,
                 PgStat_Counter errcode);extern void pgstat_report_analyze(Relation rel,
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f0a8416..fb1ea49 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1765,7 +1765,10 @@ pg_stat_all_tables| SELECT c.oid AS relid,    pg_stat_get_last_analyze_time(c.oid) AS
last_analyze,   pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze,    pg_stat_get_vacuum_count(c.oid) AS
vacuum_count,
+    pg_stat_get_last_vacuum_truncated(c.oid) AS last_vacuum_truncated,
+    pg_stat_get_last_vacuum_untruncated(c.oid) AS last_vacuum_untruncated,
pg_stat_get_last_vacuum_index_scans(c.oid)AS last_vacuum_index_scans,
 
+    pg_stat_get_last_vacuum_oldest_xmin(c.oid) AS last_vacuum_oldest_xmin,    pg_stat_get_last_vacuum_status(c.oid) AS
last_vacuum_status,   pg_stat_get_autovacuum_fail_count(c.oid) AS autovacuum_fail_count,
pg_stat_get_autovacuum_count(c.oid)AS autovacuum_count,
 
@@ -1916,7 +1919,10 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_analyze,
pg_stat_all_tables.last_autoanalyze,   pg_stat_all_tables.vacuum_count,
 
+    pg_stat_all_tables.last_vacuum_truncated,
+    pg_stat_all_tables.last_vacuum_untruncated,    pg_stat_all_tables.last_vacuum_index_scans,
+    pg_stat_all_tables.last_vacuum_oldest_xmin,    pg_stat_all_tables.last_vacuum_status,
pg_stat_all_tables.autovacuum_fail_count,   pg_stat_all_tables.autovacuum_count,
 
@@ -1963,7 +1969,10 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_analyze,
pg_stat_all_tables.last_autoanalyze,   pg_stat_all_tables.vacuum_count,
 
+    pg_stat_all_tables.last_vacuum_truncated,
+    pg_stat_all_tables.last_vacuum_untruncated,    pg_stat_all_tables.last_vacuum_index_scans,
+    pg_stat_all_tables.last_vacuum_oldest_xmin,    pg_stat_all_tables.last_vacuum_status,
pg_stat_all_tables.autovacuum_fail_count,   pg_stat_all_tables.autovacuum_count,
 
-- 
2.9.2

Re: [HACKERS] More stats about skipped vacuums

From

Michael Paquier

Date:

18 November 2017, 19:23:20

On Thu, Nov 16, 2017 at 7:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Wed, 15 Nov 2017 16:13:01 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQm_WCKuUf5RD0CzeMuMO907ZPKP7mBh-3t2zSJ9jn+PA@mail.gmail.com>
>>              pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
>> +           pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,
>>              pg_stat_get_last_vacuum_time(C.oid) as last_vacuum,
>>              pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
>>              pg_stat_get_last_analyze_time(C.oid) as last_analyze,
>>              pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
>>              pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
>> Please use spaces instead of tabs. Indentation is not consistent.
>
> Done. Thank you for pointing. (whitespace-mode showed me some
> similar inconsistencies at the other places in the file...)

Yes, I am aware of those which get introduced here and there. Let's
not make things worse..

>> +       case PGSTAT_VACUUM_CANCELED:
>> +           phase = tabentry->vacuum_last_phase;
>> +           /* number of elements of phasestr above */
>> +           if (phase >= 0 && phase <= 7)
>> +               result = psprintf("%s while %s",
>> +                                 status == PGSTAT_VACUUM_CANCELED ?
>> +                                 "canceled" : "error",
>> +                                 phasestr[phase]);
>> Such complication is not necessary. The phase parameter is updated by
>> individual calls of pgstat_progress_update_param(), so the information
>> showed here overlaps with the existing information in the "phase"
>> field.
>
> The "phase" is pg_stat_progress_vacuum's? If "complexy" means
> phasestr[phase], the "phase" cannot be overlap with
> last_vacuum_status since pg_stat_progress_vacuum's entry has
> already gone when someone looks into pg_stat_all_tables and see a
> failed vacuum status. Could you give a bit specific comment?

I mean that if you tend to report this information, you should just
use a separate column for it. Having a single column report two
informations, which are here the type of error and potentially the
moment where it appeared are harder to parse.

>> However, progress reports are here to allow users to do decisions
>> based on the activity of how things are working. This patch proposes
>> to add multiple new fields:
>> - oldest Xmin.
>> - number of index scans.
>> - number of pages truncated.
>> - number of pages that should have been truncated, but are not truncated.
>> Among all this information, as Sawada-san has already mentioned
>> upthread, the more index scans the less dead tuples you can store at
>> once, so autovacuum_work_mem ought to be increases. This is useful for
>> tuning and should be documented properly if reported to give
>> indications about vacuum behavior. The rest though, could indicate how
>> aggressive autovacuum is able to remove tail blocks and do its work.
>> But what really matters for users to decide if autovacuum should be
>> more aggressive is tracking the number of dead tuples, something which
>> is already evaluated.
>
> Hmm. I tend to agree. Such numbers are better to be shown as
> average of the last n vacuums or maximum. I decided to show
> last_vacuum_index_scan only and I think that someone can record
> it continuously to elsewhere if wants.

As a user, what would you make of those numbers? How would they help
in tuning autovacuum for a relation? We need to clear up those
questions before thinking if there are cases where those are useful.

>> Tracking the number of failed vacuum attempts is also something
>> helpful to understand how much the job is able to complete. As there
>> is already tracking vacuum jobs that have completed, it could be
>> possible, instead of logging activity when a vacuum job has failed, to
>> track the number of *begun* jobs on a relation. Then it is possible to
>> guess how many have failed by taking the difference between those that
>> completed properly. Having counters per failure types could also be a
>> possibility.
>
> Maybe pg_stat_all_tables is not the place to hold such many kinds
> of vacuum specific information. pg_stat_vacuum_all_tables or
> something like?

What do you have in mind? pg_stat_all_tables already includes counters
about the number of vacuums and analyze runs completed. I guess that
the number of failures, and the types of failures ought to be similar
counters at the same level.

>> For this commit fest, I would suggest a patch that simply adds
>> tracking for the number of index scans done, with documentation to
>> give recommendations about parameter tuning. i am switching the patch
>> as "waiting on author".
>
> Ok, the patch has been split into the following four parts. (Not
> split by function, but by the kind of information to add.)
> The first one is that.
>
> 0001. Adds pg_stat_all_tables.last_vacuum_index_scans. Documentation is added.
>
> 0002. Adds pg_stat_all_tables.vacuum_required. And primitive documentation.
>
> 0003. Adds pg_stat_all_tables.last_vacuum_status/autovacuum_fail_count
>    plus primitive documentation.
>
> 0004. truncation information stuff.
>
> One concern on pg_stat_all_tables view is the number of
> predefined functions it uses. Currently 20 functions and this
> patch adds more seven. I feel it's better that at least the
> functions this patch adds are merged into one function..

For the scope of this commit fest, why not focusing only on 0001 with
the time that remains? This at least is something I am sure will be
useful.
       <para>
+         Vacuuming scans all index pages to remove index entries that pointed
+         the removed tuples. In order to finish vacuuming by as few index
+         scans as possible, the removed tuples are remembered in working
+         memory. If this setting is not large enough, vacuuming runs
+         additional index scans to vacate the memory and it might cause a
+         performance problem. That behavior can be monitored
+         in <xref linkend="pg-stat-all-tables-view">.
+       </para>
Why not making that the third paragraph, after autovacuum_work_mem has
been mentioned for the first time? This could be reworded as well.
Short idea:
Vacuum scans all index pages to remove index entries that pointed to
dead tuples. Finishing vacuum with a minimal number of index scans
reduces the time it takes to complete it, and a new scan is triggered
once the in-memory storage for dead tuple pointers gets full, whose
size is defined by autovacuum_work_mem. So increasing this parameter
can make the operation finish more quickly. This can be monitored with
pg_stat_all_tables.
            pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
+            pg_stat_get_last_vacuum_index_scans(C.oid) AS
last_vacuum_index_scans,            pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
Counters with counters, and last vacuum info with last vacuum info, no?
-- 
Michael

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

21 November 2017, 13:09:57

Thank you for the comments.

At Sat, 18 Nov 2017 22:23:20 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQV1Emkj=5VFzui250T6v+xcpRQ2RfHu_oQMbdXnZw3mA@mail.gmail.com>
> On Thu, Nov 16, 2017 at 7:34 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Wed, 15 Nov 2017 16:13:01 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQm_WCKuUf5RD0CzeMuMO907ZPKP7mBh-3t2zSJ9jn+PA@mail.gmail.com>
> >> Please use spaces instead of tabs. Indentation is not consistent.
> >
> > Done. Thank you for pointing. (whitespace-mode showed me some
> > similar inconsistencies at the other places in the file...)
> 
> Yes, I am aware of those which get introduced here and there. Let's
> not make things worse..

Year, I agree with it.

> >> +       case PGSTAT_VACUUM_CANCELED:
> >> +           phase = tabentry->vacuum_last_phase;
> >> +           /* number of elements of phasestr above */
> >> +           if (phase >= 0 && phase <= 7)
> >> +               result = psprintf("%s while %s",
> >> +                                 status == PGSTAT_VACUUM_CANCELED ?
> >> +                                 "canceled" : "error",
> >> +                                 phasestr[phase]);
> >> Such complication is not necessary. The phase parameter is updated by
> >> individual calls of pgstat_progress_update_param(), so the information
> >> showed here overlaps with the existing information in the "phase"
> >> field.
> >
> > The "phase" is pg_stat_progress_vacuum's? If "complexy" means
> > phasestr[phase], the "phase" cannot be overlap with
> > last_vacuum_status since pg_stat_progress_vacuum's entry has
> > already gone when someone looks into pg_stat_all_tables and see a
> > failed vacuum status. Could you give a bit specific comment?
> 
> I mean that if you tend to report this information, you should just
> use a separate column for it. Having a single column report two
> informations, which are here the type of error and potentially the
> moment where it appeared are harder to parse.

Thanks for the explanation. Ok, now "last_vacuum_status" just
show how the last vacuum or autovacuum finished, in "completed",
"error", "canceled" and "skipped".  "last_vacuum_status_detail"
shows the phase at exiting if "error" or "canceled". They are
still in a bit complex relationship. (pgstatfuncs.c) "error" and
"cancel" could be unified since the error code is already shown
in log.

last_vac_status | last_vac_stat_detail 
================+=======================
"completed"     | (null)/"aggressive"/"full" + "partially truncated" +               | "not a target"
"skipped"       | "lock failure"
"error"         | <errcode> + <phase>
"canceled"      | <phase>

> >> However, progress reports are here to allow users to do decisions
> >> based on the activity of how things are working. This patch proposes
> >> to add multiple new fields:
> >> - oldest Xmin.
> >> - number of index scans.
> >> - number of pages truncated.
> >> - number of pages that should have been truncated, but are not truncated.
> >> Among all this information, as Sawada-san has already mentioned
> >> upthread, the more index scans the less dead tuples you can store at
> >> once, so autovacuum_work_mem ought to be increases. This is useful for
> >> tuning and should be documented properly if reported to give
> >> indications about vacuum behavior. The rest though, could indicate how
> >> aggressive autovacuum is able to remove tail blocks and do its work.
> >> But what really matters for users to decide if autovacuum should be
> >> more aggressive is tracking the number of dead tuples, something which
> >> is already evaluated.
> >
> > Hmm. I tend to agree. Such numbers are better to be shown as
> > average of the last n vacuums or maximum. I decided to show
> > last_vacuum_index_scan only and I think that someone can record
> > it continuously to elsewhere if wants.
> 
> As a user, what would you make of those numbers? How would they help
> in tuning autovacuum for a relation? We need to clear up those
> questions before thinking if there are cases where those are useful.

Ah, I found what you meant. The criteria to choose the numbers in
the previous patch was just what is not logged, and usable to
find whether something wrong is happening on vacuum. So # of
index scans was not in the list. The objective here is to find
the health of vacuum just by looking into stats views.

vacuum_required: apparently cannot be logged, and it is not so                easy to calculate.

last_vacuum_index_scans: It is shown in the log, but I agree that it                is usable to find
maintenance_work_memis too small.
 

last_vacuum_status: It is logged, but it is likely for users to                examine it after something bad has
happend.               "complete" in this column immediately shows that                vacuum on the table is perfectly
working.

last_vacuum_status_detail: The cause of cancel or skipping is not                logged but always it is hard to find
outwhat is                wrong. This narrows the area for users and/or support         to investigate.
 

autovacuum_fail_count: When vacuum has not executed for a long                time, users cannot tell wheter vacuum is
not               required at all or vacuum trials have been                skipped/canceled. This makes distinction
between               the two cases.
 

last_vacuum_untruncated: This is not shown in a log entry. Uses can                find that trailing empty pages are
leftuntruncted.
 

last_vacuum_truncated: This is shown in the log. This is just here in                order to be compared to untruncte
since# untruncated                solely doesn't have meaning.                Or conversely can find that relations are
              *unwantedly* truncated (as my understanding of the                suggestion from Alvaro)
 

last_vacuum_oldest_xmin: A problem very frequently happens is table                bloat caused by long transactions.
           
 

> >> Tracking the number of failed vacuum attempts is also something
> >> helpful to understand how much the job is able to complete. As there
> >> is already tracking vacuum jobs that have completed, it could be
> >> possible, instead of logging activity when a vacuum job has failed, to
> >> track the number of *begun* jobs on a relation. Then it is possible to
> >> guess how many have failed by taking the difference between those that
> >> completed properly. Having counters per failure types could also be a
> >> possibility.
> >
> > Maybe pg_stat_all_tables is not the place to hold such many kinds
> > of vacuum specific information. pg_stat_vacuum_all_tables or
> > something like?
> 
> What do you have in mind? pg_stat_all_tables already includes counters

Nothing specific in my mind.

> about the number of vacuums and analyze runs completed. I guess that
> the number of failures, and the types of failures ought to be similar
> counters at the same level.

Yes, my concern here is how many column we can allow in a stats
view. I think I'm a bit too warried about that.

> >> For this commit fest, I would suggest a patch that simply adds
> >> tracking for the number of index scans done, with documentation to
> >> give recommendations about parameter tuning. i am switching the patch
> >> as "waiting on author".
> >
> > Ok, the patch has been split into the following four parts. (Not
> > split by function, but by the kind of information to add.)
> > The first one is that.
> >
> > 0001. Adds pg_stat_all_tables.last_vacuum_index_scans. Documentation is added.
> >
> > 0002. Adds pg_stat_all_tables.vacuum_required. And primitive documentation.
> >
> > 0003. Adds pg_stat_all_tables.last_vacuum_status/autovacuum_fail_count
> >    plus primitive documentation.
> >
> > 0004. truncation information stuff.
> >
> > One concern on pg_stat_all_tables view is the number of
> > predefined functions it uses. Currently 20 functions and this
> > patch adds more seven. I feel it's better that at least the
> > functions this patch adds are merged into one function..
> 
> For the scope of this commit fest, why not focusing only on 0001 with
> the time that remains? This at least is something I am sure will be
> useful.

Year, so I separated the 0001 patch, but it was not my intention in
this thread. It was 0002 and 0003 so I'd like *show* them with 0001
and focusing on 0001 for this commit fest is fine to me.

>         <para>
> +         Vacuuming scans all index pages to remove index entries that pointed
> +         the removed tuples. In order to finish vacuuming by as few index
> +         scans as possible, the removed tuples are remembered in working
> +         memory. If this setting is not large enough, vacuuming runs
> +         additional index scans to vacate the memory and it might cause a
> +         performance problem. That behavior can be monitored
> +         in <xref linkend="pg-stat-all-tables-view">.
> +       </para>
> Why not making that the third paragraph, after autovacuum_work_mem has
> been mentioned for the first time? This could be reworded as well.

Just to place the Note at the last paragrah. The Note is mentioning
multiplication of autovacuum_work_mem, not about the guc
itself. Anyway I swapped them in this version.

> Short idea:
> Vacuum scans all index pages to remove index entries that pointed to
> dead tuples. Finishing vacuum with a minimal number of index scans
> reduces the time it takes to complete it, and a new scan is triggered
> once the in-memory storage for dead tuple pointers gets full, whose
> size is defined by autovacuum_work_mem. So increasing this parameter
> can make the operation finish more quickly. This can be monitored with
> pg_stat_all_tables.

I thought that it *must* be reworded anyway (because of my poor
wording). Thanks for rewording. I find this perfect.


>              pg_stat_get_vacuum_count(C.oid) AS vacuum_count,
> +            pg_stat_get_last_vacuum_index_scans(C.oid) AS
> last_vacuum_index_scans,
>              pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
> Counters with counters, and last vacuum info with last vacuum info, no?

Moved it to above vacuum_count.

By the way I'm uneasy that the 'last_vacuum_index_scans' (and
vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
both VACUUM command and autovacuum, while last_vacuum and
vacuum_count is mentioning only the command. Splitting it into
vacuum/autovaccum seems nonsense but the name is confusing. Do
you have any idea?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From af124a675637c44781ff84a979e6d9d0afb1e8d4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 21 Nov 2017 10:47:52 +0900
Subject: [PATCH 4/4] Add truncation information to pg_stat_all_tables

This patch adds truncated and tried-but-not-truncated pages in the
last vacuum. This is intended to use to find uncertain failure of
truncation or unwanted aggressive trancation.
---doc/src/sgml/monitoring.sgml         | 10 ++++++++++src/backend/catalog/system_views.sql |  2
++src/backend/commands/vacuum.c       | 14 ++++++++------src/backend/commands/vacuumlazy.c    | 13
++++++++++++-src/backend/postmaster/pgstat.c     | 10 +++++++++-src/backend/utils/adt/pgstatfuncs.c  | 32
+++++++++++++++++++++++++++++++-src/include/catalog/pg_proc.h       |  4 ++++src/include/pgstat.h                 |  6
++++++src/test/regress/expected/rules.out |  6 ++++++9 files changed, 88 insertions(+), 9 deletions(-)
 

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a0288cb..fd0507a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2585,6 +2585,16 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<entry>Oldestxmin used by the last vacuum on this table</entry>    </row>    <row>
 
+     <entry><structfield>last_vacuum_truncated</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number actually truncated pages during the last vacuum on this table</entry>
+    </row>
+    <row>
+     <entry><structfield>last_vacuum_untruncated</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Number tried but actually not truncated pages during the last vacuum on this table</entry>
+    </row>
+    <row>     <entry><structfield>last_vacuum_status</structfield></entry>     <entry><type>text</type></entry>
<entry>Statusof the last vacuum or autovacuum.</entry>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c69fea9..528e9c5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -529,6 +529,8 @@ CREATE VIEW pg_stat_all_tables AS            pg_stat_get_last_analyze_time(C.oid) as last_analyze,
         pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
pg_stat_get_last_vacuum_index_scans(C.oid)AS last_vacuum_index_scans,
 
+            pg_stat_get_last_vacuum_truncated(C.oid) AS last_vacuum_truncated,
+            pg_stat_get_last_vacuum_untruncated(C.oid) AS last_vacuum_untruncated,
pg_stat_get_last_vacuum_oldest_xmin(C.oid)AS last_vacuum_oldest_xmin,            pg_stat_get_last_vacuum_status(C.oid)
ASlast_vacuum_status,            pg_stat_get_last_vacuum_status_detail(C.oid) AS last_vacuum_status_detail,
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index cf0bca7..cf754f9 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1467,7 +1467,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if (!onerel)
{
-        pgstat_report_vacuum(relid, false, 0, 0, 0, InvalidTransactionId,
+        pgstat_report_vacuum(relid, false,
+                             0, 0, 0, 0, 0,InvalidTransactionId,                             PGSTAT_VACUUM_SKIPPED, 0,
                           PGSTAT_VACUUM_LOCK_FAILED);        PopActiveSnapshot();
 
@@ -1504,7 +1505,7 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
pgstat_report_vacuum(RelationGetRelid(onerel),                            onerel->rd_rel->relisshared,
 
-                             0, 0, 0, InvalidTransactionId,
+                             0, 0, 0, 0, 0, InvalidTransactionId,                             PGSTAT_VACUUM_FINISHED,
0,                            PGSTAT_VACUUM_NONTARGET);
 
@@ -1528,7 +1529,7 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
pgstat_report_vacuum(RelationGetRelid(onerel),                            onerel->rd_rel->relisshared,
 
-                             0, 0, 0, InvalidTransactionId,
+                             0, 0, 0, 0, 0, InvalidTransactionId,                             PGSTAT_VACUUM_FINISHED,
0,                            PGSTAT_VACUUM_NONTARGET);
 
@@ -1550,7 +1551,7 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
pgstat_report_vacuum(RelationGetRelid(onerel),                            onerel->rd_rel->relisshared,
 
-                             0, 0, 0, InvalidTransactionId,
+                             0, 0, 0, 0, 0, InvalidTransactionId,                             PGSTAT_VACUUM_FINISHED,
0,                            PGSTAT_VACUUM_NONTARGET);
 
@@ -1570,7 +1571,7 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
pgstat_report_vacuum(RelationGetRelid(onerel),                            onerel->rd_rel->relisshared,
 
-                             0, 0, 0, InvalidTransactionId,
+                             0, 0, 0, 0, 0, InvalidTransactionId,                             PGSTAT_VACUUM_FINISHED,
0,                            PGSTAT_VACUUM_NONTARGET);
 
@@ -1628,7 +1629,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)        /* VACUUM
FULLis now a variant of CLUSTER; see cluster.c */        cluster_rel(relid, InvalidOid, false,
(options& VACOPT_VERBOSE) != 0);
 
-        pgstat_report_vacuum(relid, isshared, 0, 0, 0, InvalidTransactionId,
+        pgstat_report_vacuum(relid, isshared, 0, 0, 0, 0, 0,
+                             InvalidTransactionId,                             PGSTAT_VACUUM_FINISHED, 0,
              PGSTAT_VACUUM_FULL);    }
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c53b4fa..53821f3 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -121,6 +121,7 @@ typedef struct LVRelStats    double        new_rel_tuples; /* new estimated total # of tuples */
double       new_dead_tuples;    /* new estimated total # of dead tuples */    BlockNumber pages_removed;
 
+    BlockNumber pages_not_removed;    double        tuples_deleted;    BlockNumber nonempty_pages; /* actually, last
nonemptypage + 1 */    /* List of TIDs of tuples we intend to delete */
 
@@ -248,6 +249,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    vacrelstats->old_rel_tuples
=onerel->rd_rel->reltuples;    vacrelstats->num_index_scans = 0;    vacrelstats->pages_removed = 0;
 
+    vacrelstats->pages_not_removed = 0;    vacrelstats->lock_waiter_detected = false;    vacrelstats->aggressive =
aggressive;
@@ -290,8 +292,13 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    {
lazy_truncate_heap(onerel,vacrelstats);
 
+        /* just paranoia */
+        if (vacrelstats->rel_pages >= vacrelstats->nonempty_pages)
+            vacrelstats->pages_not_removed +=
+                vacrelstats->rel_pages - vacrelstats->nonempty_pages;
+        /* check if all empty pages are truncated */
-        if (vacrelstats->rel_pages > vacrelstats->nonempty_pages)
+        if (vacrelstats->pages_not_removed > 0)            vacuum_status_details |= PGSTAT_VACUUM_PARTIALLY_TRUNCATED;
  }
 
@@ -363,6 +370,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
onerel->rd_rel->relisshared,                        new_live_tuples,
vacrelstats->new_dead_tuples,
+                         vacrelstats->pages_removed,
+                         vacrelstats->pages_not_removed,                         vacrelstats->num_index_scans,
               OldestXmin,                         PGSTAT_VACUUM_FINISHED, 0,
 
@@ -2283,6 +2292,8 @@ lazy_vacuum_cancel_handler(void)                         stats->shared,
stats->new_rel_tuples,                        stats->new_dead_tuples,
 
+                         stats->pages_removed,
+                         stats->pages_not_removed,                         stats->num_index_scans,
   OldestXmin,                         status, phase, details);
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3e1d051..a4a6169 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1404,8 +1404,10 @@ pgstat_report_autovac(Oid dboid)voidpgstat_report_vacuum(Oid tableoid, bool shared,
      PgStat_Counter livetuples, PgStat_Counter deadtuples,
 
+                     PgStat_Counter pages_removed,
+                     PgStat_Counter pages_not_removed,                     PgStat_Counter num_index_scans,
-                     TransactionId oldestxmin,
+                     TransactionId    oldestxmin,                     PgStat_Counter status, PgStat_Counter
last_phase,                    PgStat_Counter details){
 
@@ -1421,6 +1423,8 @@ pgstat_report_vacuum(Oid tableoid, bool shared,    msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples= livetuples;    msg.m_dead_tuples = deadtuples;
 
+    msg.m_pages_removed = pages_removed;
+    msg.m_pages_not_removed = pages_not_removed;    msg.m_num_index_scans = num_index_scans;    msg.m_oldest_xmin =
oldestxmin;   msg.m_vacuum_status = status;
 
@@ -4594,6 +4598,8 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples= 0;        result->n_dead_tuples = 0;        result->changes_since_analyze = 0;
 
+        result->n_pages_removed = 0;
+        result->n_pages_not_removed = 0;        result->n_index_scans = 0;        result->oldest_xmin =
InvalidTransactionId;       result->blocks_fetched = 0;
 
@@ -6009,6 +6015,8 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)    case PGSTAT_VACUUM_FINISHED:
tabentry->n_live_tuples= msg->m_live_tuples;        tabentry->n_dead_tuples = msg->m_dead_tuples;
 
+        tabentry->n_pages_removed = msg->m_pages_removed;
+        tabentry->n_pages_not_removed = msg->m_pages_not_removed;        tabentry->n_index_scans =
msg->m_num_index_scans;       tabentry->oldest_xmin = msg->m_oldest_xmin;        tabentry->vacuum_failcount = 0;
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0fba265..b32bdf5 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -204,6 +204,36 @@ pg_stat_get_vacuum_necessity(PG_FUNCTION_ARGS)}Datum
+pg_stat_get_last_vacuum_truncated(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int64        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int64) (tabentry->n_pages_removed);
+
+    PG_RETURN_INT64(result);
+}
+
+Datum
+pg_stat_get_last_vacuum_untruncated(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int64        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int64) (tabentry->n_pages_not_removed);
+
+    PG_RETURN_INT64(result);
+}
+
+Datumpg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS){    Oid            relid = PG_GETARG_OID(0);
@@ -349,7 +379,7 @@ pg_stat_get_last_vacuum_status_detail(PG_FUNCTION_ARGS)            break;        default:
-            result = "unknwon error";
+            result = "unknown status";            break;        }    }
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 48e6942..da2e9b4 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2889,6 +2889,10 @@ DATA(insert OID = 6118 (  pg_stat_get_subscription    PGNSP PGUID 12 1 0 0 0 f f
fDESCR("statistics:information about subscription");DATA(insert OID = 2579 (  pg_stat_get_vacuum_necessity    PGNSP
PGUID12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_vacuum_necessity _null_
_null__null_ ));DESCR("statistics: true if needs vacuum");
 
+DATA(insert OID = 3422 (  pg_stat_get_last_vacuum_untruncated    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_untruncated _null_ _null_ _null_ ));
 
+DESCR("statistics: pages left untruncated in the last vacuum");
+DATA(insert OID = 3423 (  pg_stat_get_last_vacuum_truncated    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 20 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_truncated _null_ _null_ _null_ ));
 
+DESCR("statistics: pages truncated in the last vacuum");DATA(insert OID = 3424 (  pg_stat_get_last_vacuum_oldest_xmin
 PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 28 "26" _null_ _null_ _null_ _null_ _null_
pg_stat_get_last_vacuum_oldest_xmin_null_ _null_ _null_ ));DESCR("statistics: The oldest xmin used in the last
vacuum");DATA(insertOID = 3281 (  pg_stat_get_last_vacuum_index_scans    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23
"26"_null_ _null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_index_scans _null_ _null_ _null_ ));
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e18a630..6079661 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -389,6 +389,8 @@ typedef struct PgStat_MsgVacuum    TimestampTz m_vacuumtime;    PgStat_Counter m_live_tuples;
PgStat_Counterm_dead_tuples;
 
+    PgStat_Counter m_pages_removed;
+    PgStat_Counter m_pages_not_removed;    PgStat_Counter m_num_index_scans;    TransactionId  m_oldest_xmin;
PgStat_Counterm_vacuum_status;
 
@@ -654,6 +656,8 @@ typedef struct PgStat_StatTabEntry    PgStat_Counter n_live_tuples;    PgStat_Counter
n_dead_tuples;   PgStat_Counter changes_since_analyze;
 
+    PgStat_Counter n_pages_removed;
+    PgStat_Counter n_pages_not_removed;    PgStat_Counter n_index_scans;    TransactionId  oldest_xmin;
@@ -1197,6 +1201,8 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type textern void
pgstat_report_autovac(Oiddboid);extern void pgstat_report_vacuum(Oid tableoid, bool shared,
PgStat_Counterlivetuples, PgStat_Counter deadtuples,
 
+                     PgStat_Counter pages_removed,
+                     PgStat_Counter pages_not_removed,                     PgStat_Counter num_index_scans,
       TransactionId oldextxmin,                     PgStat_Counter status, PgStat_Counter last_phase,
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 18a122a..111d44f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1765,6 +1765,8 @@ pg_stat_all_tables| SELECT c.oid AS relid,    pg_stat_get_last_analyze_time(c.oid) AS
last_analyze,   pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze,
pg_stat_get_last_vacuum_index_scans(c.oid)AS last_vacuum_index_scans,
 
+    pg_stat_get_last_vacuum_truncated(c.oid) AS last_vacuum_truncated,
+    pg_stat_get_last_vacuum_untruncated(c.oid) AS last_vacuum_untruncated,
pg_stat_get_last_vacuum_oldest_xmin(c.oid)AS last_vacuum_oldest_xmin,    pg_stat_get_last_vacuum_status(c.oid) AS
last_vacuum_status,   pg_stat_get_last_vacuum_status_detail(c.oid) AS last_vacuum_status_detail,
 
@@ -1918,6 +1920,8 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_analyze,
pg_stat_all_tables.last_autoanalyze,   pg_stat_all_tables.last_vacuum_index_scans,
 
+    pg_stat_all_tables.last_vacuum_truncated,
+    pg_stat_all_tables.last_vacuum_untruncated,    pg_stat_all_tables.last_vacuum_oldest_xmin,
pg_stat_all_tables.last_vacuum_status,   pg_stat_all_tables.last_vacuum_status_detail,
 
@@ -1967,6 +1971,8 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_analyze,
pg_stat_all_tables.last_autoanalyze,   pg_stat_all_tables.last_vacuum_index_scans,
 
+    pg_stat_all_tables.last_vacuum_truncated,
+    pg_stat_all_tables.last_vacuum_untruncated,    pg_stat_all_tables.last_vacuum_oldest_xmin,
pg_stat_all_tables.last_vacuum_status,   pg_stat_all_tables.last_vacuum_status_detail,
 
-- 
2.9.2

From 6b83b307b0198c3902fc1e30944e02739c0a19cd Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 21 Nov 2017 09:57:51 +0900
Subject: [PATCH 3/4] Add vacuum execution status in pg_stat_all_tables

The main objective of this patch is showing how vacuuming is
failing. It is sometimes very hard to diagnose since autovacuum stops
silently in most cases. This patch leaves the reason for vacuum
failure in pg_stat_all_tables and how many times it is continuing to
fail.
---doc/src/sgml/monitoring.sgml         |  20 +++++src/backend/catalog/system_views.sql |   4
+src/backend/commands/vacuum.c       |  40 +++++++++src/backend/commands/vacuumlazy.c    |  93
++++++++++++++++++++-src/backend/postmaster/pgstat.c     |  62 +++++++++++---src/backend/utils/adt/pgstatfuncs.c  | 157
+++++++++++++++++++++++++++++++++++src/include/catalog/pg_proc.h       |   8 ++src/include/commands/vacuum.h        |
1+src/include/pgstat.h                 |  34 +++++++-src/test/regress/expected/rules.out  |  12 +++10 files changed,
416insertions(+), 15 deletions(-)
 

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 98c5f41..a0288cb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2580,6 +2580,21 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<entry>Numberof splitted index scans performed during the last vacuum or autovacuum on this table</entry>    </row>
<row>
+     <entry><structfield>last_vacuum_oldest_xmin</structfield></entry>
+     <entry><type>bigint</type></entry>
+     <entry>Oldest xmin used by the last vacuum on this table</entry>
+    </row>
+    <row>
+     <entry><structfield>last_vacuum_status</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Status of the last vacuum or autovacuum.</entry>
+    </row>
+    <row>
+     <entry><structfield>last_vacuum_status_detail</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Details of the status of the last vacuum or autovacuum.</entry>
+    </row>
+    <row>     <entry><structfield>vacuum_count</structfield></entry>     <entry><type>bigint</type></entry>
<entry>Numberof times this table has been manually vacuumed
 
@@ -2592,6 +2607,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
daemon</entry>   </row>    <row>
 
+     <entry><structfield>vacuum_fail_count</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of continuously failed vacuum and autovacuum trials. Cleared to zero on completetion.</entry>
+    </row>
+    <row>     <entry><structfield>analyze_count</structfield></entry>     <entry><type>bigint</type></entry>
<entry>Numberof times this table has been manually analyzed</entry>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b553bf4..c69fea9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -529,8 +529,12 @@ CREATE VIEW pg_stat_all_tables AS            pg_stat_get_last_analyze_time(C.oid) as last_analyze,
          pg_stat_get_last_autoanalyze_time(C.oid) as last_autoanalyze,
pg_stat_get_last_vacuum_index_scans(C.oid)AS last_vacuum_index_scans,
 
+            pg_stat_get_last_vacuum_oldest_xmin(C.oid) AS last_vacuum_oldest_xmin,
+            pg_stat_get_last_vacuum_status(C.oid) AS last_vacuum_status,
+            pg_stat_get_last_vacuum_status_detail(C.oid) AS last_vacuum_status_detail,
pg_stat_get_vacuum_count(C.oid)AS vacuum_count,            pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
 
+            pg_stat_get_vacuum_fail_count(C.oid) AS vacuum_fail_count,            pg_stat_get_analyze_count(C.oid) AS
analyze_count,           pg_stat_get_autoanalyze_count(C.oid) AS autoanalyze_count    FROM pg_class C LEFT JOIN
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f51dcdb..cf0bca7 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -35,6 +35,7 @@#include "catalog/pg_inherits_fn.h"#include "catalog/pg_namespace.h"#include "commands/cluster.h"
+#include "commands/progress.h"#include "commands/vacuum.h"#include "miscadmin.h"#include "nodes/makefuncs.h"
@@ -367,6 +368,9 @@ vacuum(int options, List *relations, VacuumParams *params,    }    PG_CATCH();    {
+        /* report the final status of this vacuum */
+        lazy_vacuum_cancel_handler();
+        in_vacuum = false;        VacuumCostActive = false;        PG_RE_THROW();
@@ -1463,6 +1467,9 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if (!onerel)
{
+        pgstat_report_vacuum(relid, false, 0, 0, 0, InvalidTransactionId,
+                             PGSTAT_VACUUM_SKIPPED, 0,
+                             PGSTAT_VACUUM_LOCK_FAILED);        PopActiveSnapshot();
CommitTransactionCommand();       return false;
 
@@ -1494,6 +1501,13 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
(errmsg("skipping\"%s\" --- only table or database owner can vacuum it",
RelationGetRelationName(onerel))));       relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0, InvalidTransactionId,
+                             PGSTAT_VACUUM_FINISHED, 0,
+                             PGSTAT_VACUUM_NONTARGET);
+        PopActiveSnapshot();        CommitTransactionCommand();        return false;
@@ -1511,6 +1525,13 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)
(errmsg("skipping\"%s\" --- cannot vacuum non-tables or special system tables",
RelationGetRelationName(onerel))));       relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0, InvalidTransactionId,
+                             PGSTAT_VACUUM_FINISHED, 0,
+                             PGSTAT_VACUUM_NONTARGET);
+        PopActiveSnapshot();        CommitTransactionCommand();        return false;
@@ -1526,6 +1547,13 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if
(RELATION_IS_OTHER_TEMP(onerel))   {        relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0, InvalidTransactionId,
+                             PGSTAT_VACUUM_FINISHED, 0,
+                             PGSTAT_VACUUM_NONTARGET);
+        PopActiveSnapshot();        CommitTransactionCommand();        return false;
@@ -1539,6 +1567,13 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)    if
(onerel->rd_rel->relkind== RELKIND_PARTITIONED_TABLE)    {        relation_close(onerel, lmode);
 
+
+        pgstat_report_vacuum(RelationGetRelid(onerel),
+                             onerel->rd_rel->relisshared,
+                             0, 0, 0, InvalidTransactionId,
+                             PGSTAT_VACUUM_FINISHED, 0,
+                             PGSTAT_VACUUM_NONTARGET);
+        PopActiveSnapshot();        CommitTransactionCommand();        /* It's OK to proceed with ANALYZE on this
table*/
 
@@ -1584,6 +1619,8 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)     */    if
(options& VACOPT_FULL)    {
 
+        bool isshared = onerel->rd_rel->relisshared;
+        /* close relation before vacuuming, but hold lock until commit */        relation_close(onerel, NoLock);
onerel = NULL;
 
@@ -1591,6 +1628,9 @@ vacuum_rel(Oid relid, RangeVar *relation, int options, VacuumParams *params)        /* VACUUM
FULLis now a variant of CLUSTER; see cluster.c */        cluster_rel(relid, InvalidOid, false,
(options& VACOPT_VERBOSE) != 0);
 
+        pgstat_report_vacuum(relid, isshared, 0, 0, 0, InvalidTransactionId,
+                             PGSTAT_VACUUM_FINISHED, 0,
+                             PGSTAT_VACUUM_FULL);    }    else        lazy_vacuum_rel(onerel, options, params,
vac_strategy);
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 4274043..c53b4fa 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -105,6 +105,8 @@typedef struct LVRelStats{
+    Oid            reloid;            /* oid of the target relation */
+    bool        shared;            /* is shared relation? */    /* hasindex = true means two-pass strategy; false
meansone-pass */    bool        hasindex;    /* Overall statistics about rel */
 
@@ -129,6 +131,7 @@ typedef struct LVRelStats    int            num_index_scans;    TransactionId latestRemovedXid;
bool       lock_waiter_detected;
 
+    bool        aggressive;} LVRelStats;
@@ -138,6 +141,7 @@ static int    elevel = -1;static TransactionId OldestXmin;static TransactionId FreezeLimit;static
MultiXactIdMultiXactCutoff;
 
+static LVRelStats *current_lvstats;static BufferAccessStrategy vac_strategy;
@@ -201,6 +205,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    double
new_live_tuples;   TransactionId new_frozen_xid;    MultiXactId new_min_multi;
 
+    int            vacuum_status_details = 0;    Assert(params != NULL);
@@ -216,6 +221,7 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    else        elevel =
DEBUG2;
+    current_lvstats = NULL;    pgstat_progress_start_command(PROGRESS_COMMAND_VACUUM,
RelationGetRelid(onerel));
@@ -236,11 +242,20 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    vacrelstats = (LVRelStats
*)palloc0(sizeof(LVRelStats));
 
+    vacrelstats->reloid = RelationGetRelid(onerel);
+    vacrelstats->shared = onerel->rd_rel->relisshared;    vacrelstats->old_rel_pages = onerel->rd_rel->relpages;
vacrelstats->old_rel_tuples= onerel->rd_rel->reltuples;    vacrelstats->num_index_scans = 0;
vacrelstats->pages_removed= 0;    vacrelstats->lock_waiter_detected = false;
 
+    vacrelstats->aggressive = aggressive;
+
+    /*
+     * Register current vacrelstats so that final status can be reported on
+     * interrupts
+     */
+    current_lvstats = vacrelstats;    /* Open all indexes of the relation */    vac_open_indexes(onerel,
RowExclusiveLock,&nindexes, &Irel);
 
@@ -272,8 +287,14 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,     * Optionally truncate the
relation.    */    if (should_attempt_truncation(vacrelstats))
 
+    {        lazy_truncate_heap(onerel, vacrelstats);
+        /* check if all empty pages are truncated */
+        if (vacrelstats->rel_pages > vacrelstats->nonempty_pages)
+            vacuum_status_details |= PGSTAT_VACUUM_PARTIALLY_TRUNCATED;
+    }
+    /* Report that we are now doing final cleanup */    pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
                    PROGRESS_VACUUM_PHASE_FINAL_CLEANUP);
 
@@ -331,11 +352,22 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,    if (new_live_tuples < 0)
    new_live_tuples = 0;    /* just in case */
 
-    pgstat_report_vacuum(RelationGetRelid(onerel),
+    /* vacuum successfully finished. nothing to do on exit */
+    current_lvstats = NULL;
+
+    if (aggressive)
+        vacuum_status_details |= PGSTAT_VACUUM_AGGRESSIVE;
+
+    vacuum_status_details |= PGSTAT_VACUUM_COMPLETE;
+    pgstat_report_vacuum(vacrelstats->reloid,                         onerel->rd_rel->relisshared,
   new_live_tuples,                         vacrelstats->new_dead_tuples,
 
-                         vacrelstats->num_index_scans);
+                         vacrelstats->num_index_scans,
+                         OldestXmin,
+                         PGSTAT_VACUUM_FINISHED, 0,
+                         vacuum_status_details);
+    pgstat_progress_end_command();    /* and log the action if appropriate */
@@ -2198,3 +2230,60 @@ heap_page_is_all_visible(Relation rel, Buffer buf,    return all_visible;}
+
+/*
+ * lazy_vacuum_cancel_handler - report interrupted vacuum status
+ */
+void
+lazy_vacuum_cancel_handler(void)
+{
+    LVRelStats *stats = current_lvstats;
+    LocalPgBackendStatus *local_beentry;
+    PgBackendStatus *beentry;
+    int                phase;
+    int                status;
+    int                details = 0;
+
+    current_lvstats = NULL;
+
+    /* we have nothing to report */
+    if (!stats)
+        return;
+
+    /* get vacuum progress stored in backend status */
+    local_beentry = pgstat_fetch_stat_local_beentry(MyBackendId);
+    if (!local_beentry)
+        return;
+
+    beentry = &local_beentry->backendStatus;
+
+    Assert (beentry && beentry->st_progress_command == PROGRESS_COMMAND_VACUUM);
+
+    phase = beentry->st_progress_param[PROGRESS_VACUUM_PHASE];
+
+    /* we can reach here both on interrupt and error */
+    if (geterrcode() == ERRCODE_QUERY_CANCELED)
+    {
+        status = PGSTAT_VACUUM_CANCELED;
+        if (stats->aggressive)
+            details |= PGSTAT_VACUUM_AGGRESSIVE;
+    }
+    else
+    {
+        /* special case: details stores an sql error code */
+        status = PGSTAT_VACUUM_ERROR;
+        details = geterrcode();
+    }
+
+    /*
+     * vacuum has been canceled, report stats numbers without normalization
+     * here. (But currently they are not used.)
+     */
+    pgstat_report_vacuum(stats->reloid,
+                         stats->shared,
+                         stats->new_rel_tuples,
+                         stats->new_dead_tuples,
+                         stats->num_index_scans,
+                         OldestXmin,
+                         status, phase, details);
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5f3fdf6..3e1d051 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1404,7 +1404,10 @@ pgstat_report_autovac(Oid dboid)voidpgstat_report_vacuum(Oid tableoid, bool shared,
      PgStat_Counter livetuples, PgStat_Counter deadtuples,
 
-                     PgStat_Counter num_index_scans)
+                     PgStat_Counter num_index_scans,
+                     TransactionId oldestxmin,
+                     PgStat_Counter status, PgStat_Counter last_phase,
+                     PgStat_Counter details){    PgStat_MsgVacuum msg;
@@ -1419,6 +1422,10 @@ pgstat_report_vacuum(Oid tableoid, bool shared,    msg.m_live_tuples = livetuples;
msg.m_dead_tuples= deadtuples;    msg.m_num_index_scans = num_index_scans;
 
+    msg.m_oldest_xmin = oldestxmin;
+    msg.m_vacuum_status = status;
+    msg.m_vacuum_last_phase = last_phase;
+    msg.m_vacuum_details = details;    pgstat_send(&msg, sizeof(msg));}
@@ -4588,6 +4595,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_dead_tuples= 0;        result->changes_since_analyze = 0;        result->n_index_scans = 0;
 
+        result->oldest_xmin = InvalidTransactionId;        result->blocks_fetched = 0;        result->blocks_hit = 0;
     result->vacuum_timestamp = 0;
 
@@ -4598,6 +4606,11 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->analyze_count= 0;        result->autovac_analyze_timestamp = 0;        result->autovac_analyze_count = 0;
 
+
+        result->vacuum_status = 0;
+        result->vacuum_last_phase = 0;
+        result->vacuum_details = 0;
+        result->vacuum_failcount = 0;    }    return result;
@@ -5982,19 +5995,44 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)    tabentry = pgstat_get_tab_entry(dbentry,
msg->m_tableoid,true);
 
-    tabentry->n_live_tuples = msg->m_live_tuples;
-    tabentry->n_dead_tuples = msg->m_dead_tuples;
-    tabentry->n_index_scans = msg->m_num_index_scans;
+    tabentry->vacuum_status = msg->m_vacuum_status;
+    tabentry->vacuum_last_phase = msg->m_vacuum_last_phase;
+    tabentry->vacuum_details = msg->m_vacuum_details;
-    if (msg->m_autovacuum)
-    {
-        tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->autovac_vacuum_count++;
-    }
-    else
+    /*
+     * We store the numbers only when the vacuum has been completed. They
+     * might be usable to find how much the stopped vacuum processed but we
+     * choose not to show them rather than show bogus numbers.
+     */
+    switch ((StatVacuumStatus)msg->m_vacuum_status)    {
-        tabentry->vacuum_timestamp = msg->m_vacuumtime;
-        tabentry->vacuum_count++;
+    case PGSTAT_VACUUM_FINISHED:
+        tabentry->n_live_tuples = msg->m_live_tuples;
+        tabentry->n_dead_tuples = msg->m_dead_tuples;
+        tabentry->n_index_scans = msg->m_num_index_scans;
+        tabentry->oldest_xmin = msg->m_oldest_xmin;
+        tabentry->vacuum_failcount = 0;
+
+        if (msg->m_autovacuum)
+        {
+            tabentry->autovac_vacuum_timestamp = msg->m_vacuumtime;
+            tabentry->autovac_vacuum_count++;
+        }
+        else
+        {
+            tabentry->vacuum_timestamp = msg->m_vacuumtime;
+            tabentry->vacuum_count++;
+        }
+        break;
+
+    case PGSTAT_VACUUM_ERROR:
+    case PGSTAT_VACUUM_CANCELED:
+    case PGSTAT_VACUUM_SKIPPED:
+        tabentry->vacuum_failcount++;
+        break;
+
+    default:
+        break;    }}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index ab80794..0fba265 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -219,6 +219,163 @@ pg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS)}Datum
+pg_stat_get_last_vacuum_oldest_xmin(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    TransactionId    result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = InvalidTransactionId;
+    else
+        result = (int32) (tabentry->oldest_xmin);
+
+    return TransactionIdGetDatum(result);
+}
+
+Datum
+pg_stat_get_last_vacuum_status(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    char        *result = "unknown";
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) != NULL)
+    {
+        StatVacuumStatus    status;
+
+        status = tabentry->vacuum_status;
+        switch (status)
+        {
+        case PGSTAT_VACUUM_FINISHED:
+            result = "completed";
+            break;
+        case PGSTAT_VACUUM_ERROR:
+            result = "error";
+            break;
+        case PGSTAT_VACUUM_CANCELED:
+            result = "canceled";
+            break;
+        case PGSTAT_VACUUM_SKIPPED:
+            result = "skipped";
+            break;
+        default:
+            result = psprintf("unknown status: %d", status);
+            break;
+        }
+    }
+
+    PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+Datum
+pg_stat_get_last_vacuum_status_detail(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    char        *result = "unknown";
+    PgStat_StatTabEntry *tabentry;
+    StringInfoData    str;
+
+    /*
+     * status string. this must be synced with the strings shown by the
+     * statistics view "pg_stat_progress_vacuum"
+     */
+    static char *phasestr[] =
+        {"initialization",
+         "scanning heap",
+         "vacuuming indexes",
+         "vacuuming heap",
+         "cleaning up indexes",
+         "trucating heap",
+         "performing final cleanup"};
+    static char *detailstr[] =
+        {NULL,                        /* PGSTAT_VACUUM_COMPLETE */
+         "aggressive",                /* PGSTAT_VACUUM_AGGRESSIVE */
+         "full",                    /* PGSTAT_VACUUM_FULL */
+         "lock failure",            /* PGSTAT_VACUUM_LOCK_FAILED */
+         "not a target",            /* PGSTAT_VACUUM_NONTARGET */
+         "partially truncated"        /* PGSTAT_VACUUM_PARTIALLY_TRUNCATED */
+        };
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) != NULL)
+    {
+        int                    phase;
+        StatVacuumStatus    status;
+        bool                first = true;
+        int                    i;
+
+        initStringInfo(&str);
+
+        status = tabentry->vacuum_status;
+        switch (status)
+        {
+        case PGSTAT_VACUUM_ERROR:
+            /*  details is storing an sql error code */
+            appendStringInfoString(
+                &str,
+                format_elog_string(
+                    "sqlcode: %s, ",
+                    unpack_sql_state((int)tabentry->vacuum_details)));
+
+            /* FALL THROUGH */
+
+        case PGSTAT_VACUUM_CANCELED:
+            phase = tabentry->vacuum_last_phase;
+            /* number of elements of phasestr above */
+            if (phase >= 0 && phase <= 7)
+                appendStringInfoString(&str, phasestr[phase]);
+
+            result = str.data;
+            break;
+
+        case PGSTAT_VACUUM_FINISHED:
+        case PGSTAT_VACUUM_SKIPPED:
+            for (i = 0 ; i < PGSTAT_VACUUM_NDETAILS ; i++)
+            {
+                if ((tabentry->vacuum_details & (1 << i)) == 0)
+                    continue;
+
+                if (detailstr[i] == NULL)
+                    continue;
+
+                if (first)
+                    first = false;
+                else
+                    appendStringInfoString(&str, ", ");
+
+                appendStringInfoString(&str, detailstr[i]);
+            }
+            result = str.data;
+            break;
+
+        default:
+            result = "unknwon error";
+            break;
+        }
+    }
+
+    if (result == NULL)
+        PG_RETURN_NULL();
+
+    PG_RETURN_TEXT_P(cstring_to_text(result));
+}
+
+Datum
+pg_stat_get_vacuum_fail_count(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int64        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int32) (tabentry->vacuum_failcount);
+
+    PG_RETURN_INT32(result);
+}
+
+Datumpg_stat_get_blocks_fetched(PG_FUNCTION_ARGS){    Oid            relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6b84c9a..48e6942 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2889,8 +2889,16 @@ DATA(insert OID = 6118 (  pg_stat_get_subscription    PGNSP PGUID 12 1 0 0 0 f f
fDESCR("statistics:information about subscription");DATA(insert OID = 2579 (  pg_stat_get_vacuum_necessity    PGNSP
PGUID12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_vacuum_necessity _null_
_null__null_ ));DESCR("statistics: true if needs vacuum");
 
+DATA(insert OID = 3424 (  pg_stat_get_last_vacuum_oldest_xmin    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 28 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_oldest_xmin _null_ _null_ _null_ ));
 
+DESCR("statistics: The oldest xmin used in the last vacuum");DATA(insert OID = 3281 (
pg_stat_get_last_vacuum_index_scans   PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_
_null_pg_stat_get_last_vacuum_index_scans _null_ _null_ _null_ ));DESCR("statistics: number of index scans in the last
vacuum");
+DATA(insert OID = 3419 (  pg_stat_get_last_vacuum_status    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_
_null__null_ _null_ _null_ pg_stat_get_last_vacuum_status _null_ _null_ _null_ ));
 
+DESCR("statistics: ending status of the last vacuum");
+DATA(insert OID = 3420 (  pg_stat_get_last_vacuum_status_detail    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_status_detail _null_ _null_ _null_ ));
 
+DESCR("statistics: ending status of the last vacuum");
+DATA(insert OID = 3421 (  pg_stat_get_vacuum_fail_count    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26" _null_
_null__null_ _null_ _null_ pg_stat_get_vacuum_fail_count _null_ _null_ _null_ ));
 
+DESCR("statistics: number of successively failed vacuum trials");DATA(insert OID = 2026 (  pg_backend_pid
 PGNSP PGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_
_null_));DESCR("statistics: current backend PID");DATA(insert OID = 1937 (  pg_stat_get_backend_pid        PGNSP PGUID
121 0 0 0 f f f f t f s r 1 0 23 "23" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_pid _null_ _null_ _null_
));
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 84bec74..da3107a 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -190,6 +190,7 @@ extern void vacuum_delay_point(void);/* in commands/vacuumlazy.c */extern void
lazy_vacuum_rel(Relationonerel, int options,                VacuumParams *params, BufferAccessStrategy bstrategy);
 
+extern void lazy_vacuum_cancel_handler(void);/* in commands/analyze.c */extern void analyze_rel(Oid relid, RangeVar
*relation,int options,
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3ab5f4a..e18a630 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -67,6 +67,26 @@ typedef enum StatMsgType    PGSTAT_MTYPE_DEADLOCK} StatMsgType;
+/*
+ * The exit status stored in vacuum report.
+ */
+typedef enum StatVacuumStatus
+{
+    PGSTAT_VACUUM_FINISHED,
+    PGSTAT_VACUUM_CANCELED,
+    PGSTAT_VACUUM_ERROR,
+    PGSTAT_VACUUM_SKIPPED
+} StatVacuumStatus;
+
+/* bitmap for vacuum status details, except for PGSTAT_VACUUM_ERROR_WITH_CODE */
+#define PGSTAT_VACUUM_COMPLETE                (1 << 0)
+#define PGSTAT_VACUUM_AGGRESSIVE            (1 << 1)
+#define PGSTAT_VACUUM_FULL                    (1 << 2)
+#define PGSTAT_VACUUM_LOCK_FAILED            (1 << 3)
+#define PGSTAT_VACUUM_NONTARGET                (1 << 4)
+#define PGSTAT_VACUUM_PARTIALLY_TRUNCATED    (1 << 5)
+#define PGSTAT_VACUUM_NDETAILS                6
+/* ---------- * The data type used for counters. * ----------
@@ -370,6 +390,10 @@ typedef struct PgStat_MsgVacuum    PgStat_Counter m_live_tuples;    PgStat_Counter m_dead_tuples;
 PgStat_Counter m_num_index_scans;
 
+    TransactionId  m_oldest_xmin;
+    PgStat_Counter m_vacuum_status;
+    PgStat_Counter m_vacuum_last_phase;
+    PgStat_Counter m_vacuum_details;} PgStat_MsgVacuum;
@@ -631,6 +655,7 @@ typedef struct PgStat_StatTabEntry    PgStat_Counter n_dead_tuples;    PgStat_Counter
changes_since_analyze;   PgStat_Counter n_index_scans;
 
+    TransactionId  oldest_xmin;    PgStat_Counter blocks_fetched;    PgStat_Counter blocks_hit;
@@ -643,6 +668,10 @@ typedef struct PgStat_StatTabEntry    PgStat_Counter analyze_count;    TimestampTz
autovac_analyze_timestamp;   /* autovacuum initiated */    PgStat_Counter autovac_analyze_count;
 
+    PgStat_Counter    vacuum_status;
+    PgStat_Counter    vacuum_last_phase;
+    PgStat_Counter    vacuum_details;
+    PgStat_Counter    vacuum_failcount;} PgStat_StatTabEntry;
@@ -1168,7 +1197,10 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type textern void
pgstat_report_autovac(Oiddboid);extern void pgstat_report_vacuum(Oid tableoid, bool shared,
PgStat_Counterlivetuples, PgStat_Counter deadtuples,
 
-                     PgStat_Counter num_index_scans);
+                     PgStat_Counter num_index_scans,
+                     TransactionId oldextxmin,
+                     PgStat_Counter status, PgStat_Counter last_phase,
+                     PgStat_Counter detail);extern void pgstat_report_analyze(Relation rel,
PgStat_Counterlivetuples, PgStat_Counter deadtuples,                      bool resetcounter);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e827842..18a122a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1765,8 +1765,12 @@ pg_stat_all_tables| SELECT c.oid AS relid,    pg_stat_get_last_analyze_time(c.oid) AS
last_analyze,   pg_stat_get_last_autoanalyze_time(c.oid) AS last_autoanalyze,
pg_stat_get_last_vacuum_index_scans(c.oid)AS last_vacuum_index_scans,
 
+    pg_stat_get_last_vacuum_oldest_xmin(c.oid) AS last_vacuum_oldest_xmin,
+    pg_stat_get_last_vacuum_status(c.oid) AS last_vacuum_status,
+    pg_stat_get_last_vacuum_status_detail(c.oid) AS last_vacuum_status_detail,    pg_stat_get_vacuum_count(c.oid) AS
vacuum_count,   pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count,
 
+    pg_stat_get_vacuum_fail_count(c.oid) AS vacuum_fail_count,    pg_stat_get_analyze_count(c.oid) AS analyze_count,
pg_stat_get_autoanalyze_count(c.oid) AS autoanalyze_count   FROM ((pg_class c
 
@@ -1914,8 +1918,12 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_analyze,
pg_stat_all_tables.last_autoanalyze,   pg_stat_all_tables.last_vacuum_index_scans,
 
+    pg_stat_all_tables.last_vacuum_oldest_xmin,
+    pg_stat_all_tables.last_vacuum_status,
+    pg_stat_all_tables.last_vacuum_status_detail,    pg_stat_all_tables.vacuum_count,
pg_stat_all_tables.autovacuum_count,
+    pg_stat_all_tables.vacuum_fail_count,    pg_stat_all_tables.analyze_count,    pg_stat_all_tables.autoanalyze_count
 FROM pg_stat_all_tables
 
@@ -1959,8 +1967,12 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_analyze,
pg_stat_all_tables.last_autoanalyze,   pg_stat_all_tables.last_vacuum_index_scans,
 
+    pg_stat_all_tables.last_vacuum_oldest_xmin,
+    pg_stat_all_tables.last_vacuum_status,
+    pg_stat_all_tables.last_vacuum_status_detail,    pg_stat_all_tables.vacuum_count,
pg_stat_all_tables.autovacuum_count,
+    pg_stat_all_tables.vacuum_fail_count,    pg_stat_all_tables.analyze_count,    pg_stat_all_tables.autoanalyze_count
 FROM pg_stat_all_tables
 
-- 
2.9.2

From 017486cfe6231ed43d8ebb9d397f2699840d27c5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 16 Nov 2017 16:18:54 +0900
Subject: [PATCH 2/4] Add vacuum_required to pg_stat_all_tables

If vacuum of a table has been failed for a long time for some reasons,
it is hard for uses to distinguish between that the server judged
vacuuming of the table is not required and that rquired but failed.
This offers convenient way to check that as the first step of trouble
shooting.
---doc/src/sgml/config.sgml             |   5 +-doc/src/sgml/maintenance.sgml        |   4
+-doc/src/sgml/monitoring.sgml        |   5 ++src/backend/catalog/system_views.sql |   1
+src/backend/commands/cluster.c      |   2 +-src/backend/commands/vacuum.c        |  69
++++++++++++++++++---src/backend/commands/vacuumlazy.c   |  14 +----src/backend/postmaster/autovacuum.c  | 115
+++++++++++++++++++++++++++++++++++src/backend/utils/adt/pgstatfuncs.c |   9 +++src/include/catalog/pg_proc.h        |
2 +src/include/commands/vacuum.h        |   3 +-src/include/postmaster/autovacuum.h  |   1
+src/test/regress/expected/rules.out |   3 +13 files changed, 210 insertions(+), 23 deletions(-)
 

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b51d219..5bf0b33 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6579,7 +6579,10 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;       <para>
<command>VACUUM</command>performs an aggressive scan if the table's
<structname>pg_class</structname>.<structfield>relfrozenxid</structfield>field has reached
 
-        the age specified by this setting.  An aggressive scan differs from
+        the age specified by this setting. It is indicated
+        as <quote>aggressive</quote> in vacuum_required
+        of <xref linkend="pg-stat-all-tables-view">. An aggressive scan
+        differs from        a regular <command>VACUUM</command> in that it visits every page that might        contain
unfrozenXIDs or MXIDs, not just those that might contain dead        tuples.  The default is 150 million transactions.
Althoughusers can
 
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 1a37905..d045b09 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -514,7 +514,9 @@    <varname>autovacuum_freeze_max_age</varname> wouldn't make sense because an    anti-wraparound
autovacuumwould be triggered at that point anyway, and    the 0.95 multiplier leaves some breathing room to run a
manual
-    <command>VACUUM</command> before that happens.  As a rule of thumb,
+    <command>VACUUM</command> before that happens. It is indicated
+    as <quote>close to freeze-limit xid</quote> in vacuum_required
+    of <xref linkend="pg-stat-all-tables-view">. As a rule of thumb,    <command>vacuum_freeze_table_age</command>
shouldbe set to a value somewhat    below <varname>autovacuum_freeze_max_age</varname>, leaving enough gap so that    a
regularlyscheduled <command>VACUUM</command> or an autovacuum triggered by
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6a57688..98c5f41 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2547,6 +2547,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
<entry>Estimatednumber of rows modified since this table was last analyzed</entry>    </row>    <row>
 
+     <entry><structfield>vacuum_required</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Vacuum requirement status. "partial", "aggressive", "required", "not requried" or "close to freeze-limit
xid".</entry>
+    </row>
+    <row>     <entry><structfield>last_vacuum</structfield></entry>     <entry><type>timestamp with time
zone</type></entry>    <entry>Last time at which this table was manually vacuumed
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index aeba9d5..b553bf4 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -523,6 +523,7 @@ CREATE VIEW pg_stat_all_tables AS            pg_stat_get_live_tuples(C.oid) AS n_live_tup,
 pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,            pg_stat_get_mod_since_analyze(C.oid) AS n_mod_since_analyze,
 
+            pg_stat_get_vacuum_necessity(C.oid) AS vacuum_required,            pg_stat_get_last_vacuum_time(C.oid) as
last_vacuum,           pg_stat_get_last_autovacuum_time(C.oid) as last_autovacuum,
pg_stat_get_last_analyze_time(C.oid)as last_analyze,
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 48f1e6e..403b76d 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -850,7 +850,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,     */
vacuum_set_xid_limits(OldHeap,0, 0, 0, 0,                          &OldestXmin, &FreezeXid, NULL, &MultiXactCutoff,
 
-                          NULL);
+                          NULL, NULL, NULL);    /*     * FreezeXid will become the table's new relfrozenxid, and that
mustn'tgo
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index cbd6e9b..f51dcdb 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -585,6 +585,10 @@ get_all_vacuum_rels(void) *     Xmax. * - mxactFullScanLimit is a value against which a table's
relminmxidvalue is *     compared to produce a full-table vacuum, as with xidFullScanLimit.
 
+ * - aggressive is set if it is not NULL and set true if the table needs
+ *   aggressive scan.
+ * - close_to_wrap_around_limit is set if it is not NULL and set true if it is
+ *   in anti-anti-wraparound window. * * xidFullScanLimit and mxactFullScanLimit can be passed as NULL if caller is *
notinterested.
 
@@ -599,9 +603,11 @@ vacuum_set_xid_limits(Relation rel,                      TransactionId *freezeLimit,
      TransactionId *xidFullScanLimit,                      MultiXactId *multiXactCutoff,
 
-                      MultiXactId *mxactFullScanLimit)
+                      MultiXactId *mxactFullScanLimit,
+                      bool *aggressive, bool *close_to_wrap_around_limit){    int            freezemin;
+    int            freezemax;    int            mxid_freezemin;    int            effective_multixact_freeze_max_age;
 TransactionId limit;
 
@@ -701,11 +707,13 @@ vacuum_set_xid_limits(Relation rel,    *multiXactCutoff = mxactLimit;
-    if (xidFullScanLimit != NULL)
+    if (xidFullScanLimit != NULL || aggressive != NULL)    {        int            freezetable;
+        bool        maybe_anti_wrapround = false;
-        Assert(mxactFullScanLimit != NULL);
+        /* these two output should be requested together  */
+        Assert(xidFullScanLimit == NULL || mxactFullScanLimit != NULL);        /*         * Determine the table freeze
ageto use: as specified by the caller,
 
@@ -717,7 +725,14 @@ vacuum_set_xid_limits(Relation rel,        freezetable = freeze_table_age;        if (freezetable
<0)            freezetable = vacuum_freeze_table_age;
 
-        freezetable = Min(freezetable, autovacuum_freeze_max_age * 0.95);
+
+        freezemax = autovacuum_freeze_max_age * 0.95;
+        if (freezemax < freezetable)
+        {
+            /* We may be in anti-anti-warparound window */
+            freezetable = freezemax;
+            maybe_anti_wrapround = true;
+        }        Assert(freezetable >= 0);        /*
@@ -728,7 +743,8 @@ vacuum_set_xid_limits(Relation rel,        if (!TransactionIdIsNormal(limit))            limit =
FirstNormalTransactionId;
-        *xidFullScanLimit = limit;
+        if (xidFullScanLimit)
+            *xidFullScanLimit = limit;        /*         * Similar to the above, determine the table freeze age to use
for
@@ -741,10 +757,20 @@ vacuum_set_xid_limits(Relation rel,        freezetable = multixact_freeze_table_age;        if
(freezetable< 0)            freezetable = vacuum_multixact_freeze_table_age;
 
-        freezetable = Min(freezetable,
-                          effective_multixact_freeze_max_age * 0.95);
+
+        freezemax = effective_multixact_freeze_max_age * 0.95;
+        if (freezemax < freezetable)
+        {
+            /* We may be in anti-anti-warparound window */
+            freezetable = freezemax;
+            maybe_anti_wrapround = true;
+        }        Assert(freezetable >= 0);
+        /* We may be in anti-anti-warparound window */
+        if (effective_multixact_freeze_max_age * 0.95 < freezetable)
+            maybe_anti_wrapround = true;
+        /*         * Compute MultiXact limit causing a full-table vacuum, being careful         * to generate a valid
MultiXactvalue.
 
@@ -753,11 +779,38 @@ vacuum_set_xid_limits(Relation rel,        if (mxactLimit < FirstMultiXactId)
mxactLimit= FirstMultiXactId;
 
-        *mxactFullScanLimit = mxactLimit;
+        if (mxactFullScanLimit)
+            *mxactFullScanLimit = mxactLimit;
+
+        /*
+         * We request an aggressive scan if the table's frozen Xid is now
+         * older than or equal to the requested Xid full-table scan limit; or
+         * if the table's minimum MultiXactId is older than or equal to the
+         * requested mxid full-table scan limit.
+         */
+        if (aggressive)
+        {
+            *aggressive =
+                TransactionIdPrecedesOrEquals(rel->rd_rel->relfrozenxid,
+                                              limit);
+            *aggressive |=
+                MultiXactIdPrecedesOrEquals(rel->rd_rel->relminmxid,
+                                            mxactLimit);
+
+            /* set close_to_wrap_around_limit if requested */
+            if (close_to_wrap_around_limit)
+                *close_to_wrap_around_limit =
+                    (*aggressive && maybe_anti_wrapround);
+        }
+        else
+        {
+            Assert (!close_to_wrap_around_limit);
+        }    }    else    {        Assert(mxactFullScanLimit == NULL);
+        Assert(aggressive == NULL);    }}
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index c482c8e..4274043 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -227,18 +227,10 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
params->multixact_freeze_min_age,                         params->multixact_freeze_table_age,
&OldestXmin,&FreezeLimit, &xidFullScanLimit,
 
-                          &MultiXactCutoff, &mxactFullScanLimit);
+                          &MultiXactCutoff, &mxactFullScanLimit,
+                          &aggressive, NULL);
-    /*
-     * We request an aggressive scan if the table's frozen Xid is now older
-     * than or equal to the requested Xid full-table scan limit; or if the
-     * table's minimum MultiXactId is older than or equal to the requested
-     * mxid full-table scan limit; or if DISABLE_PAGE_SKIPPING was specified.
-     */
-    aggressive = TransactionIdPrecedesOrEquals(onerel->rd_rel->relfrozenxid,
-                                               xidFullScanLimit);
-    aggressive |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid,
-                                              mxactFullScanLimit);
+    /* force aggressive scan if DISABLE_PAGE_SKIPPING was specified */    if (options & VACOPT_DISABLE_PAGE_SKIPPING)
     aggressive = true;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 48765bb..abbf660 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -831,6 +831,121 @@ shutdown:}/*
+ * Returns status string of auto vacuum on the relation
+ */
+char *
+AutoVacuumRequirement(Oid reloid)
+{
+    Relation classRel;
+    Relation rel;
+    TupleDesc    pg_class_desc;
+    HeapTuple tuple;
+    Form_pg_class classForm;
+    AutoVacOpts *relopts;
+    PgStat_StatTabEntry *tabentry;
+    PgStat_StatDBEntry *shared;
+    PgStat_StatDBEntry *dbentry;
+    int            effective_multixact_freeze_max_age;
+    bool        dovacuum;
+    bool        doanalyze;
+    bool        wraparound;
+    bool        aggressive;
+    bool        xid_calculated = false;
+    bool        in_anti_wa_window = false;
+    char       *ret = "not requried";
+
+    /* Compute the multixact age for which freezing is urgent. */
+    effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+
+    /* Fetch the pgclass entry for this relation */
+    tuple = SearchSysCache1(RELOID, ObjectIdGetDatum(reloid));
+    if (!HeapTupleIsValid(tuple))
+        elog(ERROR, "cache lookup failed for relation %u", reloid);
+    classForm = (Form_pg_class) GETSTRUCT(tuple);
+
+    /* extract relopts for autovacuum */
+    classRel = heap_open(RelationRelationId, AccessShareLock);
+    pg_class_desc = RelationGetDescr(classRel);
+    relopts = extract_autovac_opts(tuple, pg_class_desc);
+    heap_close(classRel, AccessShareLock);
+
+    /* Fetch the pgstat shared entry and entry for this database */
+    shared = pgstat_fetch_stat_dbentry(InvalidOid);
+    dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
+
+    /* Fetch the pgstat entry for this table */
+    tabentry = get_pgstat_tabentry_relid(reloid, classForm->relisshared,
+                                         shared, dbentry);
+
+    /*
+     * Check if the relation needs vacuum. This function is intended to
+     * suggest aggresive vacuum for the last 5% window in
+     * autovacuum_freeze_max_age so the variable wraparound is ignored
+     * here. See vacuum_set_xid_limits for details.
+     */
+    relation_needs_vacanalyze(reloid, relopts, classForm, tabentry,
+                              effective_multixact_freeze_max_age,
+                              &dovacuum, &doanalyze, &wraparound);
+    ReleaseSysCache(tuple);
+
+    /* get further information if needed */
+    rel = NULL;
+
+    /* don't get stuck with lock  */
+    if (ConditionalLockRelationOid(reloid, AccessShareLock))
+        rel = try_relation_open(reloid, NoLock);
+
+    if (rel)
+    {
+        TransactionId OldestXmin, FreezeLimit;
+        MultiXactId MultiXactCutoff;
+
+        vacuum_set_xid_limits(rel,
+                              vacuum_freeze_min_age,
+                              vacuum_freeze_table_age,
+                              vacuum_multixact_freeze_min_age,
+                              vacuum_multixact_freeze_table_age,
+                              &OldestXmin, &FreezeLimit, NULL,
+                              &MultiXactCutoff, NULL,
+                              &aggressive, &in_anti_wa_window);
+
+        xid_calculated = true;
+        relation_close(rel, AccessShareLock);
+    }
+
+    /* choose the proper message according to the calculation above */
+    if (xid_calculated)
+    {
+        if (dovacuum)
+        {
+            /* we don't care anti-wraparound if autovacuum is on */
+            if (aggressive)
+                ret = "aggressive";
+            else
+                ret = "partial";
+        }
+        else if (in_anti_wa_window)
+            ret = "close to freeze-limit xid";
+        /* otherwise just "not requried" */
+    }
+    else
+    {
+        /*
+         * failed to compute xid limits. show less-grained messages. We can
+         * use just "required" in the autovacuum case is enough to distinguish
+         * from full-grained messages, but we require additional words in the
+         * case where autovacuum is turned off.
+         */
+        if (dovacuum)
+            ret = "required";
+        else
+            ret = "not required (lock not acquired)";
+    }
+
+    return ret;
+}
+
+/* * Determine the time to sleep, based on the database list. * * The "canlaunch" parameter indicates whether we can
starta worker right now,
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 2956356..ab80794 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -23,6 +23,7 @@#include "pgstat.h"#include "postmaster/bgworker_internals.h"#include "postmaster/postmaster.h"
+#include "postmaster/autovacuum.h"#include "storage/proc.h"#include "storage/procarray.h"#include "utils/acl.h"
@@ -195,6 +196,14 @@ pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS)}Datum
+pg_stat_get_vacuum_necessity(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+
+    PG_RETURN_TEXT_P(cstring_to_text(AutoVacuumRequirement(relid)));
+}
+
+Datumpg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS){    Oid            relid = PG_GETARG_OID(0);
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f3b606b..6b84c9a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2887,6 +2887,8 @@ DATA(insert OID = 3317 (  pg_stat_get_wal_receiver    PGNSP PGUID 12 1 0 0 0 f f
fDESCR("statistics:information about WAL receiver");DATA(insert OID = 6118 (  pg_stat_get_subscription    PGNSP PGUID
121 0 0 0 f f f f f f s r 1 0 2249 "26" "{26,26,26,23,3220,1184,1184,3220,1184}" "{i,o,o,o,o,o,o,o,o}"
"{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}"_null_
_null_pg_stat_get_subscription _null_ _null_ _null_ ));DESCR("statistics: information about subscription"); 
+DATA(insert OID = 2579 (  pg_stat_get_vacuum_necessity    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 25 "26" _null_
_null__null_ _null_ _null_ pg_stat_get_vacuum_necessity _null_ _null_ _null_ ));
 
+DESCR("statistics: true if needs vacuum");DATA(insert OID = 3281 (  pg_stat_get_last_vacuum_index_scans    PGNSP PGUID
121 0 0 0 f f f f t f s r 1 0 23 "26" _null_ _null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_index_scans _null_
_null__null_ ));DESCR("statistics: number of index scans in the last vacuum");DATA(insert OID = 2026 (  pg_backend_pid
             PGNSP PGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_
_null__null_ ));
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 60586b2..84bec74 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -182,7 +182,8 @@ extern void vacuum_set_xid_limits(Relation rel,                      TransactionId *freezeLimit,
                 TransactionId *xidFullScanLimit,                      MultiXactId *multiXactCutoff,
 
-                      MultiXactId *mxactFullScanLimit);
+                      MultiXactId *mxactFullScanLimit,
+                      bool *aggressive, bool *in_wa_window);extern void vac_update_datfrozenxid(void);extern void
vacuum_delay_point(void);
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index 3469915..848a322 100644
--- a/src/include/postmaster/autovacuum.h
+++ b/src/include/postmaster/autovacuum.h
@@ -49,6 +49,7 @@ extern int    Log_autovacuum_min_duration;extern bool AutoVacuumingActive(void);extern bool
IsAutoVacuumLauncherProcess(void);externbool IsAutoVacuumWorkerProcess(void);
 
+extern char *AutoVacuumRequirement(Oid reloid);#define IsAnyAutoVacuumProcess() \    (IsAutoVacuumLauncherProcess() ||
IsAutoVacuumWorkerProcess())
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index d0bb46c..e827842 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1759,6 +1759,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,    pg_stat_get_live_tuples(c.oid) AS n_live_tup,
pg_stat_get_dead_tuples(c.oid)AS n_dead_tup,    pg_stat_get_mod_since_analyze(c.oid) AS n_mod_since_analyze,
 
+    pg_stat_get_vacuum_necessity(c.oid) AS vacuum_required,    pg_stat_get_last_vacuum_time(c.oid) AS last_vacuum,
pg_stat_get_last_autovacuum_time(c.oid)AS last_autovacuum,    pg_stat_get_last_analyze_time(c.oid) AS last_analyze,
 
@@ -1907,6 +1908,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,   pg_stat_all_tables.n_mod_since_analyze,
 
+    pg_stat_all_tables.vacuum_required,    pg_stat_all_tables.last_vacuum,    pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
@@ -1951,6 +1953,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.n_live_tup,
pg_stat_all_tables.n_dead_tup,   pg_stat_all_tables.n_mod_since_analyze,
 
+    pg_stat_all_tables.vacuum_required,    pg_stat_all_tables.last_vacuum,    pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,
-- 
2.9.2

From 176973d844c0965c4c7f89025b968790c886f6c0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 16 Nov 2017 15:27:53 +0900
Subject: [PATCH 1/4] Show index scans of the last vacuum in pg_stat_all_tables

This number is already shown in the autovacuum completion log or the
result of VACUUM VERBOSE, but the number is useful to see whether
maintenance_work_mem is large enough so this patch adds the number in
pg_stat_all_tables view.
---doc/src/sgml/config.sgml             |  9 +++++++++doc/src/sgml/monitoring.sgml         |  5
+++++src/backend/catalog/system_views.sql|  1 +src/backend/commands/vacuumlazy.c    |  3
++-src/backend/postmaster/pgstat.c     |  6 +++++-src/backend/utils/adt/pgstatfuncs.c  | 14
++++++++++++++src/include/catalog/pg_proc.h       |  2 ++src/include/pgstat.h                 |  5
++++-src/test/regress/expected/rules.out |  3 +++9 files changed, 45 insertions(+), 3 deletions(-)
 

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fc1752f..b51d219 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1502,6 +1502,15 @@ include_dir 'conf.d'        too high.  It may be useful to control for this by separately
setting<xref linkend="guc-autovacuum-work-mem">.       </para>
 
+       <para>
+         Vacuum scans all index pages to remove index entries that pointed to
+         dead tuples. Finishing vacuum with a minimal number of index scans
+         reduces the time it takes to complete it, and a new scan is triggered
+         once the in-memory storage for dead tuple pointers gets full, whose
+         size is defined by autovacuum_work_mem. So increasing this parameter
+         can make the operation finish more quickly. This can be monitored with
+         <xref linkend="pg-stat-all-tables-view">.
+       </para>      </listitem>     </varlistentry>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6f82033..6a57688 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2570,6 +2570,11 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
daemon</entry>   </row>    <row>
 
+     <entry><structfield>last_vacuum_index_scans</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Number of splitted index scans performed during the last vacuum or autovacuum on this table</entry>
+    </row>
+    <row>     <entry><structfield>vacuum_count</structfield></entry>     <entry><type>bigint</type></entry>
<entry>Numberof times this table has been manually vacuumed
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 394aea8..aeba9d5 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -527,6 +527,7 @@ CREATE VIEW pg_stat_all_tables AS            pg_stat_get_last_autovacuum_time(C.oid) as
last_autovacuum,           pg_stat_get_last_analyze_time(C.oid) as last_analyze,
pg_stat_get_last_autoanalyze_time(C.oid)as last_autoanalyze,
 
+            pg_stat_get_last_vacuum_index_scans(C.oid) AS last_vacuum_index_scans,
pg_stat_get_vacuum_count(C.oid)AS vacuum_count,            pg_stat_get_autovacuum_count(C.oid) AS autovacuum_count,
      pg_stat_get_analyze_count(C.oid) AS analyze_count,
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 6587db7..c482c8e 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -342,7 +342,8 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
pgstat_report_vacuum(RelationGetRelid(onerel),                        onerel->rd_rel->relisshared,
  new_live_tuples,
 
-                         vacrelstats->new_dead_tuples);
+                         vacrelstats->new_dead_tuples,
+                         vacrelstats->num_index_scans);    pgstat_progress_end_command();    /* and log the action if
appropriate*/
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..5f3fdf6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1403,7 +1403,8 @@ pgstat_report_autovac(Oid dboid) */voidpgstat_report_vacuum(Oid tableoid, bool shared,
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples)
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                     PgStat_Counter num_index_scans){    PgStat_MsgVacuum msg;
@@ -1417,6 +1418,7 @@ pgstat_report_vacuum(Oid tableoid, bool shared,    msg.m_vacuumtime = GetCurrentTimestamp();
msg.m_live_tuples= livetuples;    msg.m_dead_tuples = deadtuples;
 
+    msg.m_num_index_scans = num_index_scans;    pgstat_send(&msg, sizeof(msg));}
@@ -4585,6 +4587,7 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
result->n_live_tuples= 0;        result->n_dead_tuples = 0;        result->changes_since_analyze = 0;
 
+        result->n_index_scans = 0;        result->blocks_fetched = 0;        result->blocks_hit = 0;
result->vacuum_timestamp= 0;
 
@@ -5981,6 +5984,7 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)    tabentry->n_live_tuples =
msg->m_live_tuples;   tabentry->n_dead_tuples = msg->m_dead_tuples;
 
+    tabentry->n_index_scans = msg->m_num_index_scans;    if (msg->m_autovacuum)    {
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 8d9e7c1..2956356 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -194,6 +194,20 @@ pg_stat_get_mod_since_analyze(PG_FUNCTION_ARGS)    PG_RETURN_INT64(result);}
+Datum
+pg_stat_get_last_vacuum_index_scans(PG_FUNCTION_ARGS)
+{
+    Oid            relid = PG_GETARG_OID(0);
+    int32        result;
+    PgStat_StatTabEntry *tabentry;
+
+    if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL)
+        result = 0;
+    else
+        result = (int32) (tabentry->n_index_scans);
+
+    PG_RETURN_INT32(result);
+}Datumpg_stat_get_blocks_fetched(PG_FUNCTION_ARGS)
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 0330c04..f3b606b 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2887,6 +2887,8 @@ DATA(insert OID = 3317 (  pg_stat_get_wal_receiver    PGNSP PGUID 12 1 0 0 0 f f
fDESCR("statistics:information about WAL receiver");DATA(insert OID = 6118 (  pg_stat_get_subscription    PGNSP PGUID
121 0 0 0 f f f f f f s r 1 0 2249 "26" "{26,26,26,23,3220,1184,1184,3220,1184}" "{i,o,o,o,o,o,o,o,o}"
"{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}"_null_
_null_pg_stat_get_subscription _null_ _null_ _null_ ));DESCR("statistics: information about subscription");
 
+DATA(insert OID = 3281 (  pg_stat_get_last_vacuum_index_scans    PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "26"
_null__null_ _null_ _null_ _null_ pg_stat_get_last_vacuum_index_scans _null_ _null_ _null_ ));
 
+DESCR("statistics: number of index scans in the last vacuum");DATA(insert OID = 2026 (  pg_backend_pid
PGNSPPGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_
));DESCR("statistics:current backend PID");DATA(insert OID = 1937 (  pg_stat_get_backend_pid        PGNSP PGUID 12 1 0
00 f f f f t f s r 1 0 23 "23" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_pid _null_ _null_ _null_ ));
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..3ab5f4a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -369,6 +369,7 @@ typedef struct PgStat_MsgVacuum    TimestampTz m_vacuumtime;    PgStat_Counter m_live_tuples;
PgStat_Counterm_dead_tuples;
 
+    PgStat_Counter m_num_index_scans;} PgStat_MsgVacuum;
@@ -629,6 +630,7 @@ typedef struct PgStat_StatTabEntry    PgStat_Counter n_live_tuples;    PgStat_Counter
n_dead_tuples;   PgStat_Counter changes_since_analyze;
 
+    PgStat_Counter n_index_scans;    PgStat_Counter blocks_fetched;    PgStat_Counter blocks_hit;
@@ -1165,7 +1167,8 @@ extern void pgstat_reset_single_counter(Oid objectid, PgStat_Single_Reset_Type textern void
pgstat_report_autovac(Oiddboid);extern void pgstat_report_vacuum(Oid tableoid, bool shared,
 
-                     PgStat_Counter livetuples, PgStat_Counter deadtuples);
+                     PgStat_Counter livetuples, PgStat_Counter deadtuples,
+                     PgStat_Counter num_index_scans);extern void pgstat_report_analyze(Relation rel,
  PgStat_Counter livetuples, PgStat_Counter deadtuples,                      bool resetcounter);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..d0bb46c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1763,6 +1763,7 @@ pg_stat_all_tables| SELECT c.oid AS relid,    pg_stat_get_last_autovacuum_time(c.oid) AS
last_autovacuum,   pg_stat_get_last_analyze_time(c.oid) AS last_analyze,    pg_stat_get_last_autoanalyze_time(c.oid) AS
last_autoanalyze,
+    pg_stat_get_last_vacuum_index_scans(c.oid) AS last_vacuum_index_scans,    pg_stat_get_vacuum_count(c.oid) AS
vacuum_count,   pg_stat_get_autovacuum_count(c.oid) AS autovacuum_count,    pg_stat_get_analyze_count(c.oid) AS
analyze_count,
@@ -1910,6 +1911,7 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,   pg_stat_all_tables.last_autoanalyze,
 
+    pg_stat_all_tables.last_vacuum_index_scans,    pg_stat_all_tables.vacuum_count,
pg_stat_all_tables.autovacuum_count,   pg_stat_all_tables.analyze_count,
 
@@ -1953,6 +1955,7 @@ pg_stat_user_tables| SELECT pg_stat_all_tables.relid,    pg_stat_all_tables.last_autovacuum,
pg_stat_all_tables.last_analyze,   pg_stat_all_tables.last_autoanalyze,
 
+    pg_stat_all_tables.last_vacuum_index_scans,    pg_stat_all_tables.vacuum_count,
pg_stat_all_tables.autovacuum_count,   pg_stat_all_tables.analyze_count,
 
-- 
2.9.2

Re: [HACKERS] More stats about skipped vacuums

From

Michael Paquier

Date:

22 November 2017, 05:20:22

On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> By the way I'm uneasy that the 'last_vacuum_index_scans' (and
> vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
> both VACUUM command and autovacuum, while last_vacuum and
> vacuum_count is mentioning only the command. Splitting it into
> vacuum/autovaccum seems nonsense but the name is confusing. Do
> you have any idea?

Hm. I think that you should actually have two fields, one for manual
vacuum and one for autovacuum, because each is tied to respectively
maintenance_work_mem and autovacuum_work_mem. This way admins are able
to tune each one of those parameters depending on a look at
pg_stat_all_tables. So those should be named perhaps
last_vacuum_index_scans and last_autovacuum_index_scans?
-- 
Michael

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

22 November 2017, 10:08:59

Hello,

At Wed, 22 Nov 2017 08:20:22 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQ03JrEwKqbc0fWJe9Lt1-fAQc961OWw+Upw9QmRXak0A@mail.gmail.com>
> On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > By the way I'm uneasy that the 'last_vacuum_index_scans' (and
> > vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
> > both VACUUM command and autovacuum, while last_vacuum and
> > vacuum_count is mentioning only the command. Splitting it into
> > vacuum/autovaccum seems nonsense but the name is confusing. Do
> > you have any idea?
> 
> Hm. I think that you should actually have two fields, one for manual
> vacuum and one for autovacuum, because each is tied to respectively
> maintenance_work_mem and autovacuum_work_mem. This way admins are able

It's very convincing for me. Thanks for the suggestion.

> to tune each one of those parameters depending on a look at
> pg_stat_all_tables. So those should be named perhaps
> last_vacuum_index_scans and last_autovacuum_index_scans?

Agreed. I'll do so in the next version.

# I forgot to add the version to the patch files...

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] More stats about skipped vacuums

From

Michael Paquier

Date:

22 November 2017, 10:12:33

On Wed, Nov 22, 2017 at 1:08 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Wed, 22 Nov 2017 08:20:22 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQ03JrEwKqbc0fWJe9Lt1-fAQc961OWw+Upw9QmRXak0A@mail.gmail.com>
>> On Tue, Nov 21, 2017 at 4:09 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > By the way I'm uneasy that the 'last_vacuum_index_scans' (and
>> > vacuum_fail_count in 0002 and others in 0003, 0004) is mentioning
>> > both VACUUM command and autovacuum, while last_vacuum and
>> > vacuum_count is mentioning only the command. Splitting it into
>> > vacuum/autovaccum seems nonsense but the name is confusing. Do
>> > you have any idea?
>>
>> Hm. I think that you should actually have two fields, one for manual
>> vacuum and one for autovacuum, because each is tied to respectively
>> maintenance_work_mem and autovacuum_work_mem. This way admins are able
>
> It's very convincing for me. Thanks for the suggestion.
>
>> to tune each one of those parameters depending on a look at
>> pg_stat_all_tables. So those should be named perhaps
>> last_vacuum_index_scans and last_autovacuum_index_scans?
>
> Agreed. I'll do so in the next version.

Thanks for considering the suggestion.

> # I forgot to add the version to the patch files...

Don't worry about that. That's not a problem for me I'll just keep
track of the last entry. With the room I have I'll keep focused on
0001 by the way. Others are of course welcome to look at 0002 and
onwards.
-- 
Michael

Re: [HACKERS] More stats about skipped vacuums

From

Robert Haas

Date:

24 November 2017, 21:55:08

On Tue, Nov 21, 2017 at 2:09 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Yes, my concern here is how many column we can allow in a stats
> view. I think I'm a bit too warried about that.

I think that's a good thing to worry about.   In the past, Tom has
expressed reluctance to make stats tables that have a row per table
any wider at all, IIRC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] More stats about skipped vacuums

From

Michael Paquier

Date:

25 November 2017, 16:52:42

On Sat, Nov 25, 2017 at 12:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Nov 21, 2017 at 2:09 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> Yes, my concern here is how many column we can allow in a stats
>> view. I think I'm a bit too warried about that.
>
> I think that's a good thing to worry about.   In the past, Tom has
> expressed reluctance to make stats tables that have a row per table
> any wider at all, IIRC.

Tom, any opinions to offer here? pg_stat_all_tables is currently at 22
columns (this takes two full lines on my terminal with a font size at
13). With the first patch of what's proposed on this thread there
would be 24 columns. Perhaps it would be time to split the vacuum
statistics into a new view like pg_stat_tables_vacuum or similar?
-- 
Michael

Re: [HACKERS] More stats about skipped vacuums

From

Tom Lane

Date:

25 November 2017, 21:34:24

Michael Paquier <michael.paquier@gmail.com> writes:
> On Sat, Nov 25, 2017 at 12:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I think that's a good thing to worry about.   In the past, Tom has
>> expressed reluctance to make stats tables that have a row per table
>> any wider at all, IIRC.

> Tom, any opinions to offer here? pg_stat_all_tables is currently at 22
> columns (this takes two full lines on my terminal with a font size at
> 13). With the first patch of what's proposed on this thread there
> would be 24 columns. Perhaps it would be time to split the vacuum
> statistics into a new view like pg_stat_tables_vacuum or similar?

My concern is not with the width of any view --- you can always select a
subset of columns if a view is too wide for your screen.  The issue is the
number of counters in the stats collector's state.  We know, without any
doubt, that widening PgStat_StatTabEntry causes serious pain to people
with large numbers of tables; and they have no recourse when we do it.
So the bar to adding more counters is very high IMO.  I won't say never,
but I do doubt that something like skipped vacuums should make the cut.

If we could get rid of the copy-to-a-temporary-file technology for
transferring the stats collector's data to backends, then this problem
would probably vanish or at least get a lot less severe.  But that seems
like a nontrivial project.  With the infrastructure we have today, we
could probably keep the stats tables in a DSM segment; but how would
a backend get a consistent snapshot of them?
        regards, tom lane

Re: [HACKERS] More stats about skipped vacuums

From

Robert Haas

Date:

25 November 2017, 22:10:30

On Sat, Nov 25, 2017 at 10:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> If we could get rid of the copy-to-a-temporary-file technology for
> transferring the stats collector's data to backends, then this problem
> would probably vanish or at least get a lot less severe.  But that seems
> like a nontrivial project.  With the infrastructure we have today, we
> could probably keep the stats tables in a DSM segment; but how would
> a backend get a consistent snapshot of them?

I suppose the obvious approach is to have a big lock around the
statistics data proper; this could be taken in shared mode to take a
snapshot or in exclusive mode to update statistics.  In addition,
create one or more queues where statistics messages can be enqueued in
lieu of updating the main statistics data directly.  If that doesn't
perform well enough, you could keep two copies of the statistics, A
and B.  At any given time, one copy is quiescent and the other copy is
being updated.  Periodically, at a time when we know that nobody is
taking a snapshot of the statistics, they reverse roles.

Of course, the other obvious question is whether we really need a
consistent snapshot, because that's bound to be pretty expensive even
if you eliminate the I/O cost.  Taking a consistent snapshot across
all 100,000 tables in the database even if we're only ever going to
access 5 of those tables doesn't seem like a good or scalable design.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] More stats about skipped vacuums

From

Tom Lane

Date:

25 November 2017, 23:09:54

Robert Haas <robertmhaas@gmail.com> writes:
> Of course, the other obvious question is whether we really need a
> consistent snapshot, because that's bound to be pretty expensive even
> if you eliminate the I/O cost.  Taking a consistent snapshot across
> all 100,000 tables in the database even if we're only ever going to
> access 5 of those tables doesn't seem like a good or scalable design.

Mumble.  It's a property I'm pretty hesitant to give up, especially
since the stats views have worked like that since day one.  It's
inevitable that weakening that guarantee would break peoples' queries,
probably subtly.

What we need is a way to have a consistent snapshot without implementing
it by copying 100,000 tables' worth of data for every query.  Hmm, I heard
of a technique called MVCC once ...
        regards, tom lane

Re: [HACKERS] More stats about skipped vacuums

From

Michael Paquier

Date:

26 November 2017, 05:51:53

On Sun, Nov 26, 2017 at 12:34 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Sat, Nov 25, 2017 at 12:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I think that's a good thing to worry about.   In the past, Tom has
>>> expressed reluctance to make stats tables that have a row per table
>>> any wider at all, IIRC.
>
>> Tom, any opinions to offer here? pg_stat_all_tables is currently at 22
>> columns (this takes two full lines on my terminal with a font size at
>> 13). With the first patch of what's proposed on this thread there
>> would be 24 columns. Perhaps it would be time to split the vacuum
>> statistics into a new view like pg_stat_tables_vacuum or similar?
>
> My concern is not with the width of any view --- you can always select a
> subset of columns if a view is too wide for your screen.  The issue is the
> number of counters in the stats collector's state.  We know, without any
> doubt, that widening PgStat_StatTabEntry causes serious pain to people
> with large numbers of tables; and they have no recourse when we do it.
> So the bar to adding more counters is very high IMO.  I won't say never,
> but I do doubt that something like skipped vacuums should make the cut.

I am not arguing about skipped vacuuum data here (don't think much of
it by the way), but of the number of index scans done by the last
vacuum or autovacuum. This helps in tunning autovacuum_work_mem and
maintenance_work_mem. The bar is still too high for that?
-- 
Michael

Re: [HACKERS] More stats about skipped vacuums

From

Tom Lane

Date:

26 November 2017, 06:59:19

Michael Paquier <michael.paquier@gmail.com> writes:
> I am not arguing about skipped vacuuum data here (don't think much of
> it by the way), but of the number of index scans done by the last
> vacuum or autovacuum. This helps in tunning autovacuum_work_mem and
> maintenance_work_mem. The bar is still too high for that?

I'd say so ... that's something the average user will never bother with,
and even if they knew to bother, it's far from obvious what to do with
the information.  Besides, I don't think you could just save the number
of scans and nothing else.  For it to be meaningful, you'd at least have
to know the prevailing work_mem setting and the number of dead tuples
removed ... and then you'd need some info about your historical average
and maximum number of dead tuples removed, so that you knew whether the
last vacuum operation was at all representative.  So this is sounding
like quite a lot of new counters, in support of perhaps 0.1% of the
user population.  Most people are just going to set maintenance_work_mem
as high as they can tolerate and call it good.
        regards, tom lane

Re: [HACKERS] More stats about skipped vacuums

From

Michael Paquier

Date:

26 November 2017, 16:12:09

On Sun, Nov 26, 2017 at 9:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'd say so ... that's something the average user will never bother with,
> and even if they knew to bother, it's far from obvious what to do with
> the information.  Besides, I don't think you could just save the number
> of scans and nothing else.  For it to be meaningful, you'd at least have
> to know the prevailing work_mem setting and the number of dead tuples
> removed ... and then you'd need some info about your historical average
> and maximum number of dead tuples removed, so that you knew whether the
> last vacuum operation was at all representative.  So this is sounding
> like quite a lot of new counters, in support of perhaps 0.1% of the
> user population.  Most people are just going to set maintenance_work_mem
> as high as they can tolerate and call it good.

In all the PostgreSQL deployments I deal with, the database is
embedded with other things running in parallel and memory is something
that's shared between components, so being able to tune more precisely
any of the *_work_mem parameters has value (a couple of applications
are also doing autovacuum tuning at relation-level). Would you think
that it is acceptable to add the number of index scans that happened
with the verbose output then? Personally I could live with that
information. I recall as well a thread about complains that VACUUM
VERBOSE is showing already too much information, I cannot put my
finger on it specifically now though. With
autovacuum_log_min_duration, it is easy enough to trace a vacuum
pattern. The thing is that for now the tuning is not that evident, and
CPU cycles can be worth saving in some deployments while memory could
be extended more easily.
-- 
Michael

Re: [HACKERS] More stats about skipped vacuums

From

Tom Lane

Date:

26 November 2017, 23:49:53

Michael Paquier <michael.paquier@gmail.com> writes:
> ... Would you think
> that it is acceptable to add the number of index scans that happened
> with the verbose output then?

I don't have an objection to it, but can't you tell that from VACUUM
VERBOSE already?  There should be a "INFO:  scanned index" line for
each scan.
        regards, tom lane

Re: [HACKERS] More stats about skipped vacuums

From

Robert Haas

Date:

27 November 2017, 02:05:27

On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Of course, the other obvious question is whether we really need a
>> consistent snapshot, because that's bound to be pretty expensive even
>> if you eliminate the I/O cost.  Taking a consistent snapshot across
>> all 100,000 tables in the database even if we're only ever going to
>> access 5 of those tables doesn't seem like a good or scalable design.
>
> Mumble.  It's a property I'm pretty hesitant to give up, especially
> since the stats views have worked like that since day one.  It's
> inevitable that weakening that guarantee would break peoples' queries,
> probably subtly.

You mean, queries against the stats views, or queries in general?  If
the latter, by what mechanism would the breakage happen?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] More stats about skipped vacuums

From

Tom Lane

Date:

27 November 2017, 02:19:41

Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Mumble.  It's a property I'm pretty hesitant to give up, especially
>> since the stats views have worked like that since day one.  It's
>> inevitable that weakening that guarantee would break peoples' queries,
>> probably subtly.

> You mean, queries against the stats views, or queries in general?  If
> the latter, by what mechanism would the breakage happen?

Queries against the stats views, of course.
        regards, tom lane

Re: [HACKERS] More stats about skipped vacuums

From

Michael Paquier

Date:

27 November 2017, 07:02:22

On Mon, Nov 27, 2017 at 2:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> ... Would you think
>> that it is acceptable to add the number of index scans that happened
>> with the verbose output then?
>
> I don't have an objection to it, but can't you tell that from VACUUM
> VERBOSE already?  There should be a "INFO:  scanned index" line for
> each scan.

Of course, sorry for the noise.
-- 
Michael

Re: [HACKERS] More stats about skipped vacuums

From

Michael Paquier

Date:

27 November 2017, 07:03:25

On Mon, Nov 27, 2017 at 5:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Mumble.  It's a property I'm pretty hesitant to give up, especially
>>> since the stats views have worked like that since day one.  It's
>>> inevitable that weakening that guarantee would break peoples' queries,
>>> probably subtly.
>
>> You mean, queries against the stats views, or queries in general?  If
>> the latter, by what mechanism would the breakage happen?
>
> Queries against the stats views, of course.

There has been much discussion on this thread, and the set of patches
as presented may hurt performance for users with a large number of
tables, so I am marking it as returned with feedback.
-- 
Michael

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

27 November 2017, 12:49:38

At Mon, 27 Nov 2017 10:03:25 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqTaVWd9vAjRzMOCKHP9k6ge-0u4w_7-YHKZ+gynGN8fpg@mail.gmail.com>
> On Mon, Nov 27, 2017 at 5:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> Mumble.  It's a property I'm pretty hesitant to give up, especially
> >>> since the stats views have worked like that since day one.  It's
> >>> inevitable that weakening that guarantee would break peoples' queries,
> >>> probably subtly.
> >
> >> You mean, queries against the stats views, or queries in general?  If
> >> the latter, by what mechanism would the breakage happen?
> >
> > Queries against the stats views, of course.
> 
> There has been much discussion on this thread, and the set of patches
> as presented may hurt performance for users with a large number of
> tables, so I am marking it as returned with feedback.
> -- 
> Michael

Hmmm. Okay, we must make stats collector more effeicient if we
want to have additional counters with smaller significance in the
table stats. Currently sizeof(PgStat_StatTabEntry) is 168
bytes. The whole of the patchset increases it to 232 bytes. Thus
the size of a stat file for a database with 10000 tables
increases from about 1.7MB to 2.4MB.  DSM and shared dynahash is
not dynamically expandable so placing stats on shared hash
doesn't seem effective. Stats as a regular table could work but
it seems too-much.

Is it acceptable that adding a new section containing this new
counters, which is just loaded as a byte sequence and parsing
(and filling the corresponding hash) is postponed until a counter
in the section is really requested?  The new counters need to be
shown in a separate stats view (maybe named pg_stat_vacuum).

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] More stats about skipped vacuums

From

Robert Haas

Date:

28 November 2017, 00:51:22

On Mon, Nov 27, 2017 at 1:49 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hmmm. Okay, we must make stats collector more effeicient if we
> want to have additional counters with smaller significance in the
> table stats. Currently sizeof(PgStat_StatTabEntry) is 168
> bytes. The whole of the patchset increases it to 232 bytes. Thus
> the size of a stat file for a database with 10000 tables
> increases from about 1.7MB to 2.4MB.  DSM and shared dynahash is
> not dynamically expandable so placing stats on shared hash
> doesn't seem effective. Stats as a regular table could work but
> it seems too-much.

dshash, which is already committed, is both DSM-based and dynamically
expandable.

> Is it acceptable that adding a new section containing this new
> counters, which is just loaded as a byte sequence and parsing
> (and filling the corresponding hash) is postponed until a counter
> in the section is really requested?  The new counters need to be
> shown in a separate stats view (maybe named pg_stat_vacuum).

Still makes the stats file bigger.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] More stats about skipped vacuums

From

Robert Haas

Date:

28 November 2017, 00:53:53

On Sun, Nov 26, 2017 at 3:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Mumble.  It's a property I'm pretty hesitant to give up, especially
>>> since the stats views have worked like that since day one.  It's
>>> inevitable that weakening that guarantee would break peoples' queries,
>>> probably subtly.
>
>> You mean, queries against the stats views, or queries in general?  If
>> the latter, by what mechanism would the breakage happen?
>
> Queries against the stats views, of course.

Hmm.  Those are probably rare.  If we only took a snapshot of the
statistics for the backends that explicitly access those views, that
probably wouldn't be too crazy.

Sorry if this is a stupid question, but how often and for what purpose
to regular backends need the stats collector data for purposes other
than querying the stats views?  I thought that the data was only used
to decide whether to VACUUM/ANALYZE, and therefore would be accessed
mostly by autovacuum, and for that you'd actually want the most
up-to-date view of the stats for a particular table that is available,
not any older snapshot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] More stats about skipped vacuums

From

Magnus Hagander

Date:

28 November 2017, 03:05:06

On Mon, Nov 27, 2017 at 7:53 PM, Robert Haas wrote: > On Sun, Nov 26, 2017 at 3:19 PM, Tom Lane wrote: > > Robert Haas writes: > >> On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane wrote: > >>> Mumble. It's a property I'm pretty hesitant to give up, especially > >>> since the stats views have worked like that since day one. It's > >>> inevitable that weakening that guarantee would break peoples' queries, > >>> probably subtly. > > > >> You mean, queries against the stats views, or queries in general? If > >> the latter, by what mechanism would the breakage happen? > > > > Queries against the stats views, of course. > > Hmm. Those are probably rare. If we only took a snapshot of the > statistics for the backends that explicitly access those views, that > probably wouldn't be too crazy. > > Sorry if this is a stupid question, but how often and for what purpose > to regular backends need the stats collector data for purposes other > than querying the stats views? I thought that the data was only used > to decide whether to VACUUM/ANALYZE, and therefore would be accessed > mostly by autovacuum, and for that you'd actually want the most > up-to-date view of the stats for a particular table that is available, > not any older snapshot. > > Autovacuum resets the stats to make sure. Autovacuum in particular can probably be made a lot more efficient, because it only ever looks at one relation at a time, I think. What I've been thinking about for that one before is if we could just invent a protocol (shmq based maybe) whereby autovacuum can ask the stats collector for a single table or index stat. If autovacuum never needs to see a consistent view between multiple tables, I would think that's going to be a win in a lot of cases. I don't think regular backends use them at all. But anybody looking at the stats do, and it is pretty important there. However, when it comes to the stats system, I'd say that on any busy system (which would be the ones to care about), the stats structures are still going to be *written* a lot more than they are read. We certainly don't read them at the rate of once per transaction. A lot of the reads are also limited to one database of course. I wonder if we want to implement some sort of copy-on-read-snapshot in the stats collector itself. So instead of unconditionally publishing everything, have the backends ask for it. When a backend asks for it it gets a "snapshot counter" or something from the stats collector, and on the next write after that we do a copy-write if the snapshot it still available. (no, i have not thought in detail) Or -- if we keep a per-database hashtable in dynamic shared memory (which we can now). Can we copy it into local memory in the backend fast enough that we can hold a lock and just queue up the stats updates during the copy? If we can copy the complete structure, that would fix one of the bigger bottlenecks with it today which is that we dump and rebuild the hashtables as we go through the tempfiles. -- Magnus HaganderMe: https://www.hagander.net/ Work: https://www.redpill-linpro.com/

Re: [HACKERS] More stats about skipped vacuums

From

Tom Lane

Date:

28 November 2017, 05:16:06

Magnus Hagander <magnus@hagander.net> writes:
> What I've been thinking about for that one before is if we could just
> invent a protocol (shmq based maybe) whereby autovacuum can ask the stats
> collector for a single table or index stat. If autovacuum never needs to
> see a consistent view between multiple tables, I would think that's going
> to be a win in a lot of cases.

Perhaps.  Autovac might run through quite a few tables before it finds
one in need of processing, though, so I'm not quite certain this would
yield such great benefits in isolation.

> However, when it comes to the stats system, I'd say that on any busy system
> (which would be the ones to care about), the stats structures are still
> going to be *written* a lot more than they are read.

Uh, what?  The stats collector doesn't write those files at all except
on-demand.
        regards, tom lane

Re: [HACKERS] More stats about skipped vacuums

From

Magnus Hagander

Date:

28 November 2017, 14:35:50

On Tue, Nov 28, 2017 at 12:16 AM, Tom Lane wrote: > Magnus Hagander writes: > > What I've been thinking about for that one before is if we could just > > invent a protocol (shmq based maybe) whereby autovacuum can ask the stats > > collector for a single table or index stat. If autovacuum never needs to > > see a consistent view between multiple tables, I would think that's going > > to be a win in a lot of cases. > > Perhaps. Autovac might run through quite a few tables before it finds > one in need of processing, though, so I'm not quite certain this would > yield such great benefits in isolation. > Hmm. Good point. > > However, when it comes to the stats system, I'd say that on any busy > system > > (which would be the ones to care about), the stats structures are still > > going to be *written* a lot more than they are read. > > Uh, what? The stats collector doesn't write those files at all except > on-demand. > Oops. Missing one important word. They're going to be *written to* a lot more than they are read. Meaning that each individual value is likely to be updated many times before it's ever read. In memory, in the stats collector. So not talking about the files at all -- just the numbers, independent of implementation. -- Magnus HaganderMe: https://www.hagander.net/ Work: https://www.redpill-linpro.com/

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

11 December 2017, 17:15:23

At Mon, 27 Nov 2017 13:51:22 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+Tgmob2tuqvEZfHV2kLC-xobsZxDWGdc1WmjLg5+iOPLa0NHg@mail.gmail.com>
> On Mon, Nov 27, 2017 at 1:49 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hmmm. Okay, we must make stats collector more effeicient if we
> > want to have additional counters with smaller significance in the
> > table stats. Currently sizeof(PgStat_StatTabEntry) is 168
> > bytes. The whole of the patchset increases it to 232 bytes. Thus
> > the size of a stat file for a database with 10000 tables
> > increases from about 1.7MB to 2.4MB.  DSM and shared dynahash is
> > not dynamically expandable so placing stats on shared hash
> > doesn't seem effective. Stats as a regular table could work but
> > it seems too-much.
> 
> dshash, which is already committed, is both DSM-based and dynamically
> expandable.

Yes, I forgot about that. We can just copy memory blocks to take
a snapshot of stats.

> > Is it acceptable that adding a new section containing this new
> > counters, which is just loaded as a byte sequence and parsing
> > (and filling the corresponding hash) is postponed until a counter
> > in the section is really requested?  The new counters need to be
> > shown in a separate stats view (maybe named pg_stat_vacuum).
> 
> Still makes the stats file bigger.

I considered dshash for pgstat.c and the attached is a *PoC*
patch, which is not full-fledged and just working under a not so
concurrent situation.

- Made stats collector an auxiliary process. A crash of stats
  collector leads to a whole-server restarting.

- dshash lacks capability of sequential scan so added it.

- Also added snapshot function to dshash. It just copies
  unerlying DSA segments into local memory but currently it
  doesn't aquire dshash-level locks at all. I tried the same
  thing with resize but it leads to very quick exhaustion of
  LWLocks. An LWLock for the whole dshash would be required. (and
  it is also useful to resize() and sequential scan.

- The current dshash doesn't shrink at all. Such a feature also
  will be required. (A server restart causes a shrink of hashes
  in the same way as before but bloat dshash requires copying
  more than necessary size of memory on takeing a snapshot.)

The size of DSA is about 1MB at minimum. Copying entry-by-entry
into (non-ds) hash might be better than copying underlying DSA as
a whole, and DSA/DSHASH snapshot brings a kind of dirty..


Does anyone give me opinions or suggestions?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From f5c9c45384ec43734c0890dd875101defe6590bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 1 Dec 2017 14:34:47 +0900
Subject: [PATCH 1/4] Simple implement of local shapshot of dshash.

Add snapshot feature to DSHASH. This makes palloc'ed copy of
underlying DSA and returns unmodifiable DSHASH using the copied DSA.
---
 src/backend/lib/dshash.c     | 74 +++++++++++++++++++++++++++++++++++++++-----
 src/backend/utils/mmgr/dsa.c | 57 +++++++++++++++++++++++++++++++++-
 src/include/lib/dshash.h     |  1 +
 src/include/utils/dsa.h      |  1 +
 4 files changed, 124 insertions(+), 9 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index dd87573..973a826 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -112,6 +112,7 @@ struct dshash_table
     size_t        size_log2;        /* log2(number of buckets) */
     bool        find_locked;    /* Is any partition lock held by 'find'? */
     bool        find_exclusively_locked;    /* ... exclusively? */
+    bool        is_snapshot;    /* Is this hash is a local snapshot?*/
 };
 
 /* Given a pointer to an item, find the entry (user data) it holds. */
@@ -228,6 +229,7 @@ dshash_create(dsa_area *area, const dshash_parameters *params, void *arg)
 
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->is_snapshot = false;
 
     /*
      * Set up the initial array of buckets.  Our initial size is the same as
@@ -279,6 +281,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
     hash_table->control = dsa_get_address(area, control);
     hash_table->find_locked = false;
     hash_table->find_exclusively_locked = false;
+    hash_table->is_snapshot = false;
     Assert(hash_table->control->magic == DSHASH_MAGIC);
 
     /*
@@ -321,6 +324,15 @@ dshash_destroy(dshash_table *hash_table)
     size_t        i;
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
+
+    /* this is a local copy */
+    if (hash_table->is_snapshot)
+    {
+        pfree(hash_table->area);
+        pfree(hash_table);
+        return;
+    }
+
     ensure_valid_bucket_pointers(hash_table);
 
     /* Free all the entries. */
@@ -355,6 +367,29 @@ dshash_destroy(dshash_table *hash_table)
 }
 
 /*
+ * take a local snapshot of a dshash table
+ */
+dshash_table *
+dshash_take_snapshot(dshash_table *org_table, dsa_area *new_area)
+{
+    dshash_table *new_table;
+
+    if (org_table->is_snapshot)
+        elog(ERROR, "cannot make local copy of local dshash");
+
+    new_table = palloc(sizeof(dshash_table));
+
+    new_table->area = new_area;
+    new_table->params = org_table->params;
+    new_table->control = dsa_get_address(new_table->area,
+                                         org_table->control->handle);
+    /* mark this dshash as a local copy */
+    new_table->is_snapshot = true;
+
+    return new_table;
+}
+
+/*
  * Get a handle that can be used by other processes to attach to this hash
  * table.
  */
@@ -392,15 +427,22 @@ dshash_find(dshash_table *hash_table, const void *key, bool exclusive)
     partition = PARTITION_FOR_HASH(hash);
 
     Assert(hash_table->control->magic == DSHASH_MAGIC);
-    Assert(!hash_table->find_locked);
 
-    LWLockAcquire(PARTITION_LOCK(hash_table, partition),
-                  exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    if (!hash_table->is_snapshot)
+    {
+        Assert(!hash_table->find_locked);
+        LWLockAcquire(PARTITION_LOCK(hash_table, partition),
+                      exclusive ? LW_EXCLUSIVE : LW_SHARED);
+    }
     ensure_valid_bucket_pointers(hash_table);
 
     /* Search the active bucket. */
     item = find_in_bucket(hash_table, key, BUCKET_FOR_HASH(hash_table, hash));
 
+    /* don't lock if this is a local copy */
+    if (hash_table->is_snapshot)
+        return item ? ENTRY_FROM_ITEM(item) : NULL;
+
     if (!item)
     {
         /* Not found. */
@@ -436,6 +478,9 @@ dshash_find_or_insert(dshash_table *hash_table,
     dshash_partition *partition;
     dshash_table_item *item;
 
+    if (hash_table->is_snapshot)
+        elog(ERROR, "cannot insert into local dshash");
+
     hash = hash_key(hash_table, key);
     partition_index = PARTITION_FOR_HASH(hash);
     partition = &hash_table->control->partitions[partition_index];
@@ -505,6 +550,9 @@ dshash_delete_key(dshash_table *hash_table, const void *key)
     size_t        partition;
     bool        found;
 
+    if (hash_table->is_snapshot)
+        elog(ERROR, "cannot delete from a snapshot");
+
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
@@ -545,6 +593,7 @@ dshash_delete_entry(dshash_table *hash_table, void *entry)
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(hash_table->find_locked);
     Assert(hash_table->find_exclusively_locked);
+    Assert(!hash_table->is_snapshot);
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition),
                                 LW_EXCLUSIVE));
 
@@ -563,6 +612,9 @@ dshash_release_lock(dshash_table *hash_table, void *entry)
     dshash_table_item *item = ITEM_FROM_ENTRY(entry);
     size_t        partition_index = PARTITION_FOR_HASH(item->hash);
 
+    if (hash_table->is_snapshot)
+        return;
+
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(hash_table->find_locked);
     Assert(LWLockHeldByMeInMode(PARTITION_LOCK(hash_table, partition_index),
@@ -605,10 +657,13 @@ dshash_dump(dshash_table *hash_table)
     Assert(hash_table->control->magic == DSHASH_MAGIC);
     Assert(!hash_table->find_locked);
 
-    for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+    if (!hash_table->is_snapshot)
     {
-        Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
-        LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+        {
+            Assert(!LWLockHeldByMe(PARTITION_LOCK(hash_table, i)));
+            LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);
+        }
     }
 
     ensure_valid_bucket_pointers(hash_table);
@@ -643,8 +698,11 @@ dshash_dump(dshash_table *hash_table)
         }
     }
 
-    for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
-        LWLockRelease(PARTITION_LOCK(hash_table, i));
+    if (!hash_table->is_snapshot)
+    {
+        for (i = 0; i < DSHASH_NUM_PARTITIONS; ++i)
+            LWLockRelease(PARTITION_LOCK(hash_table, i));
+    }
 }
 
 /*
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index fe62788..dd02147 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -319,6 +319,7 @@ typedef struct
     bool        pinned;
     /* The number of times that segments have been freed. */
     Size        freed_segment_counter;
+    bool        is_snapshot;
     /* The LWLock tranche ID. */
     int            lwlock_tranche_id;
     /* The general lock (protects everything except object pools). */
@@ -931,7 +932,8 @@ dsa_get_address(dsa_area *area, dsa_pointer dp)
         return NULL;
 
     /* Process any requests to detach from freed segments. */
-    check_for_freed_segments(area);
+    if (!area->control->is_snapshot)
+        check_for_freed_segments(area);
 
     /* Break the dsa_pointer into its components. */
     index = DSA_EXTRACT_SEGMENT_NUMBER(dp);
@@ -1232,6 +1234,7 @@ create_internal(void *place, size_t size,
     control->high_segment_index = 0;
     control->refcnt = 1;
     control->freed_segment_counter = 0;
+    control->is_snapshot = false;
     control->lwlock_tranche_id = tranche_id;
 
     /*
@@ -2239,3 +2242,55 @@ check_for_freed_segments(dsa_area *area)
         area->freed_segment_counter = freed_segment_counter;
     }
 }
+
+/*
+ * Make a static local copy of this dsa area.
+ */
+dsa_area *
+dsa_take_snapshot(dsa_area *source_area)
+{
+    dsa_area   *area;
+    Size        size;
+    int i;
+    char        *mem;
+
+    /* allocate required size of memory */
+    size = sizeof(dsa_area);
+    size += sizeof(dsa_area_control);
+
+    LWLockAcquire(DSA_AREA_LOCK(source_area), LW_SHARED);
+    for (i = 0 ; i <= source_area->high_segment_index ; i++)
+        size += source_area->segment_maps[i].header->size;
+    mem = palloc(size);
+
+    area = (dsa_area *)mem;
+    mem += sizeof(dsa_area);
+    area->control = (dsa_area_control *)mem;
+    mem += sizeof(dsa_area_control);
+    memcpy(area->control, source_area->control, sizeof(dsa_area_control));
+    area->control->is_snapshot = true;
+    area->mapping_pinned = false;
+
+    /* Copy and connect the all segments */
+    for (i = 0 ; i <= source_area->high_segment_index ; i++)
+    {
+        dsa_segment_map *smap = &source_area->segment_maps[i];
+        dsa_segment_map *dmap = &area->segment_maps[i];
+
+        dmap->mapped_address = mem;
+        memcpy(dmap->mapped_address, smap->mapped_address, smap->header->size);
+        mem += smap->header->size;
+        dmap->header = (dsa_segment_header*) dmap->mapped_address;
+        dmap->header->magic = 0;
+        dmap->fpm = NULL;
+        dmap->pagemap = (dsa_pointer *)
+            (dmap->mapped_address + MAXALIGN(sizeof(dsa_area_control)) +
+             MAXALIGN(sizeof(FreePageManager)));
+    }
+
+    area->high_segment_index = source_area->high_segment_index;
+    LWLockRelease(DSA_AREA_LOCK(source_area));
+
+    elog(LOG, "dsa_take_snapshot copied %lu bytes", size);
+    return area;
+}
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index 220553c..d8f48ed 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -70,6 +70,7 @@ extern dshash_table *dshash_attach(dsa_area *area,
 extern void dshash_detach(dshash_table *hash_table);
 extern dshash_table_handle dshash_get_hash_table_handle(dshash_table *hash_table);
 extern void dshash_destroy(dshash_table *hash_table);
+extern dshash_table *dshash_take_snapshot(dshash_table *org_table, dsa_area *new_area);
 
 /* Finding, creating, deleting entries. */
 extern void *dshash_find(dshash_table *hash_table,
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 516ef61..c10fb4a 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,5 +121,6 @@ extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
+extern dsa_area *dsa_take_snapshot(dsa_area *source_area);
 
 #endif                            /* DSA_H */
-- 
2.9.2

From 89f575876812fb73724a5cc64dd605d4fd52a47b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Fri, 8 Dec 2017 21:53:36 +0900
Subject: [PATCH 2/4] Add seqscan on dshash

A WIP implement of seqscan support for dshash.  This doesn't takes any
lock so hash can be currupt from any reason.
---
 src/backend/lib/dshash.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++
 src/include/lib/dshash.h | 14 +++++++++++
 2 files changed, 77 insertions(+)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index 973a826..adc2131 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -644,6 +644,69 @@ dshash_memhash(const void *v, size_t size, void *arg)
     return tag_hash(v, size);
 }
 
+void
+dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table)
+{
+    status->hash_table = hash_table;
+    status->curbucket = 0;
+    status->nbuckets = ((size_t) 1) << hash_table->control->size_log2;
+    status->curitem = NULL;
+
+    ensure_valid_bucket_pointers(hash_table);
+}
+
+void *
+dshash_seq_next(dshash_seq_status *status)
+{
+    dsa_pointer next_item_pointer;
+
+    if (status->curitem == NULL)
+    {
+        Assert (status->curbucket == 0);
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+    else
+        next_item_pointer = status->curitem->next;
+
+    while (!DsaPointerIsValid(next_item_pointer))
+    {
+        if (++status->curbucket >= status->nbuckets)
+        {
+            dshash_seq_release(status);
+            return NULL;
+        }
+        next_item_pointer = status->hash_table->buckets[status->curbucket];
+    }
+
+    status->curitem =
+        dsa_get_address(status->hash_table->area, next_item_pointer);
+    return ENTRY_FROM_ITEM(status->curitem);
+}
+
+int
+dshash_get_num_entries(dshash_table *hash_table)
+{
+    /* a shotcut implement. should be improved  */
+    dshash_seq_status s;
+    void *p;
+    int n = 0;
+
+    dshash_seq_init(&s, hash_table);
+    while ((p = dshash_seq_next(&s)) != NULL)
+    {
+        dshash_release_lock(hash_table, p);
+        n++;
+    }
+
+    return n;
+}
+
+void
+dshash_seq_release(dshash_seq_status *status)
+{
+    /* nothing to do so far..*/
+}
+
 /*
  * Print debugging information about the internal state of the hash table to
  * stderr.  The caller must hold no partition locks.
diff --git a/src/include/lib/dshash.h b/src/include/lib/dshash.h
index d8f48ed..460364c 100644
--- a/src/include/lib/dshash.h
+++ b/src/include/lib/dshash.h
@@ -59,6 +59,15 @@ typedef struct dshash_parameters
 struct dshash_table_item;
 typedef struct dshash_table_item dshash_table_item;
 
+struct dshash_seq_status
+{
+    dshash_table       *hash_table;
+    int                    curbucket;
+    int                    nbuckets;
+    dshash_table_item  *curitem;
+};
+typedef struct dshash_seq_status dshash_seq_status;
+
 /* Creating, sharing and destroying from hash tables. */
 extern dshash_table *dshash_create(dsa_area *area,
               const dshash_parameters *params,
@@ -81,6 +90,11 @@ extern bool dshash_delete_key(dshash_table *hash_table, const void *key);
 extern void dshash_delete_entry(dshash_table *hash_table, void *entry);
 extern void dshash_release_lock(dshash_table *hash_table, void *entry);
 
+/* seq scan support */
+extern void dshash_seq_init(dshash_seq_status *status, dshash_table *hash_table);
+extern void *dshash_seq_next(dshash_seq_status *status);
+extern void dshash_seq_release(dshash_seq_status *status);
+extern int dshash_get_num_entries(dshash_table *hash_table);
 /* Convenience hash and compare functions wrapping memcmp and tag_hash. */
 extern int    dshash_memcmp(const void *a, const void *b, size_t size, void *arg);
 extern dshash_hash dshash_memhash(const void *v, size_t size, void *arg);
-- 
2.9.2

From f8daa8726ed4774a58b3c5d775a1ec89182a5325 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Fri, 8 Dec 2017 22:18:29 +0900
Subject: [PATCH 3/4] Change stats collector to an axiliary process.

Shared memory and LWLocks are required to let stats collector use
dshash. This patch makes stats collector an auxiliary process.
---
 src/backend/bootstrap/bootstrap.c   |  8 +++++
 src/backend/postmaster/pgstat.c     | 58 +++++++++++++++++++++++++------------
 src/backend/postmaster/postmaster.c | 24 +++++++++------
 src/include/miscadmin.h             |  2 +-
 src/include/pgstat.h                |  4 ++-
 5 files changed, 67 insertions(+), 29 deletions(-)

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 8287de9..374a917 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -335,6 +335,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
             case WalReceiverProcess:
                 statmsg = pgstat_get_backend_desc(B_WAL_RECEIVER);
                 break;
+            case StatsCollectorProcess:
+                statmsg = pgstat_get_backend_desc(B_STATS_COLLECTOR);
+                break;
             default:
                 statmsg = "??? process";
                 break;
@@ -460,6 +463,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
             WalReceiverMain();
             proc_exit(1);        /* should never return */
 
+        case StatsCollectorProcess:
+            /* don't set signals, stats collector has its own agenda */
+            PgstatCollectorMain();
+            proc_exit(1);        /* should never return */
+
         default:
             elog(PANIC, "unrecognized process type: %d", (int) MyAuxProcType);
             proc_exit(1);
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..4ee9890 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -267,6 +267,7 @@ static List *pending_write_requests = NIL;
 /* Signal handler flags */
 static volatile bool need_exit = false;
 static volatile bool got_SIGHUP = false;
+static volatile bool got_SIGTERM = false;
 
 /*
  * Total time charged to functions so far in the current backend.
@@ -284,8 +285,8 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
-NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) pg_attribute_noreturn();
-static void pgstat_exit(SIGNAL_ARGS);
+static void pgstat_shutdown_handler(SIGNAL_ARGS);
+static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
@@ -770,11 +771,7 @@ pgstat_start(void)
             /* Close the postmaster's sockets */
             ClosePostmasterPorts(false);
 
-            /* Drop our connection to postmaster's shared memory, as well */
-            dsm_detach_all();
-            PGSharedMemoryDetach();
-
-            PgstatCollectorMain(0, NULL);
+            PgstatCollectorMain();
             break;
 #endif
 
@@ -2870,6 +2867,9 @@ pgstat_bestart(void)
             case WalReceiverProcess:
                 beentry->st_backendType = B_WAL_RECEIVER;
                 break;
+            case StatsCollectorProcess:
+                beentry->st_backendType = B_STATS_COLLECTOR;
+                break;
             default:
                 elog(FATAL, "unrecognized process type: %d",
                      (int) MyAuxProcType);
@@ -4077,6 +4077,9 @@ pgstat_get_backend_desc(BackendType backendType)
         case B_WAL_WRITER:
             backendDesc = "walwriter";
             break;
+        case B_STATS_COLLECTOR:
+            backendDesc = "stats collector";
+            break;
     }
 
     return backendDesc;
@@ -4194,8 +4197,8 @@ pgstat_send_bgwriter(void)
  *    The argc/argv parameters are valid only in EXEC_BACKEND case.
  * ----------
  */
-NON_EXEC_STATIC void
-PgstatCollectorMain(int argc, char *argv[])
+void
+PgstatCollectorMain(void)
 {
     int            len;
     PgStat_Msg    msg;
@@ -4208,8 +4211,8 @@ PgstatCollectorMain(int argc, char *argv[])
      */
     pqsignal(SIGHUP, pgstat_sighup_handler);
     pqsignal(SIGINT, SIG_IGN);
-    pqsignal(SIGTERM, SIG_IGN);
-    pqsignal(SIGQUIT, pgstat_exit);
+    pqsignal(SIGTERM, pgstat_shutdown_handler);
+    pqsignal(SIGQUIT, pgstat_quickdie_handler);
     pqsignal(SIGALRM, SIG_IGN);
     pqsignal(SIGPIPE, SIG_IGN);
     pqsignal(SIGUSR1, SIG_IGN);
@@ -4254,14 +4257,14 @@ PgstatCollectorMain(int argc, char *argv[])
         /*
          * Quit if we get SIGQUIT from the postmaster.
          */
-        if (need_exit)
+        if (got_SIGTERM)
             break;
 
         /*
          * Inner loop iterates as long as we keep getting messages, or until
          * need_exit becomes set.
          */
-        while (!need_exit)
+        while (!got_SIGTERM)
         {
             /*
              * Reload configuration if we got SIGHUP from the postmaster.
@@ -4449,14 +4452,21 @@ PgstatCollectorMain(int argc, char *argv[])
 
 /* SIGQUIT signal handler for collector process */
 static void
-pgstat_exit(SIGNAL_ARGS)
+pgstat_quickdie_handler(SIGNAL_ARGS)
 {
-    int            save_errno = errno;
+    PG_SETMASK(&BlockSig);
 
-    need_exit = true;
-    SetLatch(MyLatch);
+    /*
+     * We DO NOT want to run proc_exit() callbacks -- we're here because
+     * shared memory may be corrupted, so we don't want to try to clean up our
+     * transaction.  Just nail the windows shut and get out of town.  Now that
+     * there's an atexit callback to prevent third-party code from breaking
+     * things by calling exit() directly, we have to reset the callbacks
+     * explicitly to make this work as intended.
+     */
+    on_exit_reset();
 
-    errno = save_errno;
+    exit(2);
 }
 
 /* SIGHUP handler for collector process */
@@ -4471,6 +4481,18 @@ pgstat_sighup_handler(SIGNAL_ARGS)
     errno = save_errno;
 }
 
+static void
+pgstat_shutdown_handler(SIGNAL_ARGS)
+{
+    int save_errno = errno;
+
+    got_SIGTERM = true;
+
+    SetLatch(MyLatch);
+
+    errno = save_errno;
+}
+
 /*
  * Subroutine to clear stats in a database entry
  *
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 17c7f7e..d5fda5d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -144,7 +144,8 @@
 #define BACKEND_TYPE_AUTOVAC    0x0002    /* autovacuum worker process */
 #define BACKEND_TYPE_WALSND        0x0004    /* walsender process */
 #define BACKEND_TYPE_BGWORKER    0x0008    /* bgworker process */
-#define BACKEND_TYPE_ALL        0x000F    /* OR of all the above */
+#define BACKEND_TYPE_STATS        0x0010    /* bgworker process */
+#define BACKEND_TYPE_ALL        0x001F    /* OR of all the above */
 
 #define BACKEND_TYPE_WORKER        (BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER)
 
@@ -550,6 +551,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 #define StartCheckpointer()        StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()        StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()        StartChildProcess(WalReceiverProcess)
+#define StartStatsCollector()    StartChildProcess(StatsCollectorProcess)
 
 /* Macros to check exit status of a child process */
 #define EXIT_STATUS_0(st)  ((st) == 0)
@@ -1811,7 +1813,7 @@ ServerLoop(void)
         /* If we have lost the stats collector, try to start a new one */
         if (PgStatPID == 0 &&
             (pmState == PM_RUN || pmState == PM_HOT_STANDBY))
-            PgStatPID = pgstat_start();
+            PgStatPID = StartStatsCollector();
 
         /* If we have lost the archiver, try to start a new one. */
         if (PgArchPID == 0 && PgArchStartupAllowed())
@@ -2929,7 +2931,7 @@ reaper(SIGNAL_ARGS)
             if (PgArchStartupAllowed() && PgArchPID == 0)
                 PgArchPID = pgarch_start();
             if (PgStatPID == 0)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
 
             /* workers may be scheduled to start now */
             maybe_start_bgworkers();
@@ -3002,7 +3004,7 @@ reaper(SIGNAL_ARGS)
                  * nothing left for it to do.
                  */
                 if (PgStatPID != 0)
-                    signal_child(PgStatPID, SIGQUIT);
+                    signal_child(PgStatPID, SIGTERM);
             }
             else
             {
@@ -3088,10 +3090,10 @@ reaper(SIGNAL_ARGS)
         {
             PgStatPID = 0;
             if (!EXIT_STATUS_0(exitstatus))
-                LogChildExit(LOG, _("statistics collector process"),
-                             pid, exitstatus);
+                HandleChildCrash(pid, exitstatus,
+                                 _("statistics collector process"));
             if (pmState == PM_RUN || pmState == PM_HOT_STANDBY)
-                PgStatPID = pgstat_start();
+                PgStatPID = StartStatsCollector();
             continue;
         }
 
@@ -3321,7 +3323,7 @@ CleanupBackend(int pid,
 
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
- * walwriter, autovacuum, or background worker.
+ * walwriter, autovacuum, stats collector or background worker.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -5114,7 +5116,7 @@ sigusr1_handler(SIGNAL_ARGS)
          * Likewise, start other special children as needed.
          */
         Assert(PgStatPID == 0);
-        PgStatPID = pgstat_start();
+        PgStatPID = StartStatsCollector();
 
         ereport(LOG,
                 (errmsg("database system is ready to accept read only connections")));
@@ -5404,6 +5406,10 @@ StartChildProcess(AuxProcType type)
                 ereport(LOG,
                         (errmsg("could not fork WAL receiver process: %m")));
                 break;
+            case StatsCollectorProcess:
+                ereport(LOG,
+                        (errmsg("could not fork stats collector process: %m")));
+                break;
             default:
                 ereport(LOG,
                         (errmsg("could not fork process: %m")));
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 59da7a6..b054dab 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -395,7 +395,7 @@ typedef enum
     CheckpointerProcess,
     WalWriterProcess,
     WalReceiverProcess,
-
+    StatsCollectorProcess,
     NUM_AUXPROCTYPES            /* Must be last! */
 } AuxProcType;
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..e2a1e21 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -710,7 +710,8 @@ typedef enum BackendType
     B_STARTUP,
     B_WAL_RECEIVER,
     B_WAL_SENDER,
-    B_WAL_WRITER
+    B_WAL_WRITER,
+    B_STATS_COLLECTOR
 } BackendType;
 
 
@@ -1327,6 +1328,7 @@ extern void pgstat_send_bgwriter(void);
  * generate the pgstat* views.
  * ----------
  */
+extern void PgstatCollectorMain(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
 extern PgBackendStatus *pgstat_fetch_stat_beentry(int beid);
-- 
2.9.2

From 9fe2da9442033ee3409592e108bd1a17a3392909 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@oss.ntt.co.jp>
Date: Sat, 9 Dec 2017 01:40:55 +0900
Subject: [PATCH 4/4] Change stats sharing method

Stats collector no longer uses files to distribute stats numbers. They
are now stored in dynamic shared hash.
---
 src/backend/lib/dshash.c                      |    2 +-
 src/backend/postmaster/autovacuum.c           |    6 +-
 src/backend/postmaster/pgstat.c               | 1250 +++++++++++--------------
 src/backend/replication/basebackup.c          |   36 -
 src/backend/storage/ipc/ipci.c                |    2 +
 src/backend/storage/lmgr/lwlock.c             |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/utils/misc/guc.c                  |   41 -
 src/backend/utils/misc/postgresql.conf.sample |    1 -
 src/bin/initdb/initdb.c                       |    1 -
 src/bin/pg_basebackup/t/010_pg_basebackup.pl  |    2 +-
 src/include/miscadmin.h                       |    1 +
 src/include/pgstat.h                          |   47 +-
 src/include/storage/lwlock.h                  |    3 +
 14 files changed, 562 insertions(+), 834 deletions(-)

diff --git a/src/backend/lib/dshash.c b/src/backend/lib/dshash.c
index adc2131..9d72f28 100644
--- a/src/backend/lib/dshash.c
+++ b/src/backend/lib/dshash.c
@@ -304,7 +304,7 @@ dshash_attach(dsa_area *area, const dshash_parameters *params,
 void
 dshash_detach(dshash_table *hash_table)
 {
-    Assert(!hash_table->find_locked);
+    Assert(!hash_table->find_locked || hash_table->is_snapshot);
 
     /* The hash table may have been destroyed.  Just free local memory. */
     pfree(hash_table);
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 48765bb..770a0ec 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -2734,12 +2734,10 @@ get_pgstat_tabentry_relid(Oid relid, bool isshared, PgStat_StatDBEntry *shared,
     if (isshared)
     {
         if (PointerIsValid(shared))
-            tabentry = hash_search(shared->tables, &relid,
-                                   HASH_FIND, NULL);
+            tabentry = backend_get_tab_entry(shared, relid);
     }
     else if (PointerIsValid(dbentry))
-        tabentry = hash_search(dbentry->tables, &relid,
-                               HASH_FIND, NULL);
+        tabentry = backend_get_tab_entry(dbentry, relid);
 
     return tabentry;
 }
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 4ee9890..cccf4b6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -28,6 +28,7 @@
 #include <arpa/inet.h>
 #include <signal.h>
 #include <time.h>
+#include <utils/dsa.h>
 #ifdef HAVE_SYS_SELECT_H
 #include <sys/select.h>
 #endif
@@ -77,22 +78,10 @@
 #define PGSTAT_STAT_INTERVAL    500 /* Minimum time between stats file
                                      * updates; in milliseconds. */
 
-#define PGSTAT_RETRY_DELAY        10    /* How long to wait between checks for a
-                                     * new file; in milliseconds. */
-
-#define PGSTAT_MAX_WAIT_TIME    10000    /* Maximum time to wait for a stats
-                                         * file update; in milliseconds. */
-
-#define PGSTAT_INQ_INTERVAL        640 /* How often to ping the collector for a
-                                     * new file; in milliseconds. */
-
 #define PGSTAT_RESTART_INTERVAL 60    /* How often to attempt to restart a
                                      * failed statistics collector; in
                                      * seconds. */
 
-#define PGSTAT_POLL_LOOP_COUNT    (PGSTAT_MAX_WAIT_TIME / PGSTAT_RETRY_DELAY)
-#define PGSTAT_INQ_LOOP_COUNT    (PGSTAT_INQ_INTERVAL / PGSTAT_RETRY_DELAY)
-
 /* Minimum receive buffer size for the collector's socket. */
 #define PGSTAT_MIN_RCVBUF        (100 * 1024)
 
@@ -101,7 +90,6 @@
  * The initial size hints for the hash tables used in the collector.
  * ----------
  */
-#define PGSTAT_DB_HASH_SIZE        16
 #define PGSTAT_TAB_HASH_SIZE    512
 #define PGSTAT_FUNCTION_HASH_SIZE    512
 
@@ -131,7 +119,6 @@ int            pgstat_track_activity_query_size = 1024;
  * Built from GUC parameter
  * ----------
  */
-char       *pgstat_stat_directory = NULL;
 char       *pgstat_stat_filename = NULL;
 char       *pgstat_stat_tmpname = NULL;
 
@@ -154,6 +141,42 @@ static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
 
+/* Shared stats bootstrap infomation */
+typedef struct StatsShmemStruct {
+    dsa_handle stats_dsa_handle;
+    dshash_table_handle db_stats_handle;
+    dsa_pointer    global_stats;
+    dsa_pointer    archiver_stats;
+} StatsShmemStruct;
+
+static StatsShmemStruct * StatsShmem = NULL;
+static dsa_area *area = NULL;
+static dshash_table *db_stats;
+static dshash_table *local_db_stats;
+
+/* dshash parameter for each type of table */
+static const dshash_parameters dsh_dbparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatDBEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_DB
+};
+static const dshash_parameters dsh_tblparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatTabEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+static const dshash_parameters dsh_funcparams = {
+    sizeof(Oid),
+    sizeof(PgStat_StatFuncEntry),
+    dshash_memcmp,
+    dshash_memhash,
+    LWTRANCHE_STATS_FUNC_TABLE
+};
+
 /*
  * Structures in which backends store per-table info that's waiting to be
  * sent to the collector.
@@ -250,12 +273,16 @@ static LocalPgBackendStatus *localBackendStatusTable = NULL;
 static int    localNumBackends = 0;
 
 /*
- * Cluster wide statistics, kept in the stats collector.
- * Contains statistics that are not collected per database
- * or per table.
+ * Cluster wide statistics.
+ * Contains statistics that are not collected per database or per table.
+ * shared_* are the statistics maintained by pgstats and snapshot_* are the
+ * snapshot only taken on reader-side backends.
  */
-static PgStat_ArchiverStats archiverStats;
-static PgStat_GlobalStats globalStats;
+static PgStat_ArchiverStats *shared_archiverStats;
+static PgStat_ArchiverStats *snapshot_archiverStats;
+static PgStat_GlobalStats *shared_globalStats;
+static PgStat_GlobalStats *snapshot_globalStats;
+
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -285,24 +312,23 @@ static instr_time total_func_time;
 static pid_t pgstat_forkexec(void);
 #endif
 
+/* functions used in stats collector */
 static void pgstat_shutdown_handler(SIGNAL_ARGS);
 static void pgstat_quickdie_handler(SIGNAL_ARGS);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 static void pgstat_sighup_handler(SIGNAL_ARGS);
 
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
-static PgStat_StatTabEntry *pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,
-                     Oid tableoid, bool create);
-static void pgstat_write_statsfiles(bool permanent, bool allDbs);
-static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent);
-static HTAB *pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep);
-static void pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash, bool permanent);
-static void backend_read_statsfile(void);
+static PgStat_StatTabEntry *pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create);
+static void pgstat_write_statsfiles(void);
+static void pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry);
+static void pgstat_read_statsfiles(void);
+static void pgstat_read_db_statsfile(Oid databaseid, dshash_table *tabhash, dshash_table *funchash);
+
+/* functions used in backends */
+static bool backend_take_stats_snapshot(void);
 static void pgstat_read_current_status(void);
 
-static bool pgstat_write_statsfile_needed(void);
-static bool pgstat_db_requested(Oid databaseid);
-
 static void pgstat_send_tabstat(PgStat_MsgTabstat *tsmsg);
 static void pgstat_send_funcstats(void);
 static HTAB *pgstat_collect_oids(Oid catalogid);
@@ -320,7 +346,6 @@ static const char *pgstat_get_wait_io(WaitEventIO w);
 static void pgstat_setheader(PgStat_MsgHdr *hdr, StatMsgType mtype);
 static void pgstat_send(void *msg, int len);
 
-static void pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len);
 static void pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len);
 static void pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len);
 static void pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len);
@@ -685,7 +710,6 @@ pgstat_reset_remove_files(const char *directory)
 void
 pgstat_reset_all(void)
 {
-    pgstat_reset_remove_files(pgstat_stat_directory);
     pgstat_reset_remove_files(PGSTAT_STAT_PERMANENT_DIRECTORY);
 }
 
@@ -1010,6 +1034,95 @@ pgstat_send_funcstats(void)
 
 
 /* ----------
+ * pgstat_attach_shared_stats() -
+ *
+ *    attach existing shared stats memory
+ * ----------
+ */
+static bool
+pgstat_attach_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    if (StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID || area != NULL)
+    {
+        LWLockRelease(StatsLock);
+        return area != NULL;
+    }
+
+    /* this lives till the end of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_attach(StatsShmem->stats_dsa_handle);
+    dsa_pin_mapping(area);
+    db_stats = dshash_attach(area, &dsh_dbparams,
+                             StatsShmem->db_stats_handle, 0);
+    local_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats =    (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+
+    return true;
+}
+
+/* ----------
+ * pgstat_create_shared_stats() -
+ *
+ *    create shared stats memory
+ * ----------
+ */
+static void
+pgstat_create_shared_stats(void)
+{
+    MemoryContext oldcontext;
+
+    LWLockAcquire(StatsLock, LW_EXCLUSIVE);
+    Assert(StatsShmem->stats_dsa_handle == DSM_HANDLE_INVALID);
+
+    /* this lives till the end of the process */
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    area = dsa_create(LWTRANCHE_STATS_DSA);
+    dsa_pin_mapping(area);
+
+    db_stats = dshash_create(area, &dsh_dbparams, 0);
+
+    /* create shared area and write bootstrap information */
+    StatsShmem->stats_dsa_handle = dsa_get_handle(area);
+    StatsShmem->global_stats =
+        dsa_allocate0(area, sizeof(PgStat_GlobalStats));
+    StatsShmem->archiver_stats =
+        dsa_allocate0(area, sizeof(PgStat_ArchiverStats));
+    StatsShmem->db_stats_handle =
+        dshash_get_hash_table_handle(db_stats);
+
+    /* locally connect to the memory */
+    local_db_stats = NULL;
+    shared_globalStats = (PgStat_GlobalStats *)
+        dsa_get_address(area, StatsShmem->global_stats);
+    shared_archiverStats = (PgStat_ArchiverStats *)
+        dsa_get_address(area, StatsShmem->archiver_stats);
+    MemoryContextSwitchTo(oldcontext);
+    LWLockRelease(StatsLock);
+}
+
+/* ----------
+ * backend_get_tab_entry() -
+ *
+ *    Find database stats entry on backends. This assumes that snapshot is
+ *    created.
+ * ----------
+ */
+PgStat_StatTabEntry *
+backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid)
+{
+    Assert(dbent->snapshot_tables);
+    return dshash_find(dbent->snapshot_tables, &relid, false);
+}
+
+/* ----------
  * pgstat_vacuum_stat() -
  *
  *    Will tell the collector about objects he can get rid of.
@@ -1021,7 +1134,7 @@ pgstat_vacuum_stat(void)
     HTAB       *htab;
     PgStat_MsgTabpurge msg;
     PgStat_MsgFuncpurge f_msg;
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
@@ -1030,11 +1143,8 @@ pgstat_vacuum_stat(void)
     if (pgStatSock == PGINVALID_SOCKET)
         return;
 
-    /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
-     */
-    backend_read_statsfile();
+    if (!backend_take_stats_snapshot())
+        return;
 
     /*
      * Read pg_database and make a list of OIDs of all existing databases
@@ -1045,8 +1155,8 @@ pgstat_vacuum_stat(void)
      * Search the database hash table for dead databases and tell the
      * collector to drop them.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, local_db_stats);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         Oid            dbid = dbentry->databaseid;
 
@@ -1064,11 +1174,15 @@ pgstat_vacuum_stat(void)
     /*
      * Lookup our own database entry; if not found, nothing more to do.
      */
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
+    dbentry = (PgStat_StatDBEntry *) dshash_find(local_db_stats,
                                                  (void *) &MyDatabaseId,
-                                                 HASH_FIND, NULL);
-    if (dbentry == NULL || dbentry->tables == NULL)
+                                                 false);
+    if (dbentry == NULL || dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        if (dbentry)
+            dshash_release_lock(db_stats, dbentry);
         return;
+    }
 
     /*
      * Similarly to above, make a list of all known relations in this DB.
@@ -1083,8 +1197,8 @@ pgstat_vacuum_stat(void)
     /*
      * Check for all tables listed in stats hashtable if they still exist.
      */
-    hash_seq_init(&hstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, dbentry->snapshot_tables);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         Oid            tabid = tabentry->tableid;
 
@@ -1134,8 +1248,8 @@ pgstat_vacuum_stat(void)
      * Now repeat the above steps for functions.  However, we needn't bother
      * in the common case where no function stats are being collected.
      */
-    if (dbentry->functions != NULL &&
-        hash_get_num_entries(dbentry->functions) > 0)
+    if (dbentry->snapshot_functions != NULL &&
+        dshash_get_num_entries(dbentry->snapshot_functions) > 0)
     {
         htab = pgstat_collect_oids(ProcedureRelationId);
 
@@ -1143,8 +1257,8 @@ pgstat_vacuum_stat(void)
         f_msg.m_databaseid = MyDatabaseId;
         f_msg.m_nentries = 0;
 
-        hash_seq_init(&hstat, dbentry->functions);
-        while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&hstat)) != NULL)
+        dshash_seq_init(&hstat, dbentry->snapshot_functions);
+        while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&hstat)) != NULL)
         {
             Oid            funcid = funcentry->functionid;
 
@@ -1551,24 +1665,6 @@ pgstat_ping(void)
     pgstat_send(&msg, sizeof(msg));
 }
 
-/* ----------
- * pgstat_send_inquiry() -
- *
- *    Notify collector that we need fresh data.
- * ----------
- */
-static void
-pgstat_send_inquiry(TimestampTz clock_time, TimestampTz cutoff_time, Oid databaseid)
-{
-    PgStat_MsgInquiry msg;
-
-    pgstat_setheader(&msg.m_hdr, PGSTAT_MTYPE_INQUIRY);
-    msg.clock_time = clock_time;
-    msg.cutoff_time = cutoff_time;
-    msg.databaseid = databaseid;
-    pgstat_send(&msg, sizeof(msg));
-}
-
 
 /*
  * Initialize function call usage data.
@@ -2384,17 +2480,16 @@ PgStat_StatDBEntry *
 pgstat_fetch_stat_dbentry(Oid dbid)
 {
     /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
+     * If not done for this transaction, take a stats snapshot
      */
-    backend_read_statsfile();
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
     /*
      * Lookup the requested database; return NULL if not found
      */
-    return (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                              (void *) &dbid,
-                                              HASH_FIND, NULL);
+    return (PgStat_StatDBEntry *) dshash_find(local_db_stats,
+                                              (void *) &dbid, false);
 }
 
 
@@ -2415,23 +2510,22 @@ pgstat_fetch_stat_tabentry(Oid relid)
     PgStat_StatTabEntry *tabentry;
 
     /*
-     * If not done for this transaction, read the statistics collector stats
-     * file into some hash tables.
+     * If not done for this transaction, take a stats snapshot
      */
-    backend_read_statsfile();
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
     /*
      * Lookup our database, then look in its table hash table.
      */
     dbid = MyDatabaseId;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
+    dbentry =
+        (PgStat_StatDBEntry *) dshash_find(local_db_stats, (void *)&dbid, false);
+    if (dbentry)
     {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
+        tabentry = (PgStat_StatTabEntry *)
+            dshash_find(dbentry->snapshot_tables, (void *)&relid, false);
+
         if (tabentry)
             return tabentry;
     }
@@ -2440,14 +2534,13 @@ pgstat_fetch_stat_tabentry(Oid relid)
      * If we didn't find it, maybe it's a shared table.
      */
     dbid = InvalidOid;
-    dbentry = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                 (void *) &dbid,
-                                                 HASH_FIND, NULL);
-    if (dbentry != NULL && dbentry->tables != NULL)
+    dbentry = (PgStat_StatDBEntry *) dshash_find(local_db_stats,
+                                                 (void *) &dbid, false);
+    if (dbentry != NULL && dbentry->snapshot_tables != NULL)
     {
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &relid,
-                                                       HASH_FIND, NULL);
+        tabentry = (PgStat_StatTabEntry *)
+            dshash_find(dbentry->snapshot_tables, (void *) &relid, false);
+
         if (tabentry)
             return tabentry;
     }
@@ -2469,18 +2562,19 @@ pgstat_fetch_stat_funcentry(Oid func_id)
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry = NULL;
 
-    /* load the stats file if needed */
-    backend_read_statsfile();
+    /*
+     * If not done for this transaction, take a stats snapshot
+     */
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
-    /* Lookup our database, then find the requested function.  */
+    /*
+     * Lookup our database, then find the requested function
+     */
     dbentry = pgstat_fetch_stat_dbentry(MyDatabaseId);
-    if (dbentry != NULL && dbentry->functions != NULL)
-    {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &func_id,
-                                                         HASH_FIND, NULL);
-    }
-
+    if (dbentry != NULL && !dbentry->snapshot_functions)
+        funcentry = dshash_find(dbentry->snapshot_functions,
+                                  (void *) &func_id, false);
     return funcentry;
 }
 
@@ -2555,9 +2649,10 @@ pgstat_fetch_stat_numbackends(void)
 PgStat_ArchiverStats *
 pgstat_fetch_stat_archiver(void)
 {
-    backend_read_statsfile();
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
-    return &archiverStats;
+    return snapshot_archiverStats;
 }
 
 
@@ -2572,9 +2667,10 @@ pgstat_fetch_stat_archiver(void)
 PgStat_GlobalStats *
 pgstat_fetch_global(void)
 {
-    backend_read_statsfile();
+    if (!backend_take_stats_snapshot())
+        return NULL;
 
-    return &globalStats;
+    return snapshot_globalStats;
 }
 
 
@@ -4222,18 +4318,14 @@ PgstatCollectorMain(void)
     pqsignal(SIGTTOU, SIG_DFL);
     pqsignal(SIGCONT, SIG_DFL);
     pqsignal(SIGWINCH, SIG_DFL);
-    PG_SETMASK(&UnBlockSig);
 
-    /*
-     * Identify myself via ps
-     */
-    init_ps_display("stats collector", "", "", "");
+    PG_SETMASK(&UnBlockSig);
 
     /*
      * Read in existing stats files or initialize the stats to zero.
      */
     pgStatRunningInCollector = true;
-    pgStatDBHash = pgstat_read_statsfiles(InvalidOid, true, true);
+    pgstat_read_statsfiles();
 
     /*
      * Loop to process messages until we get SIGQUIT or detect ungraceful
@@ -4276,13 +4368,6 @@ PgstatCollectorMain(void)
             }
 
             /*
-             * Write the stats file(s) if a new request has arrived that is
-             * not satisfied by existing file(s).
-             */
-            if (pgstat_write_statsfile_needed())
-                pgstat_write_statsfiles(false, false);
-
-            /*
              * Try to receive and process a message.  This will not block,
              * since the socket is set to non-blocking mode.
              *
@@ -4330,10 +4415,6 @@ PgstatCollectorMain(void)
                 case PGSTAT_MTYPE_DUMMY:
                     break;
 
-                case PGSTAT_MTYPE_INQUIRY:
-                    pgstat_recv_inquiry((PgStat_MsgInquiry *) &msg, len);
-                    break;
-
                 case PGSTAT_MTYPE_TABSTAT:
                     pgstat_recv_tabstat((PgStat_MsgTabstat *) &msg, len);
                     break;
@@ -4424,7 +4505,7 @@ PgstatCollectorMain(void)
          * happening there, this is the best we can do.  The two-second
          * timeout matches our pre-9.2 behavior, and needs to be short enough
          * to not provoke "using stale statistics" complaints from
-         * backend_read_statsfile.
+         * backend_take_stats_snapshot.
          */
         wr = WaitLatchOrSocket(MyLatch,
                                WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_SOCKET_READABLE | WL_TIMEOUT,
@@ -4444,7 +4525,7 @@ PgstatCollectorMain(void)
     /*
      * Save the final stats to reuse at next startup.
      */
-    pgstat_write_statsfiles(true, true);
+    pgstat_write_statsfiles();
 
     exit(0);
 }
@@ -4494,14 +4575,14 @@ pgstat_shutdown_handler(SIGNAL_ARGS)
 }
 
 /*
- * Subroutine to clear stats in a database entry
+ * Subroutine to reset stats in a shared database entry
  *
  * Tables and functions hashes are initialized to empty.
  */
 static void
 reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
 {
-    HASHCTL        hash_ctl;
+    dshash_table *tbl;
 
     dbentry->n_xact_commit = 0;
     dbentry->n_xact_rollback = 0;
@@ -4527,20 +4608,14 @@ reset_dbentry_counters(PgStat_StatDBEntry *dbentry)
     dbentry->stat_reset_timestamp = GetCurrentTimestamp();
     dbentry->stats_timestamp = 0;
 
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-    dbentry->tables = hash_create("Per-database table",
-                                  PGSTAT_TAB_HASH_SIZE,
-                                  &hash_ctl,
-                                  HASH_ELEM | HASH_BLOBS);
 
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-    dbentry->functions = hash_create("Per-database function",
-                                     PGSTAT_FUNCTION_HASH_SIZE,
-                                     &hash_ctl,
-                                     HASH_ELEM | HASH_BLOBS);
+    tbl = dshash_create(area, &dsh_tblparams, 0);
+    dbentry->tables = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
+
+    tbl = dshash_create(area, &dsh_funcparams, 0);
+    dbentry->functions = dshash_get_hash_table_handle(tbl);
+    dshash_detach(tbl);
 }
 
 /*
@@ -4553,15 +4628,18 @@ pgstat_get_db_entry(Oid databaseid, bool create)
 {
     PgStat_StatDBEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
+
+    Assert(pgStatRunningInCollector);
 
     /* Lookup or create the hash table entry for this database */
-    result = (PgStat_StatDBEntry *) hash_search(pgStatDBHash,
-                                                &databaseid,
-                                                action, &found);
+    if (create)
+        result = (PgStat_StatDBEntry *)
+            dshash_find_or_insert(db_stats,    &databaseid, &found);
+    else
+        result = (PgStat_StatDBEntry *)    dshash_find(db_stats, &databaseid, true);
 
-    if (!create && !found)
-        return NULL;
+    if (!create)
+        return result;
 
     /*
      * If not found, initialize the new one.  This creates empty hash tables
@@ -4573,23 +4651,23 @@ pgstat_get_db_entry(Oid databaseid, bool create)
     return result;
 }
 
-
 /*
  * Lookup the hash table entry for the specified table. If no hash
  * table entry exists, initialize it, if the create parameter is true.
  * Else, return NULL.
  */
 static PgStat_StatTabEntry *
-pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
+pgstat_get_tab_entry(dshash_table *table, Oid tableoid, bool create)
 {
     PgStat_StatTabEntry *result;
     bool        found;
-    HASHACTION    action = (create ? HASH_ENTER : HASH_FIND);
 
     /* Lookup or create the hash table entry for this table */
-    result = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                 &tableoid,
-                                                 action, &found);
+    if (create)
+        result = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(table, &tableoid, &found);
+    else
+        result = (PgStat_StatTabEntry *) dshash_find(table, &tableoid, false);
 
     if (!create && !found)
         return NULL;
@@ -4638,14 +4716,14 @@ pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry, Oid tableoid, bool create)
  * ----------
  */
 static void
-pgstat_write_statsfiles(bool permanent, bool allDbs)
+pgstat_write_statsfiles(void)
 {
-    HASH_SEQ_STATUS hstat;
+    dshash_seq_status hstat;
     PgStat_StatDBEntry *dbentry;
     FILE       *fpout;
     int32        format_id;
-    const char *tmpfile = permanent ? PGSTAT_STAT_PERMANENT_TMPFILE : pgstat_stat_tmpname;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *tmpfile = PGSTAT_STAT_PERMANENT_TMPFILE;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
     int            rc;
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
@@ -4666,7 +4744,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Set the timestamp of the stats file.
      */
-    globalStats.stats_timestamp = GetCurrentTimestamp();
+    shared_globalStats->stats_timestamp = GetCurrentTimestamp();
 
     /*
      * Write the file header --- currently just a format ID.
@@ -4678,32 +4756,29 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
     /*
      * Write global stats struct
      */
-    rc = fwrite(&globalStats, sizeof(globalStats), 1, fpout);
+    rc = fwrite(shared_globalStats, sizeof(shared_globalStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Write archiver stats struct
      */
-    rc = fwrite(&archiverStats, sizeof(archiverStats), 1, fpout);
+    rc = fwrite(shared_archiverStats, sizeof(shared_archiverStats), 1, fpout);
     (void) rc;                    /* we'll check for error with ferror */
 
     /*
      * Walk through the database table.
      */
-    hash_seq_init(&hstat, pgStatDBHash);
-    while ((dbentry = (PgStat_StatDBEntry *) hash_seq_search(&hstat)) != NULL)
+    dshash_seq_init(&hstat, db_stats);
+    while ((dbentry = (PgStat_StatDBEntry *) dshash_seq_next(&hstat)) != NULL)
     {
         /*
          * Write out the table and function stats for this DB into the
          * appropriate per-DB stat file, if required.
          */
-        if (allDbs || pgstat_db_requested(dbentry->databaseid))
-        {
-            /* Make DB's timestamp consistent with the global stats */
-            dbentry->stats_timestamp = globalStats.stats_timestamp;
+        /* Make DB's timestamp consistent with the global stats */
+        dbentry->stats_timestamp = shared_globalStats->stats_timestamp;
 
-            pgstat_write_db_statsfile(dbentry, permanent);
-        }
+        pgstat_write_db_statsfile(dbentry);
 
         /*
          * Write out the DB entry. We don't write the tables or functions
@@ -4747,8 +4822,7 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
         unlink(tmpfile);
     }
 
-    if (permanent)
-        unlink(pgstat_stat_filename);
+    unlink(pgstat_stat_filename);
 
     /*
      * Now throw away the list of requests.  Note that requests sent after we
@@ -4763,15 +4837,14 @@ pgstat_write_statsfiles(bool permanent, bool allDbs)
  * of length len.
  */
 static void
-get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
+get_dbstat_filename(bool tempname, Oid databaseid,
                     char *filename, int len)
 {
     int            printed;
 
     /* NB -- pgstat_reset_remove_files knows about the pattern this uses */
     printed = snprintf(filename, len, "%s/db_%u.%s",
-                       permanent ? PGSTAT_STAT_PERMANENT_DIRECTORY :
-                       pgstat_stat_directory,
+                       PGSTAT_STAT_PERMANENT_DIRECTORY,
                        databaseid,
                        tempname ? "tmp" : "stat");
     if (printed > len)
@@ -4789,10 +4862,10 @@ get_dbstat_filename(bool permanent, bool tempname, Oid databaseid,
  * ----------
  */
 static void
-pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
+pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry)
 {
-    HASH_SEQ_STATUS tstat;
-    HASH_SEQ_STATUS fstat;
+    dshash_seq_status tstat;
+    dshash_seq_status fstat;
     PgStat_StatTabEntry *tabentry;
     PgStat_StatFuncEntry *funcentry;
     FILE       *fpout;
@@ -4801,9 +4874,10 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     int            rc;
     char        tmpfile[MAXPGPATH];
     char        statfile[MAXPGPATH];
+    dshash_table *tbl;
 
-    get_dbstat_filename(permanent, true, dbid, tmpfile, MAXPGPATH);
-    get_dbstat_filename(permanent, false, dbid, statfile, MAXPGPATH);
+    get_dbstat_filename(true, dbid, tmpfile, MAXPGPATH);
+    get_dbstat_filename(false, dbid, statfile, MAXPGPATH);
 
     elog(DEBUG2, "writing stats file \"%s\"", statfile);
 
@@ -4830,24 +4904,28 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
     /*
      * Walk through the database's access stats per table.
      */
-    hash_seq_init(&tstat, dbentry->tables);
-    while ((tabentry = (PgStat_StatTabEntry *) hash_seq_search(&tstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    dshash_seq_init(&tstat, tbl);
+    while ((tabentry = (PgStat_StatTabEntry *) dshash_seq_next(&tstat)) != NULL)
     {
         fputc('T', fpout);
         rc = fwrite(tabentry, sizeof(PgStat_StatTabEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * Walk through the database's function stats table.
      */
-    hash_seq_init(&fstat, dbentry->functions);
-    while ((funcentry = (PgStat_StatFuncEntry *) hash_seq_search(&fstat)) != NULL)
+    tbl = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+    dshash_seq_init(&fstat, tbl);
+    while ((funcentry = (PgStat_StatFuncEntry *) dshash_seq_next(&fstat)) != NULL)
     {
         fputc('F', fpout);
         rc = fwrite(funcentry, sizeof(PgStat_StatFuncEntry), 1, fpout);
         (void) rc;                /* we'll check for error with ferror */
     }
+    dshash_detach(tbl);
 
     /*
      * No more output to be done. Close the temp file and replace the old
@@ -4881,14 +4959,6 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
                         tmpfile, statfile)));
         unlink(tmpfile);
     }
-
-    if (permanent)
-    {
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing temporary stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
 }
 
 /* ----------
@@ -4911,46 +4981,35 @@ pgstat_write_db_statsfile(PgStat_StatDBEntry *dbentry, bool permanent)
  *    the table/function hash tables remain empty.
  * ----------
  */
-static HTAB *
-pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
+static void
+pgstat_read_statsfiles(void)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatDBEntry dbbuf;
-    HASHCTL        hash_ctl;
-    HTAB       *dbhash;
     FILE       *fpin;
     int32        format_id;
     bool        found;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
+    const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
+    dshash_table *tblstats = NULL;
+    dshash_table *funcstats = NULL;
 
+    Assert(pgStatRunningInCollector);
     /*
      * The tables will live in pgStatLocalContext.
      */
     pgstat_setup_memcxt();
 
     /*
-     * Create the DB hashtable
+     * Create the DB hashtable and global stas area
      */
-    memset(&hash_ctl, 0, sizeof(hash_ctl));
-    hash_ctl.keysize = sizeof(Oid);
-    hash_ctl.entrysize = sizeof(PgStat_StatDBEntry);
-    hash_ctl.hcxt = pgStatLocalContext;
-    dbhash = hash_create("Databases hash", PGSTAT_DB_HASH_SIZE, &hash_ctl,
-                         HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-    /*
-     * Clear out global and archiver statistics so they start from zero in
-     * case we can't load an existing statsfile.
-     */
-    memset(&globalStats, 0, sizeof(globalStats));
-    memset(&archiverStats, 0, sizeof(archiverStats));
+    pgstat_create_shared_stats();
 
     /*
      * Set the current timestamp (will be kept only in case we can't load an
      * existing statsfile).
      */
-    globalStats.stat_reset_timestamp = GetCurrentTimestamp();
-    archiverStats.stat_reset_timestamp = globalStats.stat_reset_timestamp;
+    shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
+    shared_archiverStats->stat_reset_timestamp = shared_globalStats->stat_reset_timestamp;
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -4968,7 +5027,7 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                     (errcode_for_file_access(),
                      errmsg("could not open statistics file \"%s\": %m",
                             statfile)));
-        return dbhash;
+        return;
     }
 
     /*
@@ -4985,11 +5044,11 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
     /*
      * Read global stats struct
      */
-    if (fread(&globalStats, 1, sizeof(globalStats), fpin) != sizeof(globalStats))
+    if (fread(shared_globalStats, 1, sizeof(shared_globalStats), fpin) != sizeof(shared_globalStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&globalStats, 0, sizeof(globalStats));
+        memset(shared_globalStats, 0, sizeof(*shared_globalStats));
         goto done;
     }
 
@@ -5000,17 +5059,16 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
      * file's timestamp is less than PGSTAT_STAT_INTERVAL ago, but that's not
      * an unusual scenario.
      */
-    if (pgStatRunningInCollector)
-        globalStats.stats_timestamp = 0;
+    shared_globalStats->stats_timestamp = 0;
 
     /*
      * Read archiver stats struct
      */
-    if (fread(&archiverStats, 1, sizeof(archiverStats), fpin) != sizeof(archiverStats))
+    if (fread(shared_archiverStats, 1, sizeof(shared_archiverStats), fpin) != sizeof(shared_archiverStats))
     {
         ereport(pgStatRunningInCollector ? LOG : WARNING,
                 (errmsg("corrupted statistics file \"%s\"", statfile)));
-        memset(&archiverStats, 0, sizeof(archiverStats));
+        memset(shared_archiverStats, 0, sizeof(*shared_archiverStats));
         goto done;
     }
 
@@ -5039,12 +5097,12 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 /*
                  * Add to the DB hash
                  */
-                dbentry = (PgStat_StatDBEntry *) hash_search(dbhash,
-                                                             (void *) &dbbuf.databaseid,
-                                                             HASH_ENTER,
-                                                             &found);
+                dbentry = (PgStat_StatDBEntry *)
+                    dshash_find_or_insert(db_stats, (void *) &dbbuf.databaseid,
+                                          &found);
                 if (found)
                 {
+                    dshash_release_lock(db_stats, dbentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5052,8 +5110,8 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                 }
 
                 memcpy(dbentry, &dbbuf, sizeof(PgStat_StatDBEntry));
-                dbentry->tables = NULL;
-                dbentry->functions = NULL;
+                dbentry->tables = DSM_HANDLE_INVALID;
+                dbentry->functions = DSM_HANDLE_INVALID;
 
                 /*
                  * In the collector, disregard the timestamp we read from the
@@ -5061,47 +5119,23 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
                  * stats file immediately upon the first request from any
                  * backend.
                  */
-                if (pgStatRunningInCollector)
-                    dbentry->stats_timestamp = 0;
-
-                /*
-                 * Don't create tables/functions hashtables for uninteresting
-                 * databases.
-                 */
-                if (onlydb != InvalidOid)
-                {
-                    if (dbbuf.databaseid != onlydb &&
-                        dbbuf.databaseid != InvalidOid)
-                        break;
-                }
-
-                memset(&hash_ctl, 0, sizeof(hash_ctl));
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatTabEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->tables = hash_create("Per-database table",
-                                              PGSTAT_TAB_HASH_SIZE,
-                                              &hash_ctl,
-                                              HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-                hash_ctl.keysize = sizeof(Oid);
-                hash_ctl.entrysize = sizeof(PgStat_StatFuncEntry);
-                hash_ctl.hcxt = pgStatLocalContext;
-                dbentry->functions = hash_create("Per-database function",
-                                                 PGSTAT_FUNCTION_HASH_SIZE,
-                                                 &hash_ctl,
-                                                 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+                Assert(pgStatRunningInCollector);
+                dbentry->stats_timestamp = 0;
 
                 /*
                  * If requested, read the data from the database-specific
                  * file.  Otherwise we just leave the hashtables empty.
                  */
-                if (deep)
-                    pgstat_read_db_statsfile(dbentry->databaseid,
-                                             dbentry->tables,
-                                             dbentry->functions,
-                                             permanent);
-
+                tblstats = dshash_create(area, &dsh_tblparams, 0);
+                dbentry->tables = dshash_get_hash_table_handle(tblstats);
+                funcstats = dshash_create(area, &dsh_funcparams, 0);
+                dbentry->functions =
+                    dshash_get_hash_table_handle(funcstats);
+                dshash_release_lock(db_stats, dbentry);
+                pgstat_read_db_statsfile(dbentry->databaseid,
+                                         tblstats, funcstats);
+                dshash_detach(tblstats);
+                dshash_detach(funcstats);
                 break;
 
             case 'E':
@@ -5118,34 +5152,47 @@ pgstat_read_statsfiles(Oid onlydb, bool permanent, bool deep)
 done:
     FreeFile(fpin);
 
-    /* If requested to read the permanent file, also get rid of it. */
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 
-    return dbhash;
+    return;
 }
 
 
+Size
+StatsShmemSize(void)
+{
+    return sizeof(dsa_handle);
+}
+
+void
+StatsShmemInit(void)
+{
+    bool    found;
+
+    StatsShmem = (StatsShmemStruct *)
+        ShmemInitStruct("Stats area", StatsShmemSize(),
+                        &found);
+    if (!IsUnderPostmaster)
+    {
+        Assert(!found);
+
+        StatsShmem->stats_dsa_handle = DSM_HANDLE_INVALID;
+    }
+    else
+        Assert(found);
+}
+
 /* ----------
  * pgstat_read_db_statsfile() -
  *
- *    Reads in the existing statistics collector file for the given database,
- *    filling the passed-in tables and functions hash tables.
- *
- *    As in pgstat_read_statsfiles, if the permanent file is requested, it is
- *    removed after reading.
- *
- *    Note: this code has the ability to skip storing per-table or per-function
- *    data, if NULL is passed for the corresponding hashtable.  That's not used
- *    at the moment though.
+ *    Reads in the permanent statistics collector file and create shared
+ *    statistics tables. The file is removed afer reading.
  * ----------
  */
 static void
-pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
-                         bool permanent)
+pgstat_read_db_statsfile(Oid databaseid,
+                         dshash_table *tabhash, dshash_table *funchash)
 {
     PgStat_StatTabEntry *tabentry;
     PgStat_StatTabEntry tabbuf;
@@ -5156,7 +5203,8 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
     bool        found;
     char        statfile[MAXPGPATH];
 
-    get_dbstat_filename(permanent, false, databaseid, statfile, MAXPGPATH);
+    Assert(pgStatRunningInCollector);
+    get_dbstat_filename(false, databaseid, statfile, MAXPGPATH);
 
     /*
      * Try to open the stats file. If it doesn't exist, the backends simply
@@ -5215,12 +5263,13 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (tabhash == NULL)
                     break;
 
-                tabentry = (PgStat_StatTabEntry *) hash_search(tabhash,
-                                                               (void *) &tabbuf.tableid,
-                                                               HASH_ENTER, &found);
+                tabentry = (PgStat_StatTabEntry *)
+                    dshash_find_or_insert(tabhash,
+                                          (void *) &tabbuf.tableid, &found);
 
                 if (found)
                 {
+                    dshash_release_lock(tabhash, tabentry);
                     ereport(pgStatRunningInCollector ? LOG : WARNING,
                             (errmsg("corrupted statistics file \"%s\"",
                                     statfile)));
@@ -5228,6 +5277,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(tabentry, &tabbuf, sizeof(tabbuf));
+                dshash_release_lock(tabhash, tabentry);
                 break;
 
                 /*
@@ -5249,9 +5299,9 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 if (funchash == NULL)
                     break;
 
-                funcentry = (PgStat_StatFuncEntry *) hash_search(funchash,
-                                                                 (void *) &funcbuf.functionid,
-                                                                 HASH_ENTER, &found);
+                funcentry = (PgStat_StatFuncEntry *)
+                    dshash_find_or_insert(funchash,
+                                          (void *) &funcbuf.functionid, &found);
 
                 if (found)
                 {
@@ -5262,6 +5312,7 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
                 }
 
                 memcpy(funcentry, &funcbuf, sizeof(funcbuf));
+                dshash_release_lock(funchash, funcentry);
                 break;
 
                 /*
@@ -5281,142 +5332,50 @@ pgstat_read_db_statsfile(Oid databaseid, HTAB *tabhash, HTAB *funchash,
 done:
     FreeFile(fpin);
 
-    if (permanent)
-    {
-        elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-        unlink(statfile);
-    }
+    elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
+    unlink(statfile);
 }
 
 /* ----------
- * pgstat_read_db_statsfile_timestamp() -
- *
- *    Attempt to determine the timestamp of the last db statfile write.
- *    Returns true if successful; the timestamp is stored in *ts.
+ * backend_clean_snapshot_callback() -
  *
- *    This needs to be careful about handling databases for which no stats file
- *    exists, such as databases without a stat entry or those not yet written:
- *
- *    - if there's a database entry in the global file, return the corresponding
- *    stats_timestamp value.
- *
- *    - if there's no db stat entry (e.g. for a new or inactive database),
- *    there's no stats_timestamp value, but also nothing to write so we return
- *    the timestamp of the global statfile.
+ *    This is usually called with arg = NULL when the memory context where the
+ *  current snapshot has been taken. Don't bother to release memory for the
+ *  case.
  * ----------
  */
-static bool
-pgstat_read_db_statsfile_timestamp(Oid databaseid, bool permanent,
-                                   TimestampTz *ts)
+static void
+backend_clean_snapshot_callback(void *arg)
 {
-    PgStat_StatDBEntry dbentry;
-    PgStat_GlobalStats myGlobalStats;
-    PgStat_ArchiverStats myArchiverStats;
-    FILE       *fpin;
-    int32        format_id;
-    const char *statfile = permanent ? PGSTAT_STAT_PERMANENT_FILENAME : pgstat_stat_filename;
-
-    /*
-     * Try to open the stats file.  As above, anything but ENOENT is worthy of
-     * complaining about.
-     */
-    if ((fpin = AllocateFile(statfile, PG_BINARY_R)) == NULL)
-    {
-        if (errno != ENOENT)
-            ereport(pgStatRunningInCollector ? LOG : WARNING,
-                    (errcode_for_file_access(),
-                     errmsg("could not open statistics file \"%s\": %m",
-                            statfile)));
-        return false;
-    }
-
-    /*
-     * Verify it's of the expected format.
-     */
-    if (fread(&format_id, 1, sizeof(format_id), fpin) != sizeof(format_id) ||
-        format_id != PGSTAT_FILE_FORMAT_ID)
+    if (arg != NULL)
     {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        /* explicitly called, so explicitly free resources */
+        if (snapshot_globalStats)
+            pfree(snapshot_globalStats);
 
-    /*
-     * Read global stats struct
-     */
-    if (fread(&myGlobalStats, 1, sizeof(myGlobalStats),
-              fpin) != sizeof(myGlobalStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
+        if (snapshot_archiverStats)
+            pfree(snapshot_archiverStats);
 
-    /*
-     * Read archiver stats struct
-     */
-    if (fread(&myArchiverStats, 1, sizeof(myArchiverStats),
-              fpin) != sizeof(myArchiverStats))
-    {
-        ereport(pgStatRunningInCollector ? LOG : WARNING,
-                (errmsg("corrupted statistics file \"%s\"", statfile)));
-        FreeFile(fpin);
-        return false;
-    }
-
-    /* By default, we're going to return the timestamp of the global file. */
-    *ts = myGlobalStats.stats_timestamp;
-
-    /*
-     * We found an existing collector stats file.  Read it and look for a
-     * record for the requested database.  If found, use its timestamp.
-     */
-    for (;;)
-    {
-        switch (fgetc(fpin))
+        if (local_db_stats)
         {
-                /*
-                 * 'D'    A PgStat_StatDBEntry struct describing a database
-                 * follows.
-                 */
-            case 'D':
-                if (fread(&dbentry, 1, offsetof(PgStat_StatDBEntry, tables),
-                          fpin) != offsetof(PgStat_StatDBEntry, tables))
-                {
-                    ereport(pgStatRunningInCollector ? LOG : WARNING,
-                            (errmsg("corrupted statistics file \"%s\"",
-                                    statfile)));
-                    goto done;
-                }
+            dshash_seq_status seq;
+            PgStat_StatDBEntry *dbent;
 
-                /*
-                 * If this is the DB we're looking for, save its timestamp and
-                 * we're done.
-                 */
-                if (dbentry.databaseid == databaseid)
-                {
-                    *ts = dbentry.stats_timestamp;
-                    goto done;
-                }
-
-                break;
-
-            case 'E':
-                goto done;
-
-            default:
-                ereport(pgStatRunningInCollector ? LOG : WARNING,
-                        (errmsg("corrupted statistics file \"%s\"",
-                                statfile)));
-                goto done;
+            dshash_seq_init(&seq, local_db_stats);
+            while ((dbent = dshash_seq_next(&seq)) != NULL)
+            {
+                if (dbent->snapshot_tables)
+                    dshash_detach(dbent->snapshot_tables);
+                if (dbent->snapshot_functions)
+                    dshash_detach(dbent->snapshot_functions);
+            }
+            dshash_destroy(local_db_stats);
         }
     }
 
-done:
-    FreeFile(fpin);
-    return true;
+    snapshot_globalStats = NULL;
+    snapshot_archiverStats = NULL;
+    local_db_stats = NULL;
 }
 
 /*
@@ -5424,131 +5383,77 @@ done:
  * some hash tables.  The results will be kept until pgstat_clear_snapshot()
  * is called (typically, at end of transaction).
  */
-static void
-backend_read_statsfile(void)
+static bool
+backend_take_stats_snapshot(void)
 {
-    TimestampTz min_ts = 0;
-    TimestampTz ref_ts = 0;
-    Oid            inquiry_db;
-    int            count;
+    PgStat_StatDBEntry  *dbent;
+    dsa_area            *new_area;
+    dshash_seq_status seq;
+    MemoryContext oldcontext;
+    MemoryContextCallback *mcxt_cb;
 
-    /* already read it? */
-    if (pgStatDBHash)
-        return;
     Assert(!pgStatRunningInCollector);
 
-    /*
-     * In a normal backend, we check staleness of the data for our own DB, and
-     * so we send MyDatabaseId in inquiry messages.  In the autovac launcher,
-     * check staleness of the shared-catalog data, and send InvalidOid in
-     * inquiry messages so as not to force writing unnecessary data.
-     */
-    if (IsAutoVacuumLauncherProcess())
-        inquiry_db = InvalidOid;
-    else
-        inquiry_db = MyDatabaseId;
-
-    /*
-     * Loop until fresh enough stats file is available or we ran out of time.
-     * The stats inquiry message is sent repeatedly in case collector drops
-     * it; but not every single time, as that just swamps the collector.
-     */
-    for (count = 0; count < PGSTAT_POLL_LOOP_COUNT; count++)
+    oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+    if (!pgstat_attach_shared_stats())
     {
-        bool        ok;
-        TimestampTz file_ts = 0;
-        TimestampTz cur_ts;
-
-        CHECK_FOR_INTERRUPTS();
-
-        ok = pgstat_read_db_statsfile_timestamp(inquiry_db, false, &file_ts);
+        MemoryContextSwitchTo(oldcontext);
+        return false;
+    }
+    MemoryContextSwitchTo(oldcontext);
 
-        cur_ts = GetCurrentTimestamp();
-        /* Calculate min acceptable timestamp, if we didn't already */
-        if (count == 0 || cur_ts < ref_ts)
-        {
-            /*
-             * We set the minimum acceptable timestamp to PGSTAT_STAT_INTERVAL
-             * msec before now.  This indirectly ensures that the collector
-             * needn't write the file more often than PGSTAT_STAT_INTERVAL. In
-             * an autovacuum worker, however, we want a lower delay to avoid
-             * using stale data, so we use PGSTAT_RETRY_DELAY (since the
-             * number of workers is low, this shouldn't be a problem).
-             *
-             * We don't recompute min_ts after sleeping, except in the
-             * unlikely case that cur_ts went backwards.  So we might end up
-             * accepting a file a bit older than PGSTAT_STAT_INTERVAL.  In
-             * practice that shouldn't happen, though, as long as the sleep
-             * time is less than PGSTAT_STAT_INTERVAL; and we don't want to
-             * tell the collector that our cutoff time is less than what we'd
-             * actually accept.
-             */
-            ref_ts = cur_ts;
-            if (IsAutoVacuumWorkerProcess())
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_RETRY_DELAY);
-            else
-                min_ts = TimestampTzPlusMilliseconds(ref_ts,
-                                                     -PGSTAT_STAT_INTERVAL);
-        }
+    if (snapshot_globalStats)
+        return true;
 
-        /*
-         * If the file timestamp is actually newer than cur_ts, we must have
-         * had a clock glitch (system time went backwards) or there is clock
-         * skew between our processor and the stats collector's processor.
-         * Accept the file, but send an inquiry message anyway to make
-         * pgstat_recv_inquiry do a sanity check on the collector's time.
-         */
-        if (ok && file_ts > cur_ts)
-        {
-            /*
-             * A small amount of clock skew between processors isn't terribly
-             * surprising, but a large difference is worth logging.  We
-             * arbitrarily define "large" as 1000 msec.
-             */
-            if (file_ts >= TimestampTzPlusMilliseconds(cur_ts, 1000))
-            {
-                char       *filetime;
-                char       *mytime;
-
-                /* Copy because timestamptz_to_str returns a static buffer */
-                filetime = pstrdup(timestamptz_to_str(file_ts));
-                mytime = pstrdup(timestamptz_to_str(cur_ts));
-                elog(LOG, "stats collector's time %s is later than backend local time %s",
-                     filetime, mytime);
-                pfree(filetime);
-                pfree(mytime);
-            }
+    Assert(snapshot_archiverStats == NULL);
+    Assert(local_db_stats == NULL);
 
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
-            break;
-        }
+    /*
+     * the snapshot lives within the current transaction if any, the current
+     * memory context liftime otherwise.
+     */
+    if (IsTransactionState())
+        MemoryContextSwitchTo(TopTransactionContext);
 
-        /* Normal acceptance case: file is not older than cutoff time */
-        if (ok && file_ts >= min_ts)
-            break;
+    /* global stats can be just copied  */
+    snapshot_globalStats = palloc(sizeof(PgStat_GlobalStats));
+    memcpy(snapshot_globalStats, shared_globalStats,
+           sizeof(PgStat_GlobalStats));
 
-        /* Not there or too old, so kick the collector and wait a bit */
-        if ((count % PGSTAT_INQ_LOOP_COUNT) == 0)
-            pgstat_send_inquiry(cur_ts, min_ts, inquiry_db);
+    snapshot_archiverStats = palloc(sizeof(PgStat_ArchiverStats));
+    memcpy(snapshot_archiverStats, shared_archiverStats,
+           sizeof(PgStat_ArchiverStats));
 
-        pg_usleep(PGSTAT_RETRY_DELAY * 1000L);
+    /*
+     * take a local snapshot for every dsahsh. It's ok if the snapshots are
+     * not in strictly consistent.
+     */
+    new_area = dsa_take_snapshot(area);
+    local_db_stats = dshash_take_snapshot(db_stats, new_area);
+    dshash_seq_init(&seq, local_db_stats);
+    while ((dbent = (PgStat_StatDBEntry *) dshash_seq_next(&seq)) != NULL)
+    {
+        dshash_table *t;
+
+        t = dshash_attach(area, &dsh_tblparams, dbent->tables, 0);
+        dbent->snapshot_tables = dshash_take_snapshot(t, new_area);
+        dshash_detach(t);
+        t = dshash_attach(area, &dsh_funcparams, dbent->functions, 0);
+        dbent->snapshot_functions = dshash_take_snapshot(t, new_area);
+        dshash_detach(t);
     }
 
-    if (count >= PGSTAT_POLL_LOOP_COUNT)
-        ereport(LOG,
-                (errmsg("using stale statistics instead of current ones "
-                        "because stats collector is not responding")));
+    /* set the timestamp of taking this snapshot */
+    snapshot_globalStats->stats_timestamp = GetCurrentTimestamp();
 
-    /*
-     * Autovacuum launcher wants stats about all databases, but a shallow read
-     * is sufficient.  Regular backends want a deep read for just the tables
-     * they can see (MyDatabaseId + shared catalogs).
-     */
-    if (IsAutoVacuumLauncherProcess())
-        pgStatDBHash = pgstat_read_statsfiles(InvalidOid, false, false);
-    else
-        pgStatDBHash = pgstat_read_statsfiles(MyDatabaseId, false, true);
+    /* register callback to clear snapshot */
+    mcxt_cb = (MemoryContextCallback *)palloc(sizeof(MemoryContextCallback));
+    mcxt_cb->func = backend_clean_snapshot_callback;
+    mcxt_cb->arg = NULL;
+    MemoryContextRegisterResetCallback(CurrentMemoryContext, mcxt_cb);
+    MemoryContextSwitchTo(oldcontext);
+
+    return true;
 }
 
 
@@ -5581,6 +5486,8 @@ pgstat_setup_memcxt(void)
 void
 pgstat_clear_snapshot(void)
 {
+    int param = 0;    /* only the address is significant */
+
     /* Release memory, if any was allocated */
     if (pgStatLocalContext)
         MemoryContextDelete(pgStatLocalContext);
@@ -5590,99 +5497,12 @@ pgstat_clear_snapshot(void)
     pgStatDBHash = NULL;
     localBackendStatusTable = NULL;
     localNumBackends = 0;
-}
-
-
-/* ----------
- * pgstat_recv_inquiry() -
- *
- *    Process stat inquiry requests.
- * ----------
- */
-static void
-pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
-{
-    PgStat_StatDBEntry *dbentry;
-
-    elog(DEBUG2, "received inquiry for database %u", msg->databaseid);
-
-    /*
-     * If there's already a write request for this DB, there's nothing to do.
-     *
-     * Note that if a request is found, we return early and skip the below
-     * check for clock skew.  This is okay, since the only way for a DB
-     * request to be present in the list is that we have been here since the
-     * last write round.  It seems sufficient to check for clock skew once per
-     * write round.
-     */
-    if (list_member_oid(pending_write_requests, msg->databaseid))
-        return;
-
-    /*
-     * Check to see if we last wrote this database at a time >= the requested
-     * cutoff time.  If so, this is a stale request that was generated before
-     * we updated the DB file, and we don't need to do so again.
-     *
-     * If the requestor's local clock time is older than stats_timestamp, we
-     * should suspect a clock glitch, ie system time going backwards; though
-     * the more likely explanation is just delayed message receipt.  It is
-     * worth expending a GetCurrentTimestamp call to be sure, since a large
-     * retreat in the system clock reading could otherwise cause us to neglect
-     * to update the stats file for a long time.
-     */
-    dbentry = pgstat_get_db_entry(msg->databaseid, false);
-    if (dbentry == NULL)
-    {
-        /*
-         * We have no data for this DB.  Enter a write request anyway so that
-         * the global stats will get updated.  This is needed to prevent
-         * backend_read_statsfile from waiting for data that we cannot supply,
-         * in the case of a new DB that nobody has yet reported any stats for.
-         * See the behavior of pgstat_read_db_statsfile_timestamp.
-         */
-    }
-    else if (msg->clock_time < dbentry->stats_timestamp)
-    {
-        TimestampTz cur_ts = GetCurrentTimestamp();
-
-        if (cur_ts < dbentry->stats_timestamp)
-        {
-            /*
-             * Sure enough, time went backwards.  Force a new stats file write
-             * to get back in sync; but first, log a complaint.
-             */
-            char       *writetime;
-            char       *mytime;
-
-            /* Copy because timestamptz_to_str returns a static buffer */
-            writetime = pstrdup(timestamptz_to_str(dbentry->stats_timestamp));
-            mytime = pstrdup(timestamptz_to_str(cur_ts));
-            elog(LOG,
-                 "stats_timestamp %s is later than collector's time %s for database %u",
-                 writetime, mytime, dbentry->databaseid);
-            pfree(writetime);
-            pfree(mytime);
-        }
-        else
-        {
-            /*
-             * Nope, it's just an old request.  Assuming msg's clock_time is
-             * >= its cutoff_time, it must be stale, so we can ignore it.
-             */
-            return;
-        }
-    }
-    else if (msg->cutoff_time <= dbentry->stats_timestamp)
-    {
-        /* Stale request, ignore it */
-        return;
-    }
 
     /*
-     * We need to write this DB, so create a request.
+     * the parameter inform the function that it is not called from
+     * MemoryContextCallback
      */
-    pending_write_requests = lappend_oid(pending_write_requests,
-                                         msg->databaseid);
+    backend_clean_snapshot_callback(¶m);
 }
 
 
@@ -5695,6 +5515,7 @@ pgstat_recv_inquiry(PgStat_MsgInquiry *msg, int len)
 static void
 pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 {
+    dshash_table *tabhash;
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
     int            i;
@@ -5710,6 +5531,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     dbentry->n_block_read_time += msg->m_block_read_time;
     dbentry->n_block_write_time += msg->m_block_write_time;
 
+    tabhash = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
@@ -5717,9 +5539,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
     {
         PgStat_TableEntry *tabmsg = &(msg->m_entry[i]);
 
-        tabentry = (PgStat_StatTabEntry *) hash_search(dbentry->tables,
-                                                       (void *) &(tabmsg->t_id),
-                                                       HASH_ENTER, &found);
+        tabentry = (PgStat_StatTabEntry *)
+            dshash_find_or_insert(tabhash, (void *) &(tabmsg->t_id), &found);
 
         if (!found)
         {
@@ -5778,6 +5599,7 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         tabentry->n_live_tuples = Max(tabentry->n_live_tuples, 0);
         /* Likewise for n_dead_tuples */
         tabentry->n_dead_tuples = Max(tabentry->n_dead_tuples, 0);
+        dshash_release_lock(tabhash, tabentry);
 
         /*
          * Add per-table stats to the per-database entry, too.
@@ -5790,6 +5612,8 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
         dbentry->n_blocks_fetched += tabmsg->t_counts.t_blocks_fetched;
         dbentry->n_blocks_hit += tabmsg->t_counts.t_blocks_hit;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -5802,27 +5626,33 @@ pgstat_recv_tabstat(PgStat_MsgTabstat *msg, int len)
 static void
 pgstat_recv_tabpurge(PgStat_MsgTabpurge *msg, int len)
 {
+    dshash_table *tbl;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, false);
-
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->tables)
+    if (!dbentry || dbentry->tables == DSM_HANDLE_INVALID)
+    {
+        if (dbentry)
+            dshash_release_lock(db_stats, dbentry);
         return;
+    }
 
+    tbl = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
     /*
      * Process all table entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->tables,
-                           (void *) &(msg->m_tableid[i]),
-                           HASH_REMOVE, NULL);
+        (void) dshash_delete_key(tbl, (void *) &(msg->m_tableid[i]));
     }
+
+    dshash_release_lock(db_stats, dbentry);
+
 }
 
 
@@ -5848,23 +5678,20 @@ pgstat_recv_dropdb(PgStat_MsgDropdb *msg, int len)
      */
     if (dbentry)
     {
-        char        statfile[MAXPGPATH];
-
-        get_dbstat_filename(false, false, dbid, statfile, MAXPGPATH);
-
-        elog(DEBUG2, "removing stats file \"%s\"", statfile);
-        unlink(statfile);
-
-        if (dbentry->tables != NULL)
-            hash_destroy(dbentry->tables);
-        if (dbentry->functions != NULL)
-            hash_destroy(dbentry->functions);
+        if (dbentry->tables != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+            dshash_destroy(tbl);
+        }
+        if (dbentry->functions != DSM_HANDLE_INVALID)
+        {
+            dshash_table *tbl =
+                dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+            dshash_destroy(tbl);
+        }
 
-        if (hash_search(pgStatDBHash,
-                        (void *) &dbid,
-                        HASH_REMOVE, NULL) == NULL)
-            ereport(ERROR,
-                    (errmsg("database hash table corrupted during cleanup --- abort")));
+        dshash_delete_entry(db_stats, (void *)dbentry);
     }
 }
 
@@ -5892,19 +5719,28 @@ pgstat_recv_resetcounter(PgStat_MsgResetcounter *msg, int len)
      * We simply throw away all the database's table entries by recreating a
      * new hash table for them.
      */
-    if (dbentry->tables != NULL)
-        hash_destroy(dbentry->tables);
-    if (dbentry->functions != NULL)
-        hash_destroy(dbentry->functions);
-
-    dbentry->tables = NULL;
-    dbentry->functions = NULL;
+    if (dbentry->tables != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_destroy(t);
+        dbentry->tables = DSM_HANDLE_INVALID;
+    }
+    if (dbentry->functions != DSM_HANDLE_INVALID)
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_destroy(t);
+        dbentry->functions = DSM_HANDLE_INVALID;
+    }
 
     /*
      * Reset database-level stats, too.  This creates empty hash tables for
      * tables and functions.
      */
     reset_dbentry_counters(dbentry);
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5919,14 +5755,14 @@ pgstat_recv_resetsharedcounter(PgStat_MsgResetsharedcounter *msg, int len)
     if (msg->m_resettarget == RESET_BGWRITER)
     {
         /* Reset the global background writer statistics for the cluster. */
-        memset(&globalStats, 0, sizeof(globalStats));
-        globalStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_globalStats, 0, sizeof(shared_globalStats));
+        shared_globalStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
     else if (msg->m_resettarget == RESET_ARCHIVER)
     {
         /* Reset the archiver statistics for the cluster. */
-        memset(&archiverStats, 0, sizeof(archiverStats));
-        archiverStats.stat_reset_timestamp = GetCurrentTimestamp();
+        memset(&shared_archiverStats, 0, sizeof(shared_archiverStats));
+        shared_archiverStats->stat_reset_timestamp = GetCurrentTimestamp();
     }
 
     /*
@@ -5956,11 +5792,19 @@ pgstat_recv_resetsinglecounter(PgStat_MsgResetsinglecounter *msg, int len)
 
     /* Remove object if it exists, ignore it if not */
     if (msg->m_resettype == RESET_TABLE)
-        (void) hash_search(dbentry->tables, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
     else if (msg->m_resettype == RESET_FUNCTION)
-        (void) hash_search(dbentry->functions, (void *) &(msg->m_objectid),
-                           HASH_REMOVE, NULL);
+    {
+        dshash_table *t =
+            dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
+        dshash_delete_key(t, (void *) &(msg->m_objectid));
+    }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5980,6 +5824,8 @@ pgstat_recv_autovac(PgStat_MsgAutovacStart *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->last_autovac_time = msg->m_start_time;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -5993,13 +5839,13 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
-
+    dshash_table *table;
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
-
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -6014,6 +5860,9 @@ pgstat_recv_vacuum(PgStat_MsgVacuum *msg, int len)
         tabentry->vacuum_timestamp = msg->m_vacuumtime;
         tabentry->vacuum_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6027,13 +5876,15 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
 {
     PgStat_StatDBEntry *dbentry;
     PgStat_StatTabEntry *tabentry;
+    dshash_table *table;
 
     /*
      * Store the data in the table's hashtable entry.
      */
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
-    tabentry = pgstat_get_tab_entry(dbentry, msg->m_tableoid, true);
+    table = dshash_attach(area, &dsh_tblparams, dbentry->tables, 0);
+    tabentry = pgstat_get_tab_entry(table, msg->m_tableoid, true);
 
     tabentry->n_live_tuples = msg->m_live_tuples;
     tabentry->n_dead_tuples = msg->m_dead_tuples;
@@ -6056,6 +5907,9 @@ pgstat_recv_analyze(PgStat_MsgAnalyze *msg, int len)
         tabentry->analyze_timestamp = msg->m_analyzetime;
         tabentry->analyze_count++;
     }
+    dshash_release_lock(table, tabentry);
+    dshash_detach(table);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 
@@ -6071,18 +5925,18 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
     if (msg->m_failed)
     {
         /* Failed archival attempt */
-        ++archiverStats.failed_count;
-        memcpy(archiverStats.last_failed_wal, msg->m_xlog,
-               sizeof(archiverStats.last_failed_wal));
-        archiverStats.last_failed_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->failed_count;
+        memcpy(shared_archiverStats->last_failed_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_failed_wal));
+        shared_archiverStats->last_failed_timestamp = msg->m_timestamp;
     }
     else
     {
         /* Successful archival operation */
-        ++archiverStats.archived_count;
-        memcpy(archiverStats.last_archived_wal, msg->m_xlog,
-               sizeof(archiverStats.last_archived_wal));
-        archiverStats.last_archived_timestamp = msg->m_timestamp;
+        ++shared_archiverStats->archived_count;
+        memcpy(shared_archiverStats->last_archived_wal, msg->m_xlog,
+               sizeof(shared_archiverStats->last_archived_wal));
+        shared_archiverStats->last_archived_timestamp = msg->m_timestamp;
     }
 }
 
@@ -6095,16 +5949,16 @@ pgstat_recv_archiver(PgStat_MsgArchiver *msg, int len)
 static void
 pgstat_recv_bgwriter(PgStat_MsgBgWriter *msg, int len)
 {
-    globalStats.timed_checkpoints += msg->m_timed_checkpoints;
-    globalStats.requested_checkpoints += msg->m_requested_checkpoints;
-    globalStats.checkpoint_write_time += msg->m_checkpoint_write_time;
-    globalStats.checkpoint_sync_time += msg->m_checkpoint_sync_time;
-    globalStats.buf_written_checkpoints += msg->m_buf_written_checkpoints;
-    globalStats.buf_written_clean += msg->m_buf_written_clean;
-    globalStats.maxwritten_clean += msg->m_maxwritten_clean;
-    globalStats.buf_written_backend += msg->m_buf_written_backend;
-    globalStats.buf_fsync_backend += msg->m_buf_fsync_backend;
-    globalStats.buf_alloc += msg->m_buf_alloc;
+    shared_globalStats->timed_checkpoints += msg->m_timed_checkpoints;
+    shared_globalStats->requested_checkpoints += msg->m_requested_checkpoints;
+    shared_globalStats->checkpoint_write_time += msg->m_checkpoint_write_time;
+    shared_globalStats->checkpoint_sync_time += msg->m_checkpoint_sync_time;
+    shared_globalStats->buf_written_checkpoints += msg->m_buf_written_checkpoints;
+    shared_globalStats->buf_written_clean += msg->m_buf_written_clean;
+    shared_globalStats->maxwritten_clean += msg->m_maxwritten_clean;
+    shared_globalStats->buf_written_backend += msg->m_buf_written_backend;
+    shared_globalStats->buf_fsync_backend += msg->m_buf_fsync_backend;
+    shared_globalStats->buf_alloc += msg->m_buf_alloc;
 }
 
 /* ----------
@@ -6145,6 +5999,8 @@ pgstat_recv_recoveryconflict(PgStat_MsgRecoveryConflict *msg, int len)
             dbentry->n_conflict_startup_deadlock++;
             break;
     }
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6161,6 +6017,8 @@ pgstat_recv_deadlock(PgStat_MsgDeadlock *msg, int len)
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
     dbentry->n_deadlocks++;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6178,6 +6036,8 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 
     dbentry->n_temp_bytes += msg->m_filesize;
     dbentry->n_temp_files += 1;
+
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6189,6 +6049,7 @@ pgstat_recv_tempfile(PgStat_MsgTempFile *msg, int len)
 static void
 pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 {
+    dshash_table *t;
     PgStat_FunctionEntry *funcmsg = &(msg->m_entry[0]);
     PgStat_StatDBEntry *dbentry;
     PgStat_StatFuncEntry *funcentry;
@@ -6197,14 +6058,14 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 
     dbentry = pgstat_get_db_entry(msg->m_databaseid, true);
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++, funcmsg++)
     {
-        funcentry = (PgStat_StatFuncEntry *) hash_search(dbentry->functions,
-                                                         (void *) &(funcmsg->f_id),
-                                                         HASH_ENTER, &found);
+        funcentry = (PgStat_StatFuncEntry *)
+            dshash_find_or_insert(t, (void *) &(funcmsg->f_id), &found);
 
         if (!found)
         {
@@ -6225,7 +6086,11 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
             funcentry->f_total_time += funcmsg->f_total_time;
             funcentry->f_self_time += funcmsg->f_self_time;
         }
+        dshash_release_lock(t, funcentry);
     }
+
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /* ----------
@@ -6237,6 +6102,7 @@ pgstat_recv_funcstat(PgStat_MsgFuncstat *msg, int len)
 static void
 pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
 {
+    dshash_table *t;
     PgStat_StatDBEntry *dbentry;
     int            i;
 
@@ -6245,60 +6111,20 @@ pgstat_recv_funcpurge(PgStat_MsgFuncpurge *msg, int len)
     /*
      * No need to purge if we don't even know the database.
      */
-    if (!dbentry || !dbentry->functions)
+    if (!dbentry || dbentry->functions == DSM_HANDLE_INVALID)
         return;
 
+    t = dshash_attach(area, &dsh_funcparams, dbentry->functions, 0);
     /*
      * Process all function entries in the message.
      */
     for (i = 0; i < msg->m_nentries; i++)
     {
         /* Remove from hashtable if present; we don't care if it's not. */
-        (void) hash_search(dbentry->functions,
-                           (void *) &(msg->m_functionid[i]),
-                           HASH_REMOVE, NULL);
+        dshash_delete_key(t, (void *) &(msg->m_functionid[i]));
     }
-}
-
-/* ----------
- * pgstat_write_statsfile_needed() -
- *
- *    Do we need to write out any stats files?
- * ----------
- */
-static bool
-pgstat_write_statsfile_needed(void)
-{
-    if (pending_write_requests != NIL)
-        return true;
-
-    /* Everything was written recently */
-    return false;
-}
-
-/* ----------
- * pgstat_db_requested() -
- *
- *    Checks whether stats for a particular DB need to be written to a file.
- * ----------
- */
-static bool
-pgstat_db_requested(Oid databaseid)
-{
-    /*
-     * If any requests are outstanding at all, we should write the stats for
-     * shared catalogs (the "database" with OID 0).  This ensures that
-     * backends will see up-to-date stats for shared catalogs, even though
-     * they send inquiry messages mentioning only their own DB.
-     */
-    if (databaseid == InvalidOid && pending_write_requests != NIL)
-        return true;
-
-    /* Search to see if there's an open request to write this database. */
-    if (list_member_oid(pending_write_requests, databaseid))
-        return true;
-
-    return false;
+    dshash_detach(t);
+    dshash_release_lock(db_stats, dbentry);
 }
 
 /*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index cd7d391..c8a08b7 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -73,9 +73,6 @@ static void throttle(size_t increment);
 /* Was the backup currently in-progress initiated in recovery mode? */
 static bool backup_started_in_recovery = false;
 
-/* Relative path of temporary statistics directory */
-static char *statrelpath = NULL;
-
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -106,13 +103,6 @@ static TimestampTz throttled_last;
 static const char *excludeDirContents[] =
 {
     /*
-     * Skip temporary statistics files. PG_STAT_TMP_DIR must be skipped even
-     * when stats_temp_directory is set because PGSS_TEXT_FILE is always
-     * created there.
-     */
-    PG_STAT_TMP_DIR,
-
-    /*
      * It is generally not useful to backup the contents of this directory
      * even if the intention is to restore to another master. See backup.sgml
      * for a more detailed description.
@@ -196,11 +186,8 @@ perform_base_backup(basebackup_options *opt)
     TimeLineID    endtli;
     StringInfo    labelfile;
     StringInfo    tblspc_map_file = NULL;
-    int            datadirpathlen;
     List       *tablespaces = NIL;
 
-    datadirpathlen = strlen(DataDir);
-
     backup_started_in_recovery = RecoveryInProgress();
 
     labelfile = makeStringInfo();
@@ -225,18 +212,6 @@ perform_base_backup(basebackup_options *opt)
 
         SendXlogRecPtrResult(startptr, starttli);
 
-        /*
-         * Calculate the relative path of temporary statistics directory in
-         * order to skip the files which are located in that directory later.
-         */
-        if (is_absolute_path(pgstat_stat_directory) &&
-            strncmp(pgstat_stat_directory, DataDir, datadirpathlen) == 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory + datadirpathlen + 1);
-        else if (strncmp(pgstat_stat_directory, "./", 2) != 0)
-            statrelpath = psprintf("./%s", pgstat_stat_directory);
-        else
-            statrelpath = pgstat_stat_directory;
-
         /* Add a node for the base directory at the end */
         ti = palloc0(sizeof(tablespaceinfo));
         ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -1042,17 +1017,6 @@ sendDir(const char *path, int basepathlen, bool sizeonly, List *tablespaces,
             continue;
 
         /*
-         * Exclude contents of directory specified by statrelpath if not set
-         * to the default (pg_stat_tmp) which is caught in the loop above.
-         */
-        if (statrelpath != NULL && strcmp(pathbuf, statrelpath) == 0)
-        {
-            elog(DEBUG1, "contents of directory \"%s\" excluded from backup", statrelpath);
-            size += _tarWriteDir(pathbuf, basepathlen, &statbuf, sizeonly);
-            continue;
-        }
-
-        /*
          * We can skip pg_wal, the WAL segments need to be fetched from the
          * WAL archive anyway. But include it as an empty directory anyway, so
          * we get permissions right.
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..16270ff 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -150,6 +150,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
         size = add_size(size, SyncScanShmemSize());
         size = add_size(size, AsyncShmemSize());
         size = add_size(size, BackendRandomShmemSize());
+        size = add_size(size, StatsShmemSize());
 #ifdef EXEC_BACKEND
         size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
     SyncScanShmemInit();
     AsyncShmemInit();
     BackendRandomShmemInit();
+    StatsShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 46f5c42..8ffab2f 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -518,6 +518,9 @@ RegisterLWLockTranches(void)
                           "session_typmod_table");
     LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
     LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DSA, "stats table dsa");
+    LWLockRegisterTranche(LWTRANCHE_STATS_DB, "db stats");
+    LWLockRegisterTranche(LWTRANCHE_STATS_FUNC_TABLE, "table/func stats");
 
     /* Register named tranches. */
     for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ec..798af9f 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock                42
 BackendRandomLock                    43
 LogicalRepWorkerLock                44
 CLogTruncationLock                    45
+StatsLock                            46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0f7a96d..e2a6fb2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -185,7 +185,6 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so
 static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source);
 static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source);
 static void assign_effective_io_concurrency(int newval, void *extra);
-static void assign_pgstat_temp_directory(const char *newval, void *extra);
 static bool check_application_name(char **newval, void **extra, GucSource source);
 static void assign_application_name(const char *newval, void *extra);
 static bool check_cluster_name(char **newval, void **extra, GucSource source);
@@ -3562,17 +3561,6 @@ static struct config_string ConfigureNamesString[] =
     },
 
     {
-        {"stats_temp_directory", PGC_SIGHUP, STATS_COLLECTOR,
-            gettext_noop("Writes temporary statistics files to the specified directory."),
-            NULL,
-            GUC_SUPERUSER_ONLY
-        },
-        &pgstat_temp_directory,
-        PG_STAT_TMP_DIR,
-        check_canonical_path, assign_pgstat_temp_directory, NULL
-    },
-
-    {
         {"synchronous_standby_names", PGC_SIGHUP, REPLICATION_MASTER,
             gettext_noop("Number of synchronous standbys and list of names of potential synchronous ones."),
             NULL,
@@ -10438,35 +10426,6 @@ assign_effective_io_concurrency(int newval, void *extra)
 #endif                            /* USE_PREFETCH */
 }
 
-static void
-assign_pgstat_temp_directory(const char *newval, void *extra)
-{
-    /* check_canonical_path already canonicalized newval for us */
-    char       *dname;
-    char       *tname;
-    char       *fname;
-
-    /* directory */
-    dname = guc_malloc(ERROR, strlen(newval) + 1);    /* runtime dir */
-    sprintf(dname, "%s", newval);
-
-    /* global stats */
-    tname = guc_malloc(ERROR, strlen(newval) + 12); /* /global.tmp */
-    sprintf(tname, "%s/global.tmp", newval);
-    fname = guc_malloc(ERROR, strlen(newval) + 13); /* /global.stat */
-    sprintf(fname, "%s/global.stat", newval);
-
-    if (pgstat_stat_directory)
-        free(pgstat_stat_directory);
-    pgstat_stat_directory = dname;
-    if (pgstat_stat_tmpname)
-        free(pgstat_stat_tmpname);
-    pgstat_stat_tmpname = tname;
-    if (pgstat_stat_filename)
-        free(pgstat_stat_filename);
-    pgstat_stat_filename = fname;
-}
-
 static bool
 check_application_name(char **newval, void **extra, GucSource source)
 {
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 842cf36..529c093 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -492,7 +492,6 @@
 #track_io_timing = off
 #track_functions = none            # none, pl, all
 #track_activity_query_size = 1024    # (change requires restart)
-#stats_temp_directory = 'pg_stat_tmp'
 
 
 # - Monitoring -
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ddc850d..0e0511f 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -216,7 +216,6 @@ static const char *const subdirs[] = {
     "pg_replslot",
     "pg_tblspc",
     "pg_stat",
-    "pg_stat_tmp",
     "pg_xact",
     "pg_logical",
     "pg_logical/snapshots",
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index cdf4f5b..a25de6d 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -78,7 +78,7 @@ is_deeply(
 
 # Contents of these directories should not be copied.
 foreach my $dirname (
-    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_stat_tmp pg_subtrans)
+    qw(pg_dynshmem pg_notify pg_replslot pg_serial pg_snapshots pg_subtrans)
   )
 {
     is_deeply(
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index b054dab..8a197bf 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -407,6 +407,7 @@ extern AuxProcType MyAuxProcType;
 #define AmCheckpointerProcess()        (MyAuxProcType == CheckpointerProcess)
 #define AmWalWriterProcess()        (MyAuxProcType == WalWriterProcess)
 #define AmWalReceiverProcess()        (MyAuxProcType == WalReceiverProcess)
+#define AmStatsCollectorProcess()    (MyAuxProcType == StatsCollectorProcess)
 
 
 /*****************************************************************************
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e2a1e21..b48741f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -13,6 +13,7 @@
 
 #include "datatype/timestamp.h"
 #include "fmgr.h"
+#include "lib/dshash.h"
 #include "libpq/pqcomm.h"
 #include "port/atomics.h"
 #include "portability/instr_time.h"
@@ -30,9 +31,6 @@
 #define PGSTAT_STAT_PERMANENT_FILENAME        "pg_stat/global.stat"
 #define PGSTAT_STAT_PERMANENT_TMPFILE        "pg_stat/global.tmp"
 
-/* Default directory to store temporary statistics data in */
-#define PG_STAT_TMP_DIR        "pg_stat_tmp"
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
@@ -48,7 +46,6 @@ typedef enum TrackFunctionsLevel
 typedef enum StatMsgType
 {
     PGSTAT_MTYPE_DUMMY,
-    PGSTAT_MTYPE_INQUIRY,
     PGSTAT_MTYPE_TABSTAT,
     PGSTAT_MTYPE_TABPURGE,
     PGSTAT_MTYPE_DROPDB,
@@ -216,35 +213,6 @@ typedef struct PgStat_MsgDummy
     PgStat_MsgHdr m_hdr;
 } PgStat_MsgDummy;
 
-
-/* ----------
- * PgStat_MsgInquiry            Sent by a backend to ask the collector
- *                                to write the stats file(s).
- *
- * Ordinarily, an inquiry message prompts writing of the global stats file,
- * the stats file for shared catalogs, and the stats file for the specified
- * database.  If databaseid is InvalidOid, only the first two are written.
- *
- * New file(s) will be written only if the existing file has a timestamp
- * older than the specified cutoff_time; this prevents duplicated effort
- * when multiple requests arrive at nearly the same time, assuming that
- * backends send requests with cutoff_times a little bit in the past.
- *
- * clock_time should be the requestor's current local time; the collector
- * uses this to check for the system clock going backward, but it has no
- * effect unless that occurs.  We assume clock_time >= cutoff_time, though.
- * ----------
- */
-
-typedef struct PgStat_MsgInquiry
-{
-    PgStat_MsgHdr m_hdr;
-    TimestampTz clock_time;        /* observed local clock time */
-    TimestampTz cutoff_time;    /* minimum acceptable file timestamp */
-    Oid            databaseid;        /* requested DB (InvalidOid => shared only) */
-} PgStat_MsgInquiry;
-
-
 /* ----------
  * PgStat_TableEntry            Per-table info in a MsgTabstat
  * ----------
@@ -539,7 +507,6 @@ typedef union PgStat_Msg
 {
     PgStat_MsgHdr msg_hdr;
     PgStat_MsgDummy msg_dummy;
-    PgStat_MsgInquiry msg_inquiry;
     PgStat_MsgTabstat msg_tabstat;
     PgStat_MsgTabpurge msg_tabpurge;
     PgStat_MsgDropdb msg_dropdb;
@@ -601,10 +568,13 @@ typedef struct PgStat_StatDBEntry
 
     /*
      * tables and functions must be last in the struct, because we don't write
-     * the pointers out to the stats file.
+     * the handles and pointers out to the stats file.
      */
-    HTAB       *tables;
-    HTAB       *functions;
+    dshash_table_handle tables;
+    dshash_table_handle functions;
+    /* for snapshot tables */
+    dshash_table *snapshot_tables;
+    dshash_table *snapshot_functions;
 } PgStat_StatDBEntry;
 
 
@@ -1201,6 +1171,7 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 extern void pgstat_initstats(Relation rel);
 
 extern char *pgstat_clip_activity(const char *raw_activity);
+extern PgStat_StatTabEntry *backend_get_tab_entry(PgStat_StatDBEntry *dbent, Oid relid);
 
 /* ----------
  * pgstat_report_wait_start() -
@@ -1328,6 +1299,8 @@ extern void pgstat_send_bgwriter(void);
  * generate the pgstat* views.
  * ----------
  */
+extern Size StatsShmemSize(void);
+extern void StatsShmemInit(void);
 extern void PgstatCollectorMain(void);
 extern PgStat_StatDBEntry *pgstat_fetch_stat_dbentry(Oid dbid);
 extern PgStat_StatTabEntry *pgstat_fetch_stat_tabentry(Oid relid);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 460843d..0b17d3e 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,9 @@ typedef enum BuiltinTrancheIds
     LWTRANCHE_SESSION_TYPMOD_TABLE,
     LWTRANCHE_TBM,
     LWTRANCHE_PARALLEL_APPEND,
+    LWTRANCHE_STATS_DSA,
+    LWTRANCHE_STATS_DB,
+    LWTRANCHE_STATS_FUNC_TABLE,
     LWTRANCHE_FIRST_USER_DEFINED
 }            BuiltinTrancheIds;
 
-- 
2.9.2

Re: [HACKERS] More stats about skipped vacuums

From

Masahiko Sawada

Date:

06 February 2018, 11:50:01

On Mon, Dec 11, 2017 at 8:15 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Mon, 27 Nov 2017 13:51:22 -0500, Robert Haas <robertmhaas@gmail.com> wrote in
<CA+Tgmob2tuqvEZfHV2kLC-xobsZxDWGdc1WmjLg5+iOPLa0NHg@mail.gmail.com>
>> On Mon, Nov 27, 2017 at 1:49 AM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > Hmmm. Okay, we must make stats collector more effeicient if we
>> > want to have additional counters with smaller significance in the
>> > table stats. Currently sizeof(PgStat_StatTabEntry) is 168
>> > bytes. The whole of the patchset increases it to 232 bytes. Thus
>> > the size of a stat file for a database with 10000 tables
>> > increases from about 1.7MB to 2.4MB.  DSM and shared dynahash is
>> > not dynamically expandable so placing stats on shared hash
>> > doesn't seem effective. Stats as a regular table could work but
>> > it seems too-much.
>>
>> dshash, which is already committed, is both DSM-based and dynamically
>> expandable.
>
> Yes, I forgot about that. We can just copy memory blocks to take
> a snapshot of stats.
>
>> > Is it acceptable that adding a new section containing this new
>> > counters, which is just loaded as a byte sequence and parsing
>> > (and filling the corresponding hash) is postponed until a counter
>> > in the section is really requested?  The new counters need to be
>> > shown in a separate stats view (maybe named pg_stat_vacuum).
>>
>> Still makes the stats file bigger.
>
> I considered dshash for pgstat.c and the attached is a *PoC*
> patch, which is not full-fledged and just working under a not so
> concurrent situation.
>
> - Made stats collector an auxiliary process. A crash of stats
>   collector leads to a whole-server restarting.
>
> - dshash lacks capability of sequential scan so added it.
>
> - Also added snapshot function to dshash. It just copies
>   unerlying DSA segments into local memory but currently it
>   doesn't aquire dshash-level locks at all. I tried the same
>   thing with resize but it leads to very quick exhaustion of
>   LWLocks. An LWLock for the whole dshash would be required. (and
>   it is also useful to resize() and sequential scan.
>
> - The current dshash doesn't shrink at all. Such a feature also
>   will be required. (A server restart causes a shrink of hashes
>   in the same way as before but bloat dshash requires copying
>   more than necessary size of memory on takeing a snapshot.)
>
> The size of DSA is about 1MB at minimum. Copying entry-by-entry
> into (non-ds) hash might be better than copying underlying DSA as
> a whole, and DSA/DSHASH snapshot brings a kind of dirty..
>
>
> Does anyone give me opinions or suggestions?
>

The implementation of autovacuum work-item has been changed (by commit
31ae1638) because dynamic shared memory is not portable enough. IIUC
this patch is going to do the similar thing. Since stats collector is
also a critical part of the server, should we consider whether we can
change it? Or the portability problem is not relevant with this patch?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

06 February 2018, 16:24:37

At Tue, 6 Feb 2018 14:50:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCRn6Q0wGG7UwGVsQJZbocNsRaZByJomUy+-GRkVH-i9A@mail.gmail.com>
> On Mon, Dec 11, 2017 at 8:15 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I considered dshash for pgstat.c and the attached is a *PoC*
> > patch, which is not full-fledged and just working under a not so
> > concurrent situation.
> >
> > - Made stats collector an auxiliary process. A crash of stats
> >   collector leads to a whole-server restarting.
> >
> > - dshash lacks capability of sequential scan so added it.
> >
> > - Also added snapshot function to dshash. It just copies
> >   unerlying DSA segments into local memory but currently it
> >   doesn't aquire dshash-level locks at all. I tried the same
> >   thing with resize but it leads to very quick exhaustion of
> >   LWLocks. An LWLock for the whole dshash would be required. (and
> >   it is also useful to resize() and sequential scan.
> >
> > - The current dshash doesn't shrink at all. Such a feature also
> >   will be required. (A server restart causes a shrink of hashes
> >   in the same way as before but bloat dshash requires copying
> >   more than necessary size of memory on takeing a snapshot.)
> >
> > The size of DSA is about 1MB at minimum. Copying entry-by-entry
> > into (non-ds) hash might be better than copying underlying DSA as
> > a whole, and DSA/DSHASH snapshot brings a kind of dirty..
> >
> >
> > Does anyone give me opinions or suggestions?
> >
> 
> The implementation of autovacuum work-item has been changed (by commit
> 31ae1638) because dynamic shared memory is not portable enough. IIUC
> this patch is going to do the similar thing. Since stats collector is
> also a critical part of the server, should we consider whether we can
> change it? Or the portability problem is not relevant with this patch?

Thank you for the pointer. I digged out the following thread from
this and the the patch seems to be the consequence of the
discussion. I'll study it and think what to do on this.

https://www.postgresql.org/message-id/20170814005656.d5tvz464qkmz66tq@alap3.anarazel.de

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

07 February 2018, 07:34:58

At Tue, 06 Feb 2018 19:24:37 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180206.192437.229464841.horiguchi.kyotaro@lab.ntt.co.jp>
> At Tue, 6 Feb 2018 14:50:01 +0900, Masahiko Sawada <sawada.mshk@gmail.com> wrote in
<CAD21AoCRn6Q0wGG7UwGVsQJZbocNsRaZByJomUy+-GRkVH-i9A@mail.gmail.com>
> > The implementation of autovacuum work-item has been changed (by commit
> > 31ae1638) because dynamic shared memory is not portable enough. IIUC
> > this patch is going to do the similar thing. Since stats collector is
> > also a critical part of the server, should we consider whether we can
> > change it? Or the portability problem is not relevant with this patch?
> 
> Thank you for the pointer. I digged out the following thread from
> this and the the patch seems to be the consequence of the
> discussion. I'll study it and think what to do on this.
> 
> https://www.postgresql.org/message-id/20170814005656.d5tvz464qkmz66tq@alap3.anarazel.de

Done. The dominant reason for the ripping-off is that the
work-items array was allocated in a fixed-size DSA segment at
process startup time and wouldn't be resized.

Based on the reason, it fails to run when
dynamic_shared_memory_type = none and it is accompanied by
several cleanup complexities. The decision there is we should go
for just using static shared memory rather than adding complexity
for nothing. If it really needs to be expandable in the future,
it's the time to use DSA. (But would still maintain a fallback
stuff.)


Thinking of this, I think that this patch has a reason to use DSA
but needs some additional work.

- Fallback mechanism when dynamic_shared_memory_type = none

  This means that the old file based stuff lives alongside the
  DSA stuff. This is used when '= none' and on failure of DSA
  mechanism.

- Out-of transactoin clean up stuff

  Something like the following patch.

  https://www.postgresql.org/message-id/20170814231638.x6vgnzlr7eww4bui@alvherre.pgsql

And as known problems:

- Make it use less LWLocks.

- DSHash shrink mechanism. Maybe need to find the way to detect
  the necessity to shrink.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] More stats about skipped vacuums

From

Robert Haas

Date:

07 February 2018, 23:09:26

On Tue, Feb 6, 2018 at 8:34 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Based on the reason, it fails to run when
> dynamic_shared_memory_type = none and it is accompanied by
> several cleanup complexities. The decision there is we should go
> for just using static shared memory rather than adding complexity
> for nothing. If it really needs to be expandable in the future,
> it's the time to use DSA. (But would still maintain a fallback
> stuff.)

It seems to me that there was a thread where Tom proposed removing
support for dynamic_shared_memory_type = none.  The main reason that I
included that option initially was because it seemed silly to risk
causing problems for users whose dynamic shared memory facilities
didn't work for the sake of a feature that, at the time (9.4), had no
in-core users.

But things have shifted a bit since then.  We have had few complaints
about dynamic shared memory causing portability problems (except for
performance: apparently some implementations perform better than
others on some systems, and we need support for huge pages, but
neither of those things are a reason to disable it) and we now have
in-core use that is enabled by default.  I suggest we remove support
for dynamic_shared_memory_type = none first, and see if we get any
complaints.  If we don't, then future patches can rely on it being
present.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] More stats about skipped vacuums

From

Tom Lane

Date:

08 February 2018, 03:59:20

Robert Haas <robertmhaas@gmail.com> writes:
> It seems to me that there was a thread where Tom proposed removing
> support for dynamic_shared_memory_type = none.

I think you're recalling <32138.1502675970@sss.pgh.pa.us>, wherein
I pointed out that

>>> Whether that's worth the trouble is debatable.  The current code
>>> in initdb believes that every platform has some type of DSM support
>>> (see choose_dsm_implementation).  Nobody's complained about that,
>>> and it certainly works on every buildfarm animal.  So for all we know,
>>> dynamic_shared_memory_type = none is broken already.

(That was in fact in the same thread Kyotaro-san just linked to about
reimplementing the stats collector.)

It's still true that we've no reason to believe there are any supported
platforms that haven't got some sort of DSM.  Performance might be a
different question, of course ... but it's hard to believe that
transferring stats through DSM wouldn't be better than writing them
out to files.

> I suggest we remove support for dynamic_shared_memory_type = none first,
> and see if we get any complaints.  If we don't, then future patches can
> rely on it being present.

If we remove it in v11, it'd still be maybe a year from now before we'd
have much confidence from that alone that nobody cares.  I think the lack
of complaints about it in 9.6 and 10 is a more useful data point.

            regards, tom lane

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

08 February 2018, 15:04:15

Hello,

At Wed, 07 Feb 2018 16:59:20 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in <3246.1518040760@sss.pgh.pa.us>
> Robert Haas <robertmhaas@gmail.com> writes:
> > It seems to me that there was a thread where Tom proposed removing
> > support for dynamic_shared_memory_type = none.
> 
> I think you're recalling <32138.1502675970@sss.pgh.pa.us>, wherein
> I pointed out that
> 
> >>> Whether that's worth the trouble is debatable.  The current code
> >>> in initdb believes that every platform has some type of DSM support
> >>> (see choose_dsm_implementation).  Nobody's complained about that,
> >>> and it certainly works on every buildfarm animal.  So for all we know,
> >>> dynamic_shared_memory_type = none is broken already.
> 
> (That was in fact in the same thread Kyotaro-san just linked to about
> reimplementing the stats collector.)
> 
> It's still true that we've no reason to believe there are any supported
> platforms that haven't got some sort of DSM.  Performance might be a
> different question, of course ... but it's hard to believe that
> transferring stats through DSM wouldn't be better than writing them
> out to files.

Good to hear. Thanks.

> > I suggest we remove support for dynamic_shared_memory_type = none first,
> > and see if we get any complaints.  If we don't, then future patches can
> > rely on it being present.
> 
> If we remove it in v11, it'd still be maybe a year from now before we'd
> have much confidence from that alone that nobody cares.  I think the lack
> of complaints about it in 9.6 and 10 is a more useful data point.

So that means that we are assumed to be able to rely on the
existence of DSM at the present since over a year we had no
complain despite the fact that DSM is silently turned on? And
apart from that we are ready to remove 'none' from the options of
dynamic_shared_memory_type right now?

If I may rely on DSM, fallback stuff would not be required.


>             regards, tom lane

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

08 February 2018, 15:21:56

At Thu, 08 Feb 2018 18:04:15 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180208.180415.112312013.horiguchi.kyotaro@lab.ntt.co.jp>
> > > I suggest we remove support for dynamic_shared_memory_type = none first,
> > > and see if we get any complaints.  If we don't, then future patches can
> > > rely on it being present.
> > 
> > If we remove it in v11, it'd still be maybe a year from now before we'd
> > have much confidence from that alone that nobody cares.  I think the lack
> > of complaints about it in 9.6 and 10 is a more useful data point.
> 
> So that means that we are assumed to be able to rely on the
> existence of DSM at the present since over a year we had no
> complain despite the fact that DSM is silently turned on? And
> apart from that we are ready to remove 'none' from the options of
> dynamic_shared_memory_type right now?

I found the follwoing commit related to this.

| commit d41ab71712a4457ed39d5471b23949872ac91def
| Author: Robert Haas <rhaas@postgresql.org>
| Date:   Wed Oct 16 09:41:03 2013 -0400
| 
|     initdb: Suppress dynamic shared memory when probing for max_connections.
|     
|     This might not be the right long-term solution here, but it will
|     hopefully turn the buildfarm green again.
|     
|     Oversight noted by Andres Freund

The discussion is found here.

https://www.postgresql.org/message-id/CA+TgmoYHiiGrcvSSJhmbSEBMoF2zX_9_9rWd75Cwvu99YrDxew@mail.gmail.com

I suppose that the problem has not been resolved yet..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] More stats about skipped vacuums

From

Kyotaro HORIGUCHI

Date:

09 February 2018, 14:16:35

Hello.

At Thu, 08 Feb 2018 18:21:56 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180208.182156.96551245.horiguchi.kyotaro@lab.ntt.co.jp>
> I suppose that the problem has not been resolved yet..

I found several bugs during studying this but my conclusion here
is that the required decision here is that whether we regard the
unavailability of DSM as a fatal error as we do for out of
memory. Maybe we can go for the direction but just doing it
certainly let some buidfarm animals (at least anole? smew is not
found.) out of their lives.

I've not found the exact cause of the problem that regtest on the
bf animals always suceeded using sysv shmem but "postgres --boot"
by initdb alone can fail using the same mechanism. But regtest
seems to continue working if initdb sets max_connection to 20 or
more.  At least it suceeds for me with the values max_connection
= 20 and shared_buffers=50MB on centos.

Finally, I'd like to propose the followings.

 - kill dynamic_shared_memory_type = nune just now.

 * server stops at startup if DSM is not available.

 - Let initdb set max_connection = 20 as the fallback value in
   the case. (Another porposed patch) And regression should
   succeed with that.

If we are agreed on this, I will be able to go forward.


I want to have opinions on this from the expericed poeple.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center