Thread: [HACKERS] [PATCH] Vacuum: Update FSM more frequently
As discussed in the vacuum ring buffer thread: Make Vacuum update the FSM more frequently, to avoid the case where autovacuum fails to reach the point where it updates the FSM in highly contended tables. Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids recursing into branches that already contain enough free space, to avoid having to traverse the whole FSM and thus induce quadratic costs. Intermediate FSM vacuums are only supposed to make enough free space visible to avoid extension until the final (non-partial) FSM vacuum. Run partial FSM vacuums after each index pass, which is a point at which whole ranges of the heap have been thorougly cleaned, and we can expect no further updates to those ranges of the FSM save for concurrent activity. When there are no indexes, and thus no index passes, run partial FSM vacuums every 8GB of dirtied pages or 1/8th of the relation, whichever is highest. This allows some partial work to be made visible without incurring quadratic cost. In any case, FSM are small in relation to the table, so even when quadratic cost is involved, it should not be problematic. Index passes already incur quadratic cost, and the addition of the FSM is unlikely to be measurable. Run a partial FSM vacuum with a low pruning threshold right at the beginning of the VACUUM to finish up any work left over from prior canceled vacuum runs, something that is common in highly contended tables when running only autovacuum. Autovacuum canceling is thus handled by updating the FSM first-thing when autovacuum retries vacuuming a relation. I attempted to add an autovacuum work item for performing the FSM update shortly after the cancel, but that had some issues so I abandoned the approach. For one, it would sometimes crash when adding the work item from inside autovacuum itself. I didn't find the cause of the crash, but I suspect AutoVacuumRequestWork was being called in a context where it was not safe. In any case, the way work items work didn't seem like it would have worked for our purposes anyway, since they will only ever be processed after processing all tables in the database, something that could take ages, the work items are limited to 256, which would make the approach troublesome for databases with more than 256 tables that trigger this case, and it required de-duplicating work items which had quadratic cost without major refactoring of the feature. So, patch attached. I'll add it to the next CF as well. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
A few general comments.
+ FreeSpaceMapVacuum(onerel, 64);
Just want to know why '64' is used here? It's better to give a description.
+ else
+ {
+ newslot = fsm_get_avail(page, 0);
+ }
Since there is only one line in the else the bracket will not be needed.
And there in one more space ahead of else which should be removed.
+ if (nindexes == 0 &&
+ (vacuumed_pages_at_fsm_vac - vacuumed_pages) >
vacuum_fsm_every_pages)
+ {
+ /* Vacuum the Free Space Map */
+ FreeSpaceMapVacuum(onerel, 0);
+ vacuumed_pages_at_fsm_vac = vacuumed_pages;
+ }
vacuumed_pages_at_fsm_vac is initialised with value zero and seems no
chance to go into the bracket.
Regards,
Jing Wang
Fujitsu Australia
On Mon, Nov 27, 2017 at 2:39 PM, Jing Wang <jingwangian@gmail.com> wrote: > A few general comments. > > + FreeSpaceMapVacuum(onerel, 64); > > Just want to know why '64' is used here? It's better to give a description. > > + else > + { > + newslot = fsm_get_avail(page, 0); > + } > > Since there is only one line in the else the bracket will not be needed. And > there in one more space ahead of else which should be removed. > > > + if (nindexes == 0 && > + (vacuumed_pages_at_fsm_vac - vacuumed_pages) > > vacuum_fsm_every_pages) > + { > + /* Vacuum the Free Space Map */ > + FreeSpaceMapVacuum(onerel, 0); > + vacuumed_pages_at_fsm_vac = vacuumed_pages; > + } > > vacuumed_pages_at_fsm_vac is initialised with value zero and seems no chance > to go into the bracket. The patch presented still applies, and there has been a review two days ago. This is a short delay to answer for the author, so I am moving this patch to next CF with the same status of waiting on author. -- Michael
Greetings Claudio, * Michael Paquier (michael.paquier@gmail.com) wrote: > On Mon, Nov 27, 2017 at 2:39 PM, Jing Wang <jingwangian@gmail.com> wrote: > > A few general comments. > > > > + FreeSpaceMapVacuum(onerel, 64); > > > > Just want to know why '64' is used here? It's better to give a description. > > > > + else > > + { > > + newslot = fsm_get_avail(page, 0); > > + } > > > > Since there is only one line in the else the bracket will not be needed. And > > there in one more space ahead of else which should be removed. > > > > > > + if (nindexes == 0 && > > + (vacuumed_pages_at_fsm_vac - vacuumed_pages) > > > vacuum_fsm_every_pages) > > + { > > + /* Vacuum the Free Space Map */ > > + FreeSpaceMapVacuum(onerel, 0); > > + vacuumed_pages_at_fsm_vac = vacuumed_pages; > > + } > > > > vacuumed_pages_at_fsm_vac is initialised with value zero and seems no chance > > to go into the bracket. > > The patch presented still applies, and there has been a review two > days ago. This is a short delay to answer for the author, so I am > moving this patch to next CF with the same status of waiting on > author. Looks like this patch probably still applies and the changes suggested above seem pretty small, any chance you can provide an updated patch soon Claudio, so that it can be reviewed properly during this commitfest? Thanks! Stephen
Attachment
On Sat, Jan 6, 2018 at 7:33 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings Claudio,Looks like this patch probably still applies and the changes suggested
* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Mon, Nov 27, 2017 at 2:39 PM, Jing Wang <jingwangian@gmail.com> wrote:
> > A few general comments.
> >
> > + FreeSpaceMapVacuum(onerel, 64);
> >
> > Just want to know why '64' is used here? It's better to give a description.
> >
> > + else
> > + {
> > + newslot = fsm_get_avail(page, 0);
> > + }
> >
> > Since there is only one line in the else the bracket will not be needed. And
> > there in one more space ahead of else which should be removed.
> >
> >
> > + if (nindexes == 0 &&
> > + (vacuumed_pages_at_fsm_vac - vacuumed_pages) >
> > vacuum_fsm_every_pages)
> > + {
> > + /* Vacuum the Free Space Map */
> > + FreeSpaceMapVacuum(onerel, 0);
> > + vacuumed_pages_at_fsm_vac = vacuumed_pages;
> > + }
> >
> > vacuumed_pages_at_fsm_vac is initialised with value zero and seems no chance
> > to go into the bracket.
>
> The patch presented still applies, and there has been a review two
> days ago. This is a short delay to answer for the author, so I am
> moving this patch to next CF with the same status of waiting on
> author.
above seem pretty small, any chance you can provide an updated patch
soon Claudio, so that it can be reviewed properly during this
commitfest?
I failed to notice the comments among the list's noise, sorry.
I'll get on to it now.
On Mon, Nov 27, 2017 at 2:39 AM, Jing Wang <jingwangian@gmail.com> wrote:
A few general comments.+ FreeSpaceMapVacuum(onerel, 64);Just want to know why '64' is used here? It's better to give a description.
It's just a random cutoff point.
The point of that vacuum run is to mark free space early, to avoid needless relation extension before long vacuum runs finish. This only happens if the FSM is mostly zeroes, if there's free space in there, it will be used.
To make this run fast, since it's all redundant work before the final FSM vacuum pass, branches with more than that amount of free space are skipped. There's little point in vacuuming those early since they already contain free space. This helps avoid the quadratic cost of vacuuming the FSM every N dead tuples, since already-vacuumed branches will have free space, and will thus be skipped. It could still be quadratic in the worst case, but it avoids it often enough.
To make this run fast, since it's all redundant work before the final FSM vacuum pass, branches with more than that amount of free space are skipped. There's little point in vacuuming those early since they already contain free space. This helps avoid the quadratic cost of vacuuming the FSM every N dead tuples, since already-vacuumed branches will have free space, and will thus be skipped. It could still be quadratic in the worst case, but it avoids it often enough.
As for the value itself, it's an arbitrary choice. In-between index passes, we have a rather precise cutoff we can use to avoid this redundant work. But in the first run, we don't, so I made an arbitrary choice there that felt right. I have no empirical performance data about alternative values.
+ else+ {+ newslot = fsm_get_avail(page, 0);+ }Since there is only one line in the else the bracket will not be needed. And there in one more space ahead of else which should be removed.
Ok
+ if (nindexes == 0 &&+ (vacuumed_pages_at_fsm_vac - vacuumed_pages) > vacuum_fsm_every_pages)+ {+ /* Vacuum the Free Space Map */+ FreeSpaceMapVacuum(onerel, 0);+ vacuumed_pages_at_fsm_vac = vacuumed_pages;+ }vacuumed_pages_at_fsm_vac is initialised with value zero and seems no chance to go into the bracket.
You're right, the difference there is backwards
Attached patch with the proposed changes.
Attachment
On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids > recursing into branches that already contain enough free space, to > avoid having to traverse the whole FSM and thus induce quadratic > costs. Intermediate FSM vacuums are only supposed to make enough > free space visible to avoid extension until the final (non-partial) > FSM vacuum. Hmm, I think this resolve a part of the issue. How about calling AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified and give the relid that we were vacuuming but could not complete as a new autovacuum work-item? The new autovacuum work-item makes the worker vacuum FSMs of the given relation and its indices. That way, we can ensure that FSM gets vacuumed by the cancelled autovacuum process or other autovacuum processes. Since a work-item can be handled by other autovacuum process I think 256 work-item limit would not be a problem. Or it might be better to make autovacuum launcher launch worker process if there is pending work-item in a database even if there is no table needs to be vacuumed/analyzed. > For one, it would sometimes crash when adding the work item from > inside autovacuum itself. I didn't find the cause of the crash, but I suspect > AutoVacuumRequestWork was being called in a context where it > was not safe. Perhaps the cause of this might be that autovacuum work-item is implemented using DSA, which has been changed before by commit 31ae1638ce35c23979f9bcbb92c6bb51744dbccb. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids >> recursing into branches that already contain enough free space, to >> avoid having to traverse the whole FSM and thus induce quadratic >> costs. Intermediate FSM vacuums are only supposed to make enough >> free space visible to avoid extension until the final (non-partial) >> FSM vacuum. > > Hmm, I think this resolve a part of the issue. How about calling > AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified > and give the relid that we were vacuuming but could not complete as a > new autovacuum work-item? The new autovacuum work-item makes the > worker vacuum FSMs of the given relation and its indices. Well, I tried that in fact, as I mentioned in the OP. I abandoned it due to the conjunction of the 2 main blockers I found and mentioned there. In essence, those issues defeat the purpose of the patch (to get the free space visible ASAP). Don't forget, this is aimed at cases where autovacuum of a single relation takes a very long time. That is, very big relations. Maybe days, like in my case. A whole autovacuum cycle can take weeks, so delaying FSM vacuum that much is not good, and using work items still cause those delays, not to mention the segfaults. > That way, we > can ensure that FSM gets vacuumed by the cancelled autovacuum process > or other autovacuum processes. Since a work-item can be handled by > other autovacuum process I think 256 work-item limit would not be a > problem. Why do you think it wouldn't? In particular if you take into account the above. If you have more than 256 relations in the cluster, it could very well happen that you've queued the maximum amount and no autovac worker has had a chance to take a look at them, because they're all stuck vacuuming huge relations. Not to mention the need to de-duplicate work items. We wouldn't want to request repeated FSM vacuums, or worst, queue an FSM vacuum of a single table 256 times and fill up the queue with redundant items. With the current structure, de-duplication is O(N), so if we wanted to raise the limit of 256 work items, we'd need a structure that would let us de-duplicate in less than O(N). In essence, it's a ton of work for very little gain. Hence why I abandoned it. > Or it might be better to make autovacuum launcher launch > worker process if there is pending work-item in a database even if > there is no table needs to be vacuumed/analyzed. Quite a must, in fact. >> For one, it would sometimes crash when adding the work item from >> inside autovacuum itself. I didn't find the cause of the crash, but I suspect >> AutoVacuumRequestWork was being called in a context where it >> was not safe. > > Perhaps the cause of this might be that autovacuum work-item is > implemented using DSA, which has been changed before by commit > 31ae1638ce35c23979f9bcbb92c6bb51744dbccb. I thought about that. Perhaps. But the work-item approach didn't fail just because of the segfaults. It's also that it doesn't address the free space visibility issue quickly enough.
On Mon, Jan 29, 2018 at 11:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids >>> recursing into branches that already contain enough free space, to >>> avoid having to traverse the whole FSM and thus induce quadratic >>> costs. Intermediate FSM vacuums are only supposed to make enough >>> free space visible to avoid extension until the final (non-partial) >>> FSM vacuum. >> >> Hmm, I think this resolve a part of the issue. How about calling >> AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified >> and give the relid that we were vacuuming but could not complete as a >> new autovacuum work-item? The new autovacuum work-item makes the >> worker vacuum FSMs of the given relation and its indices. > > Well, I tried that in fact, as I mentioned in the OP. > > I abandoned it due to the conjunction of the 2 main blockers I found > and mentioned there. In essence, those issues defeat the purpose of > the patch (to get the free space visible ASAP). > > Don't forget, this is aimed at cases where autovacuum of a single > relation takes a very long time. That is, very big relations. Maybe > days, like in my case. A whole autovacuum cycle can take weeks, so > delaying FSM vacuum that much is not good, and using work items still > cause those delays, not to mention the segfaults. Yeah, I agree to vacuum fsm more frequently because it can prevent table bloating from concurrent modifying. But considering the way to prevent the table bloating from cancellation of autovacuum, I guess we need more things. This proposal seems to provide us an ability that is we "might be" able to prevent table bloating due to cancellation of autovacuum. Since we can know that autovacuum is cancelled, I'd like to have a way so that we can surely vacuum fsm even if vacuum is cancelled. Thoughts? Also the patch always vacuums fsm at beginning of the vacuum with a threshold but it is useless if the table has been properly vacuumed. I don't think it's good idea to add an additional step that "might be" efficient, because current vacuum is already heavy. > >> That way, we >> can ensure that FSM gets vacuumed by the cancelled autovacuum process >> or other autovacuum processes. Since a work-item can be handled by >> other autovacuum process I think 256 work-item limit would not be a >> problem. > > Why do you think it wouldn't? In particular if you take into account > the above. If you have more than 256 relations in the cluster, it > could very well happen that you've queued the maximum amount and no > autovac worker has had a chance to take a look at them, because > they're all stuck vacuuming huge relations. > > Not to mention the need to de-duplicate work items. We wouldn't want > to request repeated FSM vacuums, or worst, queue an FSM vacuum of a > single table 256 times and fill up the queue with redundant items. > With the current structure, de-duplication is O(N), so if we wanted to > raise the limit of 256 work items, we'd need a structure that would > let us de-duplicate in less than O(N). In essence, it's a ton of work > for very little gain. Hence why I abandoned it. I'd missed something. Agreed with you. On detail of the patch, --- a/src/backend/storage/freespace/indexfsm.c +++ b/src/backend/storage/freespace/indexfsm.c @@ -70,5 +70,5 @@ RecordUsedIndexPage(Relation rel, BlockNumber usedBlock) void IndexFreeSpaceMapVacuum(Relation rel) { - FreeSpaceMapVacuum(rel); + FreeSpaceMapVacuum(rel, 0); } @@ -816,11 +820,19 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p) { int child_avail; + /* Tree pruning for partial vacuums */ + if (threshold) + { + child_avail = fsm_get_avail(page, slot); + if (child_avail >= threshold) + continue; + } Don't we skip all fsm pages if we set the threshold to 0? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Feb 1, 2018 at 9:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Jan 29, 2018 at 11:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>> Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids >>>> recursing into branches that already contain enough free space, to >>>> avoid having to traverse the whole FSM and thus induce quadratic >>>> costs. Intermediate FSM vacuums are only supposed to make enough >>>> free space visible to avoid extension until the final (non-partial) >>>> FSM vacuum. >>> >>> Hmm, I think this resolve a part of the issue. How about calling >>> AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified >>> and give the relid that we were vacuuming but could not complete as a >>> new autovacuum work-item? The new autovacuum work-item makes the >>> worker vacuum FSMs of the given relation and its indices. >> >> Well, I tried that in fact, as I mentioned in the OP. >> >> I abandoned it due to the conjunction of the 2 main blockers I found >> and mentioned there. In essence, those issues defeat the purpose of >> the patch (to get the free space visible ASAP). >> >> Don't forget, this is aimed at cases where autovacuum of a single >> relation takes a very long time. That is, very big relations. Maybe >> days, like in my case. A whole autovacuum cycle can take weeks, so >> delaying FSM vacuum that much is not good, and using work items still >> cause those delays, not to mention the segfaults. > > Yeah, I agree to vacuum fsm more frequently because it can prevent > table bloating from concurrent modifying. But considering the way to > prevent the table bloating from cancellation of autovacuum, I guess we > need more things. This proposal seems to provide us an ability that is > we "might be" able to prevent table bloating due to cancellation of > autovacuum. Since we can know that autovacuum is cancelled, I'd like > to have a way so that we can surely vacuum fsm even if vacuum is > cancelled. Thoughts? After autovacuum gets cancelled, the next time it wakes up it will retry vacuuming the cancelled relation. That's because a cancelled autovacuum doesn't update the last-vacuumed stats. So the timing between an autovacuum work item and the next retry for that relation is more or less an autovacuum nap time, except perhaps in the case where many vacuums get cancelled, and they have to be queued. That's why an initial FSM vacuum makes sense. It has a similar timing to the autovacuum work item, it has the benefit that it can be triggered manually (manual vacuum), and it's cheap and efficient. > Also the patch always vacuums fsm at beginning of the vacuum with a > threshold but it is useless if the table has been properly vacuumed. I > don't think it's good idea to add an additional step that "might be" > efficient, because current vacuum is already heavy. FSMs are several orders of magnitude smaller than the tables themselves. A TB-sized table I have here has a 95M FSM. If you add threshold skipping, that initial FSM vacuum *will* be efficient. By definition, the FSM will be less than 1/4000th of the table, so that initial FSM pass takes less than 1/4000th of the whole vacuum. Considerably less considering the simplicity of the task. > On detail of the patch, > > --- a/src/backend/storage/freespace/indexfsm.c > +++ b/src/backend/storage/freespace/indexfsm.c > @@ -70,5 +70,5 @@ RecordUsedIndexPage(Relation rel, BlockNumber usedBlock) > void > IndexFreeSpaceMapVacuum(Relation rel) > { > - FreeSpaceMapVacuum(rel); > + FreeSpaceMapVacuum(rel, 0); > } > > @@ -816,11 +820,19 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, > bool *eof_p) > { > int child_avail; > > + /* Tree pruning for partial vacuums */ > + if (threshold) > + { > + child_avail = fsm_get_avail(page, slot); > + if (child_avail >= threshold) > + continue; > + } > > Don't we skip all fsm pages if we set the threshold to 0? No, that doesn't skip anything if threshold is 0, since it doesn't get to the continue, which is the one skipping nodes.
On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Feb 1, 2018 at 9:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Mon, Jan 29, 2018 at 11:31 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>>> Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids >>>>> recursing into branches that already contain enough free space, to >>>>> avoid having to traverse the whole FSM and thus induce quadratic >>>>> costs. Intermediate FSM vacuums are only supposed to make enough >>>>> free space visible to avoid extension until the final (non-partial) >>>>> FSM vacuum. >>>> >>>> Hmm, I think this resolve a part of the issue. How about calling >>>> AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified >>>> and give the relid that we were vacuuming but could not complete as a >>>> new autovacuum work-item? The new autovacuum work-item makes the >>>> worker vacuum FSMs of the given relation and its indices. >>> >>> Well, I tried that in fact, as I mentioned in the OP. >>> >>> I abandoned it due to the conjunction of the 2 main blockers I found >>> and mentioned there. In essence, those issues defeat the purpose of >>> the patch (to get the free space visible ASAP). >>> >>> Don't forget, this is aimed at cases where autovacuum of a single >>> relation takes a very long time. That is, very big relations. Maybe >>> days, like in my case. A whole autovacuum cycle can take weeks, so >>> delaying FSM vacuum that much is not good, and using work items still >>> cause those delays, not to mention the segfaults. >> >> Yeah, I agree to vacuum fsm more frequently because it can prevent >> table bloating from concurrent modifying. But considering the way to >> prevent the table bloating from cancellation of autovacuum, I guess we >> need more things. This proposal seems to provide us an ability that is >> we "might be" able to prevent table bloating due to cancellation of >> autovacuum. Since we can know that autovacuum is cancelled, I'd like >> to have a way so that we can surely vacuum fsm even if vacuum is >> cancelled. Thoughts? > > After autovacuum gets cancelled, the next time it wakes up it will > retry vacuuming the cancelled relation. That's because a cancelled > autovacuum doesn't update the last-vacuumed stats. > > So the timing between an autovacuum work item and the next retry for > that relation is more or less an autovacuum nap time, except perhaps > in the case where many vacuums get cancelled, and they have to be > queued. I think that's not true if there are multiple databases. > > That's why an initial FSM vacuum makes sense. It has a similar timing > to the autovacuum work item, it has the benefit that it can be > triggered manually (manual vacuum), and it's cheap and efficient. > >> Also the patch always vacuums fsm at beginning of the vacuum with a >> threshold but it is useless if the table has been properly vacuumed. I >> don't think it's good idea to add an additional step that "might be" >> efficient, because current vacuum is already heavy. > > FSMs are several orders of magnitude smaller than the tables > themselves. A TB-sized table I have here has a 95M FSM. If you add > threshold skipping, that initial FSM vacuum *will* be efficient. > > By definition, the FSM will be less than 1/4000th of the table, so > that initial FSM pass takes less than 1/4000th of the whole vacuum. > Considerably less considering the simplicity of the task. I agree the fsm is very smaller than heap and vacuum on fsm will not be comparatively heavy but I'm afraid that the vacuum will get more heavy in the future if we pile up such improvement that are small but might not be efficient. For example, a feature for reporting the last vacuum status has been proposed[1]. I wonder if we can use it to determine whether we do the fsm vacuum at beginning of vacuum. >> On detail of the patch, >> >> --- a/src/backend/storage/freespace/indexfsm.c >> +++ b/src/backend/storage/freespace/indexfsm.c >> @@ -70,5 +70,5 @@ RecordUsedIndexPage(Relation rel, BlockNumber usedBlock) >> void >> IndexFreeSpaceMapVacuum(Relation rel) >> { >> - FreeSpaceMapVacuum(rel); >> + FreeSpaceMapVacuum(rel, 0); >> } >> >> @@ -816,11 +820,19 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, >> bool *eof_p) >> { >> int child_avail; >> >> + /* Tree pruning for partial vacuums */ >> + if (threshold) >> + { >> + child_avail = fsm_get_avail(page, slot); >> + if (child_avail >= threshold) >> + continue; >> + } >> >> Don't we skip all fsm pages if we set the threshold to 0? > > No, that doesn't skip anything if threshold is 0, since it doesn't get > to the continue, which is the one skipping nodes. Oops, yes you're right. Sorry for the noise. [1] https://www.postgresql.org/message-id/20171010.192616.108347483.horiguchi.kyotaro%40lab.ntt.co.jp Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> After autovacuum gets cancelled, the next time it wakes up it will >> retry vacuuming the cancelled relation. That's because a cancelled >> autovacuum doesn't update the last-vacuumed stats. >> >> So the timing between an autovacuum work item and the next retry for >> that relation is more or less an autovacuum nap time, except perhaps >> in the case where many vacuums get cancelled, and they have to be >> queued. > > I think that's not true if there are multiple databases. I'd have to re-check. The loop is basically, IIRC: while(1) { vacuum db ; work items ; nap } Now, if that's vacuum one db, not all, and if the decision on the second run doesn't pick the same db because that big table failed to be vacuumed, then you're right. In that case we could add the FSM vacuum as a work item *in addition* to what this patch does. If the work item queue is full and the FSM vacuum doesn't happen, it'd be no worse than with the patch as-is. Is that what you suggest? With that in mind, I'm noticing WorkItems have a avw_database that isn't checked by do_autovacuum. Is that right? Shouldn't only work items that belong to the database being autovacuumed be processed? >> That's why an initial FSM vacuum makes sense. It has a similar timing >> to the autovacuum work item, it has the benefit that it can be >> triggered manually (manual vacuum), and it's cheap and efficient. >> >>> Also the patch always vacuums fsm at beginning of the vacuum with a >>> threshold but it is useless if the table has been properly vacuumed. I >>> don't think it's good idea to add an additional step that "might be" >>> efficient, because current vacuum is already heavy. >> >> FSMs are several orders of magnitude smaller than the tables >> themselves. A TB-sized table I have here has a 95M FSM. If you add >> threshold skipping, that initial FSM vacuum *will* be efficient. >> >> By definition, the FSM will be less than 1/4000th of the table, so >> that initial FSM pass takes less than 1/4000th of the whole vacuum. >> Considerably less considering the simplicity of the task. > > I agree the fsm is very smaller than heap and vacuum on fsm will not > be comparatively heavy but I'm afraid that the vacuum will get more > heavy in the future if we pile up such improvement that are small but > might not be efficient. For example, a feature for reporting the last > vacuum status has been proposed[1]. I wonder if we can use it to > determine whether we do the fsm vacuum at beginning of vacuum. Yes, such a feature would allow skipping that initial FSM vacuum. That can be improved in a future patch if that proposed feature gets merged. This patch can be treated independently from that in the meanwhile, don't you think?
On Mon, Feb 5, 2018 at 2:55 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > With that in mind, I'm noticing WorkItems have a avw_database that > isn't checked by do_autovacuum. Is that right? Shouldn't only work > items that belong to the database being autovacuumed be processed? NVM. I had to git pull, it's fixed in head.
On Tue, Feb 6, 2018 at 2:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> After autovacuum gets cancelled, the next time it wakes up it will >>> retry vacuuming the cancelled relation. That's because a cancelled >>> autovacuum doesn't update the last-vacuumed stats. >>> >>> So the timing between an autovacuum work item and the next retry for >>> that relation is more or less an autovacuum nap time, except perhaps >>> in the case where many vacuums get cancelled, and they have to be >>> queued. >> >> I think that's not true if there are multiple databases. > > I'd have to re-check. > > The loop is basically, IIRC: > > while(1) { vacuum db ; work items ; nap } > > Now, if that's vacuum one db, not all, and if the decision on the > second run doesn't pick the same db because that big table failed to > be vacuumed, then you're right. > > In that case we could add the FSM vacuum as a work item *in addition* > to what this patch does. If the work item queue is full and the FSM > vacuum doesn't happen, it'd be no worse than with the patch as-is. > > Is that what you suggest? That's one of the my suggestion. I might had to consider this issue for each case. To be clear let me summarize for each case. For table with indices, vacuum on fsm of heap is done after lazy_vacuum_heap(). Therefore I think that the case where a table got vacuumed but fsm couldn't get vacuumed doesn't happen unless the autovacuum gets cancelled before or while vacuuming fsm. So it can solve the problem in most cases. Also we can use a vacuum progress information to check whether we should vacuum fsm of table after got cancelled. For vacuuming fsm of index, we might have to consider to vacuum fsm of index after lazy_vacuum_index. For table with no index, it would be more complicated; similar to table with indices we can vacuum fsm of table more frequently but table bloating still can happen. Considering a way to surely vacuum fsm of table there are some approaches: (1) using autovacuum work-item, (2) vacuuming fsm of table in PG_CATCH, (3) remembering the tables got cancelled and vacuuming them after finished a loop of table_oids. For (1), we have a issue that is work-item queue will be full when many tables get cancelled and it's not good idea to queue many redundant work-items. For (2), it would not be a good way to vacuum fsm of table immediately after cancelled because immediately after got cancelled the table still likely to being locked by others. For (3), that might work fine but it can happen that other autovacum worker vacuums the fsm of table before processing the list. But it would be better than we always vacuums them at beginning of vacuum. So I'm in favor of (3). Even when processing tables in the list, we should take a lock on the table conditionally so that the autovacuum doesn't block any foreground work. However, it's quite possible that I'm not seeing the whole picture here. > > With that in mind, I'm noticing WorkItems have a avw_database that > isn't checked by do_autovacuum. Is that right? Shouldn't only work > items that belong to the database being autovacuumed be processed? >>> That's why an initial FSM vacuum makes sense. It has a similar timing >>> to the autovacuum work item, it has the benefit that it can be >>> triggered manually (manual vacuum), and it's cheap and efficient. >>> >>>> Also the patch always vacuums fsm at beginning of the vacuum with a >>>> threshold but it is useless if the table has been properly vacuumed. I >>>> don't think it's good idea to add an additional step that "might be" >>>> efficient, because current vacuum is already heavy. >>> >>> FSMs are several orders of magnitude smaller than the tables >>> themselves. A TB-sized table I have here has a 95M FSM. If you add >>> threshold skipping, that initial FSM vacuum *will* be efficient. >>> >>> By definition, the FSM will be less than 1/4000th of the table, so >>> that initial FSM pass takes less than 1/4000th of the whole vacuum. >>> Considerably less considering the simplicity of the task. >> >> I agree the fsm is very smaller than heap and vacuum on fsm will not >> be comparatively heavy but I'm afraid that the vacuum will get more >> heavy in the future if we pile up such improvement that are small but >> might not be efficient. For example, a feature for reporting the last >> vacuum status has been proposed[1]. I wonder if we can use it to >> determine whether we do the fsm vacuum at beginning of vacuum. > > Yes, such a feature would allow skipping that initial FSM vacuum. That > can be improved in a future patch if that proposed feature gets > merged. This patch can be treated independently from that in the > meanwhile, don't you think? IMO, I'd like to have it after we could have such an optimization. Otherwise it will result in adding extra steps for every vacuums until we get the optimization. But, as you said, we also can think it's no problem because the vacuum fsm of table at beginning of the vacuum has the threshold. I'd like to hear more opinions about this. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Feb 6, 2018 at 4:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Feb 6, 2018 at 2:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>> After autovacuum gets cancelled, the next time it wakes up it will >>>> retry vacuuming the cancelled relation. That's because a cancelled >>>> autovacuum doesn't update the last-vacuumed stats. >>>> >>>> So the timing between an autovacuum work item and the next retry for >>>> that relation is more or less an autovacuum nap time, except perhaps >>>> in the case where many vacuums get cancelled, and they have to be >>>> queued. >>> >>> I think that's not true if there are multiple databases. >> >> I'd have to re-check. >> >> The loop is basically, IIRC: >> >> while(1) { vacuum db ; work items ; nap } >> >> Now, if that's vacuum one db, not all, and if the decision on the >> second run doesn't pick the same db because that big table failed to >> be vacuumed, then you're right. >> >> In that case we could add the FSM vacuum as a work item *in addition* >> to what this patch does. If the work item queue is full and the FSM >> vacuum doesn't happen, it'd be no worse than with the patch as-is. >> >> Is that what you suggest? > > That's one of the my suggestion. I might had to consider this issue > for each case. To be clear let me summarize for each case. > > For table with indices, vacuum on fsm of heap is done after > lazy_vacuum_heap(). Therefore I think that the case where a table got > vacuumed but fsm couldn't get vacuumed doesn't happen unless the > autovacuum gets cancelled before or while vacuuming fsm. Well, that's the whole point. Autovacuums get cancelled all the time in highly contended tables. I have more than a handful tables in production that never finish autovacuum, so I have to trigger manual vacuums periodically. > (1) using autovacuum work-item, (2) vacuuming fsm of table in > PG_CATCH, (3) remembering the tables got cancelled and vacuuming them > after finished a loop of table_oids. ... > So I'm in favor of (3). ... > However, it's quite possible that I'm not seeing the whole picture here. Well, none of that handles the case where the autovacuum of a table doesn't get cancelled, but takes a very long time. No free space becomes visible during long-running vacuums. That means bloat keeps accumulating even though vacuum is freeing space, because the FSM doesn't expose that free space. The extra work incurred in those FSM vacuums isn't useless, it's preventing bloat in long-running vacuums. I can look into doing 3, that *might* get rid of the need to do that initial FSM vacuum, but all other intermediate vacuums are still needed.
On Tue, Feb 6, 2018 at 7:51 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > No free space becomes visible during long-running vacuums. That means > bloat keeps accumulating even though vacuum is freeing space, because > the FSM doesn't expose that free space. > > The extra work incurred in those FSM vacuums isn't useless, it's > preventing bloat in long-running vacuums. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Tue, Feb 6, 2018 at 4:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Feb 6, 2018 at 2:55 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>>> After autovacuum gets cancelled, the next time it wakes up it will >>>>> retry vacuuming the cancelled relation. That's because a cancelled >>>>> autovacuum doesn't update the last-vacuumed stats. >>>>> >>>>> So the timing between an autovacuum work item and the next retry for >>>>> that relation is more or less an autovacuum nap time, except perhaps >>>>> in the case where many vacuums get cancelled, and they have to be >>>>> queued. >>>> >>>> I think that's not true if there are multiple databases. >>> >>> I'd have to re-check. >>> >>> The loop is basically, IIRC: >>> >>> while(1) { vacuum db ; work items ; nap } >>> >>> Now, if that's vacuum one db, not all, and if the decision on the >>> second run doesn't pick the same db because that big table failed to >>> be vacuumed, then you're right. >>> >>> In that case we could add the FSM vacuum as a work item *in addition* >>> to what this patch does. If the work item queue is full and the FSM >>> vacuum doesn't happen, it'd be no worse than with the patch as-is. >>> >>> Is that what you suggest? >> >> That's one of the my suggestion. I might had to consider this issue >> for each case. To be clear let me summarize for each case. >> >> For table with indices, vacuum on fsm of heap is done after >> lazy_vacuum_heap(). Therefore I think that the case where a table got >> vacuumed but fsm couldn't get vacuumed doesn't happen unless the >> autovacuum gets cancelled before or while vacuuming fsm. > > Well, that's the whole point. Autovacuums get cancelled all the time > in highly contended tables. I have more than a handful tables in > production that never finish autovacuum, so I have to trigger manual > vacuums periodically. > >> (1) using autovacuum work-item, (2) vacuuming fsm of table in >> PG_CATCH, (3) remembering the tables got cancelled and vacuuming them >> after finished a loop of table_oids. > ... >> So I'm in favor of (3). > ... >> However, it's quite possible that I'm not seeing the whole picture here. > > Well, none of that handles the case where the autovacuum of a table > doesn't get cancelled, but takes a very long time. > > No free space becomes visible during long-running vacuums. That means > bloat keeps accumulating even though vacuum is freeing space, because > the FSM doesn't expose that free space. > > The extra work incurred in those FSM vacuums isn't useless, it's > preventing bloat in long-running vacuums. > > I can look into doing 3, that *might* get rid of the need to do that > initial FSM vacuum, but all other intermediate vacuums are still > needed. Understood. So how about that this patch focuses only make FSM vacuum more frequently and leaves the initial FSM vacuum and the handling cancellation cases? The rest can be a separate patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> I can look into doing 3, that *might* get rid of the need to do that >> initial FSM vacuum, but all other intermediate vacuums are still >> needed. > > Understood. So how about that this patch focuses only make FSM vacuum > more frequently and leaves the initial FSM vacuum and the handling > cancellation cases? The rest can be a separate patch. Ok. Attached split patches. All but the initial FSM pass is in 1, the initial FSM pass as in the original patch is in 2. I will post an alternative patch for 2 when/if I have one. In the meanwhile, we can talk about 1.
Attachment
On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> I can look into doing 3, that *might* get rid of the need to do that >>> initial FSM vacuum, but all other intermediate vacuums are still >>> needed. >> >> Understood. So how about that this patch focuses only make FSM vacuum >> more frequently and leaves the initial FSM vacuum and the handling >> cancellation cases? The rest can be a separate patch. > > Ok. > > Attached split patches. All but the initial FSM pass is in 1, the > initial FSM pass as in the original patch is in 2. > > I will post an alternative patch for 2 when/if I have one. In the > meanwhile, we can talk about 1. Thank you for updating the patch! + /* Tree pruning for partial vacuums */ + if (threshold) + { + child_avail = fsm_get_avail(page, slot); + if (child_avail >= threshold) + continue; + } I think slots in a non-leaf page might not have a correct value because fsm_set_avail doesn't propagate up beyond pages. So do we need to check the root of child page rather than a slot in the non-lead page? I might be missing something though. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Feb 9, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>> I can look into doing 3, that *might* get rid of the need to do that >>>> initial FSM vacuum, but all other intermediate vacuums are still >>>> needed. >>> >>> Understood. So how about that this patch focuses only make FSM vacuum >>> more frequently and leaves the initial FSM vacuum and the handling >>> cancellation cases? The rest can be a separate patch. >> >> Ok. >> >> Attached split patches. All but the initial FSM pass is in 1, the >> initial FSM pass as in the original patch is in 2. >> >> I will post an alternative patch for 2 when/if I have one. In the >> meanwhile, we can talk about 1. > > Thank you for updating the patch! > > + /* Tree pruning for partial vacuums */ > + if (threshold) > + { > + child_avail = fsm_get_avail(page, slot); > + if (child_avail >= threshold) > + continue; > + } > > I think slots in a non-leaf page might not have a correct value > because fsm_set_avail doesn't propagate up beyond pages. So do we need > to check the root of child page rather than a slot in the non-lead > page? I might be missing something though. I'm not sure I understand what you're asking. Yes, a partial FSM vacuum won't correct those inconsistencies. It doesn't matter though, because as long as the root node (or whichever inner node we're looking at the time) reports free space, the lookup for free space will recurse and find the lower nodes, even if they report more space than the inner node reported. The goal of making free space discoverable will have been achieved nonetheless. The final FSM vacuum pass isn't partial, to finally correct all those small inconsistencies.
On Fri, Feb 9, 2018 at 11:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Fri, Feb 9, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>>> I can look into doing 3, that *might* get rid of the need to do that >>>>> initial FSM vacuum, but all other intermediate vacuums are still >>>>> needed. >>>> >>>> Understood. So how about that this patch focuses only make FSM vacuum >>>> more frequently and leaves the initial FSM vacuum and the handling >>>> cancellation cases? The rest can be a separate patch. >>> >>> Ok. >>> >>> Attached split patches. All but the initial FSM pass is in 1, the >>> initial FSM pass as in the original patch is in 2. >>> >>> I will post an alternative patch for 2 when/if I have one. In the >>> meanwhile, we can talk about 1. >> >> Thank you for updating the patch! >> >> + /* Tree pruning for partial vacuums */ >> + if (threshold) >> + { >> + child_avail = fsm_get_avail(page, slot); >> + if (child_avail >= threshold) >> + continue; >> + } >> >> I think slots in a non-leaf page might not have a correct value >> because fsm_set_avail doesn't propagate up beyond pages. So do we need >> to check the root of child page rather than a slot in the non-lead >> page? I might be missing something though. > > I'm not sure I understand what you're asking. > > Yes, a partial FSM vacuum won't correct those inconsistencies. It > doesn't matter though, because as long as the root node (or whichever > inner node we're looking at the time) reports free space, the lookup > for free space will recurse and find the lower nodes, even if they > report more space than the inner node reported. The goal of making > free space discoverable will have been achieved nonetheless. For example, if a slot of internal node of fsm tree has a value greater than the threshold while the root of leaf node of fsm tree has a value less than threshold, we want to update the internal node of fsm tree but we will skip it with your patch. I wonder if this can happen. > The final FSM vacuum pass isn't partial, to finally correct all those > small inconsistencies. Yes, but the purpose of this patch is to prevent table bloating from concurrent modifications? Here is other review comments. + /* Tree pruning for partial vacuums */ + if (threshold) + { + child_avail = fsm_get_avail(page, slot); + if (child_avail >= threshold) + continue; + } 'child_avail' is a category value while 'threshold' is an actual size of freespace. The logic of finding the max_freespace seems not work fine if table with indices because max_freespace is not updated based on post-compaction freespace. I think we need to get the max freespace size during vacuuming heap (i.g. at lazy_vacuum_heap) and use it as the threshold. + vacuum_fsm_every_pages = nblocks / 8; + vacuum_fsm_every_pages = Max(vacuum_fsm_every_pages, 1048576); I think a comment for 1048576 is needed. And perhaps we can define it as a macro. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Feb 9, 2018 at 11:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Fri, Feb 9, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>> On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>>>> I can look into doing 3, that *might* get rid of the need to do that >>>>>> initial FSM vacuum, but all other intermediate vacuums are still >>>>>> needed. >>>>> >>>>> Understood. So how about that this patch focuses only make FSM vacuum >>>>> more frequently and leaves the initial FSM vacuum and the handling >>>>> cancellation cases? The rest can be a separate patch. >>>> >>>> Ok. >>>> >>>> Attached split patches. All but the initial FSM pass is in 1, the >>>> initial FSM pass as in the original patch is in 2. >>>> >>>> I will post an alternative patch for 2 when/if I have one. In the >>>> meanwhile, we can talk about 1. >>> >>> Thank you for updating the patch! >>> >>> + /* Tree pruning for partial vacuums */ >>> + if (threshold) >>> + { >>> + child_avail = fsm_get_avail(page, slot); >>> + if (child_avail >= threshold) >>> + continue; >>> + } >>> >>> I think slots in a non-leaf page might not have a correct value >>> because fsm_set_avail doesn't propagate up beyond pages. So do we need >>> to check the root of child page rather than a slot in the non-lead >>> page? I might be missing something though. >> >> I'm not sure I understand what you're asking. >> >> Yes, a partial FSM vacuum won't correct those inconsistencies. It >> doesn't matter though, because as long as the root node (or whichever >> inner node we're looking at the time) reports free space, the lookup >> for free space will recurse and find the lower nodes, even if they >> report more space than the inner node reported. The goal of making >> free space discoverable will have been achieved nonetheless. > > For example, if a slot of internal node of fsm tree has a value > greater than the threshold while the root of leaf node of fsm tree has > a value less than threshold, we want to update the internal node of > fsm tree but we will skip it with your patch. I wonder if this can > happen. If I understand your point correctly, that would mean that space was used since the last vacuum. Partial FSM vacuums don't concern themselves with that case, the normal free space search algorithm will handle that, retrying with the next slot until it succeeds (or until it gives up). IIUC, each time a failure of such kind is found while searching, the FSM gets updated to avoid following that link a second time. See, in fsm_search: /* * At lower level, failure can happen if the value in the upper- * level node didn't reflect the value on the lower page. Update * the upper node, to avoid falling into the same trap again, and * start over. > >> The final FSM vacuum pass isn't partial, to finally correct all those >> small inconsistencies. > > Yes, but the purpose of this patch is to prevent table bloating from > concurrent modifications? Yes, by *marking* unmarked space. If the slot is above the threshold, it means there's already marked free space on that branch, and search will go into it already, so no pressing need to refine the mark. The space already marked can serve while vacuum makes further progress. Ie: it can wait. > Here is other review comments. > > + /* Tree pruning for partial vacuums */ > + if (threshold) > + { > + child_avail = fsm_get_avail(page, slot); > + if (child_avail >= threshold) > + continue; > + } > > 'child_avail' is a category value while 'threshold' is an actual size > of freespace. Good catch. It was meant to be a category, but I see the public interface of the FSM doesn't usually expose categories, so fixing that. > The logic of finding the max_freespace seems not work fine if table > with indices because max_freespace is not updated based on > post-compaction freespace. I think we need to get the max freespace > size during vacuuming heap (i.g. at lazy_vacuum_heap) and use it as > the threshold. Good point. > > + vacuum_fsm_every_pages = nblocks / 8; > + vacuum_fsm_every_pages = Max(vacuum_fsm_every_pages, 1048576); > > I think a comment for 1048576 is needed. And perhaps we can define it > as a macro. 1M pages = 8GB with an 8kb page size. I can clarify. Attached are new versions of the patches with these corrections.
Attachment
On Thu, Feb 15, 2018 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> The final FSM vacuum pass isn't partial, to finally correct all those >>> small inconsistencies. >> >> Yes, but the purpose of this patch is to prevent table bloating from >> concurrent modifications? > > Yes, by *marking* unmarked space. If the slot is above the threshold, > it means there's already marked free space on that branch, and search > will go into it already, so no pressing need to refine the mark. The > space already marked can serve while vacuum makes further progress. > Ie: it can wait. Although... I think I see what you mean. If you have this: 255 . 0 . . 0 . . 255 . 0 . . 64 . . 64 . 0 . . 0 . . 64 A partial vacuum won't even recurse beyond the root node, so it won't expose the free space 2 levels down. This could be arrived at by having an almost fully packed table that contains some free space near the end, and it gets deletions near the beginning. Vacuum will mark the leaf levels at the beginning of the table, and we'd like for that free space to be discoverable to avoid packing more rows near the end where they prevent truncation. The patch as it is now will only vacuum that part of the FSM when the root value drops, which usually requires all the free space on that region of the heap to be exhausted. With the current recursive FSM vacuum method, however, that's unavoidable. We can't detect that there's some free space below the current level to be exposed without recursing into the tree, and recursing is exactly the kind of work we want to prevent with tree pruning. In all my trials, however, I did not run into that case. It must not be easy to get into that situation, because the root node already has ~4000 slots in it. Each of those 4000 slots should be in that situation to keep the space at the end of the table as the sole discoverable free space, which seems like a rather contorted worst case.
On Fri, Feb 16, 2018 at 5:00 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Feb 15, 2018 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > >>>> The final FSM vacuum pass isn't partial, to finally correct all those >>>> small inconsistencies. >>> >>> Yes, but the purpose of this patch is to prevent table bloating from >>> concurrent modifications? >> >> Yes, by *marking* unmarked space. If the slot is above the threshold, >> it means there's already marked free space on that branch, and search >> will go into it already, so no pressing need to refine the mark. The >> space already marked can serve while vacuum makes further progress. >> Ie: it can wait. > > Although... I think I see what you mean. > > If you have this: > > 255 > . 0 > . . 0 > . . 255 > . 0 > . . 64 > . . 64 > . 0 > . . 0 > . . 64 > > > A partial vacuum won't even recurse beyond the root node, so it won't > expose the free space 2 levels down. Yeah, that's what I meant. I think this might be able to happen slightly easily if a tables has fillfactor < 100. For example, updating tuples on pages that are almost packed except for fillfactor with the more bigger value > > This could be arrived at by having an almost fully packed table that > contains some free space near the end, and it gets deletions near the > beginning. Vacuum will mark the leaf levels at the beginning of the > table, and we'd like for that free space to be discoverable to avoid > packing more rows near the end where they prevent truncation. > > The patch as it is now will only vacuum that part of the FSM when the > root value drops, which usually requires all the free space on that > region of the heap to be exhausted. > > With the current recursive FSM vacuum method, however, that's > unavoidable. We can't detect that there's some free space below the > current level to be exposed without recursing into the tree, and > recursing is exactly the kind of work we want to prevent with tree > pruning. > > In all my trials, however, I did not run into that case. It must not > be easy to get into that situation, because the root node already has > ~4000 slots in it. Each of those 4000 slots should be in that > situation to keep the space at the end of the table as the sole > discoverable free space, which seems like a rather contorted worst > case. Okay, I guess that this patch cannot help in the case where the freespace of pages shown on fsm gets decreased by vacuum because the upper-level node doesn't reflect the value on the lower page. However it doesn't happen often as you said and it's the same as it is even though it happens. So I also think it would not become a problem. I'll review v4 patch. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Tue, Feb 20, 2018 at 5:04 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Feb 16, 2018 at 5:00 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Thu, Feb 15, 2018 at 4:47 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> >>>>> The final FSM vacuum pass isn't partial, to finally correct all those >>>>> small inconsistencies. >>>> >>>> Yes, but the purpose of this patch is to prevent table bloating from >>>> concurrent modifications? >>> >>> Yes, by *marking* unmarked space. If the slot is above the threshold, >>> it means there's already marked free space on that branch, and search >>> will go into it already, so no pressing need to refine the mark. The >>> space already marked can serve while vacuum makes further progress. >>> Ie: it can wait. >> >> Although... I think I see what you mean. >> >> If you have this: >> >> 255 >> . 0 >> . . 0 >> . . 255 >> . 0 >> . . 64 >> . . 64 >> . 0 >> . . 0 >> . . 64 >> >> >> A partial vacuum won't even recurse beyond the root node, so it won't >> expose the free space 2 levels down. > > Yeah, that's what I meant. I think this might be able to happen > slightly easily if a tables has fillfactor < 100. For example, > updating tuples on pages that are almost packed except for fillfactor > with the more bigger value > >> >> This could be arrived at by having an almost fully packed table that >> contains some free space near the end, and it gets deletions near the >> beginning. Vacuum will mark the leaf levels at the beginning of the >> table, and we'd like for that free space to be discoverable to avoid >> packing more rows near the end where they prevent truncation. >> >> The patch as it is now will only vacuum that part of the FSM when the >> root value drops, which usually requires all the free space on that >> region of the heap to be exhausted. >> >> With the current recursive FSM vacuum method, however, that's >> unavoidable. We can't detect that there's some free space below the >> current level to be exposed without recursing into the tree, and >> recursing is exactly the kind of work we want to prevent with tree >> pruning. >> >> In all my trials, however, I did not run into that case. It must not >> be easy to get into that situation, because the root node already has >> ~4000 slots in it. Each of those 4000 slots should be in that >> situation to keep the space at the end of the table as the sole >> discoverable free space, which seems like a rather contorted worst >> case. > > Okay, I guess that this patch cannot help in the case where the > freespace of pages shown on fsm gets decreased by vacuum because the > upper-level node doesn't reflect the value on the lower page. However > it doesn't happen often as you said and it's the same as it is even > though it happens. So I also think it would not become a problem. > > I'll review v4 patch. > Here is review comment for v4 patch. @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats) * We don't insert a vacuum delay point here, because we have an * exclusive lock on the table which we want to hold for as short a * time as possible. We still need to check for interrupts however. + * We might have to acquire the autovacuum lock, however, but that + * shouldn't pose a deadlock risk. */ CHECK_FOR_INTERRUPTS(); Is this change necessary? ---- + /* + * If there are no indexes then we should periodically vacuum the FSM + * on huge relations to make free space visible early. + */ + if (nindexes == 0 && + (vacuumed_pages - vacuumed_pages_at_fsm_vac) > vacuum_fsm_every_pages) + { + /* Vacuum the Free Space Map */ + FreeSpaceMapVacuum(onerel, max_freespace); + vacuumed_pages_at_fsm_vac = vacuumed_pages; + max_freespace = 0; + } I think this code block should be executed before we check if the page is whether new or empty and then do 'continue'. Otherwise we cannot reach this code if the table has a lot of new or empty pages. ---- @@ -785,7 +789,7 @@ fsm_search(Relation rel, uint8 min_cat) * Recursive guts of FreeSpaceMapVacuum */ static uint8 -fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p) +fsm_vacuum_page(Relation rel, FSMAddress addr, uint8 threshold, bool *eof_p) { Buffer buf; Page page; I think the comment for 'threshold' is needed. Maybe for FreeSpaceMapvacuum as well? ---- @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks) * If minValue > 0, the updated page is also searched for a page with at * least minValue of free space. If one is found, its slot number is * returned, -1 otherwise. + * + * If minValue == 0, the value at the root node is returned. */ static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, addr.level == FSM_BOTTOM_LEVEL, true); } + else + newslot = fsm_get_avail(page, 0); I think it's not good idea to make fsm_set_and_search return either a slot number or a category value according to the argument. Category values is actually uint8 but this function returns it as int. Also we can call fsm_set_and_search with minValue = 0 at RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch, fsm_set_and_search() then returns the root value but it's not correct. Also, is this change necessary for this patch? ISTM this change is used for the change in fsm_update_recursive() as follows but I guess this change can be a separate patch. + new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0); + + /* + * Bubble up, not the value we just set, but the one now in the root + * node of the just-updated page, which is the page's highest value. + */ Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Here is review comment for v4 patch. > > @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel, > LVRelStats *vacrelstats) > * We don't insert a vacuum delay point here, because we have an > * exclusive lock on the table which we want to hold > for as short a > * time as possible. We still need to check for > interrupts however. > + * We might have to acquire the autovacuum lock, > however, but that > + * shouldn't pose a deadlock risk. > */ > CHECK_FOR_INTERRUPTS(); > > Is this change necessary? I don't recall doing that change. Maybe a rebase gone wrong. > ---- > + /* > + * If there are no indexes then we should periodically > vacuum the FSM > + * on huge relations to make free space visible early. > + */ > + if (nindexes == 0 && > + (vacuumed_pages - vacuumed_pages_at_fsm_vac) > > vacuum_fsm_every_pages) > + { > + /* Vacuum the Free Space Map */ > + FreeSpaceMapVacuum(onerel, max_freespace); > + vacuumed_pages_at_fsm_vac = vacuumed_pages; > + max_freespace = 0; > + } > > I think this code block should be executed before we check if the page > is whether new or empty and then do 'continue'. Otherwise we cannot > reach this code if the table has a lot of new or empty pages. In order for the counter (vacuumed_pages) to increase, there have to be plenty of opportunities for this code to run, and I specifically wanted to avoid vacuuming the FSM too often for those cases particularly (when Vacuum scans lots of pages but does no writes). > ---- > @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks) > * If minValue > 0, the updated page is also searched for a page with at > * least minValue of free space. If one is found, its slot number is > * returned, -1 otherwise. > + * > + * If minValue == 0, the value at the root node is returned. > */ > static int > fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, > @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr, > uint16 slot, > > addr.level == FSM_BOTTOM_LEVEL, > true); > } > + else > + newslot = fsm_get_avail(page, 0); > > I think it's not good idea to make fsm_set_and_search return either a > slot number or a category value according to the argument. Category > values is actually uint8 but this function returns it as int. Should be fine, uint8 can be contained inside an int in all platforms. > Also we > can call fsm_set_and_search with minValue = 0 at > RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch, > fsm_set_and_search() then returns the root value but it's not correct. I guess I can add another function to do that. > Also, is this change necessary for this patch? ISTM this change is > used for the change in fsm_update_recursive() as follows but I guess > this change can be a separate patch. > > + new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0); > + > + /* > + * Bubble up, not the value we just set, but the one now in the root > + * node of the just-updated page, which is the page's highest value. > + */ I can try to separate them I guess.
On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> ---- >> + /* >> + * If there are no indexes then we should periodically >> vacuum the FSM >> + * on huge relations to make free space visible early. >> + */ >> + if (nindexes == 0 && >> + (vacuumed_pages - vacuumed_pages_at_fsm_vac) > >> vacuum_fsm_every_pages) >> + { >> + /* Vacuum the Free Space Map */ >> + FreeSpaceMapVacuum(onerel, max_freespace); >> + vacuumed_pages_at_fsm_vac = vacuumed_pages; >> + max_freespace = 0; >> + } >> >> I think this code block should be executed before we check if the page >> is whether new or empty and then do 'continue'. Otherwise we cannot >> reach this code if the table has a lot of new or empty pages. > > In order for the counter (vacuumed_pages) to increase, there have to > be plenty of opportunities for this code to run, and I specifically > wanted to avoid vacuuming the FSM too often for those cases > particularly (when Vacuum scans lots of pages but does no writes). Wait, I see what you mean. Entirely empty pages. Ok, I can move it.
On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Here is review comment for v4 patch. >> >> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel, >> LVRelStats *vacrelstats) >> * We don't insert a vacuum delay point here, because we have an >> * exclusive lock on the table which we want to hold >> for as short a >> * time as possible. We still need to check for >> interrupts however. >> + * We might have to acquire the autovacuum lock, >> however, but that >> + * shouldn't pose a deadlock risk. >> */ >> CHECK_FOR_INTERRUPTS(); >> >> Is this change necessary? > > I don't recall doing that change. Maybe a rebase gone wrong. > >> ---- >> + /* >> + * If there are no indexes then we should periodically >> vacuum the FSM >> + * on huge relations to make free space visible early. >> + */ >> + if (nindexes == 0 && >> + (vacuumed_pages - vacuumed_pages_at_fsm_vac) > >> vacuum_fsm_every_pages) >> + { >> + /* Vacuum the Free Space Map */ >> + FreeSpaceMapVacuum(onerel, max_freespace); >> + vacuumed_pages_at_fsm_vac = vacuumed_pages; >> + max_freespace = 0; >> + } >> >> I think this code block should be executed before we check if the page >> is whether new or empty and then do 'continue'. Otherwise we cannot >> reach this code if the table has a lot of new or empty pages. > > In order for the counter (vacuumed_pages) to increase, there have to > be plenty of opportunities for this code to run, and I specifically > wanted to avoid vacuuming the FSM too often for those cases > particularly (when Vacuum scans lots of pages but does no writes). > >> ---- >> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks) >> * If minValue > 0, the updated page is also searched for a page with at >> * least minValue of free space. If one is found, its slot number is >> * returned, -1 otherwise. >> + * >> + * If minValue == 0, the value at the root node is returned. >> */ >> static int >> fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, >> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr, >> uint16 slot, >> >> addr.level == FSM_BOTTOM_LEVEL, >> true); >> } >> + else >> + newslot = fsm_get_avail(page, 0); >> >> I think it's not good idea to make fsm_set_and_search return either a >> slot number or a category value according to the argument. Category >> values is actually uint8 but this function returns it as int. > > Should be fine, uint8 can be contained inside an int in all platforms. > >> Also we >> can call fsm_set_and_search with minValue = 0 at >> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch, >> fsm_set_and_search() then returns the root value but it's not correct. > > I guess I can add another function to do that. > >> Also, is this change necessary for this patch? ISTM this change is >> used for the change in fsm_update_recursive() as follows but I guess >> this change can be a separate patch. >> >> + new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0); >> + >> + /* >> + * Bubble up, not the value we just set, but the one now in the root >> + * node of the just-updated page, which is the page's highest value. >> + */ > > I can try to separate them I guess. Patch set with the requested changes attached. 2 didn't change AFAIK, it's just rebased on top of the new 1.
Attachment
On Mon, Feb 26, 2018 at 1:32 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Here is review comment for v4 patch. >>> >>> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel, >>> LVRelStats *vacrelstats) >>> * We don't insert a vacuum delay point here, because we have an >>> * exclusive lock on the table which we want to hold >>> for as short a >>> * time as possible. We still need to check for >>> interrupts however. >>> + * We might have to acquire the autovacuum lock, >>> however, but that >>> + * shouldn't pose a deadlock risk. >>> */ >>> CHECK_FOR_INTERRUPTS(); >>> >>> Is this change necessary? >> >> I don't recall doing that change. Maybe a rebase gone wrong. >> >>> ---- >>> + /* >>> + * If there are no indexes then we should periodically >>> vacuum the FSM >>> + * on huge relations to make free space visible early. >>> + */ >>> + if (nindexes == 0 && >>> + (vacuumed_pages - vacuumed_pages_at_fsm_vac) > >>> vacuum_fsm_every_pages) >>> + { >>> + /* Vacuum the Free Space Map */ >>> + FreeSpaceMapVacuum(onerel, max_freespace); >>> + vacuumed_pages_at_fsm_vac = vacuumed_pages; >>> + max_freespace = 0; >>> + } >>> >>> I think this code block should be executed before we check if the page >>> is whether new or empty and then do 'continue'. Otherwise we cannot >>> reach this code if the table has a lot of new or empty pages. >> >> In order for the counter (vacuumed_pages) to increase, there have to >> be plenty of opportunities for this code to run, and I specifically >> wanted to avoid vacuuming the FSM too often for those cases >> particularly (when Vacuum scans lots of pages but does no writes). >> >>> ---- >>> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks) >>> * If minValue > 0, the updated page is also searched for a page with at >>> * least minValue of free space. If one is found, its slot number is >>> * returned, -1 otherwise. >>> + * >>> + * If minValue == 0, the value at the root node is returned. >>> */ >>> static int >>> fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, >>> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr, >>> uint16 slot, >>> >>> addr.level == FSM_BOTTOM_LEVEL, >>> true); >>> } >>> + else >>> + newslot = fsm_get_avail(page, 0); >>> >>> I think it's not good idea to make fsm_set_and_search return either a >>> slot number or a category value according to the argument. Category >>> values is actually uint8 but this function returns it as int. >> >> Should be fine, uint8 can be contained inside an int in all platforms. >> >>> Also we >>> can call fsm_set_and_search with minValue = 0 at >>> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch, >>> fsm_set_and_search() then returns the root value but it's not correct. >> >> I guess I can add another function to do that. >> >>> Also, is this change necessary for this patch? ISTM this change is >>> used for the change in fsm_update_recursive() as follows but I guess >>> this change can be a separate patch. >>> >>> + new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0); >>> + >>> + /* >>> + * Bubble up, not the value we just set, but the one now in the root >>> + * node of the just-updated page, which is the page's highest value. >>> + */ >> >> I can try to separate them I guess. > > Patch set with the requested changes attached. > > 2 didn't change AFAIK, it's just rebased on top of the new 1. Oops. Forgot about the comment changes requested. Fixed those in v6, attached.
Attachment
On Tue, Feb 27, 2018 at 1:44 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Feb 26, 2018 at 1:32 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> Here is review comment for v4 patch. >>>> >>>> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel, >>>> LVRelStats *vacrelstats) >>>> * We don't insert a vacuum delay point here, because we have an >>>> * exclusive lock on the table which we want to hold >>>> for as short a >>>> * time as possible. We still need to check for >>>> interrupts however. >>>> + * We might have to acquire the autovacuum lock, >>>> however, but that >>>> + * shouldn't pose a deadlock risk. >>>> */ >>>> CHECK_FOR_INTERRUPTS(); >>>> >>>> Is this change necessary? >>> >>> I don't recall doing that change. Maybe a rebase gone wrong. >>> >>>> ---- >>>> + /* >>>> + * If there are no indexes then we should periodically >>>> vacuum the FSM >>>> + * on huge relations to make free space visible early. >>>> + */ >>>> + if (nindexes == 0 && >>>> + (vacuumed_pages - vacuumed_pages_at_fsm_vac) > >>>> vacuum_fsm_every_pages) >>>> + { >>>> + /* Vacuum the Free Space Map */ >>>> + FreeSpaceMapVacuum(onerel, max_freespace); >>>> + vacuumed_pages_at_fsm_vac = vacuumed_pages; >>>> + max_freespace = 0; >>>> + } >>>> >>>> I think this code block should be executed before we check if the page >>>> is whether new or empty and then do 'continue'. Otherwise we cannot >>>> reach this code if the table has a lot of new or empty pages. >>> >>> In order for the counter (vacuumed_pages) to increase, there have to >>> be plenty of opportunities for this code to run, and I specifically >>> wanted to avoid vacuuming the FSM too often for those cases >>> particularly (when Vacuum scans lots of pages but does no writes). >>> >>>> ---- >>>> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks) >>>> * If minValue > 0, the updated page is also searched for a page with at >>>> * least minValue of free space. If one is found, its slot number is >>>> * returned, -1 otherwise. >>>> + * >>>> + * If minValue == 0, the value at the root node is returned. >>>> */ >>>> static int >>>> fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, >>>> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr, >>>> uint16 slot, >>>> >>>> addr.level == FSM_BOTTOM_LEVEL, >>>> true); >>>> } >>>> + else >>>> + newslot = fsm_get_avail(page, 0); >>>> >>>> I think it's not good idea to make fsm_set_and_search return either a >>>> slot number or a category value according to the argument. Category >>>> values is actually uint8 but this function returns it as int. >>> >>> Should be fine, uint8 can be contained inside an int in all platforms. >>> >>>> Also we >>>> can call fsm_set_and_search with minValue = 0 at >>>> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch, >>>> fsm_set_and_search() then returns the root value but it's not correct. >>> >>> I guess I can add another function to do that. >>> >>>> Also, is this change necessary for this patch? ISTM this change is >>>> used for the change in fsm_update_recursive() as follows but I guess >>>> this change can be a separate patch. >>>> >>>> + new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0); >>>> + >>>> + /* >>>> + * Bubble up, not the value we just set, but the one now in the root >>>> + * node of the just-updated page, which is the page's highest value. >>>> + */ >>> >>> I can try to separate them I guess. >> >> Patch set with the requested changes attached. >> >> 2 didn't change AFAIK, it's just rebased on top of the new 1. > > Oops. Forgot about the comment changes requested. Fixed those in v6, attached. Thank you for updating patches! 0001 patch looks good to me except for the following unnecessary empty lines. + * If there are no indexes then we should periodically vacuum the FSM + * on huge relations to make free space visible early. + */ + else if (nindexes == 0 && fsm_updated_pages > vacuum_fsm_every_pages) + { + /* Vacuum the Free Space Map */ + FreeSpaceMapVacuum(onerel, max_freespace); + fsm_updated_pages = 0; + max_freespace = 0; + } + + + /* @@ -913,10 +967,14 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, vmbuffer, InvalidTransactionId, VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN); END_CRIT_SECTION(); + } 0002 patch looks good to me. For 0003 patch, I think fsm_set() function name could lead misreading because it actually not only sets the value but also returns the value. fsm_set_and_get() might be better? Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Feb 26, 2018 at 10:20 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for updating patches! > > 0001 patch looks good to me except for the following unnecessary empty lines. > > + * If there are no indexes then we should periodically > vacuum the FSM > + * on huge relations to make free space visible early. > + */ > + else if (nindexes == 0 && fsm_updated_pages > > vacuum_fsm_every_pages) > + { > + /* Vacuum the Free Space Map */ > + FreeSpaceMapVacuum(onerel, max_freespace); > + fsm_updated_pages = 0; > + max_freespace = 0; > + } > + > + > + /* > > @@ -913,10 +967,14 @@ lazy_scan_heap(Relation onerel, int options, > LVRelStats *vacrelstats, > > vmbuffer, InvalidTransactionId, > > VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN); > END_CRIT_SECTION(); > + > } > > 0002 patch looks good to me. > > For 0003 patch, I think fsm_set() function name could lead misreading > because it actually not only sets the value but also returns the > value. fsm_set_and_get() might be better? Attached is the patchset modified as requested
Attachment
On Tue, Feb 6, 2018 at 4:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > For vacuuming fsm of index, we might have to consider to > vacuum fsm of index after lazy_vacuum_index. I've been thinking about that, and I think you're right. So here's a fourth patch that adds that to nbtree's bulkdelete implementation. Seems to be the best place to add such a thing. GIN and GIST don't delete pages until vacuumcleanup, so they can't do the same, sadly.
Attachment
On Wed, Feb 28, 2018 at 12:06 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Feb 26, 2018 at 10:20 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Thank you for updating patches! >> >> 0001 patch looks good to me except for the following unnecessary empty lines. >> >> + * If there are no indexes then we should periodically >> vacuum the FSM >> + * on huge relations to make free space visible early. >> + */ >> + else if (nindexes == 0 && fsm_updated_pages > >> vacuum_fsm_every_pages) >> + { >> + /* Vacuum the Free Space Map */ >> + FreeSpaceMapVacuum(onerel, max_freespace); >> + fsm_updated_pages = 0; >> + max_freespace = 0; >> + } >> + >> + >> + /* >> >> @@ -913,10 +967,14 @@ lazy_scan_heap(Relation onerel, int options, >> LVRelStats *vacrelstats, >> >> vmbuffer, InvalidTransactionId, >> >> VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN); >> END_CRIT_SECTION(); >> + >> } >> >> 0002 patch looks good to me. >> >> For 0003 patch, I think fsm_set() function name could lead misreading >> because it actually not only sets the value but also returns the >> value. fsm_set_and_get() might be better? > > Attached is the patchset modified as requested Thank you for updating the patches! +/* + * When a table has no indexes, vacuum the FSM at most every this + * many dirty pages. With a default page size of 8kb, this value + * basically means 8GB of dirtied pages. + */ +#define VACUUM_FSM_EVERY_PAGES 1048576 Is it better if we vacuum fsm every 8GB regardless of the block size? Because an installation that uses >8GB block size is likely to have the pages less than what an 8GB block size installation has, the current patch might lead to delay fsm vacuum. What do you think? If it's better, we can define it as follows. #define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ) -- @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid; TransactionId relminmxid = onerel->rd_rel->relminmxid; BlockNumber empty_pages, - vacuumed_pages; + vacuumed_pages, + fsm_updated_pages, + vacuum_fsm_every_pages; double num_tuples, tups_vacuumed, nkeep, Regarding fsm_updated_pages variable name, I think it's better to be freespace_updated_pages because we actually counts the page updated its freespace, not fsm. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Fri, Mar 2, 2018 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Thank you for updating the patches! > > +/* > + * When a table has no indexes, vacuum the FSM at most every this > + * many dirty pages. With a default page size of 8kb, this value > + * basically means 8GB of dirtied pages. > + */ > +#define VACUUM_FSM_EVERY_PAGES 1048576 > > Is it better if we vacuum fsm every 8GB regardless of the block size? > Because an installation that uses >8GB block size is likely to have > the pages less than what an 8GB block size installation has, the > current patch might lead to delay fsm vacuum. What do you think? If > it's better, we can define it as follows. > > #define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ) I honestly don't know. I've never tweaked page size, and know nobody that did, so I have no experience with those installations. Lets go with your proposal. > > -- > @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options, > LVRelStats *vacrelstats, > TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid; > TransactionId relminmxid = onerel->rd_rel->relminmxid; > BlockNumber empty_pages, > - vacuumed_pages; > + vacuumed_pages, > + fsm_updated_pages, > + vacuum_fsm_every_pages; > double num_tuples, > tups_vacuumed, > nkeep, > > Regarding fsm_updated_pages variable name, I think it's better to be > freespace_updated_pages because we actually counts the page updated > its freespace, not fsm. Good point. New version attached.
Attachment
On Fri, Mar 2, 2018 at 10:47 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Fri, Mar 2, 2018 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> Thank you for updating the patches! >> >> +/* >> + * When a table has no indexes, vacuum the FSM at most every this >> + * many dirty pages. With a default page size of 8kb, this value >> + * basically means 8GB of dirtied pages. >> + */ >> +#define VACUUM_FSM_EVERY_PAGES 1048576 >> >> Is it better if we vacuum fsm every 8GB regardless of the block size? >> Because an installation that uses >8GB block size is likely to have >> the pages less than what an 8GB block size installation has, the >> current patch might lead to delay fsm vacuum. What do you think? If >> it's better, we can define it as follows. >> >> #define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ) > > I honestly don't know. I've never tweaked page size, and know nobody > that did, so I have no experience with those installations. > > Lets go with your proposal. > >> >> -- >> @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options, >> LVRelStats *vacrelstats, >> TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid; >> TransactionId relminmxid = onerel->rd_rel->relminmxid; >> BlockNumber empty_pages, >> - vacuumed_pages; >> + vacuumed_pages, >> + fsm_updated_pages, >> + vacuum_fsm_every_pages; >> double num_tuples, >> tups_vacuumed, >> nkeep, >> >> Regarding fsm_updated_pages variable name, I think it's better to be >> freespace_updated_pages because we actually counts the page updated >> its freespace, not fsm. > > Good point. > > New version attached. Sorry, forgot to actually squash. Now the real new version is attached.
Attachment
On Fri, Mar 2, 2018 at 10:50 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Fri, Mar 2, 2018 at 10:47 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Fri, Mar 2, 2018 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>> Thank you for updating the patches! >>> >>> +/* >>> + * When a table has no indexes, vacuum the FSM at most every this >>> + * many dirty pages. With a default page size of 8kb, this value >>> + * basically means 8GB of dirtied pages. >>> + */ >>> +#define VACUUM_FSM_EVERY_PAGES 1048576 >>> >>> Is it better if we vacuum fsm every 8GB regardless of the block size? >>> Because an installation that uses >8GB block size is likely to have >>> the pages less than what an 8GB block size installation has, the >>> current patch might lead to delay fsm vacuum. What do you think? If >>> it's better, we can define it as follows. >>> >>> #define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ) >> >> I honestly don't know. I've never tweaked page size, and know nobody >> that did, so I have no experience with those installations. >> >> Lets go with your proposal. >> >>> >>> -- >>> @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options, >>> LVRelStats *vacrelstats, >>> TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid; >>> TransactionId relminmxid = onerel->rd_rel->relminmxid; >>> BlockNumber empty_pages, >>> - vacuumed_pages; >>> + vacuumed_pages, >>> + fsm_updated_pages, >>> + vacuum_fsm_every_pages; >>> double num_tuples, >>> tups_vacuumed, >>> nkeep, >>> >>> Regarding fsm_updated_pages variable name, I think it's better to be >>> freespace_updated_pages because we actually counts the page updated >>> its freespace, not fsm. >> >> Good point. >> >> New version attached. > > Sorry, forgot to actually squash. Now the real new version is attached. Thank you for updating. I think the patches has enough discussion and quality, so can go to the committer's review. I've marked them as Ready for Committer. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Mon, Mar 5, 2018 at 10:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Fri, Mar 2, 2018 at 10:50 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Fri, Mar 2, 2018 at 10:47 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Fri, Mar 2, 2018 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>> Thank you for updating the patches! >>>> >>>> +/* >>>> + * When a table has no indexes, vacuum the FSM at most every this >>>> + * many dirty pages. With a default page size of 8kb, this value >>>> + * basically means 8GB of dirtied pages. >>>> + */ >>>> +#define VACUUM_FSM_EVERY_PAGES 1048576 >>>> >>>> Is it better if we vacuum fsm every 8GB regardless of the block size? >>>> Because an installation that uses >8GB block size is likely to have >>>> the pages less than what an 8GB block size installation has, the >>>> current patch might lead to delay fsm vacuum. What do you think? If >>>> it's better, we can define it as follows. >>>> >>>> #define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ) >>> >>> I honestly don't know. I've never tweaked page size, and know nobody >>> that did, so I have no experience with those installations. >>> >>> Lets go with your proposal. >>> >>>> >>>> -- >>>> @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options, >>>> LVRelStats *vacrelstats, >>>> TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid; >>>> TransactionId relminmxid = onerel->rd_rel->relminmxid; >>>> BlockNumber empty_pages, >>>> - vacuumed_pages; >>>> + vacuumed_pages, >>>> + fsm_updated_pages, >>>> + vacuum_fsm_every_pages; >>>> double num_tuples, >>>> tups_vacuumed, >>>> nkeep, >>>> >>>> Regarding fsm_updated_pages variable name, I think it's better to be >>>> freespace_updated_pages because we actually counts the page updated >>>> its freespace, not fsm. >>> >>> Good point. >>> >>> New version attached. >> >> Sorry, forgot to actually squash. Now the real new version is attached. > > Thank you for updating. I think the patches has enough discussion and > quality, so can go to the committer's review. I've marked them as > Ready for Committer. > Oops, the following comment needs to be updated. Since it would be better to defer decision of this behavior to committers I'll keep the status of this patch as Ready for Committer behavior. +/* + * When a table has no indexes, vacuum the FSM at most every this + * many dirty pages. With a default page size of 8kb, this value + * basically means 8GB of dirtied pages. + */ +#define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ) + Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Sun, Mar 4, 2018 at 10:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Mon, Mar 5, 2018 at 10:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Fri, Mar 2, 2018 at 10:50 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >>> On Fri, Mar 2, 2018 at 10:47 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >>>> On Fri, Mar 2, 2018 at 7:38 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote: >>>>> Thank you for updating the patches! >>>>> >>>>> +/* >>>>> + * When a table has no indexes, vacuum the FSM at most every this >>>>> + * many dirty pages. With a default page size of 8kb, this value >>>>> + * basically means 8GB of dirtied pages. >>>>> + */ >>>>> +#define VACUUM_FSM_EVERY_PAGES 1048576 >>>>> >>>>> Is it better if we vacuum fsm every 8GB regardless of the block size? >>>>> Because an installation that uses >8GB block size is likely to have >>>>> the pages less than what an 8GB block size installation has, the >>>>> current patch might lead to delay fsm vacuum. What do you think? If >>>>> it's better, we can define it as follows. >>>>> >>>>> #define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ) >>>> >>>> I honestly don't know. I've never tweaked page size, and know nobody >>>> that did, so I have no experience with those installations. >>>> >>>> Lets go with your proposal. >>>> >>>>> >>>>> -- >>>>> @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options, >>>>> LVRelStats *vacrelstats, >>>>> TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid; >>>>> TransactionId relminmxid = onerel->rd_rel->relminmxid; >>>>> BlockNumber empty_pages, >>>>> - vacuumed_pages; >>>>> + vacuumed_pages, >>>>> + fsm_updated_pages, >>>>> + vacuum_fsm_every_pages; >>>>> double num_tuples, >>>>> tups_vacuumed, >>>>> nkeep, >>>>> >>>>> Regarding fsm_updated_pages variable name, I think it's better to be >>>>> freespace_updated_pages because we actually counts the page updated >>>>> its freespace, not fsm. >>>> >>>> Good point. >>>> >>>> New version attached. >>> >>> Sorry, forgot to actually squash. Now the real new version is attached. >> >> Thank you for updating. I think the patches has enough discussion and >> quality, so can go to the committer's review. I've marked them as >> Ready for Committer. >> > > Oops, the following comment needs to be updated. Since it would be > better to defer decision of this behavior to committers I'll keep the > status of this patch as Ready for Committer behavior. > > +/* > + * When a table has no indexes, vacuum the FSM at most every this > + * many dirty pages. With a default page size of 8kb, this value > + * basically means 8GB of dirtied pages. > + */ > +#define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ) > + I just noticed that it actually computes 8MB (not GB) of pages. Attached a fix for that and the comment update.
Attachment
Claudio Freire <klaussfreire@gmail.com> writes: > [ 0001-Vacuum-Update-FSM-more-frequently-v9.patch ] I hadn't paid any attention to this patch previously, so maybe I'm missing something ... but this sure seems like a very bizarre approach to the problem. If the idea is to fix the FSM's upper levels after vacuuming a known sub-range of the table, why would you not proceed by teaching FreeSpaceMapVacuum to recurse only within that sub-range of page numbers? This setup with a threshold seems entirely Rube Goldbergian. It's dependent on a magic threshold number that you can only select by trial and error, and it's inevitably going to spend time updating segments of the FSM tree that have nothing to do with the part that's been vacuumed. This approach would also offer a less arbitrary way to decide how often to do the updates: choose a distance that has something to do with the FSM's structure, so that we don't waste effort reconsidering fragments of an upper tree page. regards, tom lane
On Sun, Mar 25, 2018 at 12:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> [ 0001-Vacuum-Update-FSM-more-frequently-v9.patch ] > > I hadn't paid any attention to this patch previously, so maybe I'm > missing something ... but this sure seems like a very bizarre approach > to the problem. If the idea is to fix the FSM's upper levels after > vacuuming a known sub-range of the table, why would you not proceed > by teaching FreeSpaceMapVacuum to recurse only within that sub-range > of page numbers? > Yeah, sounds like a better idea and I think we already do something similar during relation extension (when we add blocks in bulk) via UpdateFreeSpaceMap. Is that what you have in mind? It seems like Claudio's method is somewhat more selective when deciding which branches to traverse, but it appears complicated as compared to what you are proposing. This setup with a threshold seems entirely Rube > Goldbergian. It's dependent on a magic threshold number that you can > only select by trial and error, and it's inevitably going to spend time > updating segments of the FSM tree that have nothing to do with the part > that's been vacuumed. > > This approach would also offer a less arbitrary way to decide how often > to do the updates: choose a distance that has something to do with the > FSM's structure, so that we don't waste effort reconsidering fragments > of an upper tree page. > +1. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Mar 24, 2018 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> [ 0001-Vacuum-Update-FSM-more-frequently-v9.patch ] > > I hadn't paid any attention to this patch previously, so maybe I'm > missing something ... but this sure seems like a very bizarre approach > to the problem. If the idea is to fix the FSM's upper levels after > vacuuming a known sub-range of the table, why would you not proceed > by teaching FreeSpaceMapVacuum to recurse only within that sub-range > of page numbers? This setup with a threshold seems entirely Rube > Goldbergian. It's dependent on a magic threshold number that you can > only select by trial and error, and it's inevitably going to spend time > updating segments of the FSM tree that have nothing to do with the part > that's been vacuumed. Well, the point is to not only update the range we know we've vacuumed, but also to finish the updates done by a potential previously cancelled autovacuum. But I guess that's only for the first run. Your approach seems like a better idea for the other runs. Just FTR, the number isn't so magic. During vacuum, the patch records the highest amount of free space produced, and will only recurse into branches that don't already record that much free space. So it's quite deterministic in those cases, it's only the first run the one that has to guess, and your approach doesn't apply for that first run.
Claudio Freire <klaussfreire@gmail.com> writes: > On Sat, Mar 24, 2018 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I hadn't paid any attention to this patch previously, so maybe I'm >> missing something ... but this sure seems like a very bizarre approach >> to the problem. If the idea is to fix the FSM's upper levels after >> vacuuming a known sub-range of the table, why would you not proceed >> by teaching FreeSpaceMapVacuum to recurse only within that sub-range >> of page numbers? This setup with a threshold seems entirely Rube >> Goldbergian. It's dependent on a magic threshold number that you can >> only select by trial and error, and it's inevitably going to spend time >> updating segments of the FSM tree that have nothing to do with the part >> that's been vacuumed. > Well, the point is to not only update the range we know we've > vacuumed, but also to finish the updates done by a potential > previously cancelled autovacuum. I think that's not an important consideration, or at least would stop being one after a patch like this. The reason it's a problem now is precisely that we don't try to vacuum the FSM till the very end; if we did FSM cleanup every X pages --- in particular, before not after the final relation-truncation attempt --- there wouldn't be risk of skipping so much FSM work that we need to worry about forcing the issue just in case there'd been a prior cancellation. (Of course, you'd still need to do something after the truncation step to truncate the FSM, but I'm arguing it should *only* respond to that, not have to clean up all the rest of the FSM state.) regards, tom lane
On Mon, Mar 26, 2018 at 11:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> On Sat, Mar 24, 2018 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I hadn't paid any attention to this patch previously, so maybe I'm >>> missing something ... but this sure seems like a very bizarre approach >>> to the problem. If the idea is to fix the FSM's upper levels after >>> vacuuming a known sub-range of the table, why would you not proceed >>> by teaching FreeSpaceMapVacuum to recurse only within that sub-range >>> of page numbers? This setup with a threshold seems entirely Rube >>> Goldbergian. It's dependent on a magic threshold number that you can >>> only select by trial and error, and it's inevitably going to spend time >>> updating segments of the FSM tree that have nothing to do with the part >>> that's been vacuumed. > >> Well, the point is to not only update the range we know we've >> vacuumed, but also to finish the updates done by a potential >> previously cancelled autovacuum. > > I think that's not an important consideration, or at least would stop > being one after a patch like this. The reason it's a problem now is > precisely that we don't try to vacuum the FSM till the very end; if > we did FSM cleanup every X pages --- in particular, before not after > the final relation-truncation attempt --- there wouldn't be risk of > skipping so much FSM work that we need to worry about forcing the > issue just in case there'd been a prior cancellation. I'm thinking that in conjunction with the high MWM patch for vacuum, it could still happen that considerable amount of vacuuming is left unexposed upon cancellation. The "bizarre" approach provides some relief. I'll see about implementing the FSM range vacuuming operation for non-initial runs, there seems to be consensus that it's a good idea. But I still think this partial run at the very beginning is useful still.
On Mon, Mar 26, 2018 at 11:26 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Mar 26, 2018 at 11:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Claudio Freire <klaussfreire@gmail.com> writes: >>> On Sat, Mar 24, 2018 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> I hadn't paid any attention to this patch previously, so maybe I'm >>>> missing something ... but this sure seems like a very bizarre approach >>>> to the problem. If the idea is to fix the FSM's upper levels after >>>> vacuuming a known sub-range of the table, why would you not proceed >>>> by teaching FreeSpaceMapVacuum to recurse only within that sub-range >>>> of page numbers? This setup with a threshold seems entirely Rube >>>> Goldbergian. It's dependent on a magic threshold number that you can >>>> only select by trial and error, and it's inevitably going to spend time >>>> updating segments of the FSM tree that have nothing to do with the part >>>> that's been vacuumed. >> >>> Well, the point is to not only update the range we know we've >>> vacuumed, but also to finish the updates done by a potential >>> previously cancelled autovacuum. >> >> I think that's not an important consideration, or at least would stop >> being one after a patch like this. The reason it's a problem now is >> precisely that we don't try to vacuum the FSM till the very end; if >> we did FSM cleanup every X pages --- in particular, before not after >> the final relation-truncation attempt --- there wouldn't be risk of >> skipping so much FSM work that we need to worry about forcing the >> issue just in case there'd been a prior cancellation. > > I'm thinking that in conjunction with the high MWM patch for vacuum, > it could still happen that considerable amount of vacuuming is left > unexposed upon cancellation. > > The "bizarre" approach provides some relief. > > I'll see about implementing the FSM range vacuuming operation for > non-initial runs, there seems to be consensus that it's a good idea. > > But I still think this partial run at the very beginning is useful still. Attached patches, rebased and modified as discussed: 1 no longer does tree pruning, it just vacuums a range of the FSM 2 reintroduces tree pruning for the initial FSM vacuum 3 and 4 remain as they were, but rebased
Attachment
On Mon, Mar 26, 2018 at 2:46 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Mon, Mar 26, 2018 at 11:26 AM, Claudio Freire <klaussfreire@gmail.com> wrote: >> On Mon, Mar 26, 2018 at 11:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Claudio Freire <klaussfreire@gmail.com> writes: >>>> On Sat, Mar 24, 2018 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>>> I hadn't paid any attention to this patch previously, so maybe I'm >>>>> missing something ... but this sure seems like a very bizarre approach >>>>> to the problem. If the idea is to fix the FSM's upper levels after >>>>> vacuuming a known sub-range of the table, why would you not proceed >>>>> by teaching FreeSpaceMapVacuum to recurse only within that sub-range >>>>> of page numbers? This setup with a threshold seems entirely Rube >>>>> Goldbergian. It's dependent on a magic threshold number that you can >>>>> only select by trial and error, and it's inevitably going to spend time >>>>> updating segments of the FSM tree that have nothing to do with the part >>>>> that's been vacuumed. >>> >>>> Well, the point is to not only update the range we know we've >>>> vacuumed, but also to finish the updates done by a potential >>>> previously cancelled autovacuum. >>> >>> I think that's not an important consideration, or at least would stop >>> being one after a patch like this. The reason it's a problem now is >>> precisely that we don't try to vacuum the FSM till the very end; if >>> we did FSM cleanup every X pages --- in particular, before not after >>> the final relation-truncation attempt --- there wouldn't be risk of >>> skipping so much FSM work that we need to worry about forcing the >>> issue just in case there'd been a prior cancellation. >> >> I'm thinking that in conjunction with the high MWM patch for vacuum, >> it could still happen that considerable amount of vacuuming is left >> unexposed upon cancellation. >> >> The "bizarre" approach provides some relief. >> >> I'll see about implementing the FSM range vacuuming operation for >> non-initial runs, there seems to be consensus that it's a good idea. >> >> But I still think this partial run at the very beginning is useful still. > > Attached patches, rebased and modified as discussed: > > 1 no longer does tree pruning, it just vacuums a range of the FSM > > 2 reintroduces tree pruning for the initial FSM vacuum > > 3 and 4 remain as they were, but rebased Sorry, ignore patch number 3 in the earlier mail, I selected the wrong patch to attach. Attached now is the correct number 3
Attachment
Claudio Freire <klaussfreire@gmail.com> writes: > Attached patches, rebased and modified as discussed: > 1 no longer does tree pruning, it just vacuums a range of the FSM > 2 reintroduces tree pruning for the initial FSM vacuum > 3 and 4 remain as they were, but rebased I reviewed and cleaned up 0001. The API for FreeSpaceMapVacuumRange was underspecified, and the use of it seemed to have off-by-one errors. Also, you still had vacuum doing a full FreeSpaceMapVacuum call at the end; I thought the point was to get rid of that. We then need to make sure we clean up after a truncation, but we can do that by introducing a FreeSpaceMapVacuumRange call into FreeSpaceMapTruncateRel. I think the attached 0001 is committable, but feel free to take another look. I still don't like 0002. It's adding a lot of complexity, and not-negligible overhead, to solve yesterday's problem. After 0001, there's no reason to assume that vacuum is particularly likely to get cancelled between having made cleanups and having updated the upper FSM levels. (Maybe the odds are a bit more for the no-indexes case, but that doesn't seem like it's common enough to justify a special mechanism either.) Not sure what to think about 0003. At this point I'd be inclined to flush UpdateFreeSpaceMap altogether and use FreeSpaceMapVacuumRange in its place. I don't think the design of that function is any better chosen than its name, and possible bugs in its subroutines don't make it more attractive. Not sure about 0004 either. The fact that we can't localize what part of the index needs to be updated means that repeated IndexFreeSpaceMapVacuum calls are a roughly quadratic cost. Maybe in proportion to the other work we have to do, they're not much, but on the other hand how much benefit is there? Should we make the call conditional on how many index pages got updated? Also, I wonder why you only touched nbtree and spgist. (For that matter, I wonder why BRIN doesn't go through IndexFreeSpaceMapVacuum like the rest of the index AMs. Or perhaps it has the right idea and we should get rid of IndexFreeSpaceMapVacuum as a useless layer.) regards, tom lane diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index f9da24c..d2a0066 100644 *** a/src/backend/commands/vacuumlazy.c --- b/src/backend/commands/vacuumlazy.c *************** *** 85,90 **** --- 85,99 ---- #define VACUUM_TRUNCATE_LOCK_TIMEOUT 5000 /* ms */ /* + * When a table has no indexes, vacuum the FSM after every 8GB, approximately + * (it won't be exact because we only vacuum FSM after processing a heap page + * that has some removable tuples). When there are indexes, this is ignored, + * and we vacuum FSM after each index/heap cleaning pass. + */ + #define VACUUM_FSM_EVERY_PAGES \ + ((BlockNumber) (((uint64) 8 * 1024 * 1024 * 1024) / BLCKSZ)) + + /* * Guesstimation of number of dead tuples per page. This is used to * provide an upper limit to memory allocated when vacuuming small * tables. *************** lazy_vacuum_rel(Relation onerel, int opt *** 285,293 **** pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_FINAL_CLEANUP); - /* Vacuum the Free Space Map */ - FreeSpaceMapVacuum(onerel); - /* * Update statistics in pg_class. * --- 294,299 ---- *************** lazy_scan_heap(Relation onerel, int opti *** 465,471 **** TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid; TransactionId relminmxid = onerel->rd_rel->relminmxid; BlockNumber empty_pages, ! vacuumed_pages; double num_tuples, /* total number of nonremovable tuples */ live_tuples, /* live tuples (reltuples estimate) */ tups_vacuumed, /* tuples cleaned up by vacuum */ --- 471,478 ---- TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid; TransactionId relminmxid = onerel->rd_rel->relminmxid; BlockNumber empty_pages, ! vacuumed_pages, ! next_fsm_block_to_vacuum; double num_tuples, /* total number of nonremovable tuples */ live_tuples, /* live tuples (reltuples estimate) */ tups_vacuumed, /* tuples cleaned up by vacuum */ *************** lazy_scan_heap(Relation onerel, int opti *** 501,506 **** --- 508,514 ---- relname))); empty_pages = vacuumed_pages = 0; + next_fsm_block_to_vacuum = (BlockNumber) 0; num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0; indstats = (IndexBulkDeleteResult **) *************** lazy_scan_heap(Relation onerel, int opti *** 752,757 **** --- 760,772 ---- vacrelstats->num_dead_tuples = 0; vacrelstats->num_index_scans++; + /* + * Vacuum the Free Space Map to make newly-freed space visible on + * upper-level FSM pages. Note we have not yet processed blkno. + */ + FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno); + next_fsm_block_to_vacuum = blkno; + /* Report that we are once again scanning the heap */ pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_SCAN_HEAP); *************** lazy_scan_heap(Relation onerel, int opti *** 1200,1205 **** --- 1215,1233 ---- */ vacrelstats->num_dead_tuples = 0; vacuumed_pages++; + + /* + * Periodically do incremental FSM vacuuming to make newly-freed + * space visible on upper FSM pages. Note: although we've cleaned + * the current block, we haven't yet updated its FSM entry (that + * happens further down), so passing end == blkno is correct. + */ + if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES) + { + FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, + blkno); + next_fsm_block_to_vacuum = blkno; + } } freespace = PageGetHeapFreeSpace(page); *************** lazy_scan_heap(Relation onerel, int opti *** 1368,1373 **** --- 1396,1408 ---- vacrelstats->num_index_scans++; } + /* + * Vacuum the remainder of the Free Space Map. We must do this whether or + * not there were indexes. + */ + if (blkno > next_fsm_block_to_vacuum) + FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno); + /* report all blocks vacuumed; and that we're cleaning up */ pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno); pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, diff --git a/src/backend/storage/freespace/README b/src/backend/storage/freespace/README index bbd1b93..e7ff23b 100644 *** a/src/backend/storage/freespace/README --- b/src/backend/storage/freespace/README *************** have a corrupted page, with a parent som *** 180,192 **** Secondly, if we detect corrupted pages while we search, traversing down the tree. That check will notice if a parent node is set to too high a value. In both cases, the upper nodes on the page are immediately rebuilt, fixing ! the corruption. ! Vacuum updates all the bottom level pages with correct amount of free space ! on the heap pages, fixing any outdated values there. After the heap and ! index passes are done, FreeSpaceMapVacuum is called, and the FSM tree is ! scanned in depth-first order. This fixes any discrepancies between upper ! and lower level FSM pages. TODO ---- --- 180,192 ---- Secondly, if we detect corrupted pages while we search, traversing down the tree. That check will notice if a parent node is set to too high a value. In both cases, the upper nodes on the page are immediately rebuilt, fixing ! the corruption so far as that page is concerned. ! VACUUM updates all the bottom-level FSM pages with the correct amount of free ! space on corresponding heap pages, as it proceeds through the heap. This ! goes through fsm_set_avail(), so that the upper nodes on those pages are ! immediately updated. Periodically, VACUUM calls FreeSpaceMapVacuum[Range] ! to propagate the new free-space info into the upper pages of the FSM tree. TODO ---- diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c index dd8ae52..8ae981f 100644 *** a/src/backend/storage/freespace/freespace.c --- b/src/backend/storage/freespace/freespace.c *************** static Size fsm_space_cat_to_avail(uint8 *** 108,114 **** static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, uint8 newValue, uint8 minValue); static BlockNumber fsm_search(Relation rel, uint8 min_cat); ! static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof); static BlockNumber fsm_get_lastblckno(Relation rel, FSMAddress addr); static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat); --- 108,116 ---- static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, uint8 newValue, uint8 minValue); static BlockNumber fsm_search(Relation rel, uint8 min_cat); ! static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, ! BlockNumber start, BlockNumber end, ! bool *eof); static BlockNumber fsm_get_lastblckno(Relation rel, FSMAddress addr); static void fsm_update_recursive(Relation rel, FSMAddress addr, uint8 new_cat); *************** FreeSpaceMapTruncateRel(Relation rel, Bl *** 370,390 **** */ if (rel->rd_smgr) rel->rd_smgr->smgr_fsm_nblocks = new_nfsmblocks; } /* ! * FreeSpaceMapVacuum - scan and fix any inconsistencies in the FSM */ void FreeSpaceMapVacuum(Relation rel) { bool dummy; ! /* ! * Traverse the tree in depth-first order. The tree is stored physically ! * in depth-first order, so this should be pretty I/O efficient. ! */ ! fsm_vacuum_page(rel, FSM_ROOT_ADDRESS, &dummy); } /******** Internal routines ********/ --- 372,418 ---- */ if (rel->rd_smgr) rel->rd_smgr->smgr_fsm_nblocks = new_nfsmblocks; + + /* + * Update upper-level FSM pages to account for the truncation. This is + * important because the just-truncated pages were likely marked as + * all-free, and would be preferentially selected. + */ + FreeSpaceMapVacuumRange(rel, nblocks, InvalidBlockNumber); } /* ! * FreeSpaceMapVacuum - update upper-level pages in the rel's FSM ! * ! * We assume that the bottom-level pages have already been updated with ! * new free-space information. */ void FreeSpaceMapVacuum(Relation rel) { bool dummy; ! /* Recursively scan the tree, starting at the root */ ! (void) fsm_vacuum_page(rel, FSM_ROOT_ADDRESS, ! (BlockNumber) 0, InvalidBlockNumber, ! &dummy); ! } ! ! /* ! * FreeSpaceMapVacuumRange - update upper-level pages in the rel's FSM ! * ! * As above, but assume that only heap pages between start and end-1 inclusive ! * have new free-space information, so update only the upper-level slots ! * covering that block range. end == InvalidBlockNumber is equivalent to ! * "all the rest of the relation". ! */ ! void ! FreeSpaceMapVacuumRange(Relation rel, BlockNumber start, BlockNumber end) ! { ! bool dummy; ! ! /* Recursively scan the tree, starting at the root */ ! (void) fsm_vacuum_page(rel, FSM_ROOT_ADDRESS, start, end, &dummy); } /******** Internal routines ********/ *************** fsm_search(Relation rel, uint8 min_cat) *** 783,791 **** /* * Recursive guts of FreeSpaceMapVacuum */ static uint8 ! fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p) { Buffer buf; Page page; --- 811,831 ---- /* * Recursive guts of FreeSpaceMapVacuum + * + * Examine the FSM page indicated by addr, as well as its children, updating + * upper-level nodes that cover the heap block range from start to end-1. + * (It's okay if end is beyond the actual end of the map.) + * Return the maximum freespace value on this page. + * + * If addr is past the end of the FSM, set *eof_p to true and return 0. + * + * This traverses the tree in depth-first order. The tree is stored + * physically in depth-first order, so this should be pretty I/O efficient. */ static uint8 ! fsm_vacuum_page(Relation rel, FSMAddress addr, ! BlockNumber start, BlockNumber end, ! bool *eof_p) { Buffer buf; Page page; *************** fsm_vacuum_page(Relation rel, FSMAddress *** 804,818 **** page = BufferGetPage(buf); /* ! * Recurse into children, and fix the information stored about them at ! * this level. */ if (addr.level > FSM_BOTTOM_LEVEL) { ! int slot; bool eof = false; ! for (slot = 0; slot < SlotsPerFSMPage; slot++) { int child_avail; --- 844,894 ---- page = BufferGetPage(buf); /* ! * If we're above the bottom level, recurse into children, and fix the ! * information stored about them at this level. */ if (addr.level > FSM_BOTTOM_LEVEL) { ! FSMAddress fsm_start, ! fsm_end; ! uint16 fsm_start_slot, ! fsm_end_slot; ! int slot, ! start_slot, ! end_slot; bool eof = false; ! /* ! * Compute the range of slots we need to update on this page, given ! * the requested range of heap blocks to consider. (Some of this work ! * will be duplicated in each recursive call, but it's cheap enough to ! * not worry about.) ! */ ! fsm_start = fsm_get_location(start, &fsm_start_slot); ! fsm_end = fsm_get_location(end, &fsm_end_slot); ! ! while (fsm_start.level < addr.level) ! { ! fsm_start = fsm_get_parent(fsm_start, &fsm_start_slot); ! fsm_end = fsm_get_parent(fsm_end, &fsm_end_slot); ! } ! Assert(fsm_start.level == addr.level); ! ! if (fsm_start.logpageno == addr.logpageno) ! start_slot = fsm_start_slot; ! else if (fsm_start.logpageno > addr.logpageno) ! start_slot = SlotsPerFSMPage; ! else ! start_slot = 0; ! ! if (fsm_end.logpageno == addr.logpageno) ! end_slot = fsm_end_slot; ! else if (fsm_end.logpageno > addr.logpageno) ! end_slot = SlotsPerFSMPage; ! else ! end_slot = -1; ! ! for (slot = start_slot; slot < end_slot; slot++) { int child_avail; *************** fsm_vacuum_page(Relation rel, FSMAddress *** 820,826 **** /* After we hit end-of-file, just clear the rest of the slots */ if (!eof) ! child_avail = fsm_vacuum_page(rel, fsm_get_child(addr, slot), &eof); else child_avail = 0; --- 896,904 ---- /* After we hit end-of-file, just clear the rest of the slots */ if (!eof) ! child_avail = fsm_vacuum_page(rel, fsm_get_child(addr, slot), ! start, end, ! &eof); else child_avail = 0; *************** fsm_vacuum_page(Relation rel, FSMAddress *** 835,840 **** --- 913,919 ---- } } + /* Now get the maximum value on the page, to return to caller */ max_avail = fsm_get_max_avail(BufferGetPage(buf)); /* diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h index a517d7e..a203251 100644 *** a/src/include/storage/freespace.h --- b/src/include/storage/freespace.h *************** extern void XLogRecordPageWithFreeSpace( *** 32,37 **** --- 32,39 ---- extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks); extern void FreeSpaceMapVacuum(Relation rel); + extern void FreeSpaceMapVacuumRange(Relation rel, BlockNumber start, + BlockNumber end); extern void UpdateFreeSpaceMap(Relation rel, BlockNumber startBlkNum, BlockNumber endBlkNum,
On Wed, Mar 28, 2018 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> Attached patches, rebased and modified as discussed: >> 1 no longer does tree pruning, it just vacuums a range of the FSM >> 2 reintroduces tree pruning for the initial FSM vacuum >> 3 and 4 remain as they were, but rebased > > I reviewed and cleaned up 0001. The API for FreeSpaceMapVacuumRange was > underspecified, and the use of it seemed to have off-by-one errors. Also, > you still had vacuum doing a full FreeSpaceMapVacuum call at the end; > I thought the point was to get rid of that. We then need to make sure > we clean up after a truncation, but we can do that by introducing a > FreeSpaceMapVacuumRange call into FreeSpaceMapTruncateRel. I think the > attached 0001 is committable, but feel free to take another look. + + /* + * Periodically do incremental FSM vacuuming to make newly-freed + * space visible on upper FSM pages. Note: although we've cleaned + * the current block, we haven't yet updated its FSM entry (that + * happens further down), so passing end == blkno is correct. + */ + if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES) + { + FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, + blkno); + next_fsm_block_to_vacuum = blkno; + } } Using next_fsm_block_to_vacuum there seems strange. v10 counted the number of blocks with updated free space to vacuum the FSM only after a lot of changes to it were made. This will vacuum the FSM after *scanning* a lot of pages, even if little modifications were made to them. Not sure which is better, but the logic in v10 sounded right to me. ! if (fsm_end.logpageno == addr.logpageno) ! end_slot = fsm_end_slot; ! else if (fsm_end.logpageno > addr.logpageno) ! end_slot = SlotsPerFSMPage; ! else ! end_slot = -1; ! ! for (slot = start_slot; slot < end_slot; slot++) This doesn't seem correct. The +1 in v10 was intentional. Suppose a leaf node of the FSM has a parent in slot 3, and that one has a parent at slot 10. This would vacuum slots 0-9 on the upper level, but not 10. That would leave a whole branch (the one where the end block belongs) unvisited. That's why the end_slot has to be inclusive of the end block. We have work to do recursing the end_slot. > I still don't like 0002. It's adding a lot of complexity, and > not-negligible overhead, to solve yesterday's problem. Are you sure it's non-negligible? Most of my benchmarks couldn't measure any change whatsoever after this patch in regards to run/cpu time. The size of the FSM is so much smaller than the table, that the cost of vacuuming it is drowned by all the other work done and buried under the noise. Maybe under very special cases where vacuum does nothing, skipping all rows, it might be measurable. A heavily bloated-then-cleaned table with few live rows per page, but all frozen, that would be a total worst-case. But reading the visibility map to skip rows is comparable work to vacuuming the FSM. There's no reason to think it would worsen by *that* much. I might try to benchmark that a bit after the long weekend. Anyway, it's a separate patch to be independently committed/vetted. > After 0001, > there's no reason to assume that vacuum is particularly likely to get > cancelled between having made cleanups and having updated the upper FSM > levels. (Maybe the odds are a bit more for the no-indexes case, but > that doesn't seem like it's common enough to justify a special mechanism > either.) Why not? Any kind of DDL (even those that don't rewrite the heap) would cancel autovacuum. You might think DDL isn't common enough to worry about, but I've seen cases where regular reindex were required to keep index bloat in check (and were cron'd), and those cancel autovacuum all the time. > Not sure about 0004 either. The fact that we can't localize what part of > the index needs to be updated means that repeated IndexFreeSpaceMapVacuum > calls are a roughly quadratic cost. Well, the index bulk delete itself already is a huge elephant-sized quadratic cost. The FSM is only the cherry on top. The updated part can't be localize because it isn't. All blocks could potentially be changed. Even in correlated indexes, upper levels need not be physically correlated and would screw with the "vacuum block range" optimization. I could try to optimize the case where it's possible, by recording the first and last blocks modified, but that would be a hugely invasive patch (it would have to touch a lot of places in btree code). And index bloat is a very real issue, as bad as heap bloat is. > Maybe in proportion to the other work > we have to do, they're not much, but on the other hand how much benefit is > there? A week-long multi-pass vacuum I was forced to do about a month ago accumulated 400GB of *index bloat* because of this. To put into context, the index was ~800GB, so close to 50% bloat. Granted, with this patch it would still accumulate bloat, just not 400GB. Just 400GB / # of passes. It's still a decent improvement I would say. And perhaps there was something else wrong in my installation. But the patch would indeed have helped a lot. > Should we make the call conditional on how many index pages got > updated? It could be attempted. > Also, I wonder why you only touched nbtree and spgist. They're the only ones that used the FSM during bulk deletes, at least that I could find grepping a bit. I was a bit surprised to find that hash doesn't use the FSM at all, but perhaps there's a reason for that. GIN does all FSM updates at the end during ginvacuumcleanup, so there's nothing to do between bulkdelete calls. > (For > that matter, I wonder why BRIN doesn't go through IndexFreeSpaceMapVacuum > like the rest of the index AMs. Or perhaps it has the right idea and we > should get rid of IndexFreeSpaceMapVacuum as a useless layer.) No idea.
Claudio Freire <klaussfreire@gmail.com> writes: > v10 counted the number of blocks with updated free space to vacuum the > FSM only after a lot of changes to it were made. This will vacuum the > FSM after *scanning* a lot of pages, even if little modifications were > made to them. Yes, that's exactly the point. We still need to update the upper pages of the FSM, regardless of exactly how many pages were changed by VACUUM underneath. The pages VACUUM didn't change might still be out of date in the upper pages, since retail updates from other operations don't immediately propagate; so there's a fixed amount of work that needs to be done there and we might as well do it in more-or-less-predictable quanta. I don't see any point in adding counting overhead/complexity for that. In fact, I seriously considered just making it update after every VACUUM_FSM_EVERY_PAGES period, but decided that for the case with indexes, doing it right after each cleanup cycle was sensible, and then we might as well make the no-index case look as much like that as we conveniently can. The no-index case seems vestigial anyway; how often is that really going to apply? So spending a lot of complexity on it doesn't seem warranted, especially when the argument for more complexity is at best dubious. > This doesn't seem correct. [ thinks for a bit... ] Yeah, you're right, we need to round up not down when determining the last slot to scan on the upper level. I wonder how hard it is to avoid rounding up when the stop point actually does fall right at a FSM page boundary. OTOH, that might not happen often enough to be worth sweating over. regards, tom lane
Claudio Freire <klaussfreire@gmail.com> writes: > On Wed, Mar 28, 2018 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> After 0001, >> there's no reason to assume that vacuum is particularly likely to get >> cancelled between having made cleanups and having updated the upper FSM >> levels. (Maybe the odds are a bit more for the no-indexes case, but >> that doesn't seem like it's common enough to justify a special mechanism >> either.) > Why not? > Any kind of DDL (even those that don't rewrite the heap) would cancel > autovacuum. > You might think DDL isn't common enough to worry about, but I've seen > cases where regular reindex were required to keep index bloat in check > (and were cron'd), and those cancel autovacuum all the time. If you've got a situation where every vacuum gets canceled partway through, you've got bloat problems regardless, because the heap tuples are never getting removed in the first place; worrying about whether the FSM is up to date is pretty pointless. The 0001 patch basically updates the FSM as soon as it can after the tuples are actually deleted, so I think we've made the window as small as practical, and I don't really see a need to do extra work (and add substantial extra complexity) to deal with missed cleanup a bit sooner. People who are dealing with this sort of scenario a lot might be well advised to reduce autovacuum's maintenance_work_mem, so that the cleanup cycles happen more often. That's kind of costly in terms of the number of index scans, but it reduces the amount of cleanup work that can be lost to a cancel. (I'd also argue that a setup such as you describe is very possibly making things worse not better. Perhaps the 0001 patch will go some way towards making it less necessary to do that.) I've pushed 0001 and a replacement for 0003. I have to go do something else right now, but I'll take a closer look at 0004 later. regards, tom lane
I wrote: > I have to go do something > else right now, but I'll take a closer look at 0004 later. OK, so after studying 0004, it seems to me that we could do it more simply as attached; that is, move the IndexFreeSpaceMapVacuum calls into btvacuumscan/spgvacuumscan, do them only if we found any recyclable pages, and drop the calls in btvacuumcleanup/spgvacuumcleanup altogether. The reason I think this is sufficient is that the scans find and record every reusable page in the index, no matter whether they were recorded before or not. Therefore, if we don't find any such pages, there's nothing useful in the FSM and no particular urgency about making its upper pages up-to-date. It's true that if the FSM is actually corrupt, leaving that to be fixed retail by searches is probably less efficient than doing an IndexFreeSpaceMapVacuum call would be --- but *only* if you assume that the problem is just in the upper pages and the leaf pages are all fine. That doesn't seem to be a case we should optimize for. I realized that the reason BRIN doesn't go through indexfsm.c is that it's actually interested in less-than-page-size free space. So it's using the right API. However, it still looks to me like we could win something there by replacing FreeSpaceMapVacuum calls with FreeSpaceMapVacuumRange calls. I've not wrapped my head around the logic completely, but it looks like there are several places where it calls FreeSpaceMapVacuum to bubble up a free-space update for just a single page. If so, that's much less efficient than it could be. regards, tom lane diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 8158508..6fca8e3 100644 *** a/src/backend/access/nbtree/nbtree.c --- b/src/backend/access/nbtree/nbtree.c *************** btvacuumcleanup(IndexVacuumInfo *info, I *** 832,840 **** btvacuumscan(info, stats, NULL, NULL, 0); } - /* Finally, vacuum the FSM */ - IndexFreeSpaceMapVacuum(info->index); - /* * It's quite possible for us to be fooled by concurrent page splits into * double-counting some index tuples, so disbelieve any total that exceeds --- 832,837 ---- *************** btvacuumscan(IndexVacuumInfo *info, Inde *** 976,981 **** --- 973,993 ---- MemoryContextDelete(vstate.pagedelcontext); + /* + * If we found any recyclable pages (and recorded them in the FSM), then + * forcibly update the upper-level FSM pages to ensure that searchers can + * find them. It's possible that the pages were also found during + * previous scans and so this is a waste of time, but it's cheap enough + * relative to scanning the index that it shouldn't matter much, and + * making sure that free pages are available sooner not later seems + * worthwhile. + * + * Note that if no recyclable pages exist, we don't bother vacuuming the + * FSM at all. + */ + if (vstate.totFreePages > 0) + IndexFreeSpaceMapVacuum(rel); + /* update statistics */ stats->num_pages = num_pages; stats->pages_free = vstate.totFreePages; diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c index 72839cb..a83a4b5 100644 *** a/src/backend/access/spgist/spgvacuum.c --- b/src/backend/access/spgist/spgvacuum.c *************** spgvacuumscan(spgBulkDeleteState *bds) *** 846,851 **** --- 846,866 ---- SpGistUpdateMetaPage(index); /* + * If we found any empty pages (and recorded them in the FSM), then + * forcibly update the upper-level FSM pages to ensure that searchers can + * find them. It's possible that the pages were also found during + * previous scans and so this is a waste of time, but it's cheap enough + * relative to scanning the index that it shouldn't matter much, and + * making sure that free pages are available sooner not later seems + * worthwhile. + * + * Note that if no empty pages exist, we don't bother vacuuming the FSM at + * all. + */ + if (bds->stats->pages_deleted > 0) + IndexFreeSpaceMapVacuum(index); + + /* * Truncate index if possible * * XXX disabled because it's unsafe due to possible concurrent inserts. *************** dummy_callback(ItemPointer itemptr, void *** 916,922 **** IndexBulkDeleteResult * spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats) { - Relation index = info->index; spgBulkDeleteState bds; /* No-op in ANALYZE ONLY mode */ --- 931,936 ---- *************** spgvacuumcleanup(IndexVacuumInfo *info, *** 926,933 **** /* * We don't need to scan the index if there was a preceding bulkdelete * pass. Otherwise, make a pass that won't delete any live tuples, but ! * might still accomplish useful stuff with redirect/placeholder cleanup, ! * and in any case will provide stats. */ if (stats == NULL) { --- 940,947 ---- /* * We don't need to scan the index if there was a preceding bulkdelete * pass. Otherwise, make a pass that won't delete any live tuples, but ! * might still accomplish useful stuff with redirect/placeholder cleanup ! * and/or FSM housekeeping, and in any case will provide stats. */ if (stats == NULL) { *************** spgvacuumcleanup(IndexVacuumInfo *info, *** 940,948 **** spgvacuumscan(&bds); } - /* Finally, vacuum the FSM */ - IndexFreeSpaceMapVacuum(index); - /* * It's quite possible for us to be fooled by concurrent tuple moves into * double-counting some index tuples, so disbelieve any total that exceeds --- 954,959 ----
On Thu, Mar 29, 2018 at 7:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> I have to go do something >> else right now, but I'll take a closer look at 0004 later. > > OK, so after studying 0004, it seems to me that we could do it more > simply as attached; that is, move the IndexFreeSpaceMapVacuum calls > into btvacuumscan/spgvacuumscan, do them only if we found any recyclable > pages, and drop the calls in btvacuumcleanup/spgvacuumcleanup altogether. > > The reason I think this is sufficient is that the scans find and record > every reusable page in the index, no matter whether they were recorded > before or not. Therefore, if we don't find any such pages, there's > nothing useful in the FSM and no particular urgency about making its > upper pages up-to-date. It's true that if the FSM is actually corrupt, > leaving that to be fixed retail by searches is probably less efficient > than doing an IndexFreeSpaceMapVacuum call would be --- but *only* if > you assume that the problem is just in the upper pages and the leaf > pages are all fine. That doesn't seem to be a case we should optimize > for. +1, LGTM.
On Thu, Mar 29, 2018 at 2:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> On Wed, Mar 28, 2018 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> After 0001, >>> there's no reason to assume that vacuum is particularly likely to get >>> cancelled between having made cleanups and having updated the upper FSM >>> levels. (Maybe the odds are a bit more for the no-indexes case, but >>> that doesn't seem like it's common enough to justify a special mechanism >>> either.) > >> Why not? > >> Any kind of DDL (even those that don't rewrite the heap) would cancel >> autovacuum. > >> You might think DDL isn't common enough to worry about, but I've seen >> cases where regular reindex were required to keep index bloat in check >> (and were cron'd), and those cancel autovacuum all the time. > > If you've got a situation where every vacuum gets canceled partway > through, you've got bloat problems regardless, because the heap tuples are > never getting removed in the first place; worrying about whether the FSM > is up to date is pretty pointless. The 0001 patch basically updates the > FSM as soon as it can after the tuples are actually deleted, so I think > we've made the window as small as practical, and I don't really see a need > to do extra work (and add substantial extra complexity) to deal with > missed cleanup a bit sooner. > > People who are dealing with this sort of scenario a lot might be well > advised to reduce autovacuum's maintenance_work_mem, so that the cleanup > cycles happen more often. That's kind of costly in terms of the number > of index scans, but it reduces the amount of cleanup work that can be > lost to a cancel. > > (I'd also argue that a setup such as you describe is very possibly making > things worse not better. Perhaps the 0001 patch will go some way towards > making it less necessary to do that.) Alright, so we just drop 2.
On Tue, Apr 3, 2018 at 10:19 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > On Thu, Mar 29, 2018 at 2:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Claudio Freire <klaussfreire@gmail.com> writes: >>> On Wed, Mar 28, 2018 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> After 0001, >>>> there's no reason to assume that vacuum is particularly likely to get >>>> cancelled between having made cleanups and having updated the upper FSM >>>> levels. (Maybe the odds are a bit more for the no-indexes case, but >>>> that doesn't seem like it's common enough to justify a special mechanism >>>> either.) >> >>> Why not? >> >>> Any kind of DDL (even those that don't rewrite the heap) would cancel >>> autovacuum. >> >>> You might think DDL isn't common enough to worry about, but I've seen >>> cases where regular reindex were required to keep index bloat in check >>> (and were cron'd), and those cancel autovacuum all the time. >> >> If you've got a situation where every vacuum gets canceled partway >> through, you've got bloat problems regardless, because the heap tuples are >> never getting removed in the first place; worrying about whether the FSM >> is up to date is pretty pointless. The 0001 patch basically updates the >> FSM as soon as it can after the tuples are actually deleted, so I think >> we've made the window as small as practical, and I don't really see a need >> to do extra work (and add substantial extra complexity) to deal with >> missed cleanup a bit sooner. >> >> People who are dealing with this sort of scenario a lot might be well >> advised to reduce autovacuum's maintenance_work_mem, so that the cleanup >> cycles happen more often. That's kind of costly in terms of the number >> of index scans, but it reduces the amount of cleanup work that can be >> lost to a cancel. >> >> (I'd also argue that a setup such as you describe is very possibly making >> things worse not better. Perhaps the 0001 patch will go some way towards >> making it less necessary to do that.) > > Alright, so we just drop 2. So, that's it then. Thanks