Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
Date
Msg-id 20170408005629.bhjzow3ittre7vqc@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem  (Claudio Freire <klaussfreire@gmail.com>)
Responses Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem  (Claudio Freire <klaussfreire@gmail.com>)
List pgsql-hackers
Hi,


On 2017-04-07 19:43:39 -0300, Claudio Freire wrote:
> On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
> > Hi,
> >
> > I've *not* read the history of this thread.  So I really might be
> > missing some context.
> >
> >
> >> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
> >> From: Claudio Freire <klaussfreire@gmail.com>
> >> Date: Mon, 12 Sep 2016 23:36:42 -0300
> >> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
> >>
> >> Turn the dead_tuples array into a structure composed of several
> >> exponentially bigger arrays, to enable usage of more than 1GB
> >> of work mem during vacuum and thus reduce the number of full
> >> index scans necessary to remove all dead tids when the memory is
> >> available.
> >
> >>   * We are willing to use at most maintenance_work_mem (or perhaps
> >>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
> >> - * initially allocate an array of TIDs of that size, with an upper limit that
> >> + * initially allocate an array of TIDs of 128MB, or an upper limit that
> >>   * depends on table size (this limit ensures we don't allocate a huge area
> >> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
> >> - * we suspend the heap scan phase and perform a pass of index cleanup and page
> >> - * compaction, then resume the heap scan with an empty TID array.
> >> + * uselessly for vacuuming small tables). Additional arrays of increasingly
> >> + * large sizes are allocated as they become necessary.
> >> + *
> >> + * The TID array is thus represented as a list of multiple segments of
> >> + * varying size, beginning with the initial size of up to 128MB, and growing
> >> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
> >> + * is used up.
> >
> > When the chunk size is 128MB, I'm a bit unconvinced that using
> > exponential growth is worth it. The allocator overhead can't be
> > meaningful in comparison to collecting 128MB dead tuples, the potential
> > waste is pretty big, and it increases memory fragmentation.
> 
> The exponential strategy is mainly to improve lookup time (ie: to
> avoid large segment lists).

Well, if we were to do binary search on the segment list, that'd not be
necessary.

> >> +             if (seg->num_dead_tuples >= seg->max_dead_tuples)
> >> +             {
> >> +                     /*
> >> +                      * The segment is overflowing, so we must allocate a new segment.
> >> +                      * We could have a preallocated segment descriptor already, in
> >> +                      * which case we just reinitialize it, or we may need to repalloc
> >> +                      * the vacrelstats->dead_tuples array. In that case, seg will no
> >> +                      * longer be valid, so we must be careful about that. In any case,
> >> +                      * we must update the last_dead_tuple copy in the overflowing
> >> +                      * segment descriptor.
> >> +                      */
> >> +                     Assert(seg->num_dead_tuples == seg->max_dead_tuples);
> >> +                     seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
> >> +                     if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs)
> >> +                     {
> >> +                             int                     new_num_segs = vacrelstats->dead_tuples.num_segs * 2;
> >> +
> >> +                             vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc(
> >> +                                                        (void *) vacrelstats->dead_tuples.dt_segments,
> >> +                                                                new_num_segs * sizeof(DeadTuplesSegment));
> >
> > Might be worth breaking this into some sub-statements, it's quite hard
> > to read.
> 
> Breaking what precisely? The comment?

No, the three-line statement computing the new value of
dead_tuples.dt_segments.  I'd at least assign dead_tuples to a local
variable, to cut the length of the statement down.


> >> +                             while (vacrelstats->dead_tuples.num_segs < new_num_segs)
> >> +                             {
> >> +                                     /* Initialize as "unallocated" */
> >> +                                     DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dt_segments[
> >> +
vacrelstats->dead_tuples.num_segs]);
> >
> > dito.
> 
> I don't really get what you're asking here.

Trying to simplify/shorten the statement.


> >> +/*
> >>   *   lazy_tid_reaped() -- is a particular tid deletable?
> >>   *
> >>   *           This has the right signature to be an IndexBulkDeleteCallback.
> >>   *
> >> - *           Assumes dead_tuples array is in sorted order.
> >> + *           Assumes the dead_tuples multiarray is in sorted order, both
> >> + *           the segment list and each segment itself, and that all segments'
> >> + *           last_dead_tuple fields up to date
> >>   */
> >>  static bool
> >>  lazy_tid_reaped(ItemPointer itemptr, void *state)
> >
> > Have you done performance evaluation about potential performance
> > regressions in big indexes here?  IIRC this can be quite frequently
> > called?
> 
> Yes, the benchmarks are upthread. The earlier runs were run on my
> laptop and made little sense, so I'd ignore them as inaccurate. The
> latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
> time (ie: faster) of about 20%. Anastasia did another one[2] and saw
> improvements as well, roughly 30%, though it's not measuring CPU time
> but rather elapsed time.

I'd be more concerned about cases that'd already fit into memory, not ones
where we avoid doing another scan - and I think you mostly measured that?

- Andres



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: [HACKERS] Partitioned tables vs GRANT
Next
From: Claudio Freire
Date:
Subject: Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem