Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
Date
Msg-id 20170407200509.atssvs32jzrlquxk@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem  (Claudio Freire <klaussfreire@gmail.com>)
Responses Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem  (Claudio Freire <klaussfreire@gmail.com>)
List pgsql-hackers
Hi,

I've *not* read the history of this thread.  So I really might be
missing some context.


> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
> From: Claudio Freire <klaussfreire@gmail.com>
> Date: Mon, 12 Sep 2016 23:36:42 -0300
> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
> 
> Turn the dead_tuples array into a structure composed of several
> exponentially bigger arrays, to enable usage of more than 1GB
> of work mem during vacuum and thus reduce the number of full
> index scans necessary to remove all dead tids when the memory is
> available.

>   * We are willing to use at most maintenance_work_mem (or perhaps
>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
> - * initially allocate an array of TIDs of that size, with an upper limit that
> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>   * depends on table size (this limit ensures we don't allocate a huge area
> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
> - * we suspend the heap scan phase and perform a pass of index cleanup and page
> - * compaction, then resume the heap scan with an empty TID array.
> + * uselessly for vacuuming small tables). Additional arrays of increasingly
> + * large sizes are allocated as they become necessary.
> + *
> + * The TID array is thus represented as a list of multiple segments of
> + * varying size, beginning with the initial size of up to 128MB, and growing
> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
> + * is used up.

When the chunk size is 128MB, I'm a bit unconvinced that using
exponential growth is worth it. The allocator overhead can't be
meaningful in comparison to collecting 128MB dead tuples, the potential
waste is pretty big, and it increases memory fragmentation.


> + * Lookup in that structure proceeds sequentially in the list of segments,
> + * and with a binary search within each segment. Since segment's size grows
> + * exponentially, this retains O(N log N) lookup complexity.

N log N is a horrible lookup complexity.  That's the complexity of
*sorting* an entire array.  I think you might be trying to argue that
it's log(N) * log(N)? Once log(n) for the exponentially growing size of
segments, one for the binary search?

Afaics you could quite easily make it O(2 log(N)) by simply also doing
binary search over the segments.  Might not be worth it due to the small
constant involved normally.


> + * If the array threatens to overflow, we suspend the heap scan phase and
> + * perform a pass of index cleanup and page compaction, then resume the heap
> + * scan with an array of logically empty but already preallocated TID segments
> + * to be refilled with more dead tuple TIDs.

Hm, it's not really the array that overflows, it's m_w_m that'd be
exceeded, right?


>  /*
> + * Minimum (starting) size of the dead_tuples array segments. Will allocate
> + * space for 128MB worth of tid pointers in the first segment, further segments
> + * will grow in size exponentially. Don't make it too small or the segment list
> + * will grow bigger than the sweetspot for search efficiency on big vacuums.
> + */
> +#define LAZY_MIN_TUPLES        Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))

That's not really the minimum, no? s/MIN/INIT/?


> +typedef struct DeadTuplesSegment
> +{
> +    int            num_dead_tuples;    /* # of entries in the segment */
> +    int            max_dead_tuples;    /* # of entries allocated in the segment */
> +    ItemPointerData last_dead_tuple;    /* Copy of the last dead tuple (unset
> +                                         * until the segment is fully
> +                                         * populated) */
> +    unsigned short padding;
> +    ItemPointer dt_tids;    /* Array of dead tuples */
> +}    DeadTuplesSegment;

Whenever padding is needed, it should have an explanatory comment.  It's
certainly not obvious to me wh it's neede here.


> @@ -1598,6 +1657,11 @@ lazy_vacuum_index(Relation indrel,
>      ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
>      ivinfo.strategy = vac_strategy;
>  
> +    /* Finalize the current segment by setting its upper bound dead tuple */
> +    seg = DeadTuplesCurrentSegment(vacrelstats);
> +    if (seg->num_dead_tuples > 0)
> +        seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];

Why don't we just maintain this here, for all of the segments?  Seems a
bit easier.


> @@ -1973,7 +2037,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
>  static void
>  lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>  {
> -    long        maxtuples;
> +    long        maxtuples,
> +                mintuples;
>      int            vac_work_mem = IsAutoVacuumWorkerProcess() &&
>      autovacuum_work_mem != -1 ?
>      autovacuum_work_mem : maintenance_work_mem;
> @@ -1982,7 +2047,6 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>      {
>          maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>          maxtuples = Min(maxtuples, INT_MAX);
> -        maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
>  
>          /* curious coding here to ensure the multiplication can't overflow */
>          if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
> @@ -1996,10 +2060,18 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>          maxtuples = MaxHeapTuplesPerPage;
>      }
>  
> -    vacrelstats->num_dead_tuples = 0;
> -    vacrelstats->max_dead_tuples = (int) maxtuples;
> -    vacrelstats->dead_tuples = (ItemPointer)
> -        palloc(maxtuples * sizeof(ItemPointerData));
> +    mintuples = Min(LAZY_MIN_TUPLES, maxtuples);
> +
> +    vacrelstats->dead_tuples.num_entries = 0;
> +    vacrelstats->dead_tuples.max_entries = (int) maxtuples;
> +    vacrelstats->dead_tuples.num_segs = 1;
> +    vacrelstats->dead_tuples.last_seg = 0;
> +    vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *)
> +        palloc(sizeof(DeadTuplesSegment));
> +    vacrelstats->dead_tuples.dt_segments[0].dt_tids = (ItemPointer)
> +        palloc(mintuples * sizeof(ItemPointerData));
> +    vacrelstats->dead_tuples.dt_segments[0].max_dead_tuples = mintuples;
> +    vacrelstats->dead_tuples.dt_segments[0].num_dead_tuples = 0;
>  }

Hm. Why don't we delay allocating dt_segments[0] till we actually need
it?  It's not uncommon for vacuums not to be able to find any dead
tuples, and it'd not change code in lazy_record_dead_tuple() much.


> @@ -2014,31 +2086,147 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats,
>       * could if we are given a really small maintenance_work_mem. In that
>       * case, just forget the last few tuples (we'll get 'em next time).
>       */
> -    if (vacrelstats->num_dead_tuples < vacrelstats->max_dead_tuples)
> +    if (vacrelstats->dead_tuples.num_entries < vacrelstats->dead_tuples.max_entries)
>      {
> -        vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr;
> -        vacrelstats->num_dead_tuples++;
> +        DeadTuplesSegment *seg = DeadTuplesCurrentSegment(vacrelstats);
> +
> +        if (seg->num_dead_tuples >= seg->max_dead_tuples)
> +        {
> +            /*
> +             * The segment is overflowing, so we must allocate a new segment.
> +             * We could have a preallocated segment descriptor already, in
> +             * which case we just reinitialize it, or we may need to repalloc
> +             * the vacrelstats->dead_tuples array. In that case, seg will no
> +             * longer be valid, so we must be careful about that. In any case,
> +             * we must update the last_dead_tuple copy in the overflowing
> +             * segment descriptor.
> +             */
> +            Assert(seg->num_dead_tuples == seg->max_dead_tuples);
> +            seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
> +            if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs)
> +            {
> +                int            new_num_segs = vacrelstats->dead_tuples.num_segs * 2;
> +
> +                vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc(
> +                               (void *) vacrelstats->dead_tuples.dt_segments,
> +                                   new_num_segs * sizeof(DeadTuplesSegment));

Might be worth breaking this into some sub-statements, it's quite hard
to read.


> +                while (vacrelstats->dead_tuples.num_segs < new_num_segs)
> +                {
> +                    /* Initialize as "unallocated" */
> +                    DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dt_segments[
> +                                         vacrelstats->dead_tuples.num_segs]);

dito.


> +/*
>   *    lazy_tid_reaped() -- is a particular tid deletable?
>   *
>   *        This has the right signature to be an IndexBulkDeleteCallback.
>   *
> - *        Assumes dead_tuples array is in sorted order.
> + *        Assumes the dead_tuples multiarray is in sorted order, both
> + *        the segment list and each segment itself, and that all segments'
> + *        last_dead_tuple fields up to date
>   */
>  static bool
>  lazy_tid_reaped(ItemPointer itemptr, void *state)

Have you done performance evaluation about potential performance
regressions in big indexes here?  IIRC this can be quite frequently
called?


I think this is reasonably close to commit, but unfortunately not quite
there yet. I.e. I personally won't polish this up & commit in the next
couple hours, but if somebody else wants to take that on...

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Pavel Stehule
Date:
Subject: Re: [HACKERS] Undefined psql variables
Next
From: Alvaro Herrera
Date:
Subject: Re: [HACKERS] Remaining 2017-03 CF entries