Home > mailing lists
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers

From	Claudio Freire
Subject	Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
Date	April 8, 2017 04:43:39
Msg-id	CAGTBQpb68h6XBwfWa054BF7AgkRgnBFJhWQ9o6svC5cHii9afQ@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem (Andres Freund <andres@anarazel.de>)
Responses	Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
List	pgsql-hackers
Tree view
On Fri, Apr 7, 2017 at 5:05 PM, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> I've *not* read the history of this thread.  So I really might be
> missing some context.
>
>
>> From e37d29c26210a0f23cd2e9fe18a264312fecd383 Mon Sep 17 00:00:00 2001
>> From: Claudio Freire <klaussfreire@gmail.com>
>> Date: Mon, 12 Sep 2016 23:36:42 -0300
>> Subject: [PATCH] Vacuum: allow using more than 1GB work mem
>>
>> Turn the dead_tuples array into a structure composed of several
>> exponentially bigger arrays, to enable usage of more than 1GB
>> of work mem during vacuum and thus reduce the number of full
>> index scans necessary to remove all dead tids when the memory is
>> available.
>
>>   * We are willing to use at most maintenance_work_mem (or perhaps
>>   * autovacuum_work_mem) memory space to keep track of dead tuples.  We
>> - * initially allocate an array of TIDs of that size, with an upper limit that
>> + * initially allocate an array of TIDs of 128MB, or an upper limit that
>>   * depends on table size (this limit ensures we don't allocate a huge area
>> - * uselessly for vacuuming small tables).  If the array threatens to overflow,
>> - * we suspend the heap scan phase and perform a pass of index cleanup and page
>> - * compaction, then resume the heap scan with an empty TID array.
>> + * uselessly for vacuuming small tables). Additional arrays of increasingly
>> + * large sizes are allocated as they become necessary.
>> + *
>> + * The TID array is thus represented as a list of multiple segments of
>> + * varying size, beginning with the initial size of up to 128MB, and growing
>> + * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
>> + * is used up.
>
> When the chunk size is 128MB, I'm a bit unconvinced that using
> exponential growth is worth it. The allocator overhead can't be
> meaningful in comparison to collecting 128MB dead tuples, the potential
> waste is pretty big, and it increases memory fragmentation.

The exponential strategy is mainly to improve lookup time (ie: to
avoid large segment lists).

>> + * Lookup in that structure proceeds sequentially in the list of segments,
>> + * and with a binary search within each segment. Since segment's size grows
>> + * exponentially, this retains O(N log N) lookup complexity.
>
> N log N is a horrible lookup complexity.  That's the complexity of
> *sorting* an entire array.  I think you might be trying to argue that
> it's log(N) * log(N)? Once log(n) for the exponentially growing size of
> segments, one for the binary search?
>
> Afaics you could quite easily make it O(2 log(N)) by simply also doing
> binary search over the segments.  Might not be worth it due to the small
> constant involved normally.

It's a typo, yes, I meant O(log N) (which is equivalent to O(2 log N))

>> + * If the array threatens to overflow, we suspend the heap scan phase and
>> + * perform a pass of index cleanup and page compaction, then resume the heap
>> + * scan with an array of logically empty but already preallocated TID segments
>> + * to be refilled with more dead tuple TIDs.
>
> Hm, it's not really the array that overflows, it's m_w_m that'd be
> exceeded, right?

Yes, will rephrase. Although that's how the original comment expressed
the same concept.

>>  /*
>> + * Minimum (starting) size of the dead_tuples array segments. Will allocate
>> + * space for 128MB worth of tid pointers in the first segment, further segments
>> + * will grow in size exponentially. Don't make it too small or the segment list
>> + * will grow bigger than the sweetspot for search efficiency on big vacuums.
>> + */
>> +#define LAZY_MIN_TUPLES              Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
>
> That's not really the minimum, no? s/MIN/INIT/?

Ok

>> +typedef struct DeadTuplesSegment
>> +{
>> +     int                     num_dead_tuples;        /* # of entries in the segment */
>> +     int                     max_dead_tuples;        /* # of entries allocated in the segment */
>> +     ItemPointerData last_dead_tuple;        /* Copy of the last dead tuple (unset
>> +                                                                              * until the segment is fully
>> +                                                                              * populated) */
>> +     unsigned short padding;
>> +     ItemPointer dt_tids;    /* Array of dead tuples */
>> +}    DeadTuplesSegment;
>
> Whenever padding is needed, it should have an explanatory comment.  It's
> certainly not obvious to me wh it's neede here.

Ok

>> @@ -1598,6 +1657,11 @@ lazy_vacuum_index(Relation indrel,
>>       ivinfo.num_heap_tuples = vacrelstats->old_rel_tuples;
>>       ivinfo.strategy = vac_strategy;
>>
>> +     /* Finalize the current segment by setting its upper bound dead tuple */
>> +     seg = DeadTuplesCurrentSegment(vacrelstats);
>> +     if (seg->num_dead_tuples > 0)
>> +             seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
>
> Why don't we just maintain this here, for all of the segments?  Seems a
> bit easier.

Originally, I just wanted to maintain the validity of last_dead_tuple
as an invariant at all times. But it may be like you say, that it's
simpler to just maintain the invariant of all segments at finalization
time. I'll explore that possibility.

>> @@ -1973,7 +2037,8 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
>>  static void
>>  lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>>  {
>> -     long            maxtuples;
>> +     long            maxtuples,
>> +                             mintuples;
>>       int                     vac_work_mem = IsAutoVacuumWorkerProcess() &&
>>       autovacuum_work_mem != -1 ?
>>       autovacuum_work_mem : maintenance_work_mem;
>> @@ -1982,7 +2047,6 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>>       {
>>               maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
>>               maxtuples = Min(maxtuples, INT_MAX);
>> -             maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
>>
>>               /* curious coding here to ensure the multiplication can't overflow */
>>               if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
>> @@ -1996,10 +2060,18 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
>>               maxtuples = MaxHeapTuplesPerPage;
>>       }
>>
>> -     vacrelstats->num_dead_tuples = 0;
>> -     vacrelstats->max_dead_tuples = (int) maxtuples;
>> -     vacrelstats->dead_tuples = (ItemPointer)
>> -             palloc(maxtuples * sizeof(ItemPointerData));
>> +     mintuples = Min(LAZY_MIN_TUPLES, maxtuples);
>> +
>> +     vacrelstats->dead_tuples.num_entries = 0;
>> +     vacrelstats->dead_tuples.max_entries = (int) maxtuples;
>> +     vacrelstats->dead_tuples.num_segs = 1;
>> +     vacrelstats->dead_tuples.last_seg = 0;
>> +     vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *)
>> +             palloc(sizeof(DeadTuplesSegment));
>> +     vacrelstats->dead_tuples.dt_segments[0].dt_tids = (ItemPointer)
>> +             palloc(mintuples * sizeof(ItemPointerData));
>> +     vacrelstats->dead_tuples.dt_segments[0].max_dead_tuples = mintuples;
>> +     vacrelstats->dead_tuples.dt_segments[0].num_dead_tuples = 0;
>>  }
>
> Hm. Why don't we delay allocating dt_segments[0] till we actually need
> it?  It's not uncommon for vacuums not to be able to find any dead
> tuples, and it'd not change code in lazy_record_dead_tuple() much.

I avoided that because that would make dt_segments[last_seg] invalid
for the case of a just-initialized multiarray.

Some places in the code use a macro that references
dt_segments[last_seg] (mostly for indexless tables), and having to
check num_segs and do lazy initialization would have complicated the
code considerably.

Nonetheless, I'll re-check how viable doing that would be.

>> @@ -2014,31 +2086,147 @@ lazy_record_dead_tuple(LVRelStats *vacrelstats,
>>        * could if we are given a really small maintenance_work_mem. In that
>>        * case, just forget the last few tuples (we'll get 'em next time).
>>        */
>> -     if (vacrelstats->num_dead_tuples < vacrelstats->max_dead_tuples)
>> +     if (vacrelstats->dead_tuples.num_entries < vacrelstats->dead_tuples.max_entries)
>>       {
>> -             vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr;
>> -             vacrelstats->num_dead_tuples++;
>> +             DeadTuplesSegment *seg = DeadTuplesCurrentSegment(vacrelstats);
>> +
>> +             if (seg->num_dead_tuples >= seg->max_dead_tuples)
>> +             {
>> +                     /*
>> +                      * The segment is overflowing, so we must allocate a new segment.
>> +                      * We could have a preallocated segment descriptor already, in
>> +                      * which case we just reinitialize it, or we may need to repalloc
>> +                      * the vacrelstats->dead_tuples array. In that case, seg will no
>> +                      * longer be valid, so we must be careful about that. In any case,
>> +                      * we must update the last_dead_tuple copy in the overflowing
>> +                      * segment descriptor.
>> +                      */
>> +                     Assert(seg->num_dead_tuples == seg->max_dead_tuples);
>> +                     seg->last_dead_tuple = seg->dt_tids[seg->num_dead_tuples - 1];
>> +                     if (vacrelstats->dead_tuples.last_seg + 1 >= vacrelstats->dead_tuples.num_segs)
>> +                     {
>> +                             int                     new_num_segs = vacrelstats->dead_tuples.num_segs * 2;
>> +
>> +                             vacrelstats->dead_tuples.dt_segments = (DeadTuplesSegment *) repalloc(
>> +                                                        (void *) vacrelstats->dead_tuples.dt_segments,
>> +                                                                new_num_segs * sizeof(DeadTuplesSegment));
>
> Might be worth breaking this into some sub-statements, it's quite hard
> to read.

Breaking what precisely? The comment?

>> +                             while (vacrelstats->dead_tuples.num_segs < new_num_segs)
>> +                             {
>> +                                     /* Initialize as "unallocated" */
>> +                                     DeadTuplesSegment *nseg = &(vacrelstats->dead_tuples.dt_segments[
>> +                                                                              vacrelstats->dead_tuples.num_segs]);
>
> dito.

I don't really get what you're asking here.

>> +/*
>>   *   lazy_tid_reaped() -- is a particular tid deletable?
>>   *
>>   *           This has the right signature to be an IndexBulkDeleteCallback.
>>   *
>> - *           Assumes dead_tuples array is in sorted order.
>> + *           Assumes the dead_tuples multiarray is in sorted order, both
>> + *           the segment list and each segment itself, and that all segments'
>> + *           last_dead_tuple fields up to date
>>   */
>>  static bool
>>  lazy_tid_reaped(ItemPointer itemptr, void *state)
>
> Have you done performance evaluation about potential performance
> regressions in big indexes here?  IIRC this can be quite frequently
> called?

Yes, the benchmarks are upthread. The earlier runs were run on my
laptop and made little sense, so I'd ignore them as inaccurate. The
latest run[1] with a pgbench scale of 4000 gave an improvement in CPU
time (ie: faster) of about 20%. Anastasia did another one[2] and saw
improvements as well, roughly 30%, though it's not measuring CPU time
but rather elapsed time.

Even small scales (100) saw an improvement as well, although possibly
below the noise floor. Tests are very slow so I haven't run enough to
measure variance and statistical significance.

I blame the improvement not only on better cache locality (the initial
search on the segment list usually fits on L1) but also on less
overall work due to needing less index scans, and the fact that
overall lookup complexity remains O(log N) due to the exponential
segment growth strategy.

[1] https://www.postgresql.org/message-id/CAGTBQpa6NFGO_6g_y_7zQx8L9GcHDSQKYdo1tGuh791z6PYgEg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/13bee467-bdcf-d3b9-c0ee-e2792fd46839%40postgrespro.ru

>
>
> I think this is reasonably close to commit, but unfortunately not quite
> there yet. I.e. I personally won't polish this up & commit in the next
> couple hours, but if somebody else wants to take that on...
>
> Greetings,
>
> Andres Freund

I'll post an updated patch with the requested changes shortly.
pgsql-hackers by date:
From: Peter Eisentraut
Date: 08 April 2017, 04:34:51
Subject: Re: [HACKERS] monitoring.sgml missing tag
From: Claudio Freire
Date: 08 April 2017, 04:46:06
Subject: Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem
Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers

Previous

Next