Re: Vacuum: allow usage of more than 1GB of work mem - Mailing list pgsql-hackers

From Pavan Deolasee
Subject Re: Vacuum: allow usage of more than 1GB of work mem
Date
Msg-id CABOikdNtZJFMr2Cx+oJ=A2gUOPFpEpi++y25+2u0=Nr8r7utig@mail.gmail.com
Whole thread Raw
In response to Re: Vacuum: allow usage of more than 1GB of work mem  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Vacuum: allow usage of more than 1GB of work mem  (Pavan Deolasee <pavan.deolasee@gmail.com>)
List pgsql-hackers


On Wed, Sep 14, 2016 at 12:21 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Attached PoC patch changes the representation of dead tuple locations
> to the hashmap having tuple bitmap.
> The one hashmap entry consists of the block number and the TID bitmap
> of corresponding block, and the block number is the hash key of
> hashmap.
> Current implementation of this patch is not smart yet because each
> hashmap entry allocates the tuple bitmap with fixed
> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
> In case where one block can store only the several tens tuples, the
> most bits are would be waste.
>
> After improved this patch as you suggested, I will measure performance benefit.



Now, if I'm reading it correctly, this patch allocates a 132-byte
structure for every page with at least one dead tuple.  In the worst
case where there's just one dead tuple per page, that's a 20x
regression in memory usage.  Actually, it's more like 40x, because
AllocSetAlloc rounds small allocation sizes up to the next-higher
power of two, which really stings for a 132-byte allocation, and then
adds a 16-byte header to each chunk.  But even 20x is clearly not
good.  There are going to be lots of real-world cases where this uses
way more memory to track the same number of dead tuples, and I'm
guessing that benchmarking is going to reveal that it's not faster,
either.


Sawada-san offered to reimplement the patch based on what I proposed upthread. In the new scheme of things, we will allocate a fixed size bitmap of length 2D bits per page where D is average page density of live + dead tuples. (The rational behind multiplying D by a factor of 2 is to consider worst case scenario where every tuple also has a LP_DIRECT line pointer). The value of D in most real world, large tables should not go much beyond, say 100, assuming 80 bytes wide tuple and 8K blocksize. That translates to about 25 bytes/page. So all TIDs with offset less than 2D can be represented by a single bit. We augment this with an overflow to track tuples which fall outside this limit. I believe this area will be small, say 10% of the total allocation.

This representation is at least as good the current representation if there are at least 4-5% dead tuples. I don't think very large tables will be vacuumed with a scale factor less than that. And assuming 10% dead tuples, this representation will actually be much more optimal.

The idea can fail when (a) there are very few dead tuples in the table, say less than 5% or (b) there are large number of tuples falling outside the 2D limit. While I don't expect any of these to hold for real world, very large tables,  (a) we can anticipate when the vacuum starts and use current representation. (b) we can detect at run time and do a one time switch between representation. You may argue that managing two representations is clumsy, which I agree, but the code is completely isolated and probably not more than a few hundred lines.

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Peter Geoghegan
Date:
Subject: Re: condition variables
Next
From: Ashutosh Bapat
Date:
Subject: Re: Push down more full joins in postgres_fdw