Re: [PoC] Improve dead tuple storage for lazy vacuum - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: [PoC] Improve dead tuple storage for lazy vacuum |
Date | |
Msg-id | CAD21AoBdz6XW5bMSN7EVhtyXOLj32YAt6+evJLPMEZvaajKB-Q@mail.gmail.com Whole thread Raw |
In response to | Re: [PoC] Improve dead tuple storage for lazy vacuum (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On Fri, Jul 9, 2021 at 12:53 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > > On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote: > > 1. Don't allocate more than 1GB. There was a discussion to eliminate > > this limitation by using MemoryContextAllocHuge() but there were > > concerns about point 2[1]. > > > > 2. Allocate the whole memory space at once. > > > > 3. Slow lookup performance (O(logN)). > > > > I’ve done some experiments in this area and would like to share the > > results and discuss ideas. > > Yea, this is a serious issue. > > > 3) could possibly be addressed to a decent degree without changing the > fundamental datastructure too much. There's some sizable and trivial > wins by just changing vac_cmp_itemptr() to compare int64s and by using > an open coded bsearch(). > > The big problem with bsearch isn't imo the O(log(n)) complexity - it's > that it has an abominally bad cache locality. And that can be addressed > https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf > > Imo 2) isn't really that a hard problem to improve, even if we were to > stay with the current bsearch approach. Reallocation with an aggressive > growth factor or such isn't that bad. > > > That's not to say we ought to stay with binary search... > > > > > Problems Solutions > > =============== > > > > Firstly, I've considered using existing data structures: > > IntegerSet(src/backend/lib/integerset.c) and > > TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but > > only either point 2 or 3. IntegerSet uses lower memory thanks to > > simple-8b encoding but is slow at lookup, still O(logN), since it’s a > > tree structure. On the other hand, TIDBitmap has a good lookup > > performance, O(1), but could unnecessarily use larger memory in some > > cases since it always allocates the space for bitmap enough to store > > all possible offsets. With 8kB blocks, the maximum number of line > > pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the > > bitmap is 40 bytes long and we always need 46 bytes in total per block > > including other meta information. > > Imo tidbitmap isn't particularly good, even in the current use cases - > it's constraining in what we can store (a problem for other AMs), not > actually that dense, the lossy mode doesn't choose what information to > loose well etc. > > It'd be nice if we came up with a datastructure that could also replace > the bitmap scan cases. Agreed. > > > > The data structure is somewhat similar to TIDBitmap. It consists of > > the hash table and the container area; the hash table has entries per > > block and each block entry allocates its memory space, called a > > container, in the container area to store its offset numbers. The > > container area is actually an array of bytes and can be enlarged as > > needed. In the container area, the data representation of offset > > numbers varies depending on their cardinality. It has three container > > types: array, bitmap, and run. > > Not a huge fan of encoding this much knowledge about the tid layout... > > > > For example, if there are two dead tuples at offset 1 and 150, it uses > > the array container that has an array of two 2-byte integers > > representing 1 and 150, using 4 bytes in total. If we used the bitmap > > container in this case, we would need 20 bytes instead. On the other > > hand, if there are consecutive 20 dead tuples from offset 1 to 20, it > > uses the run container that has an array of 2-byte integers. The first > > value in each pair represents a starting offset number, whereas the > > second value represents its length. Therefore, in this case, the run > > container uses only 4 bytes in total. Finally, if there are dead > > tuples at every other offset from 1 to 100, it uses the bitmap > > container that has an uncompressed bitmap, using 13 bytes. We need > > another 16 bytes per block entry for hash table entry. > > > > The lookup complexity of a bitmap container is O(1) whereas the one of > > an array and a run container is O(N) or O(logN) but the number of > > elements in those two containers should not be large it would not be a > > problem. > > Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which > many tuples have been deleted. In cases where the "run" storage is > cheaper (e.g. because there's high offset numbers due to HOT pruning), > we could end up regularly scanning a few hundred entries for a > match. That's not cheap anymore. With 8kB blocks, the maximum size of a bitmap container is 37 bytes. IOW, other two types of containers are always smaller than 37 bytes. Since the run container uses 4 bytes per run, the number of runs in a run container never be more than 9. Even with 32kB blocks, we don’t have more than 37 runs. So I think N is small enough in this case. > > > > Evaluation > > ======== > > > > Before implementing this idea and integrating it with lazy vacuum > > code, I've implemented a benchmark tool dedicated to evaluating > > lazy_tid_reaped() performance[4]. > > Good idea! > > > > In all test cases, I simulated that the table has 1,000,000 blocks and > > every block has at least one dead tuple. > > That doesn't strike me as a particularly common scenario? I think it's > quite rare for there to be so evenly but sparse dead tuples. In > particularly it's very common for there to be long runs of dead tuples > separated by long ranges of no dead tuples at all... Agreed. I'll test with such scenarios. > > > > The benchmark scenario is that for > > each virtual heap tuple we check if there is its TID in the dead > > tuple storage. Here are the results of execution time in milliseconds > > and memory usage in bytes: > > In which order are the dead tuples checked? Looks like in sequential > order? In the case of an index over a column that's not correlated with > the heap order the lookups are often much more random - which can > influence lookup performance drastically, due to cache differences in > cache locality. Which will make some structures look worse/better than > others. Good point. It's sequential order, which is not good. I'll test again after shuffling virtual index tuples. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
pgsql-hackers by date: