Re: Zedstore - compressed in-core columnar storage - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Zedstore - compressed in-core columnar storage |
Date | |
Msg-id | 20190414170853.xgkxanj6v2c34z6b@development Whole thread Raw |
In response to | Zedstore - compressed in-core columnar storage (Ashwin Agrawal <aagrawal@pivotal.io>) |
List | pgsql-hackers |
On Mon, Apr 08, 2019 at 05:27:05PM -0700, Ashwin Agrawal wrote: > Heikki and I have been hacking recently for few weeks to implement > in-core columnar storage for PostgreSQL. Here's the design and initial > implementation of Zedstore, compressed in-core columnar storage (table > access method). Attaching the patch and link to github branch [1] to > follow along. > > The objective is to gather feedback on design and approach to the > same. The implementation has core basic pieces working but not close > to complete. > > Big thank you to Andres, Haribabu and team for the table access method > API's. Leveraged the API's for implementing zedstore, and proves API > to be in very good shape. Had to enhance the same minimally but > in-general didn't had to touch executor much. > > Motivations / Objectives > > * Performance improvement for queries selecting subset of columns > (reduced IO). > * Reduced on-disk footprint compared to heap table. Shorter tuple > headers and also leveraging compression of similar type data > * Be first-class citizen in the Postgres architecture (tables data can > just independently live in columnar storage) > * Fully MVCC compliant > * All Indexes supported > * Hybrid row-column store, where some columns are stored together, and > others separately. Provide flexibility of granularity on how to > divide the columns. Columns accessed together can be stored > together. > * Provide better control over bloat (similar to zheap) > * Eliminate need for separate toast tables > * Faster add / drop column or changing data type of column by avoiding > full rewrite of the table. > Cool. Me gusta. > High-level Design - B-trees for the win! > ======================================== > > To start simple, let's ignore column store aspect for a moment and > consider it as compressed row store. The column store is natural > extension of this concept, explained in next section. > > The basic on-disk data structure leveraged is a B-tree, indexed by > TID. BTree being a great data structure, fast and versatile. Note this > is not referring to existing Btree indexes, but instead net new > separate BTree for table data storage. > > TID - logical row identifier: > TID is just a 48-bit row identifier. The traditional division into > block and offset numbers is meaningless. In order to find a tuple with > a given TID, one must always descend the B-tree. Having logical TID > provides flexibility to move the tuples around different pages on page > splits or page merges can be performed. > So if TIDs are redefined this way, how does affect BRIN indexes? I mean, that's a lightweight indexing scheme which however assumes TIDs encode certain amount of locality - so this probably makes them (and Bitmap Heap Scans in general) much less eficient. That's a bit unfortunate, although I don't see a way around it :-( > The internal pages of the B-tree are super simple and boring. Each > internal page just stores an array of TID and downlink pairs. Let's > focus on the leaf level. Leaf blocks have short uncompressed header, > followed by btree items. Two kinds of items exist: > > - plain item, holds one tuple or one datum, uncompressed payload > - a "container item", holds multiple plain items, compressed payload > > +----------------------------- > | Fixed-size page header: > | > | LSN > | TID low and hi key (for Lehman & Yao B-tree operations) > | left and right page pointers > | > | Items: > | > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | uncompressed size | lastTID | payload (container > item) > | TID | size | flags | undo pointer | payload (plain item) > | TID | size | flags | undo pointer | payload (plain item) > | ... > | > +---------------------------- > So if I understand it correctly, ZSUncompressedBtreeItem is the "plain" item and ZSCompressedBtreeItem is the container one. Correct? I find it a bit confusing, and I too ran into the issue with data that can't be compressed, so I think the "container" should support both compressed and uncompressed data. Heikki already mentioned that, so I suppose it's just not implemented yet. That however means the name of the "compressed" struct gets confusing, so I suggest to rename to: ZSUncompressedBtreeItem -> ZSPlainBtreeItem ZSCompressedBtreeItem -> ZSContainerBtreeItem where the container supports both compressed and uncompressed mode. Also, maybe we don't need to put "Btree" into every damn name ;-) Looking at the ZSCompressedBtreeItem, I see it stores just first/last TID for the compressed data. Won't that be insufficient when there are some gaps due to deletions or something? Or perhaps I just don't understand how it works. Another thing is that with uncompressed size being stored as uint16, won't that be insufficient for highly compressible data / large pages? I mean, we can have pages up to 32kB, which is not that far. > Column store > ------------ > > A column store uses the same structure but we have *multiple* B-trees, > one for each column, all indexed by TID. The B-trees for all columns > are stored in the same physical file. > > A metapage at block 0, has links to the roots of the B-trees. Leaf > pages look the same, but instead of storing the whole tuple, stores > just a single attribute. To reconstruct a row with given TID, scan > descends down the B-trees for all the columns using that TID, and > fetches all attributes. Likewise, a sequential scan walks all the > B-trees in lockstep. > OK, so data for all the columns are stored in separate btrees, but in the same physical file. Wouldn't it be more convenient to have one relfilenode per column? That would also mean the 32TB limit applies to individual columns, not the whole table. Of course, it'd be more complicated and partitioning allows us to work around that limit. > So, in summary can imagine Zedstore as forest of B-trees, one for each > column, all indexed by TIDs. > > This way of laying out the data also easily allows for hybrid > row-column store, where some columns are stored together, and others > have a dedicated B-tree. Need to have user facing syntax to allow > specifying how to group the columns. > OK, makes sense. Do you also envision supporting per-column / per-group compression etc? > Main reasons for storing data this way > -------------------------------------- > > * Layout the data/tuples in mapped fashion instead of keeping the > logical to physical mapping separate from actual data. So, keep the > meta-data and data logically in single stream of file, avoiding the > need for separate forks/files to store meta-data and data. > > * Stick to fixed size physical blocks. Variable size blocks pose need > for increased logical to physical mapping maintenance, plus > restrictions on concurrency of writes and reads to files. Hence > adopt compression to fit fixed size blocks instead of other way > round. > > MVCC > ---- > MVCC works very similar to zheap for zedstore. Undo record pointers > are used to implement MVCC. Transaction information if not directly > stored with the data. In zheap, there's a small, fixed, number of > "transaction slots" on each page, but zedstore has undo pointer with > each item directly; in normal cases, the compression squeezes this > down to almost nothing. > > Implementation > ============== > > Insert: > Inserting a new row, splits the row into datums. Then for first column > decide which block to insert the same to, and pick a TID for it, and > write undo record for the same. Rest of the columns are inserted using > that same TID and point to same undo position. > What about deletes? How do these work? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: