Status of the table access method work - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Status of the table access method work |
Date | |
Msg-id | 20190405202538.vu7sffsdqqvytmt2@alap3.anarazel.de Whole thread Raw |
Responses |
Re: Status of the table access method work
Re: Status of the table access method work Re: Status of the table access method work |
List | pgsql-hackers |
Hi, In this email I want to give a brief status update of the table access method work - I assume that most of you sensibly haven't followed it into all nooks and crannies. I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the others that collaborated on making tableam happen. It was/is a huge project. I think what's in v12 - I don't know of any non-cleanup / bugfix work pending for 12 - is a pretty reasonable initial set of features. It allows to reimplement a heap like storage without any core modifications (except WAL logging, see below); it is not sufficient to implement a good index oriented table AM. It does not allow to store the catalog in a non heap table. The tableam interface itself doesn't care that much about the AM internally stores data. Most of the API (sequential scans, index lookups, insert/update/delete) don't know about blocks, and only indirectly & optionally about buffers (via BulkInsertState). There's a few callbacks / functions that do care about blocks, because it's not clear, or would have been too much work, to remove the dependency. This currently is: - ANALYZE integration - currently the sampling logic is tied to blocks. - index build range scans - the range is defined as blocks - planner relation size estimate - but that could trivially just be filled with size-in-bytes / BLCKSZin the callback. - the (optional) bitmap heap scan API - that's fairly intrinsically block based. An AM could just internally subdivide TIDs in a different way, but I don't think a bitmap scan like we have would e.g. make a lot of sense for an index oriented table without any sort of stable tid. - the sample scan API - tsmapi.h is block based, so the tableam.h API is as well. I think none of these are limiting in a particularly bad way. The most constraining factor for storage, I think, is that currently the API relies on ItemPointerData style TIDs in a number of places (i.e. a 6 byte tuple identifier). One can implement scans, and inserts into index-less tables without providing that, but no updates, deletes etc. One reason for that is that it'd just have required more changes to executor etc to allow for wider identifiers, but the primary reason is that indexes currently simply don't support anything else. I think this is, by far, the biggest limitation of the API. If one e.g. wanted to implement a practical index-organized-table, the 6 byte limitation obviously would become a limitation very quickly. I suspect that we're going to want to get rid of that limitation in indexes before long for other reasons too, to allow global indexes (which'd need to encode the partition somewhere). With regards to storing the rows themselves, the second biggest limitation is a limitation that is not actually a part of tableam itself: WAL. Many tableam's would want to use WAL, but we only have extensible WAL as part of generic_xlog.h. While that's useful to allow prototyping etc, it's imo not efficient enough to build a competitive storage engine for OLTP (OLAP probably much less of a problem). I don't know what the best approach here is - allowing "well known" extensions to register rmgr entries would be the easiest solution, but it's certainly a bit crummy. Currently there's some, fairly minor, requirement that TIDs are actually unique when not using a snapshot qualifier. That's currently only relevant for GetTupleForTrigger(), AfterTriggerSaveEvent() and EvalPlanQualFetchRowMarks(), which use SnapshotAny. That prevents AMs from implementing in-place updates (thus a problem e.g. for zheap). I've a patch that fixes that, but it's too hacky for v12 - there's not always a convenient snapshot to fetch a row (e.g. in GetTupleForTrigger() after EPQ the row isn't visible to es_snapshot). A second set of limitations is around making more of tableam optional. Right now it e.g. is not possible to have an AM that doesn't implement insert/update/delete. Obviously an AM can just throw an error in the relevant callbacks, but I think it'd be better if we made those callbacks optional, and threw errors at parse-analysis time (both to make the errors consistent, and to ensure it's consistently thrown, rather than only when e.g. an UPDATE actually finds a row to update). Currently foreign keys are allowed between tables of different types of AM. I am wondering whether we ought to allow AMs to forbid being referenced. If e.g. an AM has lower consistency guarantees than the AM of the table referencing it, it might be preferrable to forbid that. OTOH, I guess such an AM could just require UNLOGGED to be used. Another restriction is actually related to UNLOGGED - currently the UNLOGGED processing after crashes works by recognizing init forks by file name. But what if e.g. the storage isn't inside postgres files? Not sure if we actually can do anything good about that. The last issue I know about is that nodeBitmapHeapscan.c and nodeIndexOnlyscan.c currently directly accesses the visibilitymap. Which means if an AM doesn't use the VM, they're never going to use the optimized path. And conversely if the AM uses the VM, it needs to internally map tids in way compatible with heap. I strongly suspect that we're going to have to fix this quite soon. It'd be a pretty significant amount of work to allow storing catalogs in a non-heap table. One difficulty is that there's just a lot of direct accesses to catalog via heapam.h APIs - while a significant amount of work to "fix" that, it's probably not very hard for each individual site. There's a few places that rely on heap internals (checking xmin for invalidation and the like). I think the biggest issue however would be the catalog bootstrapping - to be able to read pg_am, we obviously need to go through relcache.c's bootstrapping, and that only works because we hardcode how those tables look like. I personally don't think it's particularly important issue to work on, nor am I convinced that there'd be buy-in to make the necessary extensive changes. Greetings, Andres Freund
pgsql-hackers by date: