Thread: Status of the table access method work
Hi, In this email I want to give a brief status update of the table access method work - I assume that most of you sensibly haven't followed it into all nooks and crannies. I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the others that collaborated on making tableam happen. It was/is a huge project. I think what's in v12 - I don't know of any non-cleanup / bugfix work pending for 12 - is a pretty reasonable initial set of features. It allows to reimplement a heap like storage without any core modifications (except WAL logging, see below); it is not sufficient to implement a good index oriented table AM. It does not allow to store the catalog in a non heap table. The tableam interface itself doesn't care that much about the AM internally stores data. Most of the API (sequential scans, index lookups, insert/update/delete) don't know about blocks, and only indirectly & optionally about buffers (via BulkInsertState). There's a few callbacks / functions that do care about blocks, because it's not clear, or would have been too much work, to remove the dependency. This currently is: - ANALYZE integration - currently the sampling logic is tied to blocks. - index build range scans - the range is defined as blocks - planner relation size estimate - but that could trivially just be filled with size-in-bytes / BLCKSZin the callback. - the (optional) bitmap heap scan API - that's fairly intrinsically block based. An AM could just internally subdivide TIDs in a different way, but I don't think a bitmap scan like we have would e.g. make a lot of sense for an index oriented table without any sort of stable tid. - the sample scan API - tsmapi.h is block based, so the tableam.h API is as well. I think none of these are limiting in a particularly bad way. The most constraining factor for storage, I think, is that currently the API relies on ItemPointerData style TIDs in a number of places (i.e. a 6 byte tuple identifier). One can implement scans, and inserts into index-less tables without providing that, but no updates, deletes etc. One reason for that is that it'd just have required more changes to executor etc to allow for wider identifiers, but the primary reason is that indexes currently simply don't support anything else. I think this is, by far, the biggest limitation of the API. If one e.g. wanted to implement a practical index-organized-table, the 6 byte limitation obviously would become a limitation very quickly. I suspect that we're going to want to get rid of that limitation in indexes before long for other reasons too, to allow global indexes (which'd need to encode the partition somewhere). With regards to storing the rows themselves, the second biggest limitation is a limitation that is not actually a part of tableam itself: WAL. Many tableam's would want to use WAL, but we only have extensible WAL as part of generic_xlog.h. While that's useful to allow prototyping etc, it's imo not efficient enough to build a competitive storage engine for OLTP (OLAP probably much less of a problem). I don't know what the best approach here is - allowing "well known" extensions to register rmgr entries would be the easiest solution, but it's certainly a bit crummy. Currently there's some, fairly minor, requirement that TIDs are actually unique when not using a snapshot qualifier. That's currently only relevant for GetTupleForTrigger(), AfterTriggerSaveEvent() and EvalPlanQualFetchRowMarks(), which use SnapshotAny. That prevents AMs from implementing in-place updates (thus a problem e.g. for zheap). I've a patch that fixes that, but it's too hacky for v12 - there's not always a convenient snapshot to fetch a row (e.g. in GetTupleForTrigger() after EPQ the row isn't visible to es_snapshot). A second set of limitations is around making more of tableam optional. Right now it e.g. is not possible to have an AM that doesn't implement insert/update/delete. Obviously an AM can just throw an error in the relevant callbacks, but I think it'd be better if we made those callbacks optional, and threw errors at parse-analysis time (both to make the errors consistent, and to ensure it's consistently thrown, rather than only when e.g. an UPDATE actually finds a row to update). Currently foreign keys are allowed between tables of different types of AM. I am wondering whether we ought to allow AMs to forbid being referenced. If e.g. an AM has lower consistency guarantees than the AM of the table referencing it, it might be preferrable to forbid that. OTOH, I guess such an AM could just require UNLOGGED to be used. Another restriction is actually related to UNLOGGED - currently the UNLOGGED processing after crashes works by recognizing init forks by file name. But what if e.g. the storage isn't inside postgres files? Not sure if we actually can do anything good about that. The last issue I know about is that nodeBitmapHeapscan.c and nodeIndexOnlyscan.c currently directly accesses the visibilitymap. Which means if an AM doesn't use the VM, they're never going to use the optimized path. And conversely if the AM uses the VM, it needs to internally map tids in way compatible with heap. I strongly suspect that we're going to have to fix this quite soon. It'd be a pretty significant amount of work to allow storing catalogs in a non-heap table. One difficulty is that there's just a lot of direct accesses to catalog via heapam.h APIs - while a significant amount of work to "fix" that, it's probably not very hard for each individual site. There's a few places that rely on heap internals (checking xmin for invalidation and the like). I think the biggest issue however would be the catalog bootstrapping - to be able to read pg_am, we obviously need to go through relcache.c's bootstrapping, and that only works because we hardcode how those tables look like. I personally don't think it's particularly important issue to work on, nor am I convinced that there'd be buy-in to make the necessary extensive changes. Greetings, Andres Freund
On 05/04/2019 23:25, Andres Freund wrote: > I think what's in v12 - I don't know of any non-cleanup / bugfix work > pending for 12 - is a pretty reasonable initial set of features. Hooray! > - the (optional) bitmap heap scan API - that's fairly intrinsically > block based. An AM could just internally subdivide TIDs in a different > way, but I don't think a bitmap scan like we have would e.g. make a > lot of sense for an index oriented table without any sort of stable > tid. If an AM doesn't implement the bitmap heap scan API, what happens? Bitmap scans are disabled? Even if an AM isn't block-oriented, the bitmap heap scan API still makes sense as long as there's some correlation between TIDs and physical location. The only really broken thing about that currently is the prefetching: nodeBitmapHeapScan.c calls PrefetchBuffer() directly with the TID's block numbers. It would be pretty straightforward to wrap that in a callback, so that the AM could do something different. Or move even more of the logic to the AM, so that the AM would get the whole TIDBitmap in table_beginscan_bm(). It could then implement the fetching and prefetching as it sees fit. I don't think it's urgent, though. We can cross that bridge when we get there, with the first AM that needs that flexibility. > The most constraining factor for storage, I think, is that currently the > API relies on ItemPointerData style TIDs in a number of places (i.e. a 6 > byte tuple identifier). I think 48 bits would be just about enough, but it's even more limited than you might at the moment. There are a few places that assume that the offsetnumber <= MaxHeapTuplesPerPage. See ginpostinglist.c, and MAX_TUPLES_PER_PAGE in tidbitmap.c. Also, offsetnumber can't be 0, because that makes the ItemPointer invalid, which is inconvenient if you tried to use ItemPointer as just an arbitrary 48-bit integer. - Heikki
Hi On 2019-04-08 14:53:53 +0300, Heikki Linnakangas wrote: > On 05/04/2019 23:25, Andres Freund wrote: > > - the (optional) bitmap heap scan API - that's fairly intrinsically > > block based. An AM could just internally subdivide TIDs in a different > > way, but I don't think a bitmap scan like we have would e.g. make a > > lot of sense for an index oriented table without any sort of stable > > tid. > > If an AM doesn't implement the bitmap heap scan API, what happens? Bitmap > scans are disabled? Yea, the planner doesn't consider them. It just masks the index's amhasgetbitmap. Seems to be the most reasonable thing to do? > Even if an AM isn't block-oriented, the bitmap heap scan API still makes > sense as long as there's some correlation between TIDs and physical > location. Yea, it could be a non-linear mapping. But I'm honestly not sure how many non-block oriented AMs with such a correlation there are - I mean you're not going to have that in say an IOT. And it'd be trivial to just "fake" a block mapping for an in-memory AM. > The only really broken thing about that currently is the > prefetching: nodeBitmapHeapScan.c calls PrefetchBuffer() directly with the > TID's block numbers. It would be pretty straightforward to wrap that in a > callback, so that the AM could do something different. That, and the VM_ALL_VISIBLE() checks both in nodeBitmapHeapscan.c and nodeIndexonlyscan.c. > Or move even more of the logic to the AM, so that the AM would get the whole > TIDBitmap in table_beginscan_bm(). It could then implement the fetching and > prefetching as it sees fit. > > I don't think it's urgent, though. We can cross that bridge when we get > there, with the first AM that needs that flexibility. Yea, it seemed nontrivial (not in really hard, just not obvious), and the implicated code duplication scared me away. > > The most constraining factor for storage, I think, is that currently the > > API relies on ItemPointerData style TIDs in a number of places (i.e. a 6 > > byte tuple identifier). > > I think 48 bits would be just about enough I don't think that's really true. Consider e.g. implementing an index oriented table - there's no way you can efficiently implement one with that small a key. You basically need a helper index just to have efficient and small enough tids. And given that we're also going to need wider tids for global indexes, I suspect we're just going to have to bite into the sour apple and make tids variable width. > , but it's even more limited than > you might at the moment. There are a few places that assume that the > offsetnumber <= MaxHeapTuplesPerPage. See ginpostinglist.c, and > MAX_TUPLES_PER_PAGE in tidbitmap.c. Also, offsetnumber can't be 0, because > that makes the ItemPointer invalid, which is inconvenient if you tried to > use ItemPointer as just an arbitrary 48-bit integer. Good point. Thanks for looking (and playing, in the other thread)! Greetings, Andres Freund
On Sat, Apr 6, 2019 at 7:25 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
In this email I want to give a brief status update of the table access
method work - I assume that most of you sensibly haven't followed it
into all nooks and crannies.
I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the
others that collaborated on making tableam happen. It was/is a huge
project.
A big thank you Andres for your enormous efforts in this patch.
Without your involvement, this patch couldn't have been made into v12.
With regards to storing the rows themselves, the second biggest
limitation is a limitation that is not actually a part of tableam
itself: WAL. Many tableam's would want to use WAL, but we only have
extensible WAL as part of generic_xlog.h. While that's useful to allow
prototyping etc, it's imo not efficient enough to build a competitive
storage engine for OLTP (OLAP probably much less of a problem). I don't
know what the best approach here is - allowing "well known" extensions
to register rmgr entries would be the easiest solution, but it's
certainly a bit crummy.
I got the same doubt when i looked into some of the UNDO patches
where it tries to modify the core code to add UNDO specific WAL types.
Different AM's may need different set of operations to be WAL logged,
so it may be better for the AM's to register their own types?
Regards,
Haribabu Kommi
Fujitsu Australia
On 08/04/2019 19:38, Andres Freund wrote: > On 2019-04-08 14:53:53 +0300, Heikki Linnakangas wrote: >> On 05/04/2019 23:25, Andres Freund wrote: >>> - the (optional) bitmap heap scan API - that's fairly intrinsically >>> block based. An AM could just internally subdivide TIDs in a different >>> way, but I don't think a bitmap scan like we have would e.g. make a >>> lot of sense for an index oriented table without any sort of stable >>> tid. >> >> If an AM doesn't implement the bitmap heap scan API, what happens? Bitmap >> scans are disabled? > > Yea, the planner doesn't consider them. It just masks the index's > amhasgetbitmap. Seems to be the most reasonable thing to do? Yep. >> Even if an AM isn't block-oriented, the bitmap heap scan API still makes >> sense as long as there's some correlation between TIDs and physical >> location. > > Yea, it could be a non-linear mapping. But I'm honestly not sure how > many non-block oriented AMs with such a correlation there are - I mean > you're not going to have that in say an IOT. And it'd be trivial to just > "fake" a block mapping for an in-memory AM. Now that Ashwin conveniently posted the ZedStore prototype we started to hack on [1], I'll point to that as an example :-). It stores data in a B-tree (or rather, multiple B-trees) on TIDs. So there's very high correlation between TIDs and physical locality, but it's not block oriented. Another example would be the "LZ4 Compressed Storage Manager" that Nikolai envisioned recently [2]. Before we came up with the idea of using b-trees in ZedStore, we were actually thinking of something very similar to that. Although that one perhaps still counts as "block-oriented" as far as the bitmap heap scan API is concerned, as it still deals with blocks, they're just mapped to different physical locations. I'm not sure how an Index-Organized-Table would work, but I think it would want to just get the whole bitmap, and figure out the best order to fetch the rows by itself. PS. Seems that having a table AM API has opened the floodgates for storage ideas. Nice! - Heikki [1] https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net
> On Tue, Apr 9, 2019 at 4:12 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > > On Sat, Apr 6, 2019 at 7:25 AM Andres Freund <andres@anarazel.de> wrote: >> >> With regards to storing the rows themselves, the second biggest >> limitation is a limitation that is not actually a part of tableam >> itself: WAL. Many tableam's would want to use WAL, but we only have >> extensible WAL as part of generic_xlog.h. While that's useful to allow >> prototyping etc, it's imo not efficient enough to build a competitive >> storage engine for OLTP (OLAP probably much less of a problem). I don't >> know what the best approach here is - allowing "well known" extensions >> to register rmgr entries would be the easiest solution, but it's >> certainly a bit crummy. > > > I got the same doubt when i looked into some of the UNDO patches > where it tries to modify the core code to add UNDO specific WAL types. > Different AM's may need different set of operations to be WAL logged, > so it may be better for the AM's to register their own types? I'm also curious about that. As far as I can see the main objection against that was that in this case the recovery process will depend on an extension, which could violate reliability. But I wonder if this argument is still valid for AM's, since the whole data is kind of depends on it, not only the recovery. Btw, can someone elaborate, why exactly generic_xlog is not efficient enough? I've went through the corresponding thread, looks like generic WAL records are bigger than normal one - is it the only reason?
On Tue, Apr 9, 2019 at 2:12 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > I got the same doubt when i looked into some of the UNDO patches > where it tries to modify the core code to add UNDO specific WAL types. > Different AM's may need different set of operations to be WAL logged, > so it may be better for the AM's to register their own types? In the current undo proposal, the undo subsystem itself needs an rmgr ID for WAL-logging of some low level undo log management records (ie its own record types), but then any undo-aware AM would also need to have its own rmgr ID for its own universe of WAL records (its own types, meaningful to it alone), and that same rmgr ID is used also for its undo records (which themselves have specific types). That is, in rmgrlist.h, an undo-aware AM would register not only its redo function (called for each WAL record in recovery) but also its undo function (called when transaction roll back, if your transaction generated any undo records). Which raises the question of how a hypothetical undo-aware AM could deal with undo records if it's using generic WAL records. I haven't thought about that. A couple of ideas we've bounced around to allow extensions to work with specific WAL records: (1) a community-wide registry of rmgr IDs (basically, just allocate the IDs for all known extensions in a header in the tree, like IANA), or (2) a per-cluster registry scheme where an extension, identified by its name, would be permanently allocated an rmgr number and library + callback functions for the lifetime of that cluster. Or something like that. -- Thomas Munro https://enterprisedb.com
Hi, On 2019-04-09 11:17:29 +0200, Dmitry Dolgov wrote: > I'm also curious about that. As far as I can see the main objection against > that was that in this case the recovery process will depend on an extension, > which could violate reliability. I don't think that's a primary concern - although it is one. The mapping from types of records to the handler function needs to be accessible at a very early state, when the cluster isn't yet in a consistent state. So we can't just go an look into pg_am, and look up a handler function, etc - crash recovery happens much earlier than that is possible. Nor do we want the mapping of 'rmgr id' -> 'extension' to be defined in the config file, that's way too likely to be wrong. So there needs to be a different type of mapping, accessible outside the catalog. I supect we'd have to end up with something very roughly like the relmapper infrastructure. A tertiary problem is then how to identify extensions in that mapping - although I suspect just using any library name that can be passed to load_library() will be OK. > But I wonder if this argument is still valid for AM's, since the whole > data is kind of depends on it, not only the recovery. I don't buy that argument. If you have an AM that registers, using a new facility, replay routines, and then it errors out / crashes during those, there's no way to get the cluster back into a consistent state. So it's not just the one table in that AM that's gone, it's the entire cluster that's impacted. > Btw, can someone elaborate, why exactly generic_xlog is not efficient enough? > I've went through the corresponding thread, looks like generic WAL records are > bigger than normal one - is it the only reason? That's one big reason. But also, you just can't do much more than "write this block into that file" during recovery with. A lot of our replay routines intentionally do more complicated tasks. Greetings, Andres Freund
On Fri, Apr 5, 2019 at 11:25 PM Andres Freund <andres@anarazel.de> wrote: > I want to thank Haribabu, Alvaro, Alexander, David, Dmitry and all the > others that collaborated on making tableam happen. It was/is a huge > project. Thank you so much for bringing this project to commit! Excellent work! Your explanation of existing limitations looks very good and convincing. But I think there is one you didn't mention. We require new table AMs to basically save old "contract" between heap and indexes. We have "all or nothing" choice during updates. So, storage can either insert to none of indexes or insert to all of them (HOT-like update). I think any storage, which is going to address "write amplification" point raised by Uber, needs to break this "contract". For example, zheap is promised to implement delete-marking indexes. But it's not yet published. And for me it's not clear that this approach is better among the alternatives. With delete-marking approach you need to update index tuples corresponding to old values of updated fields. But additionally to that it's not trivial to delete index tuple. In order to do that, you need to both locate this index tuple and know that this index value isn't present in undo chain. So, it's likely required another index lookup during purging of undo chain. Thus, we basically need to random lookup index twice for every deleted index tuple. Also, it becomes more complex to lookup appropriate heap tuple during index scan. Then you need to check not only visibility, but also matching index value (here we need to adjust index_fetch_tuple interface). Because it might happen that visible to you version have different index value. That may lead to O(N^2) performance while accessing single row with N versions (MySQL InnoDB has this problem). Alternative idea is to have MVCC-aware indexes. This approach looks more attractive for me. In this approach you basically need xmin, xmax fields in index tuples. On insertion of index tuple you fill it's xmin. On update, previous index tuple is marked with xmax. After that outdated index tuples might be deleted in the lazy manner when page space is required. So, only one random access is required for deleted index tuple. With this approach fetching single row is O(N). Also, index-only scan becomes very easy and doesn't even need a visibility map. The only problem here is extra space requirements for index tuples. But I think, this is well-isolated problem, which is easy to attack. For instance, some visibility information could be evicted to undo chain (like zheap does for its tuples). Also, we can have special bit for "all visible" index tuples. With "all visible" bit set this tuple can get rid of visibility fields. We can do this for index tuples, because if index tuple requires extra space we can split the page, in spite of heap where tuples are fixed in pages and xmax needs to be updated in-place. I understand that delete-marking indexes have some advantages, and some people find them more appealing. But my point is that we shouldn't builtin one of this approaches into API unless we have concrete proof that this approach is strongly overcomes another. It would be better to have our table-AM API flexible enough to implement both. I can imagine we have proper encapsulation here bringing more interaction with indexes to the table AM side. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Hi, On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote: > Your explanation of existing limitations looks very good and > convincing. But I think there is one you didn't mention. We require > new table AMs to basically save old "contract" between heap and > indexes. We have "all or nothing" choice during updates. So, storage > can either insert to none of indexes or insert to all of them > (HOT-like update). I think that's a problem, and yea, I should have mentioned it. I'd earlier thought about it and then forgot. I however don't think we should design the interface for this before we have at least one AM that's actually in decent-ish shape that needs it. I seriously doubt we'll get the interface right enough. Note: I'm *extremely* *extremely* doubtful that moving the full executor invocations for expression indices etc into the tableam is a sane thing to do. It's possible to convince me there's no alternative, but it'll be really hard. I suspect the right direction will be more going in a direction of computing new index tuples for expression indexes before tableam gets involved. If we do that right, we can also implement the stuff that 1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way. > I think any storage, which is going to address "write amplification" > point raised by Uber, needs to break this "contract". FWIW, I don't think it makes much point in using Uber as a justification for anything here. Their analysis was so deeply flawed and motivated by non-technical reasons that it should just be ignored. Greetings, Andres Freund
On Wed, Apr 10, 2019 at 8:32 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote: > > Your explanation of existing limitations looks very good and > > convincing. But I think there is one you didn't mention. We require > > new table AMs to basically save old "contract" between heap and > > indexes. We have "all or nothing" choice during updates. So, storage > > can either insert to none of indexes or insert to all of them > > (HOT-like update). > > I think that's a problem, and yea, I should have mentioned it. I'd > earlier thought about it and then forgot. > > I however don't think we should design the interface for this before we > have at least one AM that's actually in decent-ish shape that needs > it. I seriously doubt we'll get the interface right enough. > > Note: I'm *extremely* *extremely* doubtful that moving the full executor > invocations for expression indices etc into the tableam is a sane thing > to do. It's possible to convince me there's no alternative, but it'll be > really hard. > > I suspect the right direction will be more going in a direction of > computing new index tuples for expression indexes before tableam gets > involved. If we do that right, we can also implement the stuff that > 1c53c4dec3985512f7f2f53c9d76a5295cd0a2dd reverted in a proper way. Probably we can invent few modes table AM might work: calculation of all new index tuples, calculation of new and old index tuples for updated fields, calculation of all new and old index tuples and so on. And them index tuples would be calculated either in advance or by callback. > > I think any storage, which is going to address "write amplification" > > point raised by Uber, needs to break this "contract". > > FWIW, I don't think it makes much point in using Uber as a justification > for anything here. Their analysis was so deeply flawed and motivated by > non-technical reasons that it should just be ignored. Yeah, Uber is just a buzz word here. But problem that update of single indexed field leads to insertions to every index is well-known among the PostgreSQL users. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Wed, Apr 10, 2019 at 8:32 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-04-10 20:14:17 +0300, Alexander Korotkov wrote: > > Your explanation of existing limitations looks very good and > > convincing. But I think there is one you didn't mention. We require > > new table AMs to basically save old "contract" between heap and > > indexes. We have "all or nothing" choice during updates. So, storage > > can either insert to none of indexes or insert to all of them > > (HOT-like update). > > I think that's a problem, and yea, I should have mentioned it. I'd > earlier thought about it and then forgot. > > I however don't think we should design the interface for this before we > have at least one AM that's actually in decent-ish shape that needs > it. I seriously doubt we'll get the interface right enough. Sure. My point is that once we get first table AM which needs this, say zheap, we shouldn't make it like this TM_Result (*tuple_update) (Relation rel, ... bool *update_indexes, bool *delete_marking); but rather try to design proper encapsulation of logic inside of table AM. ------ Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
> On Fri, Apr 5, 2019 at 10:25 PM Andres Freund <andres@anarazel.de> wrote: > > A second set of limitations is around making more of tableam > optional. Right now it e.g. is not possible to have an AM that doesn't > implement insert/update/delete. Obviously an AM can just throw an error > in the relevant callbacks, but I think it'd be better if we made those > callbacks optional, and threw errors at parse-analysis time (both to > make the errors consistent, and to ensure it's consistently thrown, > rather than only when e.g. an UPDATE actually finds a row to update). Agree, but I guess some of tableam still should be mandatory, and then I wonder where to put the live between those that are optional and those that are not. E.g. looks like it can be relatively straightforward (ignoring `create table as` and some other stuff) to make insert/update/delete optional with messages at analysis time, but for others like parallel scan related it's probably not.
Attachment
Hi! On 2019-04-17 22:02:24 +0200, Dmitry Dolgov wrote: > > On Fri, Apr 5, 2019 at 10:25 PM Andres Freund <andres@anarazel.de> wrote: > > > > A second set of limitations is around making more of tableam > > optional. Right now it e.g. is not possible to have an AM that doesn't > > implement insert/update/delete. Obviously an AM can just throw an error > > in the relevant callbacks, but I think it'd be better if we made those > > callbacks optional, and threw errors at parse-analysis time (both to > > make the errors consistent, and to ensure it's consistently thrown, > > rather than only when e.g. an UPDATE actually finds a row to update). > > Agree, but I guess some of tableam still should be mandatory, and then I wonder > where to put the live between those that are optional and those that are not. > E.g. looks like it can be relatively straightforward (ignoring `create table as` > and some other stuff) to make insert/update/delete optional with messages at > analysis time, but for others like parallel scan related it's probably not. Thanks for the patch! I assume you're aware, but it's probably not going to be applied for 12... I think most of the read-only stuff just needs to be non-optional, and most of the DML stuff needs to be optional. On the executor side it'd probably be good to make the sample scan optional too, but then we also need to check for that during parse-analysis. In contast to bitmap scans there's no alternative way to execute them. > Assert(routine->relation_set_new_filenode != NULL); > diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c > index c39218f8db..36e2dbf1b8 100644 > --- a/src/backend/commands/copy.c > +++ b/src/backend/commands/copy.c > @@ -41,6 +41,7 @@ > #include "miscadmin.h" > #include "optimizer/optimizer.h" > #include "nodes/makefuncs.h" > +#include "nodes/nodeFuncs.h" > #include "parser/parse_coerce.h" > #include "parser/parse_collate.h" > #include "parser/parse_expr.h" > @@ -901,6 +902,13 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt, > NULL, false, false); > rte->requiredPerms = (is_from ? ACL_INSERT : ACL_SELECT); > > + if (is_from && !table_support_multi_insert(rel)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("Table access method doesn't support the operation"), > + parser_errposition(pstate, > + exprLocation((Node *) stmt)))); Probably should fall-back to plain inserts if multi-insert isn't supported. And if insert isn't supported either, we should probably talk about that specifically? I.e. a message like "access method \"%s\" of table \"%s\" does not support %s" ? Without knowing at least thatmuch operation it might sometimes be very hard to figure out what's not supported. > +static inline bool > +table_support_speculative(Relation rel) > +{ > + return rel->rd_tableam == NULL || > + (rel->rd_tableam->tuple_insert_speculative != NULL && > + rel->rd_tableam->tuple_complete_speculative != NULL); > +} In GetTableAmRoutine() I'd assert that either both or none are defined. > +static inline bool > +table_support_multi_insert(Relation rel) > +{ > + return rel->rd_tableam == NULL || > + (rel->rd_tableam->multi_insert != NULL && > + rel->rd_tableam->finish_bulk_insert != NULL); > +} bulk insert already is optional... I think there's more places that need checks like these - consider e.g. replication and such that don't go through the full blown executor. Greetings, Andres Freund
> On Wed, Apr 17, 2019 at 10:25 PM Andres Freund <andres@anarazel.de> wrote: > > I assume you're aware, but it's probably not going to be applied for 12... Sure, the patch was mostly to express more clearly what I was thinking about :) > I think most of the read-only stuff just needs to be non-optional, and most > of the DML stuff needs to be optional. > On the executor side it'd probably be good to make the sample scan optional > too, but then we also need to check for that during parse-analysis. In > contast to bitmap scans there's no alternative way to execute them. Yeah, makes sense. > bulk insert already is optional... Oh, haven't noticed.
On Wed, 10 Apr 2019 at 18:14, Alexander Korotkov <aekorotkov@gmail.com> wrote:
Alternative idea is to have MVCC-aware indexes. This approach looks
more attractive for me. In this approach you basically need xmin,
xmax fields in index tuples. On insertion of index tuple you fill
it's xmin. On update, previous index tuple is marked with xmax.
+1
xmax can be provided through to index by indexam when 1) we mark killed tuples, 2) when we do indexscan of index entry without xmax set.
xmax can be set as a hint on normal scans, or set as part of an update, as the index chooses
After that outdated index tuples might be deleted in the lazy manner
when page space is required.
That is already done, so hardly any change there.
Also, we canhave special bit for "all visible" index tuples. With "all visible"
bit set this tuple can get rid of visibility fields. We can do this
for index tuples, because if index tuple requires extra space we can
split the page, in spite of heap where tuples are fixed in pages and
xmax needs to be updated in-place.
Keeping the xmin/xmax would also be useful for historical indexes, i.e. indexes that can be used to search for data with historic snapshots.
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-Apr-10, Alexander Korotkov wrote: > Alternative idea is to have MVCC-aware indexes. This approach looks > more attractive for me. In this approach you basically need xmin, > xmax fields in index tuples. "We liked freezing xmin so much that we had to introduce freezing for xmax" -- rhaas dixit. And now we want to introduce freezing for indexes? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jun 11, 2019 at 11:47 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Apr-10, Alexander Korotkov wrote: > > Alternative idea is to have MVCC-aware indexes. This approach looks > > more attractive for me. In this approach you basically need xmin, > > xmax fields in index tuples. > > "We liked freezing xmin so much that we had to introduce freezing for > xmax" -- rhaas dixit. And now we want to introduce freezing for > indexes? Plus it would add 8 bytes to the size of every index tuple. if you are indexing long-ish strings it may not hurt too much, but if your primary key is an integer, I think your index is going to get a lot bigger. The problem with freezing is perhaps avoidable if you store an epoch in the page special space as part of all this. But I don't see any way to avoid having the tuples get wider. Possibly you could include xmin and xmax only when needed, removing xmin once the tuples are all-visible and splitting pages if you need to make room to add an xmax. I'm not sure how much that helps, though, because if you do a bulk insert, you're going to have to leave room for all of the xmins initially, and removing them later will produce free space for which you may have little use. I don't think that including visibility information in indexes is a bad idea; we've thought about making zheap do this someday. But I think that we need to use some more sophisticated approach involving, maybe, undo pointers, or some other kind of magic, rather than just widening the tuples. I expect that just widening the tuples would be good enough to win for some use cases, but I think there would be others that lose heavily. In any case, I agree that we do not want to create more things that need freezing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 11, 2019 at 8:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't think that including visibility information in indexes is a > bad idea; we've thought about making zheap do this someday. But I > think that we need to use some more sophisticated approach involving, > maybe, undo pointers, or some other kind of magic, rather than just > widening the tuples. I expect that just widening the tuples would be > good enough to win for some use cases, but I think there would be > others that lose heavily. +1. Limited visibility information would make sense (e.g. maybe a per tuple all-visible bit), which would have to be backed by something like UNDO, but storing XIDs in tuples seems like a very bad idea. The idea that something like this would have to be usable by any possible table access method seems unworkable in general. Sometimes it seems like the table access method work could use some specific non-goals. Perhaps I just haven't being paying enough attention to have noticed them. -- Peter Geoghegan
Hi, On 2019-06-11 11:59:36 -0400, Robert Haas wrote: > On Tue, Jun 11, 2019 at 11:47 AM Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > On 2019-Apr-10, Alexander Korotkov wrote: > > > Alternative idea is to have MVCC-aware indexes. This approach looks > > > more attractive for me. In this approach you basically need xmin, > > > xmax fields in index tuples. > > > > "We liked freezing xmin so much that we had to introduce freezing for > > xmax" -- rhaas dixit. And now we want to introduce freezing for > > indexes? > > Plus it would add 8 bytes to the size of every index tuple. if you > are indexing long-ish strings it may not hurt too much, but if your > primary key is an integer, I think your index is going to get a lot > bigger. > > The problem with freezing is perhaps avoidable if you store an epoch > in the page special space as part of all this. But I don't see any > way to avoid having the tuples get wider. Possibly you could include > xmin and xmax only when needed, removing xmin once the tuples are > all-visible and splitting pages if you need to make room to add an > xmax. I'm not sure how much that helps, though, because if you do a > bulk insert, you're going to have to leave room for all of the xmins > initially, and removing them later will produce free space for which > you may have little use. > > I don't think that including visibility information in indexes is a > bad idea; we've thought about making zheap do this someday. But I > think that we need to use some more sophisticated approach involving, > maybe, undo pointers, or some other kind of magic, rather than just > widening the tuples. I expect that just widening the tuples would be > good enough to win for some use cases, but I think there would be > others that lose heavily. Yea, I think there's plenty reasons to want to do something different than what we're doing. But I'd like to see a concrete proposal before building API for it... Greetings, Andres Freund
On Tue, Jun 11, 2019 at 12:32 PM Andres Freund <andres@anarazel.de> wrote: > Yea, I think there's plenty reasons to want to do something different > than what we're doing. But I'd like to see a concrete proposal before > building API for it... I wasn't intending to propose that you should. We're just in the brainstorming stage here I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company