Thread: Extensible storage manager API - SMGR hook Redux
Hi hackers, At Neon, we've been working on removing the file system dependency from PostgreSQL and replacing it with a distributed storage layer. For now, we've seen most success in this by replacing the implementation of the smgr API, but it did require some core modifications like those proposed early last year by Anastasia [0]. As mentioned in the previous thread, there are several reasons why you would want to use a non-default storage manager: storage-level compression, encryption, and disk limit quotas [0]; offloading of cold relation data was also mentioned [1]. In the thread on Anastasia's patch, Yura Sokolov mentioned that instead of a hook-based smgr extension, a registration-based smgr would be preferred, with integration into namespaces. Please find attached an as of yet incomplete patch that starts to do that. The patch is yet incomplete (as it isn't derived from Anastasia's patch), but I would like comments on this regardless, as this is a fairly fundamental component of PostgreSQL that is being modified, and it is often better to get comments early in the development cycle. One significant issue that I've seen so far are that catcache is not guaranteed to be available in all backends that need to do smgr operations, and I've not yet found a good solution. Changes compared to HEAD: - smgrsw is now dynamically allocated and grows as new storage managers are loaded (during shared_preload_libraries) - CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...]) - tablespace storage is (planned) fully managed by smgr through some new smgr apis Changes compared to Anastasia's patch: - extensions do not get to hook and replace the api of the smgr code directly - they are hidden behind the smgr registry. Successes: - 0001 passes tests (make check-world) - 0002 builds without warnings (make) TODO: - fix dependency failures when catcache is unavailable - tablespace redo is currently broken with 0002 - fix tests for 0002 - ensure that pg_dump etc. works with the new tablespace storage manager options Looking forward to any comments, suggestions and reviews. Kind regards, Matthias van de Meent Neon (https://neon.tech/) [0] https://www.postgresql.org/message-id/CAP4vRV6JKXyFfEOf%3Dn%2Bv5RGsZywAQ3CTM8ESWvgq%2BS87Tmgx_g%40mail.gmail.com [1] https://www.postgresql.org/message-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru
Attachment
Hi, On 2023-06-30 14:26:44 +0200, Matthias van de Meent wrote: > diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c > index 4c49393fc5..8685b9fde6 100644 > --- a/src/backend/postmaster/postmaster.c > +++ b/src/backend/postmaster/postmaster.c > @@ -1002,6 +1002,11 @@ PostmasterMain(int argc, char *argv[]) > */ > ApplyLauncherRegister(); > > + /* > + * Register built-in managers that are not part of static arrays > + */ > + register_builtin_dynamic_managers(); > + > /* > * process any libraries that should be preloaded at postmaster start > */ That doesn't strike me as a good place to initialize this, we'll need it in multiple places that way. How about putting it into BaseInit()? > -static const f_smgr smgrsw[] = { > +static f_smgr *smgrsw; This adds another level of indirection. I would rather limit the number of registerable smgrs than do that. > +SMgrId > +smgr_register(const f_smgr *smgr, Size smgrrelation_size) > +{ > + MemoryContextSwitchTo(old); > + > + pg_compiler_barrier(); Huh, what's that about? > @@ -59,14 +63,8 @@ typedef struct SMgrRelationData > * Fields below here are intended to be private to smgr.c and its > * submodules. Do not touch them from elsewhere. > */ > - int smgr_which; /* storage manager selector */ > - > - /* > - * for md.c; per-fork arrays of the number of open segments > - * (md_num_open_segs) and the segments themselves (md_seg_fds). > - */ > - int md_num_open_segs[MAX_FORKNUM + 1]; > - struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1]; > + SMgrId smgr_which; /* storage manager selector */ > + int smgrrelation_size; /* size of this struct, incl. smgr-specific data */ It looked to me like you determined this globally - why do we need it in every entry then? Greetings, Andres Freund
> Subject: [PATCH v1 1/2] Expose f_smgr to extensions for manual implementation From what I can see, all the md* APIs that were exposed in md.h can now be made static in md.c. The only other references to those APIs were in smgr.c. > Subject: [PATCH v1 2/2] Prototype: Allow tablespaces to specify which SMGR > they use > -typedef uint8 SMgrId; > +/* > + * volatile ID of the smgr. Across various configurations IDs may vary, > + * true identity is the name of each smgr. > + */ > +typedef int SMgrId; > > -#define MaxSMgrId UINT8_MAX > +#define MaxSMgrId INT_MAX In a future revision of this patch, seems worthwhile to just start as int instead of a uint8 to avoid this song and dance. Maybe int8 instead of int? > +static SMgrId recent_smgrid = -1; You could use InvalidSmgrId here. > +void smgrvalidatetspopts(const char *smgrname, List *opts) > +{ > + SMgrId smgrid = get_smgr_by_name(smgrname, false); > + > + smgrsw[smgrid].smgr_validate_tspopts(opts); > +} > + > +void smgrcreatetsp(const char *smgrname, Oid tsp, List *opts, bool isredo) > +{ > + SMgrId smgrid = get_smgr_by_name(smgrname, false); > + > + smgrsw[smgrid].smgr_create_tsp(tsp, opts, isredo); > +} > + > +void smgrdroptsp(const char *smgrname, Oid tsp, bool isredo) > +{ > + SMgrId smgrid = get_smgr_by_name(smgrname, false); > + > + smgrsw[smgrid].smgr_drop_tsp(tsp, isredo); > +} Do you not need to check if smgrid is the InvalidSmgrId? I didn't see any other validation anywhere. > + char *smgr; > + List *smgropts; /* list of DefElem nodes */ smgrname would probably work better alongside tablespacename in that struct. > @@ -221,7 +229,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum) > if (!InRecovery) > mdclose(reln, forknum); > > - return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL); > + return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL); > } Was this a victim of a bad rebase? Seems like it belongs in the previous patch. > +void mddroptsp(Oid tsp, bool isredo) > +{ > + > +} Some functions in this file have the return type on the previous line. This is a pretty slick patchset. Excited to read more dicussion and how it evolves. -- Tristan Partin Neon (https://neon.tech)
Found these warnings while compiling while only 0001 is applied. [1166/2337] Compiling C object src/backend/postgres_lib.a.p/storage_smgr_md.c.o ../src/backend/storage/smgr/md.c: In function ‘mdexists’: ../src/backend/storage/smgr/md.c:224:28: warning: passing argument 1 of ‘mdopenfork’ from incompatible pointer type [-Wincompatible-pointer-types] 224 | return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL); | ^~~~ | | | SMgrRelation {aka SMgrRelationData *} ../src/backend/storage/smgr/md.c:167:43: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is oftype ‘SMgrRelation’ {aka ‘SMgrRelationData *’} 167 | static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior); | ~~~~~~~~~~~~~~~^~~~ ../src/backend/storage/smgr/md.c: In function ‘mdcreate’: ../src/backend/storage/smgr/md.c:287:40: warning: passing argument 1 of ‘register_dirty_segment’ from incompatible pointertype [-Wincompatible-pointer-types] 287 | register_dirty_segment(reln, forknum, mdfd); | ^~~~ | | | SMgrRelation {aka SMgrRelationData *} ../src/backend/storage/smgr/md.c:168:51: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is oftype ‘SMgrRelation’ {aka ‘SMgrRelationData *’} 168 | static void register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum, Here is a diff to be applied to 0001 which fixes the warnings that get generated when compiling. I did see that one of the warnings gets fixed 0002 (the mdexists() one). I am assuming that change was just missed while rebasing the patchset or something. I did not see a fix for mdcreate() in 0002 however. -- Tristan Partin Neon (https://neon.tech)
Attachment
Sorry for double-posting, I accidentally replied to Matthias, not the mailing list :(
---------- Forwarded message ---------
From: Kirill Reshke <reshkekirill@gmail.com>
Date: Mon, 4 Dec 2023 at 19:46
Subject: Re: Extensible storage manager API - SMGR hook Redux
To: Matthias van de Meent <boekewurm+postgres@gmail.com>
From: Kirill Reshke <reshkekirill@gmail.com>
Date: Mon, 4 Dec 2023 at 19:46
Subject: Re: Extensible storage manager API - SMGR hook Redux
To: Matthias van de Meent <boekewurm+postgres@gmail.com>
Hi!
On Fri, 30 Jun 2023 at 15:27, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
Hi hackers,
At Neon, we've been working on removing the file system dependency
from PostgreSQL and replacing it with a distributed storage layer. For
now, we've seen most success in this by replacing the implementation
of the smgr API, but it did require some core modifications like those
proposed early last year by Anastasia [0].
As mentioned in the previous thread, there are several reasons why you
would want to use a non-default storage manager: storage-level
compression, encryption, and disk limit quotas [0]; offloading of cold
relation data was also mentioned [1].
In the thread on Anastasia's patch, Yura Sokolov mentioned that
instead of a hook-based smgr extension, a registration-based smgr
would be preferred, with integration into namespaces. Please find
attached an as of yet incomplete patch that starts to do that.
The patch is yet incomplete (as it isn't derived from Anastasia's
patch), but I would like comments on this regardless, as this is a
fairly fundamental component of PostgreSQL that is being modified, and
it is often better to get comments early in the development cycle. One
significant issue that I've seen so far are that catcache is not
guaranteed to be available in all backends that need to do smgr
operations, and I've not yet found a good solution.
Changes compared to HEAD:
- smgrsw is now dynamically allocated and grows as new storage
managers are loaded (during shared_preload_libraries)
- CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
- tablespace storage is (planned) fully managed by smgr through some
new smgr apis
Changes compared to Anastasia's patch:
- extensions do not get to hook and replace the api of the smgr code
directly - they are hidden behind the smgr registry.
Successes:
- 0001 passes tests (make check-world)
- 0002 builds without warnings (make)
TODO:
- fix dependency failures when catcache is unavailable
- tablespace redo is currently broken with 0002
- fix tests for 0002
- ensure that pg_dump etc. works with the new tablespace storage manager options
Looking forward to any comments, suggestions and reviews.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech/)
[0] https://www.postgresql.org/message-id/CAP4vRV6JKXyFfEOf%3Dn%2Bv5RGsZywAQ3CTM8ESWvgq%2BS87Tmgx_g%40mail.gmail.com
[1] https://www.postgresql.org/message-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru
So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
Is it possible to query the system catalog during crash recovery? As far as i understand the answer is "no", correct me if I'm wrong.
Furthermore, why do we only allow tablespace to have its own SMGR implementation, can we have per-relation SMGR? Maybe we can do it in a way similar to custom RMGR (meaning, write SMGR OID into WAL and use it in crash recovery etc.)?
On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote: > > So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo`would work with it? That's a very good point I hadn't considered in detail yet. Quite clearly, the current code is wrong in assuming that the catalog is accessible, and it should probably be stored in a way similar to pg_filenode.map in a file managed outside the buffer pool. > Is it possible to query the system catalog during crash recovery? As far as i understand the answer is "no", correct meif I'm wrong. Yes, you're correct, we can't access buffers like this during recovery. That's going to need some more effort. > Furthermore, why do we only allow tablespace to have its own SMGR implementation, can we have per-relation SMGR? Maybewe can do it in a way similar to custom RMGR (meaning, write SMGR OID into WAL and use it in crash recovery etc.)? AMs (and by extension, their RMGRs) that use Postgres' buffer pool have control over how they want to layout their blocks and files, but generally don't care about where those blocks and files are located, as long as they _can_ be retrieved. Tablespaces, however, describe 'drives' or 'storage pools' in which the tables/relations are stored, which to me seems to be the more logical place to configure the SMGR abstraction of how and where to store the actual data, as SMGRs manage the low-level relation block IO (= file accesses), and tablespaces manage where files are stored. Note that nothing prevents you from using one tablespace (thus different SMGR) per relation, apart from bloated catalogs and the superuser permissions required for creating those tablespaces. It'd be difficult to manage, but not impossible. Kind regards, Matthias van de Meent Neon (https://neon.tech)
On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.
Hmm, pg_filenode.map is a nice idea. So, simply maintain TableSpaceOId -> smgr id mapping in a separate file and update the whole file on any changes, right?
Looks reasonable to me, but it is clear that this solution can be really slow in some patterns, like if we create many-many tablespaces(the way you suggested it in the per-relation SMGR feature). Maybe we can store data in files somehow separately, and only update one chunk per operation.
Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its code infrasture, right? For example, it seems that code that calculates checksums can be reused.
So, we need to refactor code here, define something like FileMap API maybe. Or is it not really worth it? We can just write similar code twice.
On Mon, 4 Dec 2023 at 22:03, Kirill Reshke <reshkekirill@gmail.com> wrote: > > On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: >> >> On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote: >> > >> > So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo`would work with it? >> >> That's a very good point I hadn't considered in detail yet. Quite >> clearly, the current code is wrong in assuming that the catalog is >> accessible, and it should probably be stored in a way similar to >> pg_filenode.map in a file managed outside the buffer pool. >> > Hmm, pg_filenode.map is a nice idea. So, simply maintain TableSpaceOId -> smgr id mapping in a separate file and updatethe whole file on any changes, right? > Looks reasonable to me, but it is clear that this solution can be really slow in some patterns, like if we create many-manytablespaces(the way you suggested it in the per-relation SMGR feature). Maybe we can store data in files somehowseparately, and only update one chunk per operation. Yes, but that's a later issue... I'm not sure many-many tablespaces is actually a good thing. There are already very few reasons to store tables in more than just the default tablespace. For temporary relations, there is indeed a guc to automatically put them into one tablespace; and I can see a similar thing being useful for temporary relations, too. Then there I can see high-performant local disks vs lower-performant (but cheaper) local disks also as something reasonable. But that only gets us to ~6 tablespaces, assuming separate tablespaces for each combination of (normal, temp, unlogged) * (fast, cheap). I'm not sure there are many other reasons to add tablespaces, let alone making one for each table. Note that you can select which tablespace a table is stored in, so I see very little reason to actually do something about large numbers of tablespaces being prohibitively expensive performance-wise. Why do you want to have a whole new storage configuration for each of your relations? > Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its code infrasture, right? For example, it seemsthat code that calculates checksums can be reused. > So, we need to refactor code here, define something like FileMap API maybe. Or is it not really worth it? We can just writesimilar code twice. I'm not sure about that. I really doubt we'll need things that are that similar: right now, the tablespace->smgr mapping could be considered to be implied by the symlinks in /pg_tblspc/. Non-MD tablespaces could add a file <oid>.tblspc that detail their configuration, which would also fix the issue of spcoid->smgr mapping. Kind regards, Matthias van de Meent Neon (https://neon.tech)
Thought I would show off what is possible with this patchset. Heikki, a couple of months ago in our internal Slack, said: > [I would like] a debugging tool that checks that we're not missing any > fsyncs. I bumped into a few missing fsync bugs with unlogged tables > lately and a tool like that would've been very helpful. My task was to create such a tool, and off I went. I started with the storage manager extension patch that Matthias sent to the list last year[0]. Andres, in another thread[1], said: > I've been thinking that we need a validation layer for fsyncs, it's too hard > to get right without testing, and crash testing is not likel enough to catch > problems quickly / resource intensive. > > My thought was that we could keep a shared hash table of all files created / > dirtied at the fd layer, with the filename as key and the value the current > LSN. We'd delete files from it when they're fsynced. At checkpoints we go > through the list and see if there's any files from before the redo that aren't > yet fsynced. All of this in assert builds only, of course. I took this idea and ran with it. I call it the fsync_checker™️. It is an extension that prints relations that haven't been fsynced prior to a CHECKPOINT. Note that this idea doesn't work in practice because relations might not be fsynced, but they might be WAL-logged, like in the case of createdb. See log_smgrcreate(). I can't think of an easy way to solve this problem looking at the codebase as it stands. Here is a description of the patches: 0001: This is essentially just the patch that Matthias posted earlier, but rebased and adjusted a little bit so storage managers can "inherit" from other storage managers. 0002: This is an extension of 0001, which allows for extensions to set a global storage manager. This is pretty hacky, and if it was going to be pulled into mainline, it would need some better protection. For instance, only one extension should be able to set the global storage manager. We wouldn't want extensions stepping over each other, etc. 0003: Adds a hook for extensions to inspect a checkpoint before it actually occurs. The purpose for the fsync_checker is so that I can iterate over all the relations the extension tracks to find files that haven't been synced prior to the completion of the checkpoint. 0004: This is the actual fsync_checker extension itself. It must be preloaded. Hopefully this is a good illustration of how the initial patch could be used, even though it isn't perfect. [0]: https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com [1]: https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de -- Tristan Partin Neon (https://neon.tech)
Attachment
Hi, > Thought I would show off what is possible with this patchset. > > [...] Just wanted to let you know that cfbot doesn't seem to be entirely happy with the patch [1]. Please consider submitting an updated version. Best regards, Aleksander Alekseev (wearing co-CFM hat) [1]: http://cfbot.cputube.org/
Hi, I reviewed the discussion and took a look at the patch sets. It seems like many things are combined here. Based on the subject, I initially thought it aimed to provide the infrastructure to easily extend storage managers. This would allow anyone to create their own storage managers using this infrastructure. While it addresses this, it also includes additional features like fsync_checker, which I believe should be a separate feature. Even though it might use the same infrastructure, it appears to be a different functionality. I think we should focus solely on providing the infrastructure here. We need to decide on our approach—whether to use a hook-based method or a registration-based method—and I believe this requires further discussion. The hook-based approach is simple and works well for anyone writing their own storage manager. However, it has its limitations as we can either use the default storage manager or a custom-built one for all the work load, but we cannot choose between multiple storage managers. On the other hand, the registration-based approach allows choosing between multiple storage managers based on the workload, though it requires a lot of changes. Are we planning to support other storage managers in PostgreSQL in the near future? If not, it is better to go with the hook-based approach. Otherwise, the registration-based approach is preferable as it offers more flexibility to users and enhances PostgreSQL’s functionality. Could you please share your thoughts on this? Also, let me know if this topic has already been discussed and if any conclusions were reached. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft On Sat, Jan 13, 2024 at 1:27 AM Tristan Partin <tristan@neon.tech> wrote: > > Thought I would show off what is possible with this patchset. > > Heikki, a couple of months ago in our internal Slack, said: > > > [I would like] a debugging tool that checks that we're not missing any > > fsyncs. I bumped into a few missing fsync bugs with unlogged tables > > lately and a tool like that would've been very helpful. > > My task was to create such a tool, and off I went. I started with the > storage manager extension patch that Matthias sent to the list last > year[0]. > > Andres, in another thread[1], said: > > > I've been thinking that we need a validation layer for fsyncs, it's too hard > > to get right without testing, and crash testing is not likel enough to catch > > problems quickly / resource intensive. > > > > My thought was that we could keep a shared hash table of all files created / > > dirtied at the fd layer, with the filename as key and the value the current > > LSN. We'd delete files from it when they're fsynced. At checkpoints we go > > through the list and see if there's any files from before the redo that aren't > > yet fsynced. All of this in assert builds only, of course. > > I took this idea and ran with it. I call it the fsync_checker™️. It is an > extension that prints relations that haven't been fsynced prior to > a CHECKPOINT. Note that this idea doesn't work in practice because > relations might not be fsynced, but they might be WAL-logged, like in > the case of createdb. See log_smgrcreate(). I can't think of an easy way > to solve this problem looking at the codebase as it stands. > > Here is a description of the patches: > > 0001: > > This is essentially just the patch that Matthias posted earlier, but > rebased and adjusted a little bit so storage managers can "inherit" from > other storage managers. > > 0002: > > This is an extension of 0001, which allows for extensions to set > a global storage manager. This is pretty hacky, and if it was going to > be pulled into mainline, it would need some better protection. For > instance, only one extension should be able to set the global storage > manager. We wouldn't want extensions stepping over each other, etc. > > 0003: > > Adds a hook for extensions to inspect a checkpoint before it actually > occurs. The purpose for the fsync_checker is so that I can iterate over > all the relations the extension tracks to find files that haven't been > synced prior to the completion of the checkpoint. > > 0004: > > This is the actual fsync_checker extension itself. It must be preloaded. > > Hopefully this is a good illustration of how the initial patch could be > used, even though it isn't perfect. > > [0]: https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com > [1]: https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de > > -- > Tristan Partin > Neon (https://neon.tech)