Thread: Extensible storage manager API - SMGR hook Redux

Extensible storage manager API - SMGR hook Redux

From
Matthias van de Meent
Date:
Hi hackers,

At Neon, we've been working on removing the file system dependency
from PostgreSQL and replacing it with a distributed storage layer. For
now, we've seen most success in this by replacing the implementation
of the smgr API, but it did require some core modifications like those
proposed early last year  by Anastasia [0].

As mentioned in the previous thread, there are several reasons why you
would want to use a non-default storage manager: storage-level
compression, encryption, and disk limit quotas [0]; offloading of cold
relation data was also mentioned [1].

In the thread on Anastasia's patch, Yura Sokolov mentioned that
instead of a hook-based smgr extension, a registration-based smgr
would be preferred, with integration into namespaces. Please find
attached an as of yet incomplete patch that starts to do that.

The patch is yet incomplete (as it isn't derived from Anastasia's
patch), but I would like comments on this regardless, as this is a
fairly fundamental component of PostgreSQL that is being modified, and
it is often better to get comments early in the development cycle. One
significant issue that I've seen so far are that catcache is not
guaranteed to be available in all backends that need to do smgr
operations, and I've not yet found a good solution.

Changes compared to HEAD:
- smgrsw is now dynamically allocated and grows as new storage
managers are loaded (during shared_preload_libraries)
- CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
- tablespace storage is (planned) fully managed by smgr through some
new smgr apis

Changes compared to Anastasia's patch:
- extensions do not get to hook and replace the api of the smgr code
directly - they are hidden behind the smgr registry.

Successes:
- 0001 passes tests (make check-world)
- 0002 builds without warnings (make)

TODO:
- fix dependency failures when catcache is unavailable
- tablespace redo is currently broken with 0002
- fix tests for 0002
- ensure that pg_dump etc. works with the new tablespace storage manager options

Looking forward to any comments, suggestions and reviews.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)


[0] https://www.postgresql.org/message-id/CAP4vRV6JKXyFfEOf%3Dn%2Bv5RGsZywAQ3CTM8ESWvgq%2BS87Tmgx_g%40mail.gmail.com
[1] https://www.postgresql.org/message-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru

Attachment

Re: Extensible storage manager API - SMGR hook Redux

From
Andres Freund
Date:
Hi,

On 2023-06-30 14:26:44 +0200, Matthias van de Meent wrote:
> diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
> index 4c49393fc5..8685b9fde6 100644
> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c
> @@ -1002,6 +1002,11 @@ PostmasterMain(int argc, char *argv[])
>       */
>      ApplyLauncherRegister();
>  
> +    /*
> +     * Register built-in managers that are not part of static arrays
> +     */
> +    register_builtin_dynamic_managers();
> +
>      /*
>       * process any libraries that should be preloaded at postmaster start
>       */

That doesn't strike me as a good place to initialize this, we'll need it in
multiple places that way.  How about putting it into BaseInit()?


> -static const f_smgr smgrsw[] = {
> +static f_smgr *smgrsw;

This adds another level of indirection. I would rather limit the number of
registerable smgrs than do that.



> +SMgrId
> +smgr_register(const f_smgr *smgr, Size smgrrelation_size)
> +{

> +    MemoryContextSwitchTo(old);
> +
> +    pg_compiler_barrier();

Huh, what's that about?


> @@ -59,14 +63,8 @@ typedef struct SMgrRelationData
>       * Fields below here are intended to be private to smgr.c and its
>       * submodules.  Do not touch them from elsewhere.
>       */
> -    int            smgr_which;        /* storage manager selector */
> -
> -    /*
> -     * for md.c; per-fork arrays of the number of open segments
> -     * (md_num_open_segs) and the segments themselves (md_seg_fds).
> -     */
> -    int            md_num_open_segs[MAX_FORKNUM + 1];
> -    struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
> +    SMgrId        smgr_which;        /* storage manager selector */
> +    int            smgrrelation_size;    /* size of this struct, incl. smgr-specific data */

It looked to me like you determined this globally - why do we need it in every
entry then?

Greetings,

Andres Freund



Re: Extensible storage manager API - SMGR hook Redux

From
"Tristan Partin"
Date:
> Subject: [PATCH v1 1/2] Expose f_smgr to extensions for manual implementation

From what I can see, all the md* APIs that were exposed in md.h can now
be made static in md.c. The only other references to those APIs were in
smgr.c.

> Subject: [PATCH v1 2/2] Prototype: Allow tablespaces to specify which SMGR
>  they use

> -typedef uint8 SMgrId;
> +/*
> + * volatile ID of the smgr. Across various configurations IDs may vary,
> + * true identity is the name of each smgr.
> + */
> +typedef int SMgrId;
>
> -#define MaxSMgrId UINT8_MAX
> +#define MaxSMgrId              INT_MAX

In a future revision of this patch, seems worthwhile to just start as
int instead of a uint8 to avoid this song and dance. Maybe int8 instead
of int?

> +static SMgrId recent_smgrid = -1;

You could use InvalidSmgrId here.

> +void smgrvalidatetspopts(const char *smgrname, List *opts)
> +{
> +       SMgrId smgrid = get_smgr_by_name(smgrname, false);
> +
> +       smgrsw[smgrid].smgr_validate_tspopts(opts);
> +}
> +
> +void smgrcreatetsp(const char *smgrname, Oid tsp, List *opts, bool isredo)
> +{
> +       SMgrId smgrid = get_smgr_by_name(smgrname, false);
> +
> +       smgrsw[smgrid].smgr_create_tsp(tsp, opts, isredo);
> +}
> +
> +void smgrdroptsp(const char *smgrname, Oid tsp, bool isredo)
> +{
> +       SMgrId smgrid = get_smgr_by_name(smgrname, false);
> +
> +       smgrsw[smgrid].smgr_drop_tsp(tsp, isredo);
> +}

Do you not need to check if smgrid is the InvalidSmgrId? I didn't see
any other validation anywhere.

> +       char       *smgr;
> +       List       *smgropts; /* list of DefElem nodes */

smgrname would probably work better alongside tablespacename in that
struct.

> @@ -221,7 +229,7 @@ mdexists(SMgrRelation reln, ForkNumber forknum)
>         if (!InRecovery)
>                 mdclose(reln, forknum);
>
> -       return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
> +       return (mdopenfork(mdreln, forknum, EXTENSION_RETURN_NULL) != NULL);
>  }

Was this a victim of a bad rebase? Seems like it belongs in the previous
patch.

> +void mddroptsp(Oid tsp, bool isredo)
> +{
> +
> +}

Some functions in this file have the return type on the previous line.

This is a pretty slick patchset. Excited to read more dicussion and how
it evolves.

--
Tristan Partin
Neon (https://neon.tech)



Re: Extensible storage manager API - SMGR hook Redux

From
"Tristan Partin"
Date:
Found these warnings while compiling while only 0001 is applied.

[1166/2337] Compiling C object src/backend/postgres_lib.a.p/storage_smgr_md.c.o
../src/backend/storage/smgr/md.c: In function ‘mdexists’:
../src/backend/storage/smgr/md.c:224:28: warning: passing argument 1 of ‘mdopenfork’ from incompatible pointer type
[-Wincompatible-pointer-types]
  224 |         return (mdopenfork(reln, forknum, EXTENSION_RETURN_NULL) != NULL);
      |                            ^~~~
      |                            |
      |                            SMgrRelation {aka SMgrRelationData *}
../src/backend/storage/smgr/md.c:167:43: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is
oftype ‘SMgrRelation’ {aka ‘SMgrRelationData *’} 
  167 | static MdfdVec *mdopenfork(MdSMgrRelation reln, ForkNumber forknum, int behavior);
      |                            ~~~~~~~~~~~~~~~^~~~
../src/backend/storage/smgr/md.c: In function ‘mdcreate’:
../src/backend/storage/smgr/md.c:287:40: warning: passing argument 1 of ‘register_dirty_segment’ from incompatible
pointertype [-Wincompatible-pointer-types] 
  287 |                 register_dirty_segment(reln, forknum, mdfd);
      |                                        ^~~~
      |                                        |
      |                                        SMgrRelation {aka SMgrRelationData *}
../src/backend/storage/smgr/md.c:168:51: note: expected ‘MdSMgrRelation’ {aka ‘MdSMgrRelationData *’} but argument is
oftype ‘SMgrRelation’ {aka ‘SMgrRelationData *’} 
  168 | static void register_dirty_segment(MdSMgrRelation reln, ForkNumber forknum,

Here is a diff to be applied to 0001 which fixes the warnings that get
generated when compiling. I did see that one of the warnings gets fixed
0002 (the mdexists() one). I am assuming that change was just missed
while rebasing the patchset or something. I did not see a fix for
mdcreate() in 0002 however.

--
Tristan Partin
Neon (https://neon.tech)

Attachment

Fwd: Extensible storage manager API - SMGR hook Redux

From
Kirill Reshke
Date:
Sorry for double-posting, I accidentally replied to Matthias, not the mailing list :(

---------- Forwarded message ---------
From: Kirill Reshke <reshkekirill@gmail.com>
Date: Mon, 4 Dec 2023 at 19:46
Subject: Re: Extensible storage manager API - SMGR hook Redux
To: Matthias van de Meent <boekewurm+postgres@gmail.com>


Hi!

On Fri, 30 Jun 2023 at 15:27, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
Hi hackers,

At Neon, we've been working on removing the file system dependency
from PostgreSQL and replacing it with a distributed storage layer. For
now, we've seen most success in this by replacing the implementation
of the smgr API, but it did require some core modifications like those
proposed early last year  by Anastasia [0].

As mentioned in the previous thread, there are several reasons why you
would want to use a non-default storage manager: storage-level
compression, encryption, and disk limit quotas [0]; offloading of cold
relation data was also mentioned [1].

In the thread on Anastasia's patch, Yura Sokolov mentioned that
instead of a hook-based smgr extension, a registration-based smgr
would be preferred, with integration into namespaces. Please find
attached an as of yet incomplete patch that starts to do that.

The patch is yet incomplete (as it isn't derived from Anastasia's
patch), but I would like comments on this regardless, as this is a
fairly fundamental component of PostgreSQL that is being modified, and
it is often better to get comments early in the development cycle. One
significant issue that I've seen so far are that catcache is not
guaranteed to be available in all backends that need to do smgr
operations, and I've not yet found a good solution.

Changes compared to HEAD:
- smgrsw is now dynamically allocated and grows as new storage
managers are loaded (during shared_preload_libraries)
- CREATE TABLESPACE has new optional syntax USING smgrname (option [, ...])
- tablespace storage is (planned) fully managed by smgr through some
new smgr apis

Changes compared to Anastasia's patch:
- extensions do not get to hook and replace the api of the smgr code
directly - they are hidden behind the smgr registry.

Successes:
- 0001 passes tests (make check-world)
- 0002 builds without warnings (make)

TODO:
- fix dependency failures when catcache is unavailable
- tablespace redo is currently broken with 0002
- fix tests for 0002
- ensure that pg_dump etc. works with the new tablespace storage manager options

Looking forward to any comments, suggestions and reviews.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech/)


[0] https://www.postgresql.org/message-id/CAP4vRV6JKXyFfEOf%3Dn%2Bv5RGsZywAQ3CTM8ESWvgq%2BS87Tmgx_g%40mail.gmail.com
[1] https://www.postgresql.org/message-id/D365F19F-BC3E-4F96-A91E-8DB13049749E@yandex-team.ru

So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?
Is it possible to query the system catalog during crash recovery? As far as i understand the answer is "no", correct me if I'm wrong.
Furthermore, why do we only allow tablespace to have its own SMGR implementation, can we have per-relation SMGR? Maybe we can do it in a way similar to custom RMGR (meaning, write SMGR OID into WAL and use it in crash recovery etc.)?

Re: Extensible storage manager API - SMGR hook Redux

From
Matthias van de Meent
Date:
On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how
`smgr_redo`would work with it?
 

That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.

> Is it possible to query the system catalog during crash recovery? As far as i understand the answer is "no", correct
meif I'm wrong.
 

Yes, you're correct, we can't access buffers like this during
recovery. That's going to need some more effort.

> Furthermore, why do we only allow tablespace to have its own SMGR implementation, can we have per-relation SMGR?
Maybewe can do it in a way similar to custom RMGR (meaning, write SMGR OID into WAL and use it in crash recovery
etc.)?

AMs (and by extension, their RMGRs) that use Postgres' buffer pool
have control over how they want to layout their blocks and files, but
generally don't care about where those blocks and files are located,
as long as they _can_ be retrieved.

Tablespaces, however, describe 'drives' or 'storage pools' in which
the tables/relations are stored, which to me seems to be the more
logical place to configure the SMGR abstraction of how and where to
store the actual data, as SMGRs manage the low-level relation block IO
(= file accesses), and tablespaces manage where files are stored.

Note that nothing prevents you from using one tablespace (thus
different SMGR) per relation, apart from bloated catalogs and the
superuser permissions required for creating those tablespaces. It'd be
difficult to manage, but not impossible.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Extensible storage manager API - SMGR hook Redux

From
Kirill Reshke
Date:


On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how `smgr_redo` would work with it?

That's a very good point I hadn't considered in detail yet. Quite
clearly, the current code is wrong in assuming that the catalog is
accessible, and it should probably be stored in a way similar to
pg_filenode.map in a file managed outside the buffer pool.

Hmm, pg_filenode.map  is a nice idea. So, simply maintain TableSpaceOId -> smgr id mapping in a separate file and update the whole file on any changes, right?
Looks reasonable to me, but it is clear that this solution can be really slow in some patterns, like if we create many-many tablespaces(the way you suggested it in the per-relation SMGR feature). Maybe we can store data in files somehow separately, and only update one chunk per operation.

Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its code infrasture, right? For example, it seems that code that calculates checksums can be reused.
So, we need to refactor code here, define something like FileMap API maybe. Or is it not really worth it? We can just write similar code twice.

Re: Extensible storage manager API - SMGR hook Redux

From
Matthias van de Meent
Date:
On Mon, 4 Dec 2023 at 22:03, Kirill Reshke <reshkekirill@gmail.com> wrote:
>
> On Mon, 4 Dec 2023 at 22:21, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
>>
>> On Mon, 4 Dec 2023 at 17:51, Kirill Reshke <reshkekirill@gmail.com> wrote:
>> >
>> > So, 0002 patch uses the `get_tablespace` function, which searches Catalog to tablespace SMGR id. I wonder how
`smgr_redo`would work with it? 
>>
>> That's a very good point I hadn't considered in detail yet. Quite
>> clearly, the current code is wrong in assuming that the catalog is
>> accessible, and it should probably be stored in a way similar to
>> pg_filenode.map in a file managed outside the buffer pool.
>>
> Hmm, pg_filenode.map  is a nice idea. So, simply maintain TableSpaceOId -> smgr id mapping in a separate file and
updatethe whole file on any changes, right? 
> Looks reasonable to me, but it is clear that this solution can be really slow in some patterns, like if we create
many-manytablespaces(the way you suggested it in the per-relation SMGR feature). Maybe we can store data in files
somehowseparately, and only update one chunk per operation. 

Yes, but that's a later issue... I'm not sure many-many tablespaces is
actually a good thing. There are already very few reasons to store
tables in more than just the default tablespace. For temporary
relations, there is indeed a guc to automatically put them into one
tablespace; and I can see a similar thing being useful for temporary
relations, too. Then there I can see high-performant local disks vs
lower-performant (but cheaper) local disks also as something
reasonable. But that only gets us to ~6 tablespaces, assuming separate
tablespaces for each combination of (normal, temp, unlogged) * (fast,
cheap). I'm not sure there are many other reasons to add tablespaces,
let alone making one for each table.

Note that you can select which tablespace a table is stored in, so I
see very little reason to actually do something about large numbers of
tablespaces being prohibitively expensive performance-wise.

Why do you want to have a whole new storage configuration for each of
your relations?

> Anyway, if we use a `pg_filenode.map` - like solution, we need to reuse its code infrasture, right? For example, it
seemsthat code that calculates checksums can be reused. 
> So, we need to refactor code here, define something like FileMap API maybe. Or is it not really worth it? We can just
writesimilar code twice. 

I'm not sure about that. I really doubt we'll need things that are
that similar: right now, the tablespace->smgr mapping could be
considered to be implied by the symlinks in /pg_tblspc/. Non-MD
tablespaces could add a file <oid>.tblspc that detail their
configuration, which would also fix the issue of spcoid->smgr mapping.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)



Re: Extensible storage manager API - SMGR hook Redux

From
"Tristan Partin"
Date:
Thought I would show off what is possible with this patchset.

Heikki, a couple of months ago in our internal Slack, said:

> [I would like] a debugging tool that checks that we're not missing any
> fsyncs. I bumped into a few missing fsync bugs with unlogged tables
> lately and a tool like that would've been very helpful.

My task was to create such a tool, and off I went. I started with the
storage manager extension patch that Matthias sent to the list last
year[0].

Andres, in another thread[1], said:

> I've been thinking that we need a validation layer for fsyncs, it's too hard
> to get right without testing, and crash testing is not likel enough to catch
> problems quickly / resource intensive.
>
> My thought was that we could keep a shared hash table of all files created /
> dirtied at the fd layer, with the filename as key and the value the current
> LSN. We'd delete files from it when they're fsynced. At checkpoints we go
> through the list and see if there's any files from before the redo that aren't
> yet fsynced.  All of this in assert builds only, of course.

I took this idea and ran with it. I call it the fsync_checker™️. It is an
extension that prints relations that haven't been fsynced prior to
a CHECKPOINT. Note that this idea doesn't work in practice because
relations might not be fsynced, but they might be WAL-logged, like in
the case of createdb. See log_smgrcreate(). I can't think of an easy way
to solve this problem looking at the codebase as it stands.

Here is a description of the patches:

0001:

This is essentially just the patch that Matthias posted earlier, but
rebased and adjusted a little bit so storage managers can "inherit" from
other storage managers.

0002:

This is an extension of 0001, which allows for extensions to set
a global storage manager. This is pretty hacky, and if it was going to
be pulled into mainline, it would need some better protection. For
instance, only one extension should be able to set the global storage
manager. We wouldn't want extensions stepping over each other, etc.

0003:

Adds a hook for extensions to inspect a checkpoint before it actually
occurs. The purpose for the fsync_checker is so that I can iterate over
all the relations the extension tracks to find files that haven't been
synced prior to the completion of the checkpoint.

0004:

This is the actual fsync_checker extension itself. It must be preloaded.

Hopefully this is a good illustration of how the initial patch could be
used, even though it isn't perfect.

[0]: https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com
[1]: https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de

--
Tristan Partin
Neon (https://neon.tech)

Attachment

Re: Extensible storage manager API - SMGR hook Redux

From
Aleksander Alekseev
Date:
Hi,

> Thought I would show off what is possible with this patchset.
>
> [...]

Just wanted to let you know that cfbot doesn't seem to be entirely
happy with the patch [1]. Please consider submitting an updated
version.

Best regards,
Aleksander Alekseev (wearing co-CFM hat)

[1]: http://cfbot.cputube.org/



Re: Extensible storage manager API - SMGR hook Redux

From
Nitin Jadhav
Date:
Hi,

I reviewed the discussion and took a look at the patch sets. It seems
like many things are combined here. Based on the subject, I initially
thought it aimed to provide the infrastructure to easily extend
storage managers. This would allow anyone to create their own storage
managers using this infrastructure. While it addresses this, it also
includes additional features like fsync_checker, which I believe
should be a separate feature. Even though it might use the same
infrastructure, it appears to be a different functionality. I think we
should focus solely on providing the infrastructure here.

We need to decide on our approach—whether to use a hook-based method
or a registration-based method—and I believe this requires further
discussion.

The hook-based approach is simple and works well for anyone writing
their own storage manager. However, it has its limitations as we can
either use the default storage manager or a custom-built one for all
the work load, but we cannot choose between multiple storage managers.
On the other hand, the registration-based approach allows choosing
between multiple storage managers based on the workload, though it
requires a lot of changes.

Are we planning to support other storage managers in PostgreSQL in the
near future? If not, it is better to go with the hook-based approach.
Otherwise, the registration-based approach is preferable as it offers
more flexibility to users and enhances PostgreSQL’s functionality.

Could you please share your thoughts on this? Also, let me know if
this topic has already been discussed and if any conclusions were
reached.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Sat, Jan 13, 2024 at 1:27 AM Tristan Partin <tristan@neon.tech> wrote:
>
> Thought I would show off what is possible with this patchset.
>
> Heikki, a couple of months ago in our internal Slack, said:
>
> > [I would like] a debugging tool that checks that we're not missing any
> > fsyncs. I bumped into a few missing fsync bugs with unlogged tables
> > lately and a tool like that would've been very helpful.
>
> My task was to create such a tool, and off I went. I started with the
> storage manager extension patch that Matthias sent to the list last
> year[0].
>
> Andres, in another thread[1], said:
>
> > I've been thinking that we need a validation layer for fsyncs, it's too hard
> > to get right without testing, and crash testing is not likel enough to catch
> > problems quickly / resource intensive.
> >
> > My thought was that we could keep a shared hash table of all files created /
> > dirtied at the fd layer, with the filename as key and the value the current
> > LSN. We'd delete files from it when they're fsynced. At checkpoints we go
> > through the list and see if there's any files from before the redo that aren't
> > yet fsynced.  All of this in assert builds only, of course.
>
> I took this idea and ran with it. I call it the fsync_checker™️. It is an
> extension that prints relations that haven't been fsynced prior to
> a CHECKPOINT. Note that this idea doesn't work in practice because
> relations might not be fsynced, but they might be WAL-logged, like in
> the case of createdb. See log_smgrcreate(). I can't think of an easy way
> to solve this problem looking at the codebase as it stands.
>
> Here is a description of the patches:
>
> 0001:
>
> This is essentially just the patch that Matthias posted earlier, but
> rebased and adjusted a little bit so storage managers can "inherit" from
> other storage managers.
>
> 0002:
>
> This is an extension of 0001, which allows for extensions to set
> a global storage manager. This is pretty hacky, and if it was going to
> be pulled into mainline, it would need some better protection. For
> instance, only one extension should be able to set the global storage
> manager. We wouldn't want extensions stepping over each other, etc.
>
> 0003:
>
> Adds a hook for extensions to inspect a checkpoint before it actually
> occurs. The purpose for the fsync_checker is so that I can iterate over
> all the relations the extension tracks to find files that haven't been
> synced prior to the completion of the checkpoint.
>
> 0004:
>
> This is the actual fsync_checker extension itself. It must be preloaded.
>
> Hopefully this is a good illustration of how the initial patch could be
> used, even though it isn't perfect.
>
> [0]: https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com
> [1]: https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de
>
> --
> Tristan Partin
> Neon (https://neon.tech)



Re: Extensible storage manager API - SMGR hook Redux

From
Xun Gong
Date:

Thank you for your detailed review and insights. I share your view that a registration-based approach for custom storage managers (smgr) is more versatile. This method allows for the implementation of custom table access methods, facilitating the use of various storage services (such as file services or object storage), different file organization formats (files in one directory  or many sub directories), and flexible deletion logic (direct deletion or mark-and-sweep).

While I acknowledge that the registration-based approach requires more modifications, I believe the benefits in terms of extensibility and functionality are significant. I have seen similar explorations in the implementation of AO tables in Greenplum, where a dedicated smgr was created due to its distinct file organization compared to heap tables.

I look forward to further discussions on this topic


Nitin Jadhav <nitinjadhavpostgres@gmail.com> 于2024年12月10日周二 11:25写道:
Hi,

I reviewed the discussion and took a look at the patch sets. It seems
like many things are combined here. Based on the subject, I initially
thought it aimed to provide the infrastructure to easily extend
storage managers. This would allow anyone to create their own storage
managers using this infrastructure. While it addresses this, it also
includes additional features like fsync_checker, which I believe
should be a separate feature. Even though it might use the same
infrastructure, it appears to be a different functionality. I think we
should focus solely on providing the infrastructure here.

We need to decide on our approach—whether to use a hook-based method
or a registration-based method—and I believe this requires further
discussion.

The hook-based approach is simple and works well for anyone writing
their own storage manager. However, it has its limitations as we can
either use the default storage manager or a custom-built one for all
the work load, but we cannot choose between multiple storage managers.
On the other hand, the registration-based approach allows choosing
between multiple storage managers based on the workload, though it
requires a lot of changes.

Are we planning to support other storage managers in PostgreSQL in the
near future? If not, it is better to go with the hook-based approach.
Otherwise, the registration-based approach is preferable as it offers
more flexibility to users and enhances PostgreSQL’s functionality.

Could you please share your thoughts on this? Also, let me know if
this topic has already been discussed and if any conclusions were
reached.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Sat, Jan 13, 2024 at 1:27 AM Tristan Partin <tristan@neon.tech> wrote:
>
> Thought I would show off what is possible with this patchset.
>
> Heikki, a couple of months ago in our internal Slack, said:
>
> > [I would like] a debugging tool that checks that we're not missing any
> > fsyncs. I bumped into a few missing fsync bugs with unlogged tables
> > lately and a tool like that would've been very helpful.
>
> My task was to create such a tool, and off I went. I started with the
> storage manager extension patch that Matthias sent to the list last
> year[0].
>
> Andres, in another thread[1], said:
>
> > I've been thinking that we need a validation layer for fsyncs, it's too hard
> > to get right without testing, and crash testing is not likel enough to catch
> > problems quickly / resource intensive.
> >
> > My thought was that we could keep a shared hash table of all files created /
> > dirtied at the fd layer, with the filename as key and the value the current
> > LSN. We'd delete files from it when they're fsynced. At checkpoints we go
> > through the list and see if there's any files from before the redo that aren't
> > yet fsynced.  All of this in assert builds only, of course.
>
> I took this idea and ran with it. I call it the fsync_checker™️. It is an
> extension that prints relations that haven't been fsynced prior to
> a CHECKPOINT. Note that this idea doesn't work in practice because
> relations might not be fsynced, but they might be WAL-logged, like in
> the case of createdb. See log_smgrcreate(). I can't think of an easy way
> to solve this problem looking at the codebase as it stands.
>
> Here is a description of the patches:
>
> 0001:
>
> This is essentially just the patch that Matthias posted earlier, but
> rebased and adjusted a little bit so storage managers can "inherit" from
> other storage managers.
>
> 0002:
>
> This is an extension of 0001, which allows for extensions to set
> a global storage manager. This is pretty hacky, and if it was going to
> be pulled into mainline, it would need some better protection. For
> instance, only one extension should be able to set the global storage
> manager. We wouldn't want extensions stepping over each other, etc.
>
> 0003:
>
> Adds a hook for extensions to inspect a checkpoint before it actually
> occurs. The purpose for the fsync_checker is so that I can iterate over
> all the relations the extension tracks to find files that haven't been
> synced prior to the completion of the checkpoint.
>
> 0004:
>
> This is the actual fsync_checker extension itself. It must be preloaded.
>
> Hopefully this is a good illustration of how the initial patch could be
> used, even though it isn't perfect.
>
> [0]: https://www.postgresql.org/message-id/CAEze2WgMySu2suO_TLvFyGY3URa4mAx22WeoEicnK%3DPCNWEMrA%40mail.gmail.com
> [1]: https://www.postgresql.org/message-id/20220127182838.ba3434dp2pe5vcia%40alap3.anarazel.de
>
> --
> Tristan Partin
> Neon (https://neon.tech)