Thread: Drop type "smgr"?

Drop type "smgr"?

From
Thomas Munro
Date:
Hello hackers,

The type smgr has only one value 'magnetic disk'.  ~15 years ago it
also had a value 'main memory', and in Berkeley POSTGRES 4.2 there was
a third value 'sony jukebox'.  Back then, all tables had an associated
block storage manager, and it was recorded as an attribute relsmgr of
pg_class (or pg_relation as it was known further back).  This was the
type of that attribute, removed by Bruce in 3fa2bb31 (1997).

Nothing seems to break if you remove it (except for some tests using
it in an incidental way).  See attached.

Motivation: A couple of projects propose to add new smgr
implementations alongside md.c in order to use bufmgr.c for more kinds
of files, but it seems entirely bogus to extend the unused smgr type
to cover those.

-- 
Thomas Munro
https://enterprisedb.com

Attachment

Re: Drop type "smgr"?

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Motivation: A couple of projects propose to add new smgr
> implementations alongside md.c in order to use bufmgr.c for more kinds
> of files, but it seems entirely bogus to extend the unused smgr type
> to cover those.

I agree that smgrtype as it stands is pretty pointless, but what
will we be using instead to get to those other implementations?

            regards, tom lane


Re: Drop type "smgr"?

From
Thomas Munro
Date:
On Thu, Feb 28, 2019 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Motivation: A couple of projects propose to add new smgr
> > implementations alongside md.c in order to use bufmgr.c for more kinds
> > of files, but it seems entirely bogus to extend the unused smgr type
> > to cover those.
>
> I agree that smgrtype as it stands is pretty pointless, but what
> will we be using instead to get to those other implementations?

Our current thinking is that smgropen() should know how to map a small
number of special database OIDs to different smgr implementations
(where currently it hard-codes smgr_which = 0).  For example there
would be a pseudo-database for undo logs, and another for the SLRUs
that Shawn Debnath and others have been proposing to move into shared
buffers.  Another idea would be to widen the buffer tag to include the
selector.  Unlike the ancestral code, it wouldn't need to appear in
catalogs or ever be seen or typed in by users so there still wouldn't
be a use for this type.

-- 
Thomas Munro
https://enterprisedb.com


Re: Drop type "smgr"?

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Thu, Feb 28, 2019 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I agree that smgrtype as it stands is pretty pointless, but what
>> will we be using instead to get to those other implementations?

> Our current thinking is that smgropen() should know how to map a small
> number of special database OIDs to different smgr implementations

Hmm.  Maybe mapping based on tablespaces would be a better idea?

> Unlike the ancestral code, it wouldn't need to appear in
> catalogs or ever be seen or typed in by users so there still wouldn't
> be a use for this type.

Yeah, the $64 problem at this level is that you don't really want
to depend on catalog contents, because you cannot do a lookup to
find out what to do.  So I agree that we're pretty unlikely to
resurrect an smgr type per se.  But I'd been expecting an answer
mentioning pg_am OIDs, and was wondering how that'd work exactly.
Probably, it would still be down to some C code having hard-wired
knowledge about some OIDs ...

            regards, tom lane


Re: Drop type "smgr"?

From
Thomas Munro
Date:
On Thu, Feb 28, 2019 at 7:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Thu, Feb 28, 2019 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I agree that smgrtype as it stands is pretty pointless, but what
> >> will we be using instead to get to those other implementations?
>
> > Our current thinking is that smgropen() should know how to map a small
> > number of special database OIDs to different smgr implementations
>
> Hmm.  Maybe mapping based on tablespaces would be a better idea?

In the undo log proposal (about which more soon) we are using
tablespaces for their real purpose, so we need that OID.  If you SET
undo_tablespaces = foo then future undo data created by your session
will be written there, which might be useful for putting that IO on
different storage.  We also use the relation OID to chop up the undo
address space into separate numbered undo logs, so that different
sessions get their own space to insert data without trampling on each
other's buffer locks.  That leaves only the database OID to mess with
(unless we widen buffer tags and the smgropen() interface).

> > Unlike the ancestral code, it wouldn't need to appear in
> > catalogs or ever be seen or typed in by users so there still wouldn't
> > be a use for this type.
>
> Yeah, the $64 problem at this level is that you don't really want
> to depend on catalog contents, because you cannot do a lookup to
> find out what to do.

Right.  It would be a circular dependency if you had to read a catalog
before you could consult low level SLRUs like (say) the clog* via
shared buffers, since you need the clog to read the catalog, or (to
peer a long way down the road and around several corners) if the
catalogs themselves could optionally be stored in an undo-aware table
access manager like zheap, which might require reading undo.  I think
this block storage stuff exists at a lower level than relations and
transactions and therefore also catalogs, and there will be a small
fixed number of them and it makes sense to hard-code the knowledge of
them.

*I'm at least a little bit aware of the history here: your 2001 commit
2589735d moved clog out of shared buffers!  That enabled you to
develop the system of segment files truncated from the front.  That's
sort of what this new smgr work is about; putting things in shared
buffers, but just mapping the blocks to paths differently than md.c,
as appropriate.

-- 
Thomas Munro
https://enterprisedb.com


Re: Drop type "smgr"?

From
Haribabu Kommi
Date:

On Thu, Feb 28, 2019 at 5:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Thu, Feb 28, 2019 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I agree that smgrtype as it stands is pretty pointless, but what
>> will we be using instead to get to those other implementations?

> Our current thinking is that smgropen() should know how to map a small
> number of special database OIDs to different smgr implementations

Hmm.  Maybe mapping based on tablespaces would be a better idea?

Thanks to bringing up this idea of mutliple smgr implementations. I also
thought of implementing our own smgr implementation to support transparent
data encryption on the disk based on tablespace mapping.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Drop type "smgr"?

From
Robert Haas
Date:
On Thu, Feb 28, 2019 at 1:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> The type smgr has only one value 'magnetic disk'.  ~15 years ago it
> also had a value 'main memory', and in Berkeley POSTGRES 4.2 there was
> a third value 'sony jukebox'.  Back then, all tables had an associated
> block storage manager, and it was recorded as an attribute relsmgr of
> pg_class (or pg_relation as it was known further back).  This was the
> type of that attribute, removed by Bruce in 3fa2bb31 (1997).
>
> Nothing seems to break if you remove it (except for some tests using
> it in an incidental way).  See attached.

FWIW, +1 from me.  I thought about arguing to remove this a number of
years ago when I was poking around in this area for some reason, but
it didn't seem important enough to be worth arguing about then.  Now,
because we're actually going to maybe-hopefully get some more smgrs
that do interesting things, it seems worth the arguing...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Drop type "smgr"?

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Thu, Feb 28, 2019 at 7:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thomas Munro <thomas.munro@gmail.com> writes:
>>> Our current thinking is that smgropen() should know how to map a small
>>> number of special database OIDs to different smgr implementations

>> Hmm.  Maybe mapping based on tablespaces would be a better idea?

> In the undo log proposal (about which more soon) we are using
> tablespaces for their real purpose, so we need that OID.  If you SET
> undo_tablespaces = foo then future undo data created by your session
> will be written there, which might be useful for putting that IO on
> different storage.

Meh.  That's a point, but it doesn't exactly seem like a killer argument.
Just in the abstract, it seems much more likely to me that people would
want per-database special rels than per-tablespace special rels.  And
I think your notion of a GUC that can control this is probably pie in
the sky anyway: if we can't afford to look into the catalogs to resolve
names at this code level, how are we going to handle a GUC?

The real reason I'm concerned about this, though, is that for either
a database or a tablespace, you can *not* get away with having a magic
OID just hanging in space with no actual catalog row matching it.
If nothing else, you need an entry there to prevent someone from
reusing the OID for another purpose.  And a pg_database row that
doesn't correspond to a real database is going to break all kinds of
code, starting with pg_upgrade and the autovacuum launcher.  Special
rows in pg_tablespace are much less likely to cause issues, because
of the precedent of pg_global and pg_default.

In short, I think you're better off equating smgrs to magic
tablespaces, and if you need some way of letting users control
where those map to storage-wise, control it in some other way.

            regards, tom lane


Re: Drop type "smgr"?

From
Robert Haas
Date:
On Thu, Feb 28, 2019 at 10:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The real reason I'm concerned about this, though, is that for either
> a database or a tablespace, you can *not* get away with having a magic
> OID just hanging in space with no actual catalog row matching it.
> If nothing else, you need an entry there to prevent someone from
> reusing the OID for another purpose.  And a pg_database row that
> doesn't correspond to a real database is going to break all kinds of
> code, starting with pg_upgrade and the autovacuum launcher.  Special
> rows in pg_tablespace are much less likely to cause issues, because
> of the precedent of pg_global and pg_default.

My first intuition was the same as yours -- that we should use the
tablespace to decide which smgr is relevant -- but I now think that
intuition was wrong.  Even if you use the tablespace OID to select the
smgr, it doesn't completely solve the problem you're worried about
here.  You still have to put SOMETHING in the database OID field, and
that's going to be just as fake as it was before.  I guess you could
use the OID for pg_global for things like undo and SLRU data, but then
how is GetRelationPath() going to work?  You don't have enough bits
left anywhere to specify both a tablespace and a location within that
tablespace in a reasonable way, and I think it's far from obvious how
you would build a side channel that could carry that information.

Also, I don't see why we'd need a fake pg_database row in the first
place.  IIUC, the OID counter wraps around to FirstNormalObjectId, so
nobody should have a database with an OID less than that value.

If we were using smgrs to represent other kinds of ways of storing
user tables, e.g. network.c, cloud.c, magtape.c, punchcard.c, etc. -
then I think we'd have to find some way to let each user-defined
tablespace pick an smgr, and I don't really know how that would work
given the fact that we can't really use catalog lookups to figure out
which one to use, but actually the direction here is to store files
internal to the system using special-purpose smgrs so that we can pull
more things into shared_buffers, and for that purpose the database OID
seems cleaner.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Drop type "smgr"?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> My first intuition was the same as yours -- that we should use the
> tablespace to decide which smgr is relevant -- but I now think that
> intuition was wrong.  Even if you use the tablespace OID to select the
> smgr, it doesn't completely solve the problem you're worried about
> here.  You still have to put SOMETHING in the database OID field, and
> that's going to be just as fake as it was before.

My thought was that that could be zero if not relevant ... isn't that
what we do now for buffer tags for shared rels?

> I guess you could
> use the OID for pg_global for things like undo and SLRU data, but then
> how is GetRelationPath() going to work?  You don't have enough bits
> left anywhere to specify both a tablespace and a location within that
> tablespace in a reasonable way, and I think it's far from obvious how
> you would build a side channel that could carry that information.

It's certainly possible/likely that we're going to end up needing to
widen buffer tags to represent the smgr explicitly, because some use
cases are going to need a real database spec, some are going to need
a real tablespace spec, and some might need both.  Maybe we should
just bite that bullet.

            regards, tom lane


Re: Drop type "smgr"?

From
Robert Haas
Date:
On Thu, Feb 28, 2019 at 11:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's certainly possible/likely that we're going to end up needing to
> widen buffer tags to represent the smgr explicitly, because some use
> cases are going to need a real database spec, some are going to need
> a real tablespace spec, and some might need both.  Maybe we should
> just bite that bullet.

Well, Andres will probably complain about that.  He thinks, IIUC, that
the buffer tags are too wide already and that it's significantly
hurting performance on very very common operations - like buffer
lookups.  I haven't verified that myself, but I tend to think he knows
what he's talking about.

Anyway, given that your argument started off from the premise that we
had to have a pg_database row, I think we'd better look a little
harder at whether that premise is correct before getting too excited
here.  As I said in my earlier reply, I think that we probably don't
need to have a pg_database row given that we wrap around to
FirstNormalObjectId; any value we hard-code would be less than that.
If there are other reasons why we'd need that, it might be useful to
hear about them.

However, all we really need to decide on this thread is whether we
need 'smgr' exposed as an SQL type.  And I can't see why we need that
no matter how all of the rest of this turns out.  Nobody is currently
proposing to give users a choice of smgrs, just to use them for
internal stuff.  Even if that changed later, it doesn't necessarily
mean we'd add back an SQL type, or that if we did it would look like
the one we have today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Drop type "smgr"?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Feb 28, 2019 at 1:03 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> Nothing seems to break if you remove it (except for some tests using
>> it in an incidental way).  See attached.

> FWIW, +1 from me.

To be clear, I'm not objecting to the proposed patch either.  I was
just wondering where we plan to go from here, given that smgr.c wasn't
getting removed.

BTW, there is stuff in src/backend/storage/smgr/README that is
already obsoleted by this patch, and more that might be obsoleted
if development proceeds as discussed here.  So that needs a look.

            regards, tom lane


Re: Drop type "smgr"?

From
Shawn Debnath
Date:
On Thu, Feb 28, 2019 at 10:35:50AM -0500, Robert Haas wrote:
> On Thu, Feb 28, 2019 at 10:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The real reason I'm concerned about this, though, is that for either
> > a database or a tablespace, you can *not* get away with having a magic
> > OID just hanging in space with no actual catalog row matching it.
> > If nothing else, you need an entry there to prevent someone from
> > reusing the OID for another purpose.  And a pg_database row that
> > doesn't correspond to a real database is going to break all kinds of
> > code, starting with pg_upgrade and the autovacuum launcher.  Special
> > rows in pg_tablespace are much less likely to cause issues, because
> > of the precedent of pg_global and pg_default.
> 
> Also, I don't see why we'd need a fake pg_database row in the first
> place.  IIUC, the OID counter wraps around to FirstNormalObjectId, so
> nobody should have a database with an OID less than that value.

We have scripts under catalog directory that can check to ensure OIDs 
aren't re-used accidentally. However, we still have to define an entry 
in a catalog somewhere and I was proposing creating a new one, 
pg_storage_managers?, to track these entries. See [1] for previous
discussion on this topic. We wouldn't need to do catalog lookups for
being able to use the smgrs as the OIDs will be hardcoded in C, but the
data will be available for posterity and OID reservation.

> It's certainly possible/likely that we're going to end up needing to
> widen buffer tags to represent the smgr explicitly, because some use
> cases are going to need a real database spec, some are going to need
> a real tablespace spec, and some might need both.  Maybe we should
> just bite that bullet.

For now, the two projects that require the new smgrs, undo and slru, can 
get away with using the database OID as the smgr differentiator. We have 
enough left over bits to get our work done (Thomas please correct me if 
I am mistaken). The only issue is pg_buffercache would present DB OIDs 
that wouldn't map correctly (along with the rest of relfilenode).  We 
can modify the output to include more columns to give caching specific 
information when these special buffers are encountered.

> Well, Andres will probably complain about that.  He thinks, IIUC, that
> the buffer tags are too wide already and that it's significantly
> hurting performance on very very common operations - like buffer
> lookups.  I haven't verified that myself, but I tend to think he knows
> what he's talking about.

I can imagine this would. I am curious about the pef impact, going to at
least create a patch and do some testing.

Another thought: my colleague Anton Shyrabokau suggested potentially
re-using forknumber to differentiate smgrs. We are using 32 bits to
map 5 entries today. This approach would be similar to how we split up
the segment numbers and use the higher bits to identify forget or
unlink requests for checkpointer.

> Anyway, given that your argument started off from the premise that we
> had to have a pg_database row, I think we'd better look a little
> harder at whether that premise is correct before getting too excited
> here.  As I said in my earlier reply, I think that we probably don't
> need to have a pg_database row given that we wrap around to
> FirstNormalObjectId; any value we hard-code would be less than that.
> If there are other reasons why we'd need that, it might be useful to
> hear about them.

See above, I think a new catalog, instead of pg_database, can resolve
the issue while providing the ability to reserve the OIDs.

> However, all we really need to decide on this thread is whether we
> need 'smgr' exposed as an SQL type.  And I can't see why we need that
> no matter how all of the rest of this turns out.  Nobody is currently
> proposing to give users a choice of smgrs, just to use them for
> internal stuff.  Even if that changed later, it doesn't necessarily
> mean we'd add back an SQL type, or that if we did it would look like
> the one we have today.

+1 smgr types can be removed.


-- 
Shawn Debnath
Amazon Web Services (AWS)


Re: Drop type "smgr"?

From
Shawn Debnath
Date:
On Thu, Feb 28, 2019 at 10:02:46AM -0800, Shawn Debnath wrote:
> in a catalog somewhere and I was proposing creating a new one, 
> pg_storage_managers?, to track these entries. See [1] for previous
> discussion on this topic. We wouldn't need to do catalog lookups for

*ahem*, here's the footnote:
 
[1] 
https://www.postgresql.org/message-id/20180821184835.GA1032%4060f81dc409fc.ant.amazon.com

-- 
Shawn Debnath
Amazon Web Services (AWS)


Re: Drop type "smgr"?

From
Andres Freund
Date:
Hi,

On 2019-02-28 12:36:50 -0500, Robert Haas wrote:
> Well, Andres will probably complain about that.  He thinks, IIUC, that
> the buffer tags are too wide already and that it's significantly
> hurting performance on very very common operations - like buffer
> lookups.

Correct. Turns out especially comparing the keys after the hash match is
pretty expensive. It also is a significant factor influencing the size
of the hashtable, which influences how much of it can be in cache.

My plan is still to move to a two tiered system, where we have one
unordered datastructure to map from (db, tablespace, oid) to a secondary
ordered datastructure that then maps from (block number) to an actual
offset. With the first being cached somewhere in RelationData, therefore
not being performance critical. But while I hope to work for that in 13,
I don't think making other large projects depend on it would be smart.

Greetings,

Andres Freund


Re: Drop type "smgr"?

From
Andres Freund
Date:
Hi,

On 2019-02-28 10:02:46 -0800, Shawn Debnath wrote:
> We have scripts under catalog directory that can check to ensure OIDs 
> aren't re-used accidentally. However, we still have to define an entry 
> in a catalog somewhere and I was proposing creating a new one, 
> pg_storage_managers?, to track these entries. See [1] for previous
> discussion on this topic. We wouldn't need to do catalog lookups for
> being able to use the smgrs as the OIDs will be hardcoded in C, but the
> data will be available for posterity and OID reservation.

I'm inclined to just put them in pg_am, with a new type 's' (we already
have amtype = i for indexes, I'm planning to add 't' for tables
soon). While not a perfect fit for storage managers, it seems to fit
well enough.


> Another thought: my colleague Anton Shyrabokau suggested potentially
> re-using forknumber to differentiate smgrs. We are using 32 bits to
> map 5 entries today. This approach would be similar to how we split up
> the segment numbers and use the higher bits to identify forget or
> unlink requests for checkpointer.

That could probably be done, without incurring too much overhead
here. I'm not sure that the added complexity around the tree is worth it
however.

I personally would just go with the DB oid for the near future, where we
don't need per-database storage for those. And then plan for buffer
manager improvements.

Greetings,

Andres Freund


Re: Drop type "smgr"?

From
Tom Lane
Date:
Shawn Debnath <sdn@amazon.com> writes:
> On Thu, Feb 28, 2019 at 10:35:50AM -0500, Robert Haas wrote:
>> Also, I don't see why we'd need a fake pg_database row in the first
>> place.  IIUC, the OID counter wraps around to FirstNormalObjectId, so
>> nobody should have a database with an OID less than that value.

> We have scripts under catalog directory that can check to ensure OIDs 
> aren't re-used accidentally. However, we still have to define an entry 
> in a catalog somewhere and I was proposing creating a new one, 
> pg_storage_managers?, to track these entries.

That would fail to capture the property that the smgr OIDs mustn't
conflict with database OIDs, so the whole thing still seems like an
ugly kluge from here.

> Another thought: my colleague Anton Shyrabokau suggested potentially
> re-using forknumber to differentiate smgrs. We are using 32 bits to
> map 5 entries today.

Yeah, that seems like it might be a workable answer.

I really do think that relying on magic database OIDs is a bad idea;
if you think you aren't going to need a real database ID in there in
the future, you're being short-sighted.  And I guess the same argument
establishes that relying on magic tablespace OIDs would also be a bad
idea, even if there weren't a near-term proposal on the table for
needing real tablespace IDs in an alternate smgr.  So we need to find
some bits somewhere else, and the fork number field is the obvious
candidate.  Since, per this discussion, the smgr IDs need not really
be OIDs at all, we just need a few bits for them.

            regards, tom lane


Re: Drop type "smgr"?

From
Andres Freund
Date:
On 2019-02-28 13:16:02 -0500, Tom Lane wrote:
> Shawn Debnath <sdn@amazon.com> writes:
> > On Thu, Feb 28, 2019 at 10:35:50AM -0500, Robert Haas wrote:
> >> Also, I don't see why we'd need a fake pg_database row in the first
> >> place.  IIUC, the OID counter wraps around to FirstNormalObjectId, so
> >> nobody should have a database with an OID less than that value.
> 
> > We have scripts under catalog directory that can check to ensure OIDs 
> > aren't re-used accidentally. However, we still have to define an entry 
> > in a catalog somewhere and I was proposing creating a new one, 
> > pg_storage_managers?, to track these entries.
> 
> That would fail to capture the property that the smgr OIDs mustn't
> conflict with database OIDs, so the whole thing still seems like an
> ugly kluge from here.

It's definitely a kludge, but it doesn't seem that bad for now. Delaying
a nice answer till we have a more efficient bufmgr representation
doesn't seem crazy to me.

I don't think there's a real conflict risk given that we don't allow for
duplicated oids across catalogs at bootstrap time, and this is
definitely a bootstrap time issue.


> > Another thought: my colleague Anton Shyrabokau suggested potentially
> > re-using forknumber to differentiate smgrs. We are using 32 bits to
> > map 5 entries today.
> 
> Yeah, that seems like it might be a workable answer.

Yea, if we just split that into two 16bit entries, there'd not be much
lost. Some mild mild performance regression due to more granular
memory->register reads/writes, but I can't even remotely see that
matter.


> Since, per this discussion, the smgr IDs need not really be OIDs at
> all, we just need a few bits for them.

Personally I find having them as oids more elegant than the distaste
from misusing the database oid for a wihle, but I think it's fair to
disagree here.

Greetings,

Andres Freund


Re: Drop type "smgr"?

From
Thomas Munro
Date:
On Fri, Mar 1, 2019 at 7:24 AM Andres Freund <andres@anarazel.de> wrote:
> On 2019-02-28 13:16:02 -0500, Tom Lane wrote:
> > Shawn Debnath <sdn@amazon.com> writes:
> > > Another thought: my colleague Anton Shyrabokau suggested potentially
> > > re-using forknumber to differentiate smgrs. We are using 32 bits to
> > > map 5 entries today.
> >
> > Yeah, that seems like it might be a workable answer.
>
> Yea, if we just split that into two 16bit entries, there'd not be much
> lost. Some mild mild performance regression due to more granular
> memory->register reads/writes, but I can't even remotely see that
> matter.

Ok, that's a interesting way to include it in BufferTag without making
it wider.  But then how about the SMGR interface?  I think that value
would need to be added to the smgropen() interface, and all existing
callers would pass in (say) SGMR_RELATION (meaning "use md.c"), and
new code would use SMGR_UNDO, SMGR_SLRU etc.  That seems OK to me.
Ancient POSTGRES had an extra argument like that and would say eg
smgropen(rd->rd_rel->relsmgr, rd), but in this new idea I think it'd
always be a constant or a value from a BufferTag, and the BufferTag
would have been set with a constant, since the code reading these
buffers would always be code that knows which it wants.  We'd be using
this new argument not as a modifier to control storage location as
they did back then, but rather as a whole new namespace for
RelFileNode values, that also happens to be stored differently; that
might be a hint that it could also go into RelFileNode (but I'm not
suggesting that).

> > Since, per this discussion, the smgr IDs need not really be OIDs at
> > all, we just need a few bits for them.
>
> Personally I find having them as oids more elegant than the distaste
> from misusing the database oid for a wihle, but I think it's fair to
> disagree here.

It sounds like your buffer mapping redesign would completely change
the economics and we could reconsider much of this later without too
much drama; that was one of the things that made me feel that the
magic database OID approach was acceptable at least in the short term.

-- 
Thomas Munro
https://enterprisedb.com


Re: Drop type "smgr"?

From
Thomas Munro
Date:
On Fri, Mar 1, 2019 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > On Thu, Feb 28, 2019 at 7:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Thomas Munro <thomas.munro@gmail.com> writes:
> >>> Our current thinking is that smgropen() should know how to map a small
> >>> number of special database OIDs to different smgr implementations
>
> >> Hmm.  Maybe mapping based on tablespaces would be a better idea?
>
> > In the undo log proposal (about which more soon) we are using
> > tablespaces for their real purpose, so we need that OID.  If you SET
> > undo_tablespaces = foo then future undo data created by your session
> > will be written there, which might be useful for putting that IO on
> > different storage.
>
> Meh.  That's a point, but it doesn't exactly seem like a killer argument.
> Just in the abstract, it seems much more likely to me that people would
> want per-database special rels than per-tablespace special rels.  And
> I think your notion of a GUC that can control this is probably pie in
> the sky anyway: if we can't afford to look into the catalogs to resolve
> names at this code level, how are we going to handle a GUC?

I have this working like so:

* undo logs have a small amount of meta-data in shared memory, stored
in a file at checkpoint time, with all changes WAL logged, visible to
users in pg_stat_undo_logs view
* one of the properties of an undo log is its tablespace (the point
here being that it's not in a catalog)
* you don't need access to any catalogs to find the backing files for
a RelFileNode (the path via tablespace symlinks is derivable from
spcNode)
* therefore you can find your way from an UndoLogRecPtr in (say) a
zheap page to the relevant blocks on disk without any catalog access;
this should work even in the apparently (but not actually) circular
case of a pg_tablespace catalog that is stored in zheap (not something
we can do right now, but hypothetically speaking), and has undo data
that is stored in some non-default tablespace that must be consulted
while scanning the catalog (not that I'm suggesting that would
necessarily be a good idea to suppose catalogs in non-default
tablespaces; I'm just addressing your theoretical point)
* the GUC is used to resolve tablespace names to OIDs only by sessions
that are writing, when selecting (or creating) an undo log to attach
to and begin writing into; those sessions have no trouble reading the
catalog to do so without problematic circularities, as above

Seems to work; the main complications so far were coming up with
reasonable behaviour and interlocking when you drop tablespaces that
contain undo logs (short version: if they're not needed for snapshots
or rollback, they are dropped, wasting the rest of their undo address
space; otherwise they prevents the tablespace from being dropped with
a clear message to that effect).

It doesn't make any sense to put things like clog or any other SLRU in
a non-default tablespace though.  It's perfectly OK if not all smgr
implementations know how to deal with tablespaces, and the SLRU
support should just not support that.

> The real reason I'm concerned about this, though, is that for either
> a database or a tablespace, you can *not* get away with having a magic
> OID just hanging in space with no actual catalog row matching it.
> If nothing else, you need an entry there to prevent someone from
> reusing the OID for another purpose.  And a pg_database row that
> doesn't correspond to a real database is going to break all kinds of
> code, starting with pg_upgrade and the autovacuum launcher.  Special
> rows in pg_tablespace are much less likely to cause issues, because
> of the precedent of pg_global and pg_default.

GetNewObjectId() never returns values < FirstNormalObjectId.

I don't think it's impossible for someone to want to put SMGRs in a
catalog of some kind some day.  Even though the ones for clog, undo
etc would still probably need special hard-coded treatment as
discussed, I suppose it's remotely possible that someone might some
day figure out a useful way to allow extensions that provide different
block storage (nvram?  zfs zvols?  encryption? (see Haribabu's reply))
but I don't have any specific ideas about that or feel inclined to
design something for unknown future use.

-- 
Thomas Munro
https://enterprisedb.com


Re: Drop type "smgr"?

From
Shawn Debnath
Date:
On Fri, Mar 01, 2019 at 10:33:06AM +1300, Thomas Munro wrote:

> It doesn't make any sense to put things like clog or any other SLRU in
> a non-default tablespace though.  It's perfectly OK if not all smgr
> implementations know how to deal with tablespaces, and the SLRU
> support should just not support that.

If the generic storage manager, or whatever we end up calling it, ends 
up being generic enough - its possible that tablespace value would have 
to be respected.


-- 
Shawn Debnath
Amazon Web Services (AWS)


Re: Drop type "smgr"?

From
Andres Freund
Date:
On 2019-03-01 09:48:33 +1300, Thomas Munro wrote:
> On Fri, Mar 1, 2019 at 7:24 AM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-02-28 13:16:02 -0500, Tom Lane wrote:
> > > Shawn Debnath <sdn@amazon.com> writes:
> > > > Another thought: my colleague Anton Shyrabokau suggested potentially
> > > > re-using forknumber to differentiate smgrs. We are using 32 bits to
> > > > map 5 entries today.
> > >
> > > Yeah, that seems like it might be a workable answer.
> >
> > Yea, if we just split that into two 16bit entries, there'd not be much
> > lost. Some mild mild performance regression due to more granular
> > memory->register reads/writes, but I can't even remotely see that
> > matter.
> 
> Ok, that's a interesting way to include it in BufferTag without making
> it wider.  But then how about the SMGR interface?  I think that value
> would need to be added to the smgropen() interface, and all existing
> callers would pass in (say) SGMR_RELATION (meaning "use md.c"), and
> new code would use SMGR_UNDO, SMGR_SLRU etc.  That seems OK to me.

Right, seems like we should do that independent of whether we end up
reusing the dboid or not.


> > > Since, per this discussion, the smgr IDs need not really be OIDs at
> > > all, we just need a few bits for them.
> >
> > Personally I find having them as oids more elegant than the distaste
> > from misusing the database oid for a wihle, but I think it's fair to
> > disagree here.
> 
> It sounds like your buffer mapping redesign would completely change
> the economics and we could reconsider much of this later without too
> much drama; that was one of the things that made me feel that the
> magic database OID approach was acceptable at least in the short term.

Right.

FWIW, I think while distasteful, I could see us actually using oids,
just ones that are small enough to fit into 16bit...

Greetings,

Andres Freund


Re: Drop type "smgr"?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> FWIW, I think while distasteful, I could see us actually using oids,
> just ones that are small enough to fit into 16bit...

If we suppose that all smgrs must be built-in, that's not even much
of a restriction...

            regards, tom lane


Re: Drop type "smgr"?

From
Shawn Debnath
Date:
On Thu, Feb 28, 2019 at 02:08:49PM -0800, Andres Freund wrote:
> On 2019-03-01 09:48:33 +1300, Thomas Munro wrote:
> > On Fri, Mar 1, 2019 at 7:24 AM Andres Freund <andres@anarazel.de> wrote:
> > > On 2019-02-28 13:16:02 -0500, Tom Lane wrote:
> > > > Shawn Debnath <sdn@amazon.com> writes:
> > > > > Another thought: my colleague Anton Shyrabokau suggested potentially
> > > > > re-using forknumber to differentiate smgrs. We are using 32 bits to
> > > > > map 5 entries today.
> > > >
> > > > Yeah, that seems like it might be a workable answer.
> > >
> > > Yea, if we just split that into two 16bit entries, there'd not be much
> > > lost. Some mild mild performance regression due to more granular
> > > memory->register reads/writes, but I can't even remotely see that
> > > matter.
> > 
> > Ok, that's a interesting way to include it in BufferTag without making
> > it wider.  But then how about the SMGR interface?  I think that value
> > would need to be added to the smgropen() interface, and all existing
> > callers would pass in (say) SGMR_RELATION (meaning "use md.c"), and
> > new code would use SMGR_UNDO, SMGR_SLRU etc.  That seems OK to me.
> 
> Right, seems like we should do that independent of whether we end up
> reusing the dboid or not.

Food for thought: if we are going to muck with the smgr APIs, would it 
make sense to move away from SMgrRelation to something a bit more 
generic like, say,  SMgrHandle to better organize the internal contents 
of this structure? Internally, we could move things into an union and 
based on type of handle: relation, undo, slru/generic, translate the 
contents correctly. As you can guess, SMgrRelationData today is very 
specific to relations and holding md specific information whose memory 
would be better served re-used for the other storage managers.

Thoughts?

-- 
Shawn Debnath
Amazon Web Services (AWS)


Re: Drop type "smgr"?

From
Thomas Munro
Date:
On Fri, Mar 1, 2019 at 10:41 AM Shawn Debnath <sdn@amazon.com> wrote:
> On Fri, Mar 01, 2019 at 10:33:06AM +1300, Thomas Munro wrote:
> > It doesn't make any sense to put things like clog or any other SLRU in
> > a non-default tablespace though.  It's perfectly OK if not all smgr
> > implementations know how to deal with tablespaces, and the SLRU
> > support should just not support that.
>
> If the generic storage manager, or whatever we end up calling it, ends
> up being generic enough - its possible that tablespace value would have
> to be respected.

Right, you and I have discussed this a bit off-list, but for the
benefit of others, I think what you're getting at with "generic
storage manager" here is something like this: on the one hand, our
proposed revival of SMGR as a configuration point is about is
supporting alternative file layouts for bufmgr data, but at the same
time there is some background noise about direct IO, block encryption,
... and who knows what alternative block storage someone might come up
with ... at the block level.  So although it sounds a bit
contradictory to be saying "let's make all these different SMGRs!" at
the same time as saying "but we'll eventually need a single generic
SMGR that is smart enough to be parameterised for all of these
layouts!", I see why you say it.  In fact, the prime motivation for
putting SLRUs into shared buffers is to get better buffering, because
(anecdotally) slru.c's mini-buffer scheme performs abysmally without
the benefit of an OS page cache.  If we add optional direct IO support
(something I really want), we need it to apply to SLRUs, undo and
relations, ideally without duplicating code, so we'd probably want to
chop things up differently.  At some point I think we'll need to
separate the questions "how to map blocks to filenames and offsets"
and "how to actually perform IO".  I think the first question would be
controlled by the SMGR IDs as discussed, but the second question
probably needs to be controlled by GUCs that control all IO, and/or
special per relation settings (supposing you can encrypt just one
table, as a random example I know nothing about); but that seems way
out of scope for the present projects.  IMHO the best path from here
is to leave md.c totally untouched for now as the SMGR for plain old
relations, while we work on getting these new kinds of bufmgr data
into the tree as a first step, and a later hypothetical direct IO or
whatever project can pay for the refactoring to separate IO from
layout.

-- 
Thomas Munro
https://enterprisedb.com


Re: Drop type "smgr"?

From
Thomas Munro
Date:
On Fri, Mar 1, 2019 at 11:31 AM Shawn Debnath <sdn@amazon.com> wrote:
> On Thu, Feb 28, 2019 at 02:08:49PM -0800, Andres Freund wrote:
> > On 2019-03-01 09:48:33 +1300, Thomas Munro wrote:
> > > On Fri, Mar 1, 2019 at 7:24 AM Andres Freund <andres@anarazel.de> wrote:
> > > > On 2019-02-28 13:16:02 -0500, Tom Lane wrote:
> > > > > Shawn Debnath <sdn@amazon.com> writes:
> > > > > > Another thought: my colleague Anton Shyrabokau suggested potentially
> > > > > > re-using forknumber to differentiate smgrs. We are using 32 bits to
> > > > > > map 5 entries today.
> > > > >
> > > > > Yeah, that seems like it might be a workable answer.
> > > >
> > > > Yea, if we just split that into two 16bit entries, there'd not be much
> > > > lost. Some mild mild performance regression due to more granular
> > > > memory->register reads/writes, but I can't even remotely see that
> > > > matter.
> > >
> > > Ok, that's a interesting way to include it in BufferTag without making
> > > it wider.  But then how about the SMGR interface?  I think that value
> > > would need to be added to the smgropen() interface, and all existing
> > > callers would pass in (say) SGMR_RELATION (meaning "use md.c"), and
> > > new code would use SMGR_UNDO, SMGR_SLRU etc.  That seems OK to me.
> >
> > Right, seems like we should do that independent of whether we end up
> > reusing the dboid or not.
>
> Food for thought: if we are going to muck with the smgr APIs, would it
> make sense to move away from SMgrRelation to something a bit more
> generic like, say,  SMgrHandle to better organize the internal contents
> of this structure? Internally, we could move things into an union and
> based on type of handle: relation, undo, slru/generic, translate the
> contents correctly. As you can guess, SMgrRelationData today is very
> specific to relations and holding md specific information whose memory
> would be better served re-used for the other storage managers.
>
> Thoughts?

Right, it does contain some md-specific stuff without even
apologising.  Also smgropen() was rendered non-virtual at some point
(I mean that implementations don't even get a chance to initialise
anything, which works out only because md-specific code has leaked
into smgr.c).  In one of my undo patches (which I'll post an updated
version of on the appropriate thread soon) I added a void * called
private_data so that undo_file.c can keep track of some stuff, but
yeah I agree that more tidying up could be done.

-- 
Thomas Munro
https://enterprisedb.com


Re: Drop type "smgr"?

From
Shawn Debnath
Date:
On Fri, Mar 01, 2019 at 11:38:49AM +1300, Thomas Munro wrote:

> > Food for thought: if we are going to muck with the smgr APIs, would it
> > make sense to move away from SMgrRelation to something a bit more
> > generic like, say,  SMgrHandle to better organize the internal contents
> > of this structure? Internally, we could move things into an union and
> > based on type of handle: relation, undo, slru/generic, translate the
> > contents correctly. As you can guess, SMgrRelationData today is very
> > specific to relations and holding md specific information whose memory
> > would be better served re-used for the other storage managers.
> >
> > Thoughts?
> 
> Right, it does contain some md-specific stuff without even
> apologising.  Also smgropen() was rendered non-virtual at some point
> (I mean that implementations don't even get a chance to initialise
> anything, which works out only because md-specific code has leaked
> into smgr.c).  In one of my undo patches (which I'll post an updated
> version of on the appropriate thread soon) I added a void * called
> private_data so that undo_file.c can keep track of some stuff, but
> yeah I agree that more tidying up could be done.

I can send out a patch for this (on a separate thread!) to unblock us 
both. Unless you are closer to completion on this.

I prefer the union approach to make it more readable. I was considering 
re-factoring the structure to SMgrHandle and having the relation 
specific structure retain SMgrRelationData. For undo we could have
SMgrUndoData, and similarly for SLRU (I will come up with a better name 
than generic).  Then have these be in the union instead of the 
individual members of the struct.

-- 
Shawn Debnath
Amazon Web Services (AWS)


Re: Drop type "smgr"?

From
Konstantin Knizhnik
Date:

On 01.03.2019 1:32, Thomas Munro wrote:
> On Fri, Mar 1, 2019 at 10:41 AM Shawn Debnath <sdn@amazon.com> wrote:
>> On Fri, Mar 01, 2019 at 10:33:06AM +1300, Thomas Munro wrote:
>>> It doesn't make any sense to put things like clog or any other SLRU in
>>> a non-default tablespace though.  It's perfectly OK if not all smgr
>>> implementations know how to deal with tablespaces, and the SLRU
>>> support should just not support that.
>> If the generic storage manager, or whatever we end up calling it, ends
>> up being generic enough - its possible that tablespace value would have
>> to be respected.
> Right, you and I have discussed this a bit off-list, but for the
> benefit of others, I think what you're getting at with "generic
> storage manager" here is something like this: on the one hand, our
> proposed revival of SMGR as a configuration point is about is
> supporting alternative file layouts for bufmgr data, but at the same
> time there is some background noise about direct IO, block encryption,
> ... and who knows what alternative block storage someone might come up
> with ... at the block level.  So although it sounds a bit
> contradictory to be saying "let's make all these different SMGRs!" at
> the same time as saying "but we'll eventually need a single generic
> SMGR that is smart enough to be parameterised for all of these
> layouts!", I see why you say it.  In fact, the prime motivation for
> putting SLRUs into shared buffers is to get better buffering, because
> (anecdotally) slru.c's mini-buffer scheme performs abysmally without
> the benefit of an OS page cache.  If we add optional direct IO support
> (something I really want), we need it to apply to SLRUs, undo and
> relations, ideally without duplicating code, so we'd probably want to
> chop things up differently.  At some point I think we'll need to
> separate the questions "how to map blocks to filenames and offsets"
> and "how to actually perform IO".  I think the first question would be
> controlled by the SMGR IDs as discussed, but the second question
> probably needs to be controlled by GUCs that control all IO, and/or
> special per relation settings (supposing you can encrypt just one
> table, as a random example I know nothing about); but that seems way
> out of scope for the present projects.  IMHO the best path from here
> is to leave md.c totally untouched for now as the SMGR for plain old
> relations, while we work on getting these new kinds of bufmgr data
> into the tree as a first step, and a later hypothetical direct IO or
> whatever project can pay for the refactoring to separate IO from
> layout.
>

I completely agree with this statement:

At some point I think we'll need to separate the questions "how to map blocks to filenames and offsets" and "how to
actuallyperform IO".
 


There are two subsystems developed in PgPro which are integrated in 
Postgres at file IO level: CFS (compressed file system) and SnapFS (fast 
database snapshots).
First one provides page level encryption and compression, second - 
mechanism for fast restoring of database state.
Both are implemented by patching fd.c. My first idea was to implement 
them as alternative storage devices (alternative to md.c). But it will 
require duplication of all segment mapping logic from md.c + file 
descriptors cache from fd.c. It will be nice if it is possible to 
redefine raw file operations (FileWrite, FileRead,...) without affecting 
segment mapping logic.

One more thing... From my point of view one of the drawbacks of Postgres 
is that it requires underlaying file system and is not able to work with 
raw partitions.
It seems to me that bypassing fle system layer can significantly improve 
performance and give more possibilities for IO performance tuning.
Certainly it will require a log of changes in Postgres storage layer so 
this is not what I suggest to implement or even discuss right now.
But it may be useful to keep it in mind in discussions concerning 
"generic storage manager".


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: Drop type "smgr"?

From
Thomas Munro
Date:
On Fri, Mar 1, 2019 at 9:11 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> One more thing... From my point of view one of the drawbacks of Postgres
> is that it requires underlaying file system and is not able to work with
> raw partitions.
> It seems to me that bypassing fle system layer can significantly improve
> performance and give more possibilities for IO performance tuning.
> Certainly it will require a log of changes in Postgres storage layer so
> this is not what I suggest to implement or even discuss right now.
> But it may be useful to keep it in mind in discussions concerning
> "generic storage manager".

Hmm.  Speculation-around-the-water-cooler-mode:  I think the arguments
for using raw partitions are approximately the same as the arguments
for using a big data file that holds many relations.  The three I can
think of are (1) the space is entirely preallocated, which *might*
have performance and safety advantages, (2) if you encrypt it, no one
can see the structure (database OIDs, relation OIDs and sizes) and (3)
it might allow pages to be moved from one relation to another without
copying or interleaved in interesting ways (think SQL Server parallel
btree creation that "stitches together" btrees produced by parallel
workers, or Oracle "clustered" tables where the pages of two tables
are physically interleaved in an order that works nicely when you join
those two tables, or perhaps schemes for moving data between
relations/partitions quickly).  On the other hand, to make that work
you have to own the problem of space allocation/management that we
currently leave to the authors of XFS et al, and those guys have
worked on that for *years and years* and they work really well.  If
you made all that work for big preallocated data files, then sure, you
could also make it work for raw partitions, but I'm not sure how much
performance advantage there is for that final step.  I suspect that a
major reason for Oracle to support raw block devices many years ago
was because back then there was no other way to escape from the page
cache.  Direct IO hasn't always been available or portable, and hasn't
always worked well.  That said, it does seem plausible that we could
do the separation of (1) block -> pathname/offset mappings and (2)
actual IO operations in a way that you could potentially write your
own pseudo-filesystem that stores a whole PostgreSQL cluster inside
big data files or raw partitions.  Luckily we don't need to tackle
such mountainous terrain to avoid the page cache, today.

-- 
Thomas Munro
https://enterprisedb.com


Re: Drop type "smgr"?

From
Thomas Munro
Date:
On Thu, Feb 28, 2019 at 7:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> The type smgr has only one value 'magnetic disk'.  ~15 years ago it
> also had a value 'main memory', and in Berkeley POSTGRES 4.2 there was
> a third value 'sony jukebox'.  Back then, all tables had an associated
> block storage manager, and it was recorded as an attribute relsmgr of
> pg_class (or pg_relation as it was known further back).  This was the
> type of that attribute, removed by Bruce in 3fa2bb31 (1997).
>
> Nothing seems to break if you remove it (except for some tests using
> it in an incidental way).  See attached.

Pushed.

Thanks all for the interesting discussion.  I'm trying out Anton
Shyrabokau's suggestion of stealing bits from the fork number.  More
on that soon.

-- 
Thomas Munro
https://enterprisedb.com