It's past time to redo the smgr API - Mailing list pgsql-hackers

From Tom Lane
Subject It's past time to redo the smgr API
Date
Msg-id 5416.1076007946@sss.pgh.pa.us
Whole thread Raw
Responses Re: It's past time to redo the smgr API  ("Marc G. Fournier" <scrappy@postgresql.org>)
Re: It's past time to redo the smgr API  (Alvaro Herrera <alvherre@dcc.uchile.cl>)
Re: It's past time to redo the smgr API  (Christopher Kings-Lynne <chriskl@familyhealth.com.au>)
List pgsql-hackers
A long time ago Vadim proposed that we should revise smgr.c's API so
that it does not depend on Relations (relcache entries); rather, only
a RelFileNode value should be needed to access a file in smgr and lower
levels.  This would allow us to get rid of the concept of "blind
writes".  As of CVS tip, with most writes being done by the bgwriter,
most writes are blind.  Try tracing the bgwriter: every write involves
a file open and close, which has got to be hurting its performance.
The same has been true for awhile now of the background checkpointer.

I'm motivated to do something about this today because I'm reviewing the
PITR patch that J.R. Nield and Patrick Macdonald were working on, and
one of the things I don't care for is the kluges it adds to the smgr API
to work around lack of a Relation in some operations.

I propose the following revisions:

* smgr.c should internally keep a hashtable (indexed by RelFileNode)
with "struct SMgrRelation" entries for each relation that has recently
been accessed.  File access information for the lower-level modules
(md.c, fd.c) would appear in this hashtable entry.

* smgropen takes a RelFileNode and returns a pointer to a hashtable
entry, which is used as the argument to all other smgr operations.
smgropen itself does not cause any actual I/O; thus it doesn't matter
if the file doesn't exist yet.  smgropen followed by smgrcreate is the
way to create a new relation on disk.  Without smgrcreate, file
existence will be checked at the first read or write attempt.

* smgrclose closes the md.c-level file and drops the hashtable entry.
Hashtable entries remain valid unless explicitly closed (thus, multiple
smgropens for the same file are legal).

* The rd_fd field of relcache nodes will be replaced by an "SMgrRelation *"
pointer, so that callers having a Relation reference can access the smgr
file easily.  (In particular, this means that most calls to the bufmgr
will still pass Relations, and we don't need to propagate the API change
up to the bufmgr level.)

* The existing places in the bufmgr that do RelationNodeCacheGetRelation
followed by either regular or blind write would be replaced by smgropen
followed by smgrwrite.  Since smgropen is just a hash table lookup, it
is no more expensive than the RelationNodeCacheGetRelation it replaces.
Note these places would *not* do smgrclose afterwards.

* Because we don't smgrclose after a write, it is possible to have
"dangling" smgr entries that aren't useful any more, as well as open
file descriptors underneath them.  This isn't too big a deal on Unix
but it will be fatal for the Windows port, since it would prevent a
DROP TABLE if some other backend happens to have touched the table.
What I propose to do about this is: 1. In the bgwriter, at each checkpoint do "smgrcloseall" to close    all open
files.2. In regular backends, receipt of a relcache flush message will    result in smgrclose(), even if there is not a
relcacheentry    for the given relation ID.  A global cache flush event (sinval    buffer overflow) causes
smgrcloseall.
This ensures that open descriptors for a dead relation will not persist
past the next checkpoint.  We had already agreed, I think, that the
best solution for Windows' problem with deleting open files is to retry
pending deletes at checkpoint.  This smgr rewrite will not contribute
to actually making that happen, but it won't make things worse either.

* I'm going to get rid of the "which" argument that once selected a
particular smgr implementation (md.c vs mm.c).  It's vestigial at
present and is misdefined anyway.  If we were to resurrect the concept
of multiple storage managers, we'd need the selection to be determinable
from the RelFileNode value, rather than being a separate argument.
(When we add tablespaces, we could imagine associating smgrs with
tablespaces, but it would have to be the responsibility of smgr.c to
determine which smgr to use based on the tablespace ID.)

* We can remove the "dummy cache" support in relcache.c, as well as
RelationNodeCacheGetRelation() and the hashtable that supports it.

* The smgr hashtable entries could possibly be used for recording other
status; for instance keeping track of which relations need fsync in the
bgwriter.  I'm also thinking of merging smgr.c's existing
list-of-rels-to-be-deleted into this data structure.

* AFAICS the only downside of not having a Relation available in smgr.c
and md.c is that error messages could only refer to the RelFileNode
numbers and not to the relation name.  I'm not sure this is bad, since
in my experience what you want to know about such errors is the actual
disk filename, which RelFileNode tells you and relation name doesn't.
We could preserve the current behavior by passing the relation name to
smgropen when available, and saving the name in struct SMgrRelation.
But I'm inclined not to.

Comments?
        regards, tom lane


pgsql-hackers by date:

Previous
From: "Dave Page"
Date:
Subject: Re: PITR Dead horse?
Next
From: Tom Lane
Date:
Subject: Re: PITR Dead horse?