Thread: including PID or backend ID in relpath of temp rels

including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
Time for a new thread specific to this subject.  For previous
discussion, see here:

http://archives.postgresql.org/pgsql-hackers/2010-04/msg01140.php
http://archives.postgresql.org/pgsql-hackers/2010-04/msg01152.php

I attempted to implement this by adding an isTemp argument to relpath,
but ran into problems.  It turns out that when we create a temporary
relation and then exit the backend, the relation is merely truncated
it, and it's the background writer which actually removes the file
following the next checkpoint.  Therefore, relpath() for the temprel
must return the same answer in the background writer as it does in the
original backend, so passing isTemp isn't enough - we actually need to
pass whatever identifier we're including in the file name.  As far as
I can see, though I'm not 100% sure of this, it looks like we never
actually ask the background writer to fsync any of these files because
we never fsync them at all; but we do ask it to remove them, which is
enough to create a problem.  So, what to do about this?  Ideas:

1. We could move the responsibility for removing the files associated
with temp rels from the background writer to the owning backend.  I
think the reason why we initially truncate the files and only later
remove them is because somebody else might have 'em open, so it
mightn't be necessary for temp rels.

2. Instead of embedding a PID or backend ID in the filename, we could
just embed a boolean: isTemp or not?  This seems like cutting
ourselves off from quite a bit of useful information but maybe it
would be OK.  We could nuke all the temp stuff on cluster startup, but
we'd have to rely on catalog entries to identify orphaned files that
accumulated during normal running, which isn't ideal since one of our
long-term goals is to eliminate the need for those catalog entries.

3. We could change RelFileNode.relNode from an OID to an unsigned
32-bit integer drive off of a separate counter, and reserve some
portion of the 4 billion available values for temp relations.  I doubt
we'd have enough bits to embed something like a PID though, so this
would end up being basically an embedded boolean, along the lines of
#2.

4. We could add an additional 32-bit value to RelFileNode to identify
the backend (or a sentinel value when not temp) and create a separate
structure XLogRelFileNode or PermRelFileNode or somesuch for use in
contexts where no temp rels are allowed.

Either #3 or #4 has some possible advantages for Hot Standby in terms
of perhaps making it feasible to assign relfilenodes on a standby
server without danger of conflicting with one already assigned on the
master.

5. ???

Thoughts?

...Robert


Re: including PID or backend ID in relpath of temp rels

From
Jaime Casanova
Date:
On Sun, Apr 25, 2010 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> 1. We could move the responsibility for removing the files associated
> with temp rels from the background writer to the owning backend.  I
> think the reason why we initially truncate the files and only later
> remove them is because somebody else might have 'em open, so it
> mightn't be necessary for temp rels.
>

what happens if the backend crash and obviously doesn't remove the
file associated with temp rels?

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157


Re: including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
On Sun, Apr 25, 2010 at 10:19 PM, Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:
> On Sun, Apr 25, 2010 at 8:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> 1. We could move the responsibility for removing the files associated
>> with temp rels from the background writer to the owning backend.  I
>> think the reason why we initially truncate the files and only later
>> remove them is because somebody else might have 'em open, so it
>> mightn't be necessary for temp rels.
>>
>
> what happens if the backend crash and obviously doesn't remove the
> file associated with temp rels?

Currently, they just get orphaned.  As I understand it, if the catalog
entry survives the crash, autovacuum will remove them 2 BILLION
transactions later (and emit warning messages in the meantime);
otherwise we won't even know they're there.

As I further understand it, the main point of this change is that if
temporary tables have a distinctive name of some kind, then when we
can run through the directory and blow away files with those names
without fearing that it's *permanent* table data that somehow got
orphaned.

...Robert


Re: including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
On Sun, Apr 25, 2010 at 9:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 4. We could add an additional 32-bit value to RelFileNode to identify
> the backend (or a sentinel value when not temp) and create a separate
> structure XLogRelFileNode or PermRelFileNode or somesuch for use in
> contexts where no temp rels are allowed.

I experimented with this approach and created LocalRelFileNode and
GlobalRelFileNode and, for use in the buffer headers,
BufferRelFileNode (same as GlobalRelFileNode, but named differently
for clarity).  LocallRelFileNode = GlobalRelFileNode + the ID of the
owning backend for temp rels; or InvalidBackendId if referencing a
non-temporary rel.  These might not be the greatest names, but I think
the concept is good, because it really breaks the things that need to
be adjusted quite thoroughly.  In the course of repairing the damage I
came across a couple of things I wasn't sure about:

[relcache.c] RelationInitPhysicalAddr can't initialize
relation->rd_node.backend properly for a non-local temporary relation,
because that information isn't available.  But I'm not clear on why we
would need to create a relcache entry for a non-local temporary
relation.  If we do need to, then we'll probably need to store the
backend ID in pg_class.  That seems like something that would be best
avoided, all things being equal, especially since I can't see how to
generalize it to global temporary tables.

[smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
relations?  I think the only backend that can have an smgr reference
to a temprel other than the owning backend is bgwriter, and AFAICS
bgwriter will only have such a reference if it's responding to a
request by the owning backend to unlink the associated files, in which
case (I think) the owning backend will have no reference.

[dbsize.c] As with relcache.c, there's a problem if we're asked for
the size of a temporary relation that is not our own: we can't call
relpath() without knowing the ID of the owning backend, and there's no
way to acquire that information for pg_class.  I guess we could just
refuse to answer the question in that case, but that doesn't seem real
cool.  Or we could physically scan the directory for files that match
a suitably constructed wildcard, I suppose.

[storage.c,xact.c,twophase.c] smgrGetPendingDeletes returns via an out
parameter (its second argument) a list of RelFileNodes pending delete,
which we then write to WAL or to the two-phase state file.  Of course,
if the backend ID (or pid, but I picked backend ID somewhat
arbitrarily) is part of the filename, then we need to write that to
WAL, too.  It seems somewhat unfortunate to have to WAL-log temprels
here; as best I can tell, this is the only case where it's necessary.But if we implement a more general mechanism for
cleaningup temp
 
files, then might the need to do this go away?  Not sure.

[syncscan.c] It seems we pursue this optimization even for temprels; I
can't think of why that would be useful in practice.  If it's useless
overhead, should we skip it?  This is really independent of this
project; just a side thought.

...Robert


Re: including PID or backend ID in relpath of temp rels

From
Alvaro Herrera
Date:
Robert Haas escribió:

> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
> relations?  I think the only backend that can have an smgr reference
> to a temprel other than the owning backend is bgwriter, and AFAICS
> bgwriter will only have such a reference if it's responding to a
> request by the owning backend to unlink the associated files, in which
> case (I think) the owning backend will have no reference.

Hmm, wasn't there a proposal to have the owning backend delete the files
instead of asking the bgwriter to?

> [dbsize.c] As with relcache.c, there's a problem if we're asked for
> the size of a temporary relation that is not our own: we can't call
> relpath() without knowing the ID of the owning backend, and there's no
> way to acquire that information for pg_class.  I guess we could just
> refuse to answer the question in that case, but that doesn't seem real
> cool.  Or we could physically scan the directory for files that match
> a suitably constructed wildcard, I suppose.

I don't very much like the wildcard idea; but I don't think it's
unreasonable to refuse to provide a file size.  If the owning backend
has still got part of the table in local buffers, you'll get a
misleading answer, so perhaps it's best to not give an answer at all.

Maybe this problem could be solved if we could somehow force that
backend to write down its local buffers, in which case it'd be nice to
have a solution to the dbsize problem.

> [syncscan.c] It seems we pursue this optimization even for temprels; I
> can't think of why that would be useful in practice.  If it's useless
> overhead, should we skip it?  This is really independent of this
> project; just a side thought.

Maybe recently used buffers are more likely to be in the OS page cache,
so perhaps it's not good to disable it.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
On Tue, May 4, 2010 at 2:06 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Robert Haas escribió:

Hey, thanks for writing back!  I just spent the last few hours
thinking about this and beating my head against the wall.

>> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
>> relations?  I think the only backend that can have an smgr reference
>> to a temprel other than the owning backend is bgwriter, and AFAICS
>> bgwriter will only have such a reference if it's responding to a
>> request by the owning backend to unlink the associated files, in which
>> case (I think) the owning backend will have no reference.
>
> Hmm, wasn't there a proposal to have the owning backend delete the files
> instead of asking the bgwriter to?

I did propose that upthread; it may have been proposed previously
also. This might be worth doing independently of the rest of the patch
(which I'm starting to fear is doomed, cue ominous soundtrack) since
it would reduce the chance of orphaning data files and possibly
simplify the logic also.

>> [dbsize.c] As with relcache.c, there's a problem if we're asked for
>> the size of a temporary relation that is not our own: we can't call
>> relpath() without knowing the ID of the owning backend, and there's no
>> way to acquire that information for pg_class.  I guess we could just
>> refuse to answer the question in that case, but that doesn't seem real
>> cool.  Or we could physically scan the directory for files that match
>> a suitably constructed wildcard, I suppose.
>
> I don't very much like the wildcard idea; but I don't think it's
> unreasonable to refuse to provide a file size.  If the owning backend
> has still got part of the table in local buffers, you'll get a
> misleading answer, so perhaps it's best to not give an answer at all.
>
> Maybe this problem could be solved if we could somehow force that
> backend to write down its local buffers, in which case it'd be nice to
> have a solution to the dbsize problem.

I'm sure we could add some kind of signaling mechanism that would tell
all backends to flush their local buffers, but I'm not too sure it
would help this case very much, because you likely wouldn't want to
wait for all the backends to complete that process before reporting
results.

>> [syncscan.c] It seems we pursue this optimization even for temprels; I
>> can't think of why that would be useful in practice.  If it's useless
>> overhead, should we skip it?  This is really independent of this
>> project; just a side thought.
>
> Maybe recently used buffers are more likely to be in the OS page cache,
> so perhaps it's not good to disable it.

I don't get it.  If the whole relation fits in the page cache, it
doesn't much matter where you start a seqscan.  If it doesn't,
starting where the last one ended is anti-optimal.

...Robert


Re: including PID or backend ID in relpath of temp rels

From
Alvaro Herrera
Date:
Robert Haas escribió:
> On Tue, May 4, 2010 at 2:06 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
> > Robert Haas escribió:
> 
> Hey, thanks for writing back!  I just spent the last few hours
> thinking about this and beating my head against the wall.

:-)

> >> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
> >> relations?  I think the only backend that can have an smgr reference
> >> to a temprel other than the owning backend is bgwriter, and AFAICS
> >> bgwriter will only have such a reference if it's responding to a
> >> request by the owning backend to unlink the associated files, in which
> >> case (I think) the owning backend will have no reference.
> >
> > Hmm, wasn't there a proposal to have the owning backend delete the files
> > instead of asking the bgwriter to?
> 
> I did propose that upthread; it may have been proposed previously
> also. This might be worth doing independently of the rest of the patch
> (which I'm starting to fear is doomed, cue ominous soundtrack) since
> it would reduce the chance of orphaning data files and possibly
> simplify the logic also.

+1 for doing it separately, but hopefully that doesn't mean the rest of
this patch is doomed ...

> >> [dbsize.c] As with relcache.c, there's a problem if we're asked for
> >> the size of a temporary relation that is not our own: we can't call
> >> relpath() without knowing the ID of the owning backend, and there's no
> >> way to acquire that information for pg_class.  I guess we could just
> >> refuse to answer the question in that case, but that doesn't seem real
> >> cool.  Or we could physically scan the directory for files that match
> >> a suitably constructed wildcard, I suppose.
> >
> > I don't very much like the wildcard idea; but I don't think it's
> > unreasonable to refuse to provide a file size.  If the owning backend
> > has still got part of the table in local buffers, you'll get a
> > misleading answer, so perhaps it's best to not give an answer at all.
> >
> > Maybe this problem could be solved if we could somehow force that
> > backend to write down its local buffers, in which case it'd be nice to
> > have a solution to the dbsize problem.
> 
> I'm sure we could add some kind of signaling mechanism that would tell
> all backends to flush their local buffers, but I'm not too sure it
> would help this case very much, because you likely wouldn't want to
> wait for all the backends to complete that process before reporting
> results.

Hmm, I was thinking in the pg_relation_size function -- given this new
mechanism you could get an accurate size of temp tables for other
backends.  I wasn't thinking in the pg_database_size function, and
perhaps it's better to *not* include temp tables in that report at all.

> >> [syncscan.c] It seems we pursue this optimization even for temprels; I
> >> can't think of why that would be useful in practice.  If it's useless
> >> overhead, should we skip it?  This is really independent of this
> >> project; just a side thought.
> >
> > Maybe recently used buffers are more likely to be in the OS page cache,
> > so perhaps it's not good to disable it.
> 
> I don't get it.  If the whole relation fits in the page cache, it
> doesn't much matter where you start a seqscan.  If it doesn't,
> starting where the last one ended is anti-optimal.

Err, I was thinking that a syncscan started a bunch of pages earlier
than the point where the previous scan ended, but yeah, that's a bit
silly.  Maybe we should just ignore syncscan in temp tables altogether,
as you propose.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>> > Hmm, wasn't there a proposal to have the owning backend delete the files
>> > instead of asking the bgwriter to?
>>
>> I did propose that upthread; it may have been proposed previously
>> also. This might be worth doing independently of the rest of the patch
>> (which I'm starting to fear is doomed, cue ominous soundtrack) since
>> it would reduce the chance of orphaning data files and possibly
>> simplify the logic also.
>
> +1 for doing it separately, but hopefully that doesn't mean the rest of
> this patch is doomed ...

I wonder if it would be possible to reject access to temporary
relations at a higher level.  Right now, if you create a temporary
relation in one session, you can issue a SELECT statement against it
in another relation, and get back 0 rows.  If you then insert data
into it and select against it again, you'll get an error saying that
you can't access temporary tables of other sessions.  If you try to
truncate somebody else's temporary relation, it fails; but if you try
to drop it, it works.  In fact, you can even run ALTER TABLE ... ADD
COLUMN on somebody else's temp table, as long as you don't do anything
that requires a rewrite.  CLUSTER fails; VACUUM and VACUUM FULL both
appear to work but apparently actually don't do anything under the
hood, so that database-wide vacuums don't barf.  The whole thing seems
pretty leaky.  It would be nice if we could find a small set of
control points where we basically reject ALL access to somebody else's
temp relations, period.

One possible thing we might do (bearing in mind that we might need to
wall off access at multiple levels) would be to forbid creating a
relcache entry for a non-local temprel.  That would, in turn, forbid
doing pretty much anything to such a relation, although I'm not sure
what else would get broken in the process.  But it would eliminate,
for example, all the checks for RELATION_IS_OTHER_TEMP, since that
Just Couldn't Happen.  It would would eliminate the need to install
specific handling for this case in dbsize.c - we'd just automatically
croak.  And it's also probably necessary to do this anyhow if we want
to ever eliminate those CacheInvalidSmgr() calls for temp rels,
because if I can drop your temprel, that implies I can smgropen() it.

...Robert


Re: including PID or backend ID in relpath of temp rels

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> One possible thing we might do (bearing in mind that we might need to
> wall off access at multiple levels) would be to forbid creating a
> relcache entry for a non-local temprel.  That would, in turn, forbid
> doing pretty much anything to such a relation, although I'm not sure
> what else would get broken in the process.

Dropping temprels left behind by a crashed backend would get broken by
that; which is a deal-breaker, because we have to be able to clean those
up.
        regards, tom lane


Re: including PID or backend ID in relpath of temp rels

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> I don't very much like the wildcard idea; but I don't think it's
> unreasonable to refuse to provide a file size.  If the owning backend
> has still got part of the table in local buffers, you'll get a
> misleading answer, so perhaps it's best to not give an answer at all.

FWIW, that's not the case, anymore than it is for blocks in shared
buffer cache for regular rels.  smgrextend() results in an observable
extension of the file EOF immediately, whether or not you can see
up-to-date data for those pages.

Now people have often complained about the extra I/O involved in that,
and it'd be nice to have a solution, but it's not clear to me that
fixing it would be harder for temprels than regular rels.
        regards, tom lane


Re: including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
On Tue, May 4, 2010 at 5:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> One possible thing we might do (bearing in mind that we might need to
>> wall off access at multiple levels) would be to forbid creating a
>> relcache entry for a non-local temprel.  That would, in turn, forbid
>> doing pretty much anything to such a relation, although I'm not sure
>> what else would get broken in the process.
>
> Dropping temprels left behind by a crashed backend would get broken by
> that; which is a deal-breaker, because we have to be able to clean those
> up.

Phooey.  It was such a good idea in my head.

...Robert


Re: including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
On Tue, Apr 27, 2010 at 9:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> [storage.c,xact.c,twophase.c] smgrGetPendingDeletes returns via an out
> parameter (its second argument) a list of RelFileNodes pending delete,
> which we then write to WAL or to the two-phase state file.

It appears that we are playing a little bit fast and loose with this.
I think that the two-phase code path is solid because we prohibit
PREPARE TRANSACTION if the transaction has referenced any temporary
tables, so when we read the two-phase state file it's safe to assume
that all the tables mentioned are non-temporary.  But the ordinary
one-phase commit writes permanent and temporary relfilenodes to WAL
without distinction, and then, in xl_redo_commit() and
xl_redo_abort(), does this:
               XLogDropRelation(xlrec->xnodes[i], fork);               smgrdounlink(srel, fork, false, true);

The third argument to smgrdounlink() is "isTemp", which we're here
passing as false, but might really be true.  I don't think it
technically matters at present because the only effect of that
parameter right now is that we pass it through to
DropRelFileNodeBuffers(), which will drop shared buffers rather than
local buffers as a result of the incorrect setting.  But that won't
matter because the WAL replay process shouldn't have any local buffers
anyway, since temp relations are not otherwise WAL-logged.  For the
same reason, I don't think the call to XLogDropRelation() is an issue
because its only purpose is to remove entries from invalid_page_tab,
and there won't be any temporary pages in there anyway.

Of course if we're going to do $SUBJECT this will need to be changed
anyway, but assuming the above analysis is correct I think the
existing coding at least deserves a comment...  then again, maybe I'm
all mixed up?

...Robert


Re: including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>> >> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
>> >> relations?  I think the only backend that can have an smgr reference
>> >> to a temprel other than the owning backend is bgwriter, and AFAICS
>> >> bgwriter will only have such a reference if it's responding to a
>> >> request by the owning backend to unlink the associated files, in which
>> >> case (I think) the owning backend will have no reference.

This turns out to be wrong, I think.  It seems that what we do is
prevent backends other than the opening backend from reading pages
from non-local temp rels into private or shared buffers, but we don't
actually prevent them from having smgr references.  This allows
autovacuum to drop them, for example, in an anti-wraparound situation.
 (Thanks to Tom for helping me get my head around this better.)

>> > Hmm, wasn't there a proposal to have the owning backend delete the files
>> > instead of asking the bgwriter to?
>>
>> I did propose that upthread; it may have been proposed previously
>> also. This might be worth doing independently of the rest of the patch
>> (which I'm starting to fear is doomed, cue ominous soundtrack) since
>> it would reduce the chance of orphaning data files and possibly
>> simplify the logic also.
>
> +1 for doing it separately, but hopefully that doesn't mean the rest of
> this patch is doomed ...

Doom has been averted.  Proposed patch attached.  It passes regression
tests and seems to work, but could use additional testing and, of
course, some code-reading also.

Some notes on this patch:

It seems prett clear that it isn't desirable to simply add backend ID
to RelFileNode, because there are too many places using RelFileNode
already for purposes where the backend ID can be inferred from
context, such as buffer headers and most of xlog.  Instead, I
introduced BackendRelFileNode, which consists of an ordinary
RelFileNode augmented with a backend ID, and use that only where
needed.  In particular, the smgr layer must use BackendRelFileNode
throughout, since it operates on both permanent and temporary
relations. smgr invalidations must also include the backend ID.  xlog
generally happens only for non-temporary relations and can thus
continue to use an ordinary RelFileNode; however, commit/abort records
must use BackendRelFileNode as there may be physical storage
associated with temporary relations that must be unlinked.
Communication with the bgwriter must use BackendRelFileNode for
similar reasons. The relcache now stores rd_backend rather than
rd_islocaltemp so that it remains straightforward to call smgropen()
based on a relcache entry. Some smgr functions no longer require an
isTemp argument, because they can infer the necessary information from
their BackendRelFileNode.  smgrwrite() and smgrextend() now take a
skipFsync argument rather than an isTemp argument.

I'm not totally sure whether it makes sense to do what we were talking
about above, viz, transfer unlink responsibility for temp rels from
the bgwriter to the owning backend.  I haven't done that here.  Nor
have I implemented any kind of improved temporary file cleanup
strategy, though I hope such a thing is possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

Re: including PID or backend ID in relpath of temp rels

From
Jim Nasby
Date:
On May 6, 2010, at 10:24 PM, Robert Haas wrote:
> On Tue, May 4, 2010 at 3:03 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>>>>> [smgr.c,inval.c] Do we need to call CacheInvalidSmgr for temporary
>>>>> relations?  I think the only backend that can have an smgr reference
>>>>> to a temprel other than the owning backend is bgwriter, and AFAICS
>>>>> bgwriter will only have such a reference if it's responding to a
>>>>> request by the owning backend to unlink the associated files, in which
>>>>> case (I think) the owning backend will have no reference.
>
> This turns out to be wrong, I think.  It seems that what we do is
> prevent backends other than the opening backend from reading pages
> from non-local temp rels into private or shared buffers, but we don't
> actually prevent them from having smgr references.  This allows
> autovacuum to drop them, for example, in an anti-wraparound situation.
> (Thanks to Tom for helping me get my head around this better.)
>
>>>> Hmm, wasn't there a proposal to have the owning backend delete the files
>>>> instead of asking the bgwriter to?
>>>
>>> I did propose that upthread; it may have been proposed previously
>>> also. This might be worth doing independently of the rest of the patch
>>> (which I'm starting to fear is doomed, cue ominous soundtrack) since
>>> it would reduce the chance of orphaning data files and possibly
>>> simplify the logic also.
>>
>> +1 for doing it separately, but hopefully that doesn't mean the rest of
>> this patch is doomed ...
>
> Doom has been averted.  Proposed patch attached.  It passes regression
> tests and seems to work, but could use additional testing and, of
> course, some code-reading also.
>
> Some notes on this patch:
>
> It seems prett clear that it isn't desirable to simply add backend ID
> to RelFileNode, because there are too many places using RelFileNode
> already for purposes where the backend ID can be inferred from
> context, such as buffer headers and most of xlog.  Instead, I
> introduced BackendRelFileNode, which consists of an ordinary
> RelFileNode augmented with a backend ID, and use that only where
> needed.  In particular, the smgr layer must use BackendRelFileNode
> throughout, since it operates on both permanent and temporary
> relations. smgr invalidations must also include the backend ID.  xlog
> generally happens only for non-temporary relations and can thus
> continue to use an ordinary RelFileNode; however, commit/abort records
> must use BackendRelFileNode as there may be physical storage
> associated with temporary relations that must be unlinked.
> Communication with the bgwriter must use BackendRelFileNode for
> similar reasons. The relcache now stores rd_backend rather than
> rd_islocaltemp so that it remains straightforward to call smgropen()
> based on a relcache entry. Some smgr functions no longer require an
> isTemp argument, because they can infer the necessary information from
> their BackendRelFileNode.  smgrwrite() and smgrextend() now take a
> skipFsync argument rather than an isTemp argument.
>
> I'm not totally sure whether it makes sense to do what we were talking
> about above, viz, transfer unlink responsibility for temp rels from
> the bgwriter to the owning backend.  I haven't done that here.  Nor
> have I implemented any kind of improved temporary file cleanup
> strategy, though I hope such a thing is possible.

Any particular reason not to use directories to help organize things? IE:

base/database_oid/pg_temp_rels/backend_pid/relfilenode

Perhaps relfilenode should be something else.

This seems to have several advantages:

1: It's more organized. If you want to see all the files for a single backend you have just one place to look. Finding
everythingis still easy via filesystem find. 
2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under
pg_temp_relsgoes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run
it'sown cleanup, though I think that anytime that happens we're going to restart everything anyway. 
3: It separates all the temporary stuff away from real files.

The only downside I see is some extra code to create the backend_pid directory.
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net




Re: including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel@decibel.org> wrote:
>> It seems prett clear that it isn't desirable to simply add backend ID
>> to RelFileNode, because there are too many places using RelFileNode
>> already for purposes where the backend ID can be inferred from
>> context, such as buffer headers and most of xlog.  Instead, I
>> introduced BackendRelFileNode, which consists of an ordinary
>> RelFileNode augmented with a backend ID, and use that only where
>> needed.  In particular, the smgr layer must use BackendRelFileNode
>> throughout, since it operates on both permanent and temporary
>> relations. smgr invalidations must also include the backend ID.  xlog
>> generally happens only for non-temporary relations and can thus
>> continue to use an ordinary RelFileNode; however, commit/abort records
>> must use BackendRelFileNode as there may be physical storage
>> associated with temporary relations that must be unlinked.
>> Communication with the bgwriter must use BackendRelFileNode for
>> similar reasons. The relcache now stores rd_backend rather than
>> rd_islocaltemp so that it remains straightforward to call smgropen()
>> based on a relcache entry. Some smgr functions no longer require an
>> isTemp argument, because they can infer the necessary information from
>> their BackendRelFileNode.  smgrwrite() and smgrextend() now take a
>> skipFsync argument rather than an isTemp argument.
>>
>> I'm not totally sure whether it makes sense to do what we were talking
>> about above, viz, transfer unlink responsibility for temp rels from
>> the bgwriter to the owning backend.  I haven't done that here.  Nor
>> have I implemented any kind of improved temporary file cleanup
>> strategy, though I hope such a thing is possible.
>
> Any particular reason not to use directories to help organize things? IE:
>
> base/database_oid/pg_temp_rels/backend_pid/relfilenode
>
> Perhaps relfilenode should be something else.
>
> This seems to have several advantages:
>
> 1: It's more organized. If you want to see all the files for a single backend you have just one place to look.
Findingeverything is still easy via filesystem find. 
> 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under
pg_temp_relsgoes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run
it'sown cleanup, though I think that anytime that happens we're going to restart everything anyway. 
> 3: It separates all the temporary stuff away from real files.
>
> The only downside I see is some extra code to create the backend_pid directory.

I like the idea of using directories to organize things better and I
completely agree with points #1 and #3.  Point #2 is a little more
complicated, I think, and something I've been struggling with.  We
need to make sure that we clean up not only the temporary files but
also the catalog entries that point to them, if any.  The current code
only blows away temporary tables that are "old" in terms of
transaction IDs, is driven off the catalog entries, and essentially
does DROP TABLE <whatever>.  So it can't clean up orphaned temporary
files that don't have any catalog entries associated with them (which
is what we want to fix) but on the other hand whatever it does clean
up is cleaned up completely.

We might be able to do something like this:

1. Scan pg_temp_rels.  For each subdirectory found (whose name looks
like a backend ID), if the corresponding backend is not currently
running, add the backend ID to a list of backend IDs needing cleanup.
2. For each backend ID derived in step 1:
2A. Scan the subdirectory and add all the files you find (whose names
are in the right format) to a list of files to be deleted.
2B. Check again whether the backend in question is running.  If it is,
don't do anything further for this backend and go on to the next
backend ID (i.e. continue with step 2).
2C. For each file found in step 2A, look for a pg_class entry in the
temp tablespace for that backend ID with a matching relfilenode
number.  If one is found, drop the rel.  If not, unlink the file if it
still exists.
2D. Attempt to remove the directory, ignoring failures.

I think step 2B is sufficient to prevent a race condition where we end
up mistaking a newly created file for an orphaned one.  Am I right?

One possible problem with this is that it would need to be repeated
for every database/tablespace combination, but maybe that wouldn't be
too expensive.  autovacuum already needs to process every database,
but I don't know how you'd decide how often to check for stray temp
files.  Certainly you'd want to check after a restart... after that I
get fuzzy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including PID or backend ID in relpath of temp rels

From
Greg Stark
Date:
On Tue, May 4, 2010 at 5:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> FWIW, that's not the case, anymore than it is for blocks in shared
> buffer cache for regular rels.  smgrextend() results in an observable
> extension of the file EOF immediately, whether or not you can see
> up-to-date data for those pages.
>
> Now people have often complained about the extra I/O involved in that,
> and it'd be nice to have a solution, but it's not clear to me that
> fixing it would be harder for temprels than regular rels.

For systems that have it and filesystems that optimize it I think
posix_fallocate() handles this case. We can extend files without
actually doing any i/o but we get the guarantee from the filesystem
that it has the space available and writing to those blocks won't fail
due to lack of space. Only meta-data i/o is triggered allocating the
blocks and marking them as virtually filled with nulls and it's not
synced unless there's an fsync so there's no extra physical i/o.

This should be the case for ext4 but I'm not sure what other
filesystems implement this.


--
greg


Re: including PID or backend ID in relpath of temp rels

From
Robert Haas
Date:
On Mon, May 17, 2010 at 2:10 PM, Jim Nasby <decibel@decibel.org> wrote:
> Any particular reason not to use directories to help organize things? IE:
>
> base/database_oid/pg_temp_rels/backend_pid/relfilenode
>
> Perhaps relfilenode should be something else.
>
> This seems to have several advantages:
>
> 1: It's more organized. If you want to see all the files for a single backend you have just one place to look.
Findingeverything is still easy via filesystem find. 
> 2: Cleanup becomes easier. When a backend exits, it's entire directory goes away. On server start, everything under
pg_temp_relsgoes away. Unfortunately we still have a race condition with cleaning up if a backend dies and can't run
it'sown cleanup, though I think that anytime that happens we're going to restart everything anyway. 
> 3: It separates all the temporary stuff away from real files.
>
> The only downside I see is some extra code to create the backend_pid directory.

I thought this was a good idea when you first proposed it, but on
further review I've changed my mind.  There are several places in the
code that rely on checking whether the database directory within any
given tablespace is empty to determine whether that database is using
that tablespace.  While I could rewrite all of that logic to do the
right thing, I think it's unnecessary pain.

I talked with Tom Lane about this a little bit at PGcon and opined
that we probably only need to clean out stray temporary files at
startup.  So what I'm tempted to do is just write a function that goes
through all tablespace/database combinations and scans each directory
for files with a name like t<digits>_<digits> and blows them away.
This will leave the catalog entries pointing at nothing, but we
already have working code in autovacuum.c to clean up the catalog
entries, and I believe that will work just fine even if the underlying
files have been removed earlier.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company