Thread: including backend ID in relpath of temp rels - updated patch

including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
For previous discussion of this topic, see:

http://archives.postgresql.org/pgsql-hackers/2010-04/msg01181.php
http://archives.postgresql.org/pgsql-hackers/2010-05/msg00352.php
http://archives.postgresql.org/pgsql-hackers/2010-06/msg00302.php

As in the original version of the patch, I have not simply added
backend ID to RelFileNode, because there are too many places using
RelFileNode in contexts where the backend ID can be determined from
context, such as the shared and local buffer managers and the xlog
code.  Instead, I have introduced BackendRelFileNode for contexts
where we need both the RelFileNode and the backend ID.  The smgr layer
has to use BackendRelFileNode across the board, since it operates on
both permanent and temporary relation, including - potentially -
temporary relations of other backends.  smgr invalidations must also
include the backend ID, as must communication between regular backends
and the bgwriter.  The relcache now stores rd_backend instead of
rd_islocaltemp so that  it remains straightforward to call smgropen()
based on a relcache entry. Some smgr functions no longer require an
isTemp argument, because they can infer the necessary information from
their BackendRelFileNode.  smgrwrite() and smgrextend() now take a
skipFsync argument rather than an isTemp argument.

In this version of the patch, I've improved the temporary file cleanup
mechanism.  In CVS HEAD, when a transaction commits or aborts, we
write an XLOG record with all relations that must be unlinked,
including temporary relations.  With this patch, we no longer include
temporary relations in the XLOG records (which may be a tiny
performance benefit for some people); instead, on every startup of the
database cluster, we just nuke all temporary relation files (which is
now easy to do, since they are named differently than files for
non-temporary relations) at the same time that we nuke other temp
files.  This produces slightly different behavior.  In CVS HEAD,
temporary files get removed whenever an xlog redo happens, so either
at cluster start or after a backend crash, but only to the extent that
they appear in WAL.  With this patch, we can be sure of removing ALL
stray files, which is maybe ever-so-slightly leaky in CVS HEAD.  But
on the other hand, because it hooks into the existing temporary file
cleanup code, it only happens at cluster startup, NOT after a backend
crash.  The existing coding leaves temporary sort files and similar
around after a backend crash for forensics purposes.  That might or
might not be worth rethinking for non-debug builds, but I don't think
there's any very good reason to think that temporary relation files
need to be handled differently than temporary work files.

I believe that this patch will clear away one major obstacle to
implementing global temporary and unlogged tables: it enables us to be
sure of cleaning up properly after a crash without relying on catalog
entries or XLOG.  Based on previous discussions, however, I believe
there is support for making this change independently of what becomes
of that project, just for the benefit of having a more robust cleanup
mechanism.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Thu, Jun 10, 2010 at 4:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> For previous discussion of this topic, see:
>
> http://archives.postgresql.org/pgsql-hackers/2010-04/msg01181.php
> http://archives.postgresql.org/pgsql-hackers/2010-05/msg00352.php
> http://archives.postgresql.org/pgsql-hackers/2010-06/msg00302.php
>
> As in the original version of the patch, I have not simply added
> backend ID to RelFileNode, because there are too many places using
> RelFileNode in contexts where the backend ID can be determined from
> context, such as the shared and local buffer managers and the xlog
> code.  Instead, I have introduced BackendRelFileNode for contexts
> where we need both the RelFileNode and the backend ID.  The smgr layer
> has to use BackendRelFileNode across the board, since it operates on
> both permanent and temporary relation, including - potentially -
> temporary relations of other backends.  smgr invalidations must also
> include the backend ID, as must communication between regular backends
> and the bgwriter.  The relcache now stores rd_backend instead of
> rd_islocaltemp so that  it remains straightforward to call smgropen()
> based on a relcache entry. Some smgr functions no longer require an
> isTemp argument, because they can infer the necessary information from
> their BackendRelFileNode.  smgrwrite() and smgrextend() now take a
> skipFsync argument rather than an isTemp argument.
>
> In this version of the patch, I've improved the temporary file cleanup
> mechanism.  In CVS HEAD, when a transaction commits or aborts, we
> write an XLOG record with all relations that must be unlinked,
> including temporary relations.  With this patch, we no longer include
> temporary relations in the XLOG records (which may be a tiny
> performance benefit for some people); instead, on every startup of the
> database cluster, we just nuke all temporary relation files (which is
> now easy to do, since they are named differently than files for
> non-temporary relations) at the same time that we nuke other temp
> files.  This produces slightly different behavior.  In CVS HEAD,
> temporary files get removed whenever an xlog redo happens, so either
> at cluster start or after a backend crash, but only to the extent that
> they appear in WAL.  With this patch, we can be sure of removing ALL
> stray files, which is maybe ever-so-slightly leaky in CVS HEAD.  But
> on the other hand, because it hooks into the existing temporary file
> cleanup code, it only happens at cluster startup, NOT after a backend
> crash.  The existing coding leaves temporary sort files and similar
> around after a backend crash for forensics purposes.  That might or
> might not be worth rethinking for non-debug builds, but I don't think
> there's any very good reason to think that temporary relation files
> need to be handled differently than temporary work files.
>
> I believe that this patch will clear away one major obstacle to
> implementing global temporary and unlogged tables: it enables us to be
> sure of cleaning up properly after a crash without relying on catalog
> entries or XLOG.  Based on previous discussions, however, I believe
> there is support for making this change independently of what becomes
> of that project, just for the benefit of having a more robust cleanup
> mechanism.

Updated patch to remove minor bitrot.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

Re: including backend ID in relpath of temp rels - updated patch

From
Jaime Casanova
Date:
On Thu, Jun 10, 2010 at 3:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> I believe that this patch will clear away one major obstacle to
> implementing global temporary and unlogged tables: it enables us to be
> sure of cleaning up properly after a crash without relying on catalog
> entries or XLOG.  Based on previous discussions, however, I believe
> there is support for making this change independently of what becomes
> of that project, just for the benefit of having a more robust cleanup
> mechanism.
>

Hi,

i was looking at v3 of this patch...

it looks good, works as advertised, pass regression tests, and all
tests i could think of...
haven't looked the code at much detail but seems ok, to me at least...

--
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL


Re: including backend ID in relpath of temp rels - updated patch

From
Jaime Casanova
Date:
On Fri, Jul 23, 2010 at 10:05 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote:
> On Thu, Jun 10, 2010 at 3:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>>
>> I believe that this patch will clear away one major obstacle to
>> implementing global temporary and unlogged tables: it enables us to be
>> sure of cleaning up properly after a crash without relying on catalog
>> entries or XLOG.  Based on previous discussions, however, I believe
>> there is support for making this change independently of what becomes
>> of that project, just for the benefit of having a more robust cleanup
>> mechanism.
>>
>
> Hi,
>
> i was looking at v3 of this patch...
>

Ok, i like what you did in smgrextend, smgrwrite, and others...
changing isTemp for skipFsync is more descriptive

but i have a few questions, maybe is right what you did i only want to
understand it:
- you added this in include/storage/smgr.h, so why is safe to assume
that if the backend != InvalidBackendId it must be a temp relation?

+#define SmgrIsTemp(smgr) \
+   ((smgr)->smgr_rnode.backend != InvalidBackendId)


- you added a question like this "if (rel->rd_backend == MyBackendId)"
in a few places... why are you assuming that? that couldn't be a new
created relation (in current session of course)? is that safe?

--
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Sun, Jul 25, 2010 at 2:37 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote:
> but i have a few questions, maybe is right what you did i only want to
> understand it:
> - you added this in include/storage/smgr.h, so why is safe to assume
> that if the backend != InvalidBackendId it must be a temp relation?
>
> +#define SmgrIsTemp(smgr) \
> +   ((smgr)->smgr_rnode.backend != InvalidBackendId)

That's pretty much the whole point of the patch.  Instead of
identifying relations as simply "temporary" or "not temporary", we
identify as "a temporary relation owned by backend X" or as "not
temporary".

> - you added a question like this "if (rel->rd_backend == MyBackendId)"
> in a few places... why are you assuming that? that couldn't be a new
> created relation (in current session of course)? is that safe?

Again, rd_backend is not the creating backend ID unless the relation
is a temprel.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Jaime Casanova
Date:
On Sun, Jul 25, 2010 at 4:32 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jul 25, 2010 at 2:37 AM, Jaime Casanova <jaime@2ndquadrant.com> wrote:
>> but i have a few questions, maybe is right what you did i only want to
>> understand it:
>> - you added this in include/storage/smgr.h, so why is safe to assume
>> that if the backend != InvalidBackendId it must be a temp relation?
>>
>> +#define SmgrIsTemp(smgr) \
>> +   ((smgr)->smgr_rnode.backend != InvalidBackendId)
>
> That's pretty much the whole point of the patch.  Instead of
> identifying relations as simply "temporary" or "not temporary", we
> identify as "a temporary relation owned by backend X" or as "not
> temporary".
>

Ok, this one seems good enough... i'm marking it as "ready for committer"

--
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> [ BackendRelFileNode patch ]

One thing that I find rather distressing about this is the 25% bloat
in sizeof(SharedInvalidationMessage).  Couldn't we avoid that?  Is it
really necessary to *ever* send an SI message for a backend-local rel?

I agree that one needs to send relcache inval sometimes for temp rels,
but I don't see why each backend couldn't interpret that as a flush
on its own local version.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Thu, Aug 5, 2010 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> [ BackendRelFileNode patch ]
>
> One thing that I find rather distressing about this is the 25% bloat
> in sizeof(SharedInvalidationMessage).  Couldn't we avoid that?  Is it
> really necessary to *ever* send an SI message for a backend-local rel?

It can be dropped or unlinked by another backend, so, yes.

> I agree that one needs to send relcache inval sometimes for temp rels,
> but I don't see why each backend couldn't interpret that as a flush
> on its own local version.

The problem is that receipt of the inval message results in a call to
smgrclosenode(), which has to look up the (Backend)RelFileNode in
SMgrRelationHash.  If the backend ID isn't present, we can't search
the hash table efficiently.  This is not only a problem for
smgrclosenode(), either; that hash is used in a bunch of places, so
changing the hash key isn't really an option.  One possibility would
be to have two separate shared-invalidation message types for smgr.
One would indicate that a non-temporary relation was flushed, so the
backend ID field is known to be -1 and we can search the hash quickly.The other would indicate that a temporary
relationwas flushed and 
you'd have to scan the whole hash table to find and flush all matching
entries.  That still doesn't sound great, though.  Another thought I
had was that if we could count on the backend ID to fit into an int16
we could fit it in to what's currently padding space.  That would
require rather dramatically lowering the maximum number of backends
(currently INT_MAX/4), but it's a little hard to imagine that we can
really support more than 32,767 simultaneous backends anyway.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Aug 5, 2010 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> One thing that I find rather distressing about this is the 25% bloat
>> in sizeof(SharedInvalidationMessage). �Couldn't we avoid that? �Is it
>> really necessary to *ever* send an SI message for a backend-local rel?

> It can be dropped or unlinked by another backend, so, yes.

Really?  Surely that should be illegal during normal operation.  We
might be doing such during crash recovery, but we don't need to
broadcast sinval messages then.

It might be sufficient to consider that there are "local" and "global"
smgr inval messages, where the former never get out of the generating
backend, so a bool is enough in the message struct.

> had was that if we could count on the backend ID to fit into an int16
> we could fit it in to what's currently padding space.  That would
> require rather dramatically lowering the maximum number of backends
> (currently INT_MAX/4), but it's a little hard to imagine that we can
> really support more than 32,767 simultaneous backends anyway.

Yeah, that occurred to me too.  A further thought is that the id field
could probably be reduced to 1 byte, leaving 3 for backendid, which
would certainly be plenty.  However representing that in a portable
struct declaration would be a bit painful I fear.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Fri, Aug 6, 2010 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Aug 5, 2010 at 7:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> One thing that I find rather distressing about this is the 25% bloat
>>> in sizeof(SharedInvalidationMessage).  Couldn't we avoid that?  Is it
>>> really necessary to *ever* send an SI message for a backend-local rel?
>
>> It can be dropped or unlinked by another backend, so, yes.
>
> Really?  Surely that should be illegal during normal operation. We
> might be doing such during crash recovery, but we don't need to
> broadcast sinval messages then.

autovacuum.c does it when we start to worry about XID wraparound, but
you can also do it from any normal backend.  Just "DROP TABLE
pg_temp_2.foo" or whatever and away you go.

> It might be sufficient to consider that there are "local" and "global"
> smgr inval messages, where the former never get out of the generating
> backend, so a bool is enough in the message struct.

It would be nice to be able to do it that way, but I don't believe
it's the case, per the above.

>> had was that if we could count on the backend ID to fit into an int16
>> we could fit it in to what's currently padding space.  That would
>> require rather dramatically lowering the maximum number of backends
>> (currently INT_MAX/4), but it's a little hard to imagine that we can
>> really support more than 32,767 simultaneous backends anyway.
>
> Yeah, that occurred to me too.  A further thought is that the id field
> could probably be reduced to 1 byte, leaving 3 for backendid, which
> would certainly be plenty.  However representing that in a portable
> struct declaration would be a bit painful I fear.

Well, presumably we'd just represent it as a 1-byte field followed by
a 2-byte field, and do a bit of math.  But I don't really see the
point.  The whole architecture of a shared invalidation queue is
fundamentally non-scalable because it's a broadcast medium.  If we
wanted to efficiently support even thousands of backends (let alone
tens or hundreds of thousands) I assume we would need to rearchitect
this completely with more fine-grained queues, and have backends
subscribe to the queues pertaining to the objects they want to access
before touching them.  Or maybe something else entirely. But I don't
think broadcasting to 30,000 backends is going to work for the same
reason that plugging 30,000 machines into an Ethernet *hub* doesn't
work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 6, 2010 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Really? �Surely that should be illegal during normal operation. We
>> might be doing such during crash recovery, but we don't need to
>> broadcast sinval messages then.

> autovacuum.c does it when we start to worry about XID wraparound, but
> you can also do it from any normal backend.  Just "DROP TABLE
> pg_temp_2.foo" or whatever and away you go.

Mph.  I'm still not convinced that the sinval message needs to carry
backendid.  But maybe the first-cut solution should just be to squeeze
the id into the padding area.  You should be able to get up to 65535
allowed backends, not 32k --- or perhaps use different message type IDs
for local and global backendid, so that all 65536 bitpatterns are
allowed for a non-global backendid.

> Well, presumably we'd just represent it as a 1-byte field followed by
> a 2-byte field, and do a bit of math.  But I don't really see the
> point.  The whole architecture of a shared invalidation queue is
> fundamentally non-scalable because it's a broadcast medium.

Sure, it tops out somewhere, but 32K is way too close to configurations
we know work well enough in the field (I've seen multiple reports of
people using a couple thousand backends).  In any case, sinval readers
don't block each other in the current implementation, so I'm really
dubious that there's any inherent scalability limitation there.  I'll
hold still for 64K but I think it might be better to go for 2^24.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Jaime Casanova
Date:
On Fri, Aug 6, 2010 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Just "DROP TABLE
> pg_temp_2.foo" or whatever and away you go.
>

wow! that's true... and certainly a bug...
we shouldn't allow any session to drop other session's temp tables, or
is there a reason for this misbehavior?

--
Jaime Casanova         www.2ndQuadrant.com
Soporte y capacitación de PostgreSQL


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Jaime Casanova <jaime@2ndquadrant.com> writes:
> On Fri, Aug 6, 2010 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Just "DROP TABLE pg_temp_2.foo" or whatever and away you go.

> wow! that's true... and certainly a bug...

No, it's not a bug.  You'll find only superusers can do it.

> we shouldn't allow any session to drop other session's temp tables, or
> is there a reason for this misbehavior?

It is intentional, though I'd be willing to give it up if we had more
bulletproof crash-cleanup of temp tables --- which is one of the things
this patch is supposed to result in.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Fri, Aug 6, 2010 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Aug 6, 2010 at 1:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Really?  Surely that should be illegal during normal operation. We
>>> might be doing such during crash recovery, but we don't need to
>>> broadcast sinval messages then.
>
>> autovacuum.c does it when we start to worry about XID wraparound, but
>> you can also do it from any normal backend.  Just "DROP TABLE
>> pg_temp_2.foo" or whatever and away you go.
>
> Mph.  I'm still not convinced that the sinval message needs to carry
> backendid.

Hey, if you have an idea... I'd love to get rid of it, too, but I
don't see how to do it.

> But maybe the first-cut solution should just be to squeeze
> the id into the padding area.  You should be able to get up to 65535
> allowed backends, not 32k --- or perhaps use different message type IDs
> for local and global backendid, so that all 65536 bitpatterns are
> allowed for a non-global backendid.
>
>> Well, presumably we'd just represent it as a 1-byte field followed by
>> a 2-byte field, and do a bit of math.  But I don't really see the
>> point.  The whole architecture of a shared invalidation queue is
>> fundamentally non-scalable because it's a broadcast medium.
>
> Sure, it tops out somewhere, but 32K is way too close to configurations
> we know work well enough in the field (I've seen multiple reports of
> people using a couple thousand backends).  In any case, sinval readers
> don't block each other in the current implementation, so I'm really
> dubious that there's any inherent scalability limitation there.  I'll
> hold still for 64K but I think it might be better to go for 2^24.

Well, I wouldn't expect anyone to use an exclusive lock for readers
without a good reason, but you still have n backends that each have to
read, presumably, about O(n) messages, so eventually that's going to
start to pinch.

Do you think it's worth worrying about the reduction in the number of
possible SI message types?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Fri, Aug 6, 2010 at 2:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jaime Casanova <jaime@2ndquadrant.com> writes:
>> On Fri, Aug 6, 2010 at 12:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Just "DROP TABLE pg_temp_2.foo" or whatever and away you go.
>
>> wow! that's true... and certainly a bug...
>
> No, it's not a bug.  You'll find only superusers can do it.
>
>> we shouldn't allow any session to drop other session's temp tables, or
>> is there a reason for this misbehavior?
>
> It is intentional, though I'd be willing to give it up if we had more
> bulletproof crash-cleanup of temp tables --- which is one of the things
> this patch is supposed to result in.

This patch only directly addresses the issue of cleaning up the
storage, so there are still the catalog entries to worry about.  But
it doesn't seem impossible to think about building on this approach to
eventually handle that part of the problem better, too.  I haven't
thought too much about what that would look like, but I think it could
be done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 6, 2010 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Sure, it tops out somewhere, but 32K is way too close to configurations
>> we know work well enough in the field (I've seen multiple reports of
>> people using a couple thousand backends).

> Well, I wouldn't expect anyone to use an exclusive lock for readers
> without a good reason, but you still have n backends that each have to
> read, presumably, about O(n) messages, so eventually that's going to
> start to pinch.

Sure, but I don't see much to be gained from multiple queues either.
There are few (order of zero, in fact) cases where sinval messages
are transmitted that aren't of potential interest to all backends.
Maybe you could do something useful with a very large number of
dynamically-defined queues (like one per relation) ... but managing that
would probably swamp any savings.

> Do you think it's worth worrying about the reduction in the number of
> possible SI message types?

IIRC the number of message types is the number of catalog caches plus
half a dozen or so.  We're a long way from exhausting even a 1-byte
ID field; and we could play more games if we had to, since there would
be a padding byte free in the message types that refer to a catalog
cache.  IOW, 1-byte id doesn't bother me.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> This patch only directly addresses the issue of cleaning up the
> storage, so there are still the catalog entries to worry about.  But
> it doesn't seem impossible to think about building on this approach to
> eventually handle that part of the problem better, too.  I haven't
> thought too much about what that would look like, but I think it could
> be done.

Perhaps run through pg_class after restart and flush anything marked
relistemp?  Although the ideal solution, probably, would be for temp
tables to not have persistent catalog entries in the first place.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
David Fetter
Date:
On Fri, Aug 06, 2010 at 03:00:35PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > This patch only directly addresses the issue of cleaning up the
> > storage, so there are still the catalog entries to worry about.
> > But it doesn't seem impossible to think about building on this
> > approach to eventually handle that part of the problem better,
> > too.  I haven't thought too much about what that would look like,
> > but I think it could be done.
> 
> Perhaps run through pg_class after restart and flush anything marked
> relistemp?  Although the ideal solution, probably, would be for temp
> tables to not have persistent catalog entries in the first place.

For the upcoming global temp tables, which are visible in other
sessions, how would this work?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
David Fetter <david@fetter.org> writes:
> On Fri, Aug 06, 2010 at 03:00:35PM -0400, Tom Lane wrote:
>> Perhaps run through pg_class after restart and flush anything marked
>> relistemp?  Although the ideal solution, probably, would be for temp
>> tables to not have persistent catalog entries in the first place.

> For the upcoming global temp tables, which are visible in other
> sessions, how would this work?

Well, that's a totally different matter.  Those would certainly have
persistent entries, at least for the "master" version.  I don't think
anyone's really figured out what the best implementation looks like;
but maybe there would be per-backend "child" versions that could act
just like the current kind of temp table (and again would ideally not
have any persistent catalog entries).
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Fri, Aug 6, 2010 at 3:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> This patch only directly addresses the issue of cleaning up the
>> storage, so there are still the catalog entries to worry about.  But
>> it doesn't seem impossible to think about building on this approach to
>> eventually handle that part of the problem better, too.  I haven't
>> thought too much about what that would look like, but I think it could
>> be done.
>
> Perhaps run through pg_class after restart and flush anything marked
> relistemp?

The trouble is that you have to bind to a database before you can run
through pg_class, and the postmaster doesn't.  Of course, if it could
attach to a database and then detach again, this might be feasible,
although perhaps still a bit overly complex for the postmaster, but in
any event it doesn't.  We seem to already have a mechanism in place
where a backend that creates a temprel nukes any pre-existing temprel
schema for its own backend, but of course that's not fool-proof.

> Although the ideal solution, probably, would be for temp
> tables to not have persistent catalog entries in the first place.

I've been thinking about that, but it's a bit challenging to imagine
how it could work.  It's not just the pg_class entries you have to
think about, but also pg_attrdef, pg_attribute, pg_constraint,
pg_description, pg_index, pg_rewrite, and pg_trigger.  An even
stickier problem is that we have lots of places in the backend code
that refer to objects by OID.  We'd either need to change all of that
code (which seems like a non-starter) or somehow guarantee that the
OIDs assigned to any given backend's private objects would be
different from those assigned to any public object (which I also don't
see how to do).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Fri, Aug 6, 2010 at 2:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Aug 6, 2010 at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Sure, it tops out somewhere, but 32K is way too close to configurations
>>> we know work well enough in the field (I've seen multiple reports of
>>> people using a couple thousand backends).
>
>> Well, I wouldn't expect anyone to use an exclusive lock for readers
>> without a good reason, but you still have n backends that each have to
>> read, presumably, about O(n) messages, so eventually that's going to
>> start to pinch.
>
> Sure, but I don't see much to be gained from multiple queues either.
> There are few (order of zero, in fact) cases where sinval messages
> are transmitted that aren't of potential interest to all backends.
> Maybe you could do something useful with a very large number of
> dynamically-defined queues (like one per relation) ... but managing that
> would probably swamp any savings.

Well, what I was thinking is that if you could guarantee that a
certain backend COULDN'T have a particular relfilenode open at the
smgr level, for example, then it needn't read the invalidation
messages for that relfilenode.  Precisely how to slice that up is
another matter.  For the present case, for instance, you could
creating one queue per backend.  In the normal course of events, each
backend would subscribe only to its own queue, but if one backend
wanted to drop a temporary relation belonging to some other backend,
it would temporarily subscribe to that backend's queue, do whatever it
needed to do, and then flush all the smgr references before
unsubscribing from the queue.  That's a bit silly because we doubtless
wouldn't invent such a complicated mechanism just for this case, but I
think it illustrates the kind of thing that one could do.  Or if you
wanted to optimize for the case of a large number of databases running
in a single cluster, perhaps you'd want one queue per database plus a
shared queue for the shared catalogs.  I don't know.  This is just pie
in the sky.

>> Do you think it's worth worrying about the reduction in the number of
>> possible SI message types?
>
> IIRC the number of message types is the number of catalog caches plus
> half a dozen or so.  We're a long way from exhausting even a 1-byte
> ID field; and we could play more games if we had to, since there would
> be a padding byte free in the message types that refer to a catalog
> cache.  IOW, 1-byte id doesn't bother me.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of vie ago 06 15:32:21 -0400 2010:

> > Perhaps run through pg_class after restart and flush anything marked
> > relistemp?
> 
> The trouble is that you have to bind to a database before you can run
> through pg_class, and the postmaster doesn't.  Of course, if it could
> attach to a database and then detach again, this might be feasible,
> although perhaps still a bit overly complex for the postmaster, but in
> any event it doesn't.

A simpler idea seems to run a process specifically to connect to the
database to scan pg_class there, and then die.  It sounds a tad
expensive though.

> I've been thinking about that, but it's a bit challenging to imagine
> how it could work.  It's not just the pg_class entries you have to
> think about, but also pg_attrdef, pg_attribute, pg_constraint,
> pg_description, pg_index, pg_rewrite, and pg_trigger.  An even
> stickier problem is that we have lots of places in the backend code
> that refer to objects by OID.  We'd either need to change all of that
> code (which seems like a non-starter) or somehow guarantee that the
> OIDs assigned to any given backend's private objects would be
> different from those assigned to any public object (which I also don't
> see how to do).

Maybe we could reserve one of the 32 bits of OID to indicate private-ness.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Fri, Aug 6, 2010 at 3:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> Do you think it's worth worrying about the reduction in the number of
>>> possible SI message types?
>>
>> IIRC the number of message types is the number of catalog caches plus
>> half a dozen or so.  We're a long way from exhausting even a 1-byte
>> ID field; and we could play more games if we had to, since there would
>> be a padding byte free in the message types that refer to a catalog
>> cache.  IOW, 1-byte id doesn't bother me.
>
> OK.

Here's an updated patch, with the invalidation changes merged in and
hopefully-suitable adjustments elsewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Here's an updated patch, with the invalidation changes merged in and
> hopefully-suitable adjustments elsewhere.

I haven't tested this patch, but I read through it (and have I mentioned
how unbelievably illegible -u format patches are?).  It seems to be
close to commitable, but I found a few issues.  In no particular order:

storage.sgml needs to be updated to describe this file-naming scheme.

BackendRelFileNode isn't a particularly nice struct name; it seems like
it puts the most important aspect of the struct's purpose somewhere in
the middle of the name.  Maybe RelFileNodeBackend would be better, or
RelFileNodeFull, or something like that.

In GetNewRelFileNode, you've changed some code that is commented
/* This logic should match RelationInitPhysicalAddr */, but there
is not any corresponding change in RelationInitPhysicalAddr.  I find
this bothersome.  I think the problem is that you've made it look
like the assignment of rnode.backend is part of the logic that ought
to match RelationInitPhysicalAddr.  Perhaps set that off, at least
by a blank line, if not its own comment.

The relpath() and relpathperm() macros are a tad underdocumented for my
taste; at least put comments noting that one takes a RelFileNode and the
other a BackendRelFileNode.  And I wonder if you couldn't find better
names for relpathperm and relpathbackend.

assign_maxconnections and assign_autovacuum_max_workers need to be fixed
for MAX_BACKENDS; in fact I think all the occurrences of INT_MAX / 4 in
guc.c need to be replaced.  (And if I were you I'd grep to see if
anyplace outside guc.c knows that value...)  Also, as a matter of style,
the comment currently attached to max_connections needs to be attached
to the definition of the constant, not just modified where it is.

As an old bit-shaver this sorta bothers me:

+++ b/src/include/utils/rel.h
@@ -127,7 +127,7 @@ typedef struct RelationData    struct SMgrRelationData *rd_smgr;    /* cached file handle, or NULL
*/   int            rd_refcnt;        /* reference count */    bool        rd_istemp;        /* rel is a temporary
relation*/
 
-    bool        rd_islocaltemp; /* rel is a temp rel of this session */
+    BackendId    rd_backend;        /* owning backend id, if temporary relation */    bool        rd_isnailed;    /*
relis nailed in cache */    bool        rd_isvalid;        /* relcache entry is valid */    char        rd_indexvalid;
 /* state of rd_indexlist: 0 = not valid, 1 =
 

The struct is going to be less efficiently packed with that field order.
Maybe swap rd_istemp and rd_backend?

+        Assert(relation->rd_backend != InvalidOid);
ought to be InvalidBackendId, no?  Both new calls of GetTempNamespaceBackendId
have this wrong.  You might also want to change the comment of
GetTempNamespaceBackendId to note that its failure result is not just -1
but InvalidBackendId.

Lastly, it bothers me that you've put in code to delete files belonging
to temp rels during crash restart, without any code to clean up their
catalog entries.  This will therefore lead to dangling pg_class
references, with uncertain but probably not very nice consequences.
I think that probably shouldn't go in until there's code to drop the
catalog entries too.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
David Fetter
Date:
On Thu, Aug 12, 2010 at 12:27:45PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Here's an updated patch, with the invalidation changes merged in and
> > hopefully-suitable adjustments elsewhere.
> 
> I haven't tested this patch, but I read through it (and have I mentioned
> how unbelievably illegible -u format patches are?).

I have every confidence that you, of all people, can arrange to use
'filterdiff --format=context' on attached patches automatically,
saving you some time and us some boredom :)

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: including backend ID in relpath of temp rels - updated patch

From
"Joshua D. Drake"
Date:
On Thu, 2010-08-12 at 09:44 -0700, David Fetter wrote:
> On Thu, Aug 12, 2010 at 12:27:45PM -0400, Tom Lane wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > Here's an updated patch, with the invalidation changes merged in and
> > > hopefully-suitable adjustments elsewhere.
> >
> > I haven't tested this patch, but I read through it (and have I mentioned
> > how unbelievably illegible -u format patches are?).
>
> I have every confidence that you, of all people, can arrange to use
> 'filterdiff --format=context' on attached patches automatically,
> saving you some time and us some boredom :)

I was under the impression that the project guideline was that we only
accepted context diffs?

Joshua D. Drake

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

Re: including backend ID in relpath of temp rels - updated patch

From
David Fetter
Date:
On Thu, Aug 12, 2010 at 09:53:54AM -0700, Joshua D. Drake wrote:
> On Thu, 2010-08-12 at 09:44 -0700, David Fetter wrote:
> > On Thu, Aug 12, 2010 at 12:27:45PM -0400, Tom Lane wrote:
> > > Robert Haas <robertmhaas@gmail.com> writes:
> > > > Here's an updated patch, with the invalidation changes merged
> > > > in and hopefully-suitable adjustments elsewhere.
> > > 
> > > I haven't tested this patch, but I read through it (and have I
> > > mentioned how unbelievably illegible -u format patches are?).
> > 
> > I have every confidence that you, of all people, can arrange to
> > use 'filterdiff --format=context' on attached patches
> > automatically, saving you some time and us some boredom :)
> 
> I was under the impression that the project guideline was that we
> only accepted context diffs?

Since they're trivially producible from unified diffs, this is a
pretty silly reason to bounce--or even comment on--patches.  It's less
a guideline than a personal preference, namely Tom's.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: including backend ID in relpath of temp rels - updated patch

From
Alvaro Herrera
Date:
Excerpts from David Fetter's message of jue ago 12 12:59:33 -0400 2010:
> On Thu, Aug 12, 2010 at 09:53:54AM -0700, Joshua D. Drake wrote:

> > I was under the impression that the project guideline was that we
> > only accepted context diffs?
> 
> Since they're trivially producible from unified diffs, this is a
> pretty silly reason to bounce--or even comment on--patches.  It's less
> a guideline than a personal preference, namely Tom's.

Not that I review that many patches, but I do dislike unified diffs too.

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Thu, Aug 12, 2010 at 12:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Here's an updated patch, with the invalidation changes merged in and
>> hopefully-suitable adjustments elsewhere.
>
> I haven't tested this patch, but I read through it (and have I mentioned
> how unbelievably illegible -u format patches are?).

Sorry, I'll run it through filterdiff for you next time.  As you know,
I have the opposite preference, but since I was producing this
primarily for you to read....

I can clean up the rest of this stuff, but not this:

> Lastly, it bothers me that you've put in code to delete files belonging
> to temp rels during crash restart, without any code to clean up their
> catalog entries.  This will therefore lead to dangling pg_class
> references, with uncertain but probably not very nice consequences.
> I think that probably shouldn't go in until there's code to drop the
> catalog entries too.

I thought about this pretty carefully, and I don't believe that there
are any unpleasant consequences.  The code that assigns relfilenode
numbers is pretty careful to check that the newly assigned value is
unused BOTH in pg_class and in the directory where the file will be
created, so there should be no danger of a number getting used over
again while the catalog entries remain.  Also, the drop-object code
doesn't mind that the physical storage doesn't exist; it's perfectly
happy with that situation.  It is true that you will get an ERROR if
you try to SELECT from a table whose physical storage has been
removed, but that seems OK.

We have two existing mechanisms for removing the catalog entries: when
a backend is first asked to access a temporary file, it does a DROP
SCHEMA ... CASCADE on any pre-existing temp schema.  And a table is in
wraparound trouble and the owning backend is no longer running,
autovacuum will drop it.  Improving on this seems difficult: if you
wanted to *guarantee* that the catalog entries were removed before we
started letting in connections, you'd need to fork a backend per
database and have each one iterate through all the temp schemas and
drop them.  Considering that the existing code seems to have been
pretty careful about how this stuff gets handled, I don't think it's
worth making the whole startup sequence slower for it.  What might be
worth considering is changing the autovacuum policy to eliminate the
wraparound check, and just have it drop temp table catalog entries for
any backend not currently running, period.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Aug 12, 2010 at 12:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Lastly, it bothers me that you've put in code to delete files belonging
>> to temp rels during crash restart, without any code to clean up their
>> catalog entries. �This will therefore lead to dangling pg_class
>> references, with uncertain but probably not very nice consequences.

> I thought about this pretty carefully, and I don't believe that there
> are any unpleasant consequences.  The code that assigns relfilenode
> numbers is pretty careful to check that the newly assigned value is
> unused BOTH in pg_class and in the directory where the file will be
> created, so there should be no danger of a number getting used over
> again while the catalog entries remain.  Also, the drop-object code
> doesn't mind that the physical storage doesn't exist; it's perfectly
> happy with that situation.

Well, okay, but I'd suggest adding comments to the drop-table code
pointing out that it is now NECESSARY for it to not complain if the file
isn't there.  This was never a design goal before, AFAIR --- the fact
that it works like that is kind of accidental.  I am also pretty sure
that there used to be at least warning messages for that case, which we
would now not want.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Thu, Aug 12, 2010 at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Aug 12, 2010 at 12:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Lastly, it bothers me that you've put in code to delete files belonging
>>> to temp rels during crash restart, without any code to clean up their
>>> catalog entries.  This will therefore lead to dangling pg_class
>>> references, with uncertain but probably not very nice consequences.
>
>> I thought about this pretty carefully, and I don't believe that there
>> are any unpleasant consequences.  The code that assigns relfilenode
>> numbers is pretty careful to check that the newly assigned value is
>> unused BOTH in pg_class and in the directory where the file will be
>> created, so there should be no danger of a number getting used over
>> again while the catalog entries remain.  Also, the drop-object code
>> doesn't mind that the physical storage doesn't exist; it's perfectly
>> happy with that situation.
>
> Well, okay, but I'd suggest adding comments to the drop-table code
> pointing out that it is now NECESSARY for it to not complain if the file
> isn't there.  This was never a design goal before, AFAIR --- the fact
> that it works like that is kind of accidental.  I am also pretty sure
> that there used to be at least warning messages for that case, which we
> would now not want.

That seems like a good idea.  I'll post an updated patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Alvaro Herrera
Date:
Excerpts from Robert Haas's message of jue ago 12 13:29:57 -0400 2010:

> We have two existing mechanisms for removing the catalog entries: when
> a backend is first asked to access a temporary file, it does a DROP
> SCHEMA ... CASCADE on any pre-existing temp schema.  And a table is in
> wraparound trouble and the owning backend is no longer running,
> autovacuum will drop it.  Improving on this seems difficult: if you
> wanted to *guarantee* that the catalog entries were removed before we
> started letting in connections, you'd need to fork a backend per
> database and have each one iterate through all the temp schemas and
> drop them.  Considering that the existing code seems to have been
> pretty careful about how this stuff gets handled, I don't think it's
> worth making the whole startup sequence slower for it.  What might be
> worth considering is changing the autovacuum policy to eliminate the
> wraparound check, and just have it drop temp table catalog entries for
> any backend not currently running, period.

What about having autovacuum silenty drop the catalog entry if it's a
temp entry for which the underlying file does not exist?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Thu, Aug 12, 2010 at 5:29 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Excerpts from Robert Haas's message of jue ago 12 13:29:57 -0400 2010:
>
>> We have two existing mechanisms for removing the catalog entries: when
>> a backend is first asked to access a temporary file, it does a DROP
>> SCHEMA ... CASCADE on any pre-existing temp schema.  And a table is in
>> wraparound trouble and the owning backend is no longer running,
>> autovacuum will drop it.  Improving on this seems difficult: if you
>> wanted to *guarantee* that the catalog entries were removed before we
>> started letting in connections, you'd need to fork a backend per
>> database and have each one iterate through all the temp schemas and
>> drop them.  Considering that the existing code seems to have been
>> pretty careful about how this stuff gets handled, I don't think it's
>> worth making the whole startup sequence slower for it.  What might be
>> worth considering is changing the autovacuum policy to eliminate the
>> wraparound check, and just have it drop temp table catalog entries for
>> any backend not currently running, period.
>
> What about having autovacuum silenty drop the catalog entry if it's a
> temp entry for which the underlying file does not exist?

I think that would be subject to race conditions.  The current
mechanism is actually pretty good, and I think we can build on it if
we want to do more, rather than inventing something new.  We just need
to be specific about what problem we're trying to solve.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Aug 12, 2010 at 5:29 PM, Alvaro Herrera
> <alvherre@commandprompt.com> wrote:
>> What about having autovacuum silenty drop the catalog entry if it's a
>> temp entry for which the underlying file does not exist?

> I think that would be subject to race conditions.

Well, autovacuum's willingness to drop sufficiently old temp tables
would already risk any such race conditions.  However ...

> The current
> mechanism is actually pretty good, and I think we can build on it if
> we want to do more, rather than inventing something new.  We just need
> to be specific about what problem we're trying to solve.

... I agree with this point.  This idea wouldn't fix the concern I had
about the existence of pg_class entries with no underlying file: if that
causes any issues we'd have to fix them anyway.  So I'm not sure what
the gain is.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
One other thought about all this: in the past, one objection that's been
raised to deleting files during crash restart is the possible loss of
forensic evidence.  It might be a good idea to provide some fairly
simple way for developers to disable that deletion subroutine.  I'm not
sure that it rises to the level of needing a GUC, though.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Thu, Aug 12, 2010 at 6:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Aug 12, 2010 at 5:29 PM, Alvaro Herrera
>> <alvherre@commandprompt.com> wrote:
>>> What about having autovacuum silenty drop the catalog entry if it's a
>>> temp entry for which the underlying file does not exist?
>
>> I think that would be subject to race conditions.
>
> Well, autovacuum's willingness to drop sufficiently old temp tables
> would already risk any such race conditions.  However ...

I don't think we do.  It only drops them if the backend isn't running.And even if the backend starts running after we
checkand before we 
drop it, that's OK.  We're only dropping the OLD table, which by
definition isn't relevant to the new session.  Perhaps we should be
taking a lock on the relation first, but I think that can only really
bite us if the relation were dropped and a new relation were created
with the same OID - in that case, we might drop an unrelated table,
though it's vanishingly unlikely in practice.

>> The current
>> mechanism is actually pretty good, and I think we can build on it if
>> we want to do more, rather than inventing something new.  We just need
>> to be specific about what problem we're trying to solve.
>
> ... I agree with this point.  This idea wouldn't fix the concern I had
> about the existence of pg_class entries with no underlying file: if that
> causes any issues we'd have to fix them anyway.  So I'm not sure what
> the gain is.

Right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels -updated patch

From
"Joshua D. Drake"
Date:
On Thu, 2010-08-12 at 09:44 -0700, David Fetter wrote:
> On Thu, Aug 12, 2010 at 12:27:45PM -0400, Tom Lane wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> > > Here's an updated patch, with the invalidation changes merged in and
> > > hopefully-suitable adjustments elsewhere.
> > 
> > I haven't tested this patch, but I read through it (and have I mentioned
> > how unbelievably illegible -u format patches are?).
> 
> I have every confidence that you, of all people, can arrange to use
> 'filterdiff --format=context' on attached patches automatically,
> saving you some time and us some boredom :)

I was under the impression that the project guideline was that we only
accepted context diffs?

Joshua D. Drake

-- 
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt



Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Thu, Aug 12, 2010 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> One other thought about all this: in the past, one objection that's been
> raised to deleting files during crash restart is the possible loss of
> forensic evidence.  It might be a good idea to provide some fairly
> simple way for developers to disable that deletion subroutine.  I'm not
> sure that it rises to the level of needing a GUC, though.

See http://archives.postgresql.org/pgsql-hackers/2010-06/msg00622.php :

In this version of the patch, I've improved the temporary file cleanup
mechanism.  In CVS HEAD, when a transaction commits or aborts, we
write an XLOG record with all relations that must be unlinked,
including temporary relations.  With this patch, we no longer include
temporary relations in the XLOG records (which may be a tiny
performance benefit for some people); instead, on every startup of the
database cluster, we just nuke all temporary relation files (which is
now easy to do, since they are named differently than files for
non-temporary relations) at the same time that we nuke other temp
files.  This produces slightly different behavior.  In CVS HEAD,
temporary files get removed whenever an xlog redo happens, so either
at cluster start or after a backend crash, but only to the extent that
they appear in WAL.  With this patch, we can be sure of removing ALL
stray files, which is maybe ever-so-slightly leaky in CVS HEAD.  But
on the other hand, because it hooks into the existing temporary file
cleanup code, it only happens at cluster startup, NOT after a backend
crash.  The existing coding leaves temporary sort files and similar
around after a backend crash for forensics purposes.  That might or
might not be worth rethinking for non-debug builds, but I don't think
there's any very good reason to think that temporary relation files
need to be handled differently than temporary work files.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Aug 12, 2010 at 7:04 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> One other thought about all this: in the past, one objection that's been
>> raised to deleting files during crash restart is the possible loss of
>> forensic evidence.

> ...  With this patch, we can be sure of removing ALL
> stray files, which is maybe ever-so-slightly leaky in CVS HEAD.  But
> on the other hand, because it hooks into the existing temporary file
> cleanup code, it only happens at cluster startup, NOT after a backend
> crash.

Doh.  Thanks.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Thu, Aug 12, 2010 at 2:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> That seems like a good idea.  I'll post an updated patch.

Here is an updated patch.  It's in context-diff format this time, but
that made it go over 100K, so I gzip'd it because I can't remember
what the attachment size limit is these days.  I'm not sure whether
that works out to a win or not.

I think this addresses all previous review comments with the exception
that I've been unable to devise a more clever name for relpathperm()
and relpathbackend().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Attachment

Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Here is an updated patch.  It's in context-diff format this time,

Thanks, I appreciate that ;-)

This looks committable to me, with a couple of tiny suggestions:

In the text added to storage.sgml, s/temporary relation/temporary relations/.
Also, it'd be better if BBB and FFF were marked up as <replaceable>
rather than <literal>, see examples elsewhere in that file.

The comment for local_buffer_write_error_callback() probably meant to
say "during local buffer writes".
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Fri, Aug 13, 2010 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Here is an updated patch.  It's in context-diff format this time,
>
> Thanks, I appreciate that ;-)
>
> This looks committable to me, with a couple of tiny suggestions:

Woo hoo!

> In the text added to storage.sgml, s/temporary relation/temporary relations/.
> Also, it'd be better if BBB and FFF were marked up as <replaceable>
> rather than <literal>, see examples elsewhere in that file.

I see.  How should I mark tBBB_FFF?

I actually didn't like that way of explaining it very much, but I
couldn't think of anything clearer.  Saying "the name will consist of
a lowercase t, followed by the backend ID, followed by an underscore,
followed by the filenode" did not seem better.

> The comment for local_buffer_write_error_callback() probably meant to
> say "during local buffer writes".

No doubt.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Aug 13, 2010 at 12:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, it'd be better if BBB and FFF were marked up as <replaceable>
>> rather than <literal>, see examples elsewhere in that file.

> I see.  How should I mark tBBB_FFF?

I'd do
<literal>t<replaceable>BBB</replaceable>_<replaceable>FFF</replaceable></literal>
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Thu, Aug 12, 2010 at 1:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Lastly, it bothers me that you've put in code to delete files belonging
>> to temp rels during crash restart, without any code to clean up their
>> catalog entries.  This will therefore lead to dangling pg_class
>> references, with uncertain but probably not very nice consequences.
>> I think that probably shouldn't go in until there's code to drop the
>> catalog entries too.
>
> I thought about this pretty carefully, and I don't believe that there
> are any unpleasant consequences.  The code that assigns relfilenode
> numbers is pretty careful to check that the newly assigned value is
> unused BOTH in pg_class and in the directory where the file will be
> created, so there should be no danger of a number getting used over
> again while the catalog entries remain.  Also, the drop-object code
> doesn't mind that the physical storage doesn't exist; it's perfectly
> happy with that situation.  It is true that you will get an ERROR if
> you try to SELECT from a table whose physical storage has been
> removed, but that seems OK.

After further reflection, I think that the above reasoning is wrong.
GetNewRelFileNode() guarantees that the OID doesn't collide with the
OID column of pg_class, not the relfilenode column of pg_class; and,
per the comments to that function, it only guarantees this when
creating a new relation (to which we're therefore assigning an OID)
and not when rewriting an existing relation.  So it provides no
protection against the scenario where backend #1 drops a table that
lacks physical storage just as backend #2 assigns that OID to some
other relation and begins creating files, which backend #1 then blows
away.

However, I think we're safe for a different reason.  The only time we
unlink the physical storage of a relation without nuking its catalog
entries is during the startup sequence, when we wipe out the physical
storage for any leftover temp tables.  So if there's a race condition,
it can only apply to temp tables.  But temp tables for a particular
backend ID can only be created by that backend, and before a backend
will create any temp tables it will drop any previously existing temp
schema.  Thus, by the time a temp table can get created, there CAN'T
be any leftover catalog entries from previous sessions for it to
potentially collide with.

I think the reasoning above probably should be added to the comment at
the beginning of GetNewRelFileNode(), and I'll go do that unless
someone thinks it's incorrect.  The first sentence of that comment is
also now technically incorrect: what it now does is generate a
relfilenode such that <tablespace, relfilenode, backendid> is unique.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> After further reflection, I think that the above reasoning is wrong.
> GetNewRelFileNode() guarantees that the OID doesn't collide with the
> OID column of pg_class, not the relfilenode column of pg_class; and,
> per the comments to that function, it only guarantees this when
> creating a new relation (to which we're therefore assigning an OID)
> and not when rewriting an existing relation.  So it provides no
> protection against the scenario where backend #1 drops a table that
> lacks physical storage just as backend #2 assigns that OID to some
> other relation and begins creating files, which backend #1 then blows
> away.

> However, I think we're safe for a different reason [ omitted ]

The above scenario is only a risk if you suppose that dropping a
relation that lacks physical storage will nonetheless result in
attempted unlink()s.  I think that that's probably not the case;
but it seems related to a question that was bothering me recently.
Namely: why do we assign relfilenode values to relations that have
no physical storage?  If we did not do so, then relation drop would
see a zero in relfilenode and would certainly not attempt an incorrect
unlink().

So I'd like to look into setting relfilenode to zero for relation
relkinds that lack storage.  I'm not exactly sure why the code doesn't
do that already, though.

This came to mind after reading a commit from Bruce in which he had to
hack up pg_upgrade to preserve relfilenode numbers for storage-less
relations.  Presumably that hack could get undone.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Wed, Sep 15, 2010 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> After further reflection, I think that the above reasoning is wrong.
>> GetNewRelFileNode() guarantees that the OID doesn't collide with the
>> OID column of pg_class, not the relfilenode column of pg_class; and,
>> per the comments to that function, it only guarantees this when
>> creating a new relation (to which we're therefore assigning an OID)
>> and not when rewriting an existing relation.  So it provides no
>> protection against the scenario where backend #1 drops a table that
>> lacks physical storage just as backend #2 assigns that OID to some
>> other relation and begins creating files, which backend #1 then blows
>> away.
>
>> However, I think we're safe for a different reason [ omitted ]
>
> The above scenario is only a risk if you suppose that dropping a
> relation that lacks physical storage will nonetheless result in
> attempted unlink()s.  I think that that's probably not the case;

Why?  How would we know that it didn't have physical storage prior to
attempting the unlinks?

> but it seems related to a question that was bothering me recently.
> Namely: why do we assign relfilenode values to relations that have
> no physical storage?  If we did not do so, then relation drop would
> see a zero in relfilenode and would certainly not attempt an incorrect
> unlink().

I agree that setting relfilenode to 0 for relkinds that lack physical
storage is a good idea, but that's obviously only going to handle that
specific case.

> This came to mind after reading a commit from Bruce in which he had to
> hack up pg_upgrade to preserve relfilenode numbers for storage-less
> relations.  Presumably that hack could get undone.

Seems like a good thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


Re: including backend ID in relpath of temp rels - updated patch

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Sep 15, 2010 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The above scenario is only a risk if you suppose that dropping a
>> relation that lacks physical storage will nonetheless result in
>> attempted unlink()s. �I think that that's probably not the case;

> Why?  How would we know that it didn't have physical storage prior to
> attempting the unlinks?

From the relkind.
        regards, tom lane


Re: including backend ID in relpath of temp rels - updated patch

From
Robert Haas
Date:
On Wed, Sep 15, 2010 at 12:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Sep 15, 2010 at 12:14 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The above scenario is only a risk if you suppose that dropping a
>>> relation that lacks physical storage will nonetheless result in
>>> attempted unlink()s.  I think that that's probably not the case;
>
>> Why?  How would we know that it didn't have physical storage prior to
>> attempting the unlinks?
>
> From the relkind.

Oh, sure, I agree with you in that specific case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company