Thread: update on global temporary and unlogged tables

update on global temporary and unlogged tables

From

Robert Haas

Date:

06 September 2010, 23:55:49

I haven't had a chance to do a great deal of work on this project, but
I'm hoping to get back to it at some point and, in the meantime,
thought that it might be useful to circulate a few thoughts I've had
so far.

1. As common architecture for both features, I think that it might
make sense to replace the existing relistemp (bool) field with a
relpersistence (char) field.  The current tests against rel->rd_istemp
can be replaced with macros testing the exact property we really care
about in that situation.  I'm thinking of RelationNeedsWAL(),
RelationUsesLocalBuffers(), and RelationUsesTempNamespace().

2. With respect to global temporary tables, I've hit a fairly serious
roadblock in the form of relfrozenxid.  For a permanent table, VACUUM
by any backend can advance relfrozenxid; for a backend-local temporary
table, VACUUM by the owning backend can advance relfrozenxid.  But for
a global temporary table, the proper value for relfrozenxid is the
earliest value for any backend that has stored tuples into the table.
I'm not immediately sure what to do about this.  Tom's previous
suggestion of cloning the catalog entries for each backend that tries
to access the table is one possible alternative, but I don't like that
much for reasons previously discussed.  Incidentally, per some
previous discussion, I took a look at what Oracle does with respect to
DDL on global temp tables, and I gather that they allow it if no
session has "bound" (instantiated?) the table.  Maybe some
infrastructure along those lines would be useful in dealing with the
VACUUM problem also; not sure.

3. With respect to unlogged tables, the major obstacle seems to be
figuring out a way for these to get automatically truncated at startup
time.  As with temporary table cleanup in general, the problem here is
that you can't do the obvious thing of iterating through pg_class and
truncating each unlogged table you find without greatly complicating
the startup sequence.  However, I think there's a fairly easy way
around this problem: truncating a table basically means removing all
segments and relation forks other than the first segment of the main
fork, and truncating that one to zero length.  So we could do it the
same way we clean up temporary files - namely, based on the file name
- if we made the filenames for unlogged tables distinguishable from
those for regular and temporary tables.  What I'm thinking about is
reserving a backend ID of -2 for this purpose via some suitable
constant definition, just as -1 (InvalidBackendId) represents a
permanent table in this context.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: update on global temporary and unlogged tables

From

Robert Haas

Date:

12 September 2010, 23:49:44

On Mon, Sep 6, 2010 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 3. With respect to unlogged tables, the major obstacle seems to be
> figuring out a way for these to get automatically truncated at startup
> time.  As with temporary table cleanup in general, the problem here is
> that you can't do the obvious thing of iterating through pg_class and
> truncating each unlogged table you find without greatly complicating
> the startup sequence.  However, I think there's a fairly easy way
> around this problem: truncating a table basically means removing all
> segments and relation forks other than the first segment of the main
> fork, and truncating that one to zero length.  So we could do it the
> same way we clean up temporary files - namely, based on the file name
> - if we made the filenames for unlogged tables distinguishable from
> those for regular and temporary tables.  What I'm thinking about is
> reserving a backend ID of -2 for this purpose via some suitable
> constant definition, just as -1 (InvalidBackendId) represents a
> permanent table in this context.

I tried this approach and got fairly far with it, but ran into a snag
in the buffer manager.  It's fairly obvious that the buffer manager
has to know whether a particular buffer is from an unlogged relation
or not; for example, FlushBuffer() should skip the XLOG flush for an
unlogged buffer, and must pass the correct backend ID to smgropen().
So my first thought was just to define a bit BM_IS_UNLOGGED, with the
obvious interpretation.

That's not quite good enough, though, because GetNewRelFileNode
doesn't guarantee that the OID chosen is absolutely unique; it just
guarantees that it's unique within the space defined by the database
ID and backend ID.  So it's possible that you could have a logged
relation and an unlogged relation with the same value for
pg_class.relfilenode, which means that the buffer manager can't store
the unlogged status as a random bit someplace, but actually needs to
have it as part of the buffer tag (otherwise, a buffer descriptor hash
table lookup might find the wrong buffer).  You could maybe work
around this problem by having GetNewRelFileNode(), when generating an
OID for either a permanent or unlogged relation, check that the OID
isn't in use for either of those things already, but that requires an
extra system call, so it doesn't seem ideal.  I'd be willing to go
that route if people think it's cheap enough and more desirable for
some reason, though.

So I went looking for bit-space in the buffer tag and quickly found
some.  ForkNumber is an enum which I suppose means a 32-bit integer,
but we've only got three forks right now and it's hard to imagine more
than a handful of additional ones, so what I'm tempted to do is change
this from an enum to a 2-byte integer and replace the enum values with
#defines.  That frees up 2 bytes in the buffer tag which is more than
plenty.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: update on global temporary and unlogged tables

From

Heikki Linnakangas

Date:

13 September 2010, 06:24:55

On 13/09/10 05:49, Robert Haas wrote:
> On Mon, Sep 6, 2010 at 10:55 PM, Robert Haas<robertmhaas@gmail.com>  wrote:
>> 3. With respect to unlogged tables, the major obstacle seems to be
>> figuring out a way for these to get automatically truncated at startup
>> time.  As with temporary table cleanup in general, the problem here is
>> that you can't do the obvious thing of iterating through pg_class and
>> truncating each unlogged table you find without greatly complicating
>> the startup sequence.  However, I think there's a fairly easy way
>> around this problem: truncating a table basically means removing all
>> segments and relation forks other than the first segment of the main
>> fork, and truncating that one to zero length.  So we could do it the
>> same way we clean up temporary files - namely, based on the file name
>> - if we made the filenames for unlogged tables distinguishable from
>> those for regular and temporary tables.  What I'm thinking about is
>> reserving a backend ID of -2 for this purpose via some suitable
>> constant definition, just as -1 (InvalidBackendId) represents a
>> permanent table in this context.
>
> I tried this approach and got fairly far with it, but ran into a snag
> in the buffer manager.  It's fairly obvious that the buffer manager
> has to know whether a particular buffer is from an unlogged relation
> or not; for example, FlushBuffer() should skip the XLOG flush for an
> unlogged buffer, and must pass the correct backend ID to smgropen().
> So my first thought was just to define a bit BM_IS_UNLOGGED, with the
> obvious interpretation.

The LSNs on all pages in an unlogged relation should be zero, and 
XLogFlush() will do nothing. That's what we rely on at the moment for 
pages that are not WAL-logged for some reason, I don't think you need 
any extra flag for that.

> So I went looking for bit-space in the buffer tag and quickly found
> some.  ForkNumber is an enum which I suppose means a 32-bit integer,
> but we've only got three forks right now and it's hard to imagine more
> than a handful of additional ones, so what I'm tempted to do is change
> this from an enum to a 2-byte integer and replace the enum values with
> #defines.  That frees up 2 bytes in the buffer tag which is more than
> plenty.

I haven't been following the discussion so I don't understand why you 
need the extra bits, but no objections to reducing the fork number field.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: update on global temporary and unlogged tables

From

Robert Haas

Date:

13 September 2010, 08:57:52

On Mon, Sep 13, 2010 at 5:24 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> The LSNs on all pages in an unlogged relation should be zero, and
> XLogFlush() will do nothing. That's what we rely on at the moment for pages
> that are not WAL-logged for some reason, I don't think you need any extra
> flag for that.

Ah, interesting.  I wonder if I should add a cross-check for that and
elog(LOG) if it isn't the case, or some such.

>> So I went looking for bit-space in the buffer tag and quickly found
>> some.  ForkNumber is an enum which I suppose means a 32-bit integer,
>> but we've only got three forks right now and it's hard to imagine more
>> than a handful of additional ones, so what I'm tempted to do is change
>> this from an enum to a 2-byte integer and replace the enum values with
>> #defines.  That frees up 2 bytes in the buffer tag which is more than
>> plenty.
>
> I haven't been following the discussion so I don't understand why you need
> the extra bits, but no objections to reducing the fork number field.

Well, the idea is that unlogged table files are named differently than
regular table files (I'm thinking, just insert a "u" at the beginning)
so that we can truncate them at startup time without needing to look
at any catalog entries.  So the point is you could have
data/base/16384/u124141421 block 173 and data/base/16384/124141421
block 173 in the buffer cache at the same time.   The alternative is
to try to make sure that you never create both of those files in the
first place, but that requires an extra system call per
GetNewRelFileNode() - and that could get worse in the future if for
some reason we find it advantageous to have more than two
"namespaces".  (The already-committed RelFileNodeBackend patch already
creates one "namespace" per backend for temporary tables plus one for
permanent tables; but that doesn't run into this problem because
temporary tables use backend-local buffers - i.e. the relevant
BackendId can be inferred strictly from which set of buffers we're
looking at.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

Re: update on global temporary and unlogged tables

From

Rob Wultsch

Date:

13 September 2010, 22:07:29

<div class="gmail_quote">On Mon, Sep 6, 2010 at 7:55 PM, Robert Haas <span dir="ltr"><<a
href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="margin:0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> 3. With respect to
unloggedtables, the major obstacle seems to be<br /> figuring out a way for these to get automatically truncated at
startup<br/> time.<br /></blockquote></div><br />(please forgive what is probably a stupid question)<br />By truncate
domean reduce the table to a very small number (or zero) number of pages? Is there a case to be made for instead
somehowmarking all pages as available for reuse? Deallocating and reallocating space can be expensive.<br clear="all"
/><br/>-- <br />Rob Wultsch<br /><a href="mailto:wultsch@gmail.com">wultsch@gmail.com</a><br />

Re: update on global temporary and unlogged tables

From

Robert Haas

Date:

13 September 2010, 23:37:22

On Mon, Sep 13, 2010 at 9:06 PM, Rob Wultsch <wultsch@gmail.com> wrote:
> On Mon, Sep 6, 2010 at 7:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> 3. With respect to unlogged tables, the major obstacle seems to be
>> figuring out a way for these to get automatically truncated at startup
>> time.
>
> (please forgive what is probably a stupid question)
> By truncate do mean reduce the table to a very small number (or zero) number
> of pages? Is there a case to be made for instead somehow marking all pages
> as available for reuse? Deallocating and reallocating space can be
> expensive.

I think it's probably actually cheaper to truncate them, but since it
only happens at startup time it's probably not worth worrying
about....

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company