Thread: unlogged tables v5

unlogged tables v5

From
Robert Haas
Date:
Here's an updated patch for unlogged tables, incorporating the
following changes since v4:

1. pg_dump and pg_dumpall now take an argument
--no-unlogged-table-data, which does just about what it sounds like.
In a previous email, I suggested that what we wanted here was
--no-unlogged-tables, but on further thought I don't think that's
right.  There's no reason to think that the schema of an unlogged
table shouldn't be backed up, just the contents.

2. Documentation improvements, including changes to catalogs.sgml and
storage.sgml, per observations from Tom Lane.

3. Unlogged buffers are no longer included in a non-shutdown
checkpoint, which should improve performance for some workloads.  This
is implemented via a new flag BM_PERMANENT, which made the logic a bit
simpler than BM_UNLOGGED with the opposite sense would have done.  It
could be argued that this flag out to be called
BM_SKIP_CHECKPOINTS_EXCEPT_SHUTDOWN_CHECKPOINTS, or something like
that, but I think that's gilding the lily.

4. Unlogged tables now survive a clean shutdown.  (Note that this is
not incompatible with #3: IF a buffer is written, the relation file
will be fsync'd; #3 just affects how often they get written in the
first place.)

5. Support unlogged GIN indexes.

There are a couple of open issues which I'm thinking can be left for
future work.

A. Minimization of fsync traffic.  fsyncs on unlogged relations can
potentially be postponed until shutdown time.  Right now, they'll
happen as part of the next checkpoint.

B. Unlogged GIST indexes aren't supported.  Making this work appears
to require (i) requiring every page in shared buffers to have not only
an LSN but a full page header, (ii) allocating a flag bit in the page
header to designate that the LSN isn't real and no XLogFlush should be
done, (iii) creating a shared-memory counter that will generate fake
LSNs for use by the GIST index code and adjusting the GIST code to use
this (and set the bit described in (ii)) when an unlogged GIST is in
memory, (iv) storing the counter described in (iii) somewhere (perhaps
in the control file) on shutdown so that after a clean shutdown and
restart the value continues to increase monotonically.  I'm not
entirely convinced that the penny is worth the candle here.

I'm thinking this is about ready to commit.  Further review and
testing would be appreciated, especially of these latest changes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: unlogged tables v5

From
Simon Riggs
Date:
On Fri, 2010-12-24 at 23:36 -0500, Robert Haas wrote:
> Here's an updated patch for unlogged tables, incorporating the
> following changes since v4:

Looks good

> 5. Support unlogged GIN indexes.

Not sure from reading the docs whether unlogged indexes are supported on
logged tables? Could you clarify (or clarify more often)? Does this
solve the hash index situation?

"The contents
+      of an unlogged table are also not replicated to standby servers."
would prefer to remove "also"

Unlogged tables aren't replicated, but they would be copied as part of a
base backup. I'd request that we have a different forkname, or some
other indicator that allows a base backup to allow exclusion of such
files, since they are going to get reset to zero very soon afterwards
anyway. Not everyone would wish it, but its a good option.

> There are a couple of open issues which I'm thinking can be left for
> future work.
> 
> A. Minimization of fsync traffic.  fsyncs on unlogged relations can
> potentially be postponed until shutdown time.  Right now, they'll
> happen as part of the next checkpoint.

Half the fun of unlogged tables was for me the ability to skip the fsync
and the checkpoint writes. If we're using PostgreSQL as a cache, it will
be a little hard to explain why it still does writes in a huge storm
every so often. A performance feature that doesn't avoid a performance
hit seems a little strange, to say the least.

Should be easy enough to mark a flag on each buffer, only examined at
checkpoint.

> B. Unlogged GIST indexes aren't supported.

No problem. Understand specific difficulties.

-- Simon Riggs           http://www.2ndQuadrant.com/books/PostgreSQL Development, 24x7 Support, Training and Services



Re: unlogged tables v5

From
Robert Haas
Date:
On Mon, Dec 27, 2010 at 3:22 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Not sure from reading the docs whether unlogged indexes are supported on
> logged tables? Could you clarify (or clarify more often)? Does this
> solve the hash index situation?

They are not.  The only place you'll see that the grammar has been
modified is in the CREATE TABLE documentation.  The table is either
unlogged or not, and if it is unlogged, then its indices are also
unlogged.  As I mentioned on one of the previous threads, I think
unlogged indexes on logged tables would be a useful thing to support,
but it turns out to require somewhat different infrastructure from
what I've built here, and I don't have any imminent plans to work on
it.

> Unlogged tables aren't replicated, but they would be copied as part of a
> base backup. I'd request that we have a different forkname, or some
> other indicator that allows a base backup to allow exclusion of such
> files, since they are going to get reset to zero very soon afterwards
> anyway. Not everyone would wish it, but its a good option.

You can identify them by looking for files that have an _init fork,
and then skipping the corresponding main-fork files.  I initially had
the idea of actually giving the main fork a distinguishing name, but
this turns out to have two serious defects: (1) it makes the patch a
lot more invasive, to the point where if I'd had to keep going that
route I likely would have given up on the feature altogether, and (2)
you still need the init fork anyway.  I agree this is a bit of a drag
to work with for base backups but I don't think there's a reasonable
way to do better.

>> There are a couple of open issues which I'm thinking can be left for
>> future work.
>>
>> A. Minimization of fsync traffic.  fsyncs on unlogged relations can
>> potentially be postponed until shutdown time.  Right now, they'll
>> happen as part of the next checkpoint.
>
> Half the fun of unlogged tables was for me the ability to skip the fsync
> and the checkpoint writes. If we're using PostgreSQL as a cache, it will
> be a little hard to explain why it still does writes in a huge storm
> every so often. A performance feature that doesn't avoid a performance
> hit seems a little strange, to say the least.
>
> Should be easy enough to mark a flag on each buffer, only examined at
> checkpoint.

Well, there is such a flag, but it just skips the checkpoint write.
If the cleaning scan or a backend does a write, an fsync is still
scheduled (and gets performed at the next checkpoint).  The problem
here is that there was a groundswell of support for making unlogged
tables survive a clean shutdown, and you've got to eventually perform
these fsyncs in order for that to work.  Now you could postpone them
until the shutdown checkpoint, but that's only going to help in a
limited set of cases.  Specifically:

1. If the unlogged tables fit entirely within shared_buffers, it won't help.
2. If the operating system writes back the dirty buffers before the
next checkpoint, it won't help.
3. If you're running ext3, then the first fsync on anything is going
to flush the entire buffer cache for that FS out to disk anyway, so
unless pg_xlog is on a separate partition AND you have absolutely no
writes to any normal tables since the last checkpoint, it won't help.

We could probably get a test system set up where we run XFS on Linux
with an unlogged table larger than shared buffers and demonstrate
that, in fact, there is some preventable suckiness there.

But even then, I'm not convinced I should try to fix that as part of
this patch.  First, of course, there is a huge performance gain from
using unlogged tables even without this optimization.  I have tested
it; it's big.  Second, we already know that the background writer's
algorithm for scheduling fsyncs is not optimal.  It may be that if we
end up rejiggering that algorithm, something will fall out to handle
this case, too.  I'd rather not contort the logic too much to handle
this case until we have a better idea of how we want it to work in
general.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company