Re: tackling full page writes - Mailing list pgsql-hackers

From Robert Haas
Subject Re: tackling full page writes
Date
Msg-id BANLkTikF8ST77ZEipPUp=M7_iYd25VnVTA@mail.gmail.com
Whole thread Raw
In response to Re: tackling full page writes  (Bruce Momjian <bruce@momjian.us>)
List pgsql-hackers
On Wed, May 25, 2011 at 10:13 AM, Bruce Momjian <bruce@momjian.us> wrote:
>> > Idempotent does seem like the most promising idea.
>>
>> I tend to agree with you, but I'm worried it won't actually work out
>> to a win.  By the time we augment the records with enough additional
>> information we may have eaten up a lot of the benefit we were hoping
>> to get.
>
> This is where I was confused.  Our bad case now is when someone modifies
> one row on a page between checkpoints --- instead of writing 400 bytes,
> we write 8400.  What portion of between-checkpoint activity writes more
> than a few rows to a page?  I didn't think many, except for COPY.
> Ideally we could switch in and out of this mode per page, but that seems
> super-complicated.

Well, an easy to understand example would be a page that gets repeated
HOT updates.  We'll do this: add a tuple, add a tuple, add a tuple,
add a tuple, HOT cleanup, add a tuple, add a tuple, add a tuple, add a
tuple, HOT cleanup... and so on.  In the worst case, that could be
done many, many times between checkpoints that might be up to an hour
apart.  The problem can also occur (with a little more difficulty)
even without HOT.  Imagine a small table without lots of inserts and
deletes.  Page fills up, some rows are deleted, vacuum frees up space,
page fills up again, some more rows are deleted, vacuum frees up space
again, and so on.

But you raise an interesting point, which is that it might also be
possible to reduce the impact of write-ahead logging in other ways.
For example, if we're doing a large COPY into a table, we could buffer
up a full block of tuples and then just emit an FPI for the page.
This would likely be cheaper than logging each tuple individually.

In fact, you could imagine keeping a queue of pending WAL for each
block in shared buffers.  You don't really need that WAL to be
consolidated into a single stream until either (a) you need to write
the block or (b) you need to commit the transaction.  When one of
those things happens, you can decide at that point whether it's
cheaper to emit the individual records or do some sort of
consolidation.  Doing it in exactly that way is probably impractical,
because every backend that wants to commit would have to make a sweep
of every buffer it's dirtied and see if any of them still contain WAL
that needs to be shoved into the main queue, and that would probably
suck, especially for small transactions.  But maybe there's some
variant that could be made to work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: [ADMIN] pg_class reltuples/relpages not updated by autovacuum/vacuum
Next
From: Robert Haas
Date:
Subject: Re: Pull up aggregate subquery