Re: Page Checksums + Double Writes - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Page Checksums + Double Writes
Date
Msg-id CA+TgmoauJGnm8aVXTnUD7x4kCZp5oJiVS5FTF50XSOTXY1a2Yw@mail.gmail.com
Whole thread Raw
In response to Re: Page Checksums + Double Writes  (Florian Pflug <fgp@phlo.org>)
List pgsql-hackers
On Thu, Jan 5, 2012 at 6:15 AM, Florian Pflug <fgp@phlo.org> wrote:
> On 64-bit machines at least, we could simply mmap() the stable parts of the
> CLOG into the backend address space, and access it without any locking at all.

True.  I think this could be done, but it would take some fairly
careful thought and testing because (1) we don't currently use mmap()
anywhere else in the backend AFAIK, so we might run into portability
issues (think: Windows) and perhaps unexpected failure modes (e.g.
mmap() fails because there are too many mappings already).  Also, it's
not completely guaranteed to be a win.  Sure, you save on locking, but
now you are doing an mmap() call in every backend instead of just one
read() into shared memory.  If concurrency isn't a problem that might
be more expensive on net.  Or maybe no, but I'm kind of inclined to
steer clear of this whole area at least for 9.2.  So far, the only
test result I have only supports the notion that we run into trouble
when NUM_CPUS > NUM_CLOG_BUFFERS, and people have to before they can
even start their I/Os.  That can be fixed with a pretty modest
reengineering.  I'm sure there is a second-order effect from the cost
of repeated I/Os per se, which a backend-private cache of one form or
another might well help with, but it may not be very big.  Test
results are welcome, of course.

> I believe that we could also compress the stable part by 50% if we use one
> instead of two bits per txid. AFAIK, we need two bits because we
>
>  a) Distinguish between transaction where were ABORTED and those which never
>     completed (due to i.e. a backend crash) and
>
>  b) Mark transaction as SUBCOMMITTED to achieve atomic commits.
>
> Which both are strictly necessary for the stable parts of the clog.

Well, if we're going to do compression at all, I'm inclined to think
that we should compress by more than a factor of two.  Jim Nasby's
numbers (the worst we've seen so far) show that 18% of 1k blocks of
XIDs were all commits.  Presumably if we reduced the chunk size to,
say, 8 transactions, that percentage would go up, and even that would
be enough to get 16x compression rather than 2x.  Of course, then
keeping the uncompressed CLOG files becomes required rather than
optional, but that's OK.  What bothers me about compressing by only 2x
is that the act of compressing is not free.  You have to read all the
chunks and then write out new chunks, and those chunks then compete
for each other in cache.  Who is to say that we're not better off just
reading the uncompressed data at that point?  At least then we have
only one copy of it.

> Note that
> we could still keep the uncompressed CLOG around for debugging purposes - the
> additional compressed version would require only 2^32/8 bytes = 512 MB in the
> worst case, which people who're serious about performance can very probably
> spare.

I don't think it'd be even that much, because we only ever use half
the XID space at a time, and often probably much less: the default
value of vacuum_freeze_table_age is only 150 million transactions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Florian Weimer
Date:
Subject: Re: optimizing repeated MVCC snapshots
Next
From: Robert Haas
Date:
Subject: Re: FATAL: bogus data in lock file "postmaster.pid": ""