Re: crash-safe visibility map, take three - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: crash-safe visibility map, take three
Date
Msg-id 1291333025.30909.69.camel@jdavis-ux.asterdata.local
Whole thread Raw
In response to Re: crash-safe visibility map, take three  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: crash-safe visibility map, take three
List pgsql-hackers
On Thu, 2010-12-02 at 17:00 -0500, Robert Haas wrote:
> I'm not really convinced that this problem is confined to bulk
> loading.  Every INSERT or UPDATE results in a new tuple that may need
> hit bits set and eventually to be frozen.  A bulk load is just a time
> when you do lots of inserts all at once; it seems to me that a large
> update would cause all the same problems, plus bloat.

A big UPDATE does a lot of work, I don't see any near-term approach for
solving that. Avoiding WAL (and full page writes) for the updates to
PD_ALL_VISIBLE, hint bits, VM bits, etc., would probably be the least of
my concerns in that case.

>   The triple I/O
> problem exists for small transactions as well (and isn't desirable
> there either); it's just less noticeable because the second and third
> writes are, like the first one, small.

Bulk loading poses some unique challenges because there is no
opportunity to set PD_ALL_VISIBLE or hint bits before the loading is
complete; and by that time, many checkpoints will have already happened,
and the pages have already hit disk. That means we need to re-read them,
modify them, and write them again (plus WAL, if we were following the
rules).

Small transactions don't suffer from the same problems. They generally
begin and end without an intervening checkpoint. That means that you
have an opportunity to set PD_ALL_VISIBLE or hint bits before the
checkpoint happens, thus avoiding unnecessary extra writes.

Additionally, small transaction workloads will generally have, to some
approximation, some working set of pages. So, even if you do a read of a
single tuple, write PD_ALL_VISIBLE and hint bits (thus dirtying the
page), there is a reasonable chance that someone will come by later and
do an insert/update/delete (thus forcing WAL anyway).

And if the small transaction workload is completely random and you
aren't touching the same pages between checkpoints, then setting hint
bits one-at-a-time is not a good strategy anyway. It would be much
better to do it in bulk with a VACUUM. And if VACUUM does anything
significant to a page, it's going to WAL anyway.

I'm having trouble seeing a case other than bulk-loading which causes a
real problem. Maybe a small-transaction workload with a few long-running
transactions? Even that doesn't seem so bad.

> It seems to me that a COPY command executed in a transaction with no
> other open snapshots writing to a table created or truncated within
> the same transaction should be able to write frozen tuples from the
> get-go, regardless of anything else we do.

Well, some transaction might pick up a snapshot between the time you
begin the copy and the time it commits. We'd need to prevent such a
transaction from actually reading the table.

> I don't think it would be appropriate to hold off
> making the visibility map crash-safe, on the off chance that our
> design for so doing might complicate something else we want to do
> later.

I'm not suggesting we hold off on it at all. To the contrary, I'm
suggesting that we simply log updates of PD_ALL_VISIBLE as well as VM
bits, at least until a performance problem presents itself. That will
_simplify_ the design.

Then, when a performance problem does present itself for a certain use
case, we can see how to fix it. If many cases are affected, then we
might choose one of these more creative solutions that breaks the rules
in controlled ways, understanding the trade-offs. If only bulk loading
is affected, we might choose to address that case directly.

Regards,Jeff Davis



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: WIP patch for parallel pg_dump
Next
From: Andrew Dunstan
Date:
Subject: Re: WIP patch for parallel pg_dump