Re: logical changeset generation v6.2 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: logical changeset generation v6.2
Date
Msg-id 20131025115713.GF5332@awork2.anarazel.de
Whole thread Raw
In response to Re: logical changeset generation v6.2  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: logical changeset generation v6.2  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2013-10-24 10:59:21 -0400, Robert Haas wrote:
> On Tue, Oct 22, 2013 at 2:13 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-10-22 13:57:53 -0400, Robert Haas wrote:
> >> On Tue, Oct 22, 2013 at 1:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> >> That strikes me as a flaw in the implementation rather than the idea.
> >> >> You're presupposing a patch where the necessary information is
> >> >> available in WAL yet you don't make use of it at the proper time.
> >> >
> >> > The problem is that the mapping would be somewhere *ahead* from the
> >> > transaction/WAL we're currently decoding. We'd need to read ahead till
> >> > we find the correct one.
> >>
> >> Yes, I think that's what you need to do.
> >
> > My problem with that is that rewrite can be gigabytes into the future.
> >
> > When reading forward we could either just continue reading data into the
> > reorderbuffer, but delay replaying all future commits till we found the
> > currently needed remap. That might have quite the additional
> > storage/memory cost, but runtime complexity should be the same as normal
> > decoding.
> > Or we could individually read ahead for every transaction. But doing so
> > for every transaction will get rather expensive (rougly O(amount_of_wal^2)).
> 
> [ Sorry it's taken me a bit of time to get back to this; other tasks
> intervened, and I also just needed some time to let it settle in my
> brain. ]

No worries. I've had enough things to work on ;)

> If you read ahead looking for a set of ctid translations from
> relfilenode A to relfilenode B, and along the way you happen to
> encounter a set of translations from relfilenode C to relfilenode D,
> you could stash that set of translations away somewhere, so that if
> the next transaction you process needs that set of mappings, it's
> already computed.  With that approach, you'd never have to pre-read
> the same set of WAL files more than once.

> But, as I think about it more, that's not very different from your
> idea of stashing the translations someplace other than WAL in the
> first place.  I mean, if the read-ahead thread generates a series of
> files in pg_somethingorother that contain those maps, you could have
> just written the maps to that directory in the first place.  So on
> further review I think we could adopt that approach.

Yea, that basically was my reasoning, only expressed much more nicely ;)

> However, I'm leery about the idea of using a relation fork for this.
> I'm not sure whether that's what you had it mind, but it gives me the
> willies.  First, it adds distributed overhead to the system, as
> previously discussed; and second, I think the accounting may be kind
> of tricky, especially in the face of multiple rewrites.  I'd be more
> inclined to find a separate place to store the mappings.  Note that,
> AFAICS, there's no real need for the mapping file to be
> block-structured, and I believe they'll be written first (with no
> readers) and subsequently only read (with no further writes) and
> eventually deleted.

I was thinking of storing it along other data used during logical
decoding and let decoding's cleanup clean up that data as well. All the
information for that should be there.

There's one snag I currently can see, namely that we actually need to
prevent that a formerly dropped relfilenode is getting reused. Not
entirely sure what the best way for that is.

> One possible objection to this is that it would preclude decoding on a
> standby, which seems like a likely enough thing to want to do.  So
> maybe it's best to WAL-log the changes to the mapping file so that the
> standby can reconstruct it if needed.

The mapping file probably can be one big wal record, so it should be
easy enough to do.

For a moment I thought there's a problem with decoding on the standby
having to read ahead of the current location to find the newer mapping,
but that's actually not required since we're protected by the AEL lock
during rewrites on the standby as well.

> > I think that'd be pretty similar to just disallowing VACUUM
> > FREEZE/CLUSTER on catalog relations since effectively it'd be to
> > expensive to use.
> 
> This seems unduly pessimistic to me; unless the catalogs are really
> darn big, this is a mostly theoretical problem.

Well, it's not the size of the relation, but the amount of concurrent
WAL that's being generated that matters. But anyway, if we do it like
you described above that shouldn't be a problem.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Patch for fail-back without fresh backup
Next
From: Andres Freund
Date:
Subject: Re: logical changeset generation v6.2