[RFC][PATCH] wal decoding, attempt #2 - Mailing list pgsql-hackers

From Andres Freund
Subject [RFC][PATCH] wal decoding, attempt #2
Date
Msg-id 201209150233.25616.andres@2ndquadrant.com
Whole thread Raw
Responses [PATCH 2/8] Add minimal binary heap implementation  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 1/8] Add embedded list interface (header only)  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 4/8] add simple xlogdump tool  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 7/8] Make InvalidateSystemCaches public  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 3/8] Add support for a generic wal reading facility dubbed XLogReader  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 6/8] Log enough data into the wal to reconstruct logical changes from it if wal_level=logical  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 5/8] Add a new syscache to fetch a pg_class entry via (reltablespace, relfilenode)  (Andres Freund <andres@2ndquadrant.com>)
[PATCH 8/8] Introduce wal decoding via catalog timetravel  (Andres Freund <andres@2ndquadrant.com>)
git tree  (Andres Freund <andres@2ndquadrant.com>)
Re: [RFC][PATCH] wal decoding, attempt #2 - Design Documents  (Andres Freund <andres@2ndquadrant.com>)
Re: [RFC][PATCH] wal decoding, attempt #2  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
Hi,

It took me far longer than I planned, its not finished, but time is running 
out. I would like some feedback that I am not going astray at this point... 
*I* think the general approach is sound and a good way forward that provides 
the basic infrastructure for many (all?) of the scenarios we talked about 
before.

Anyway, here is my next attempt at $TOPIC.

Lets start with a quick demo (via psql):

/* just so we keep a sensible xmin horizon */
ROLLBACK PREPARED 'f';
BEGIN;
CREATE TABLE keepalive();
PREPARE TRANSACTION 'f';

DROP TABLE IF EXISTS replication_example;

SELECT pg_current_xlog_insert_location();
CHECKPOINT;
CREATE TABLE replication_example(id SERIAL PRIMARY KEY, somedata int, text 
varchar(120));
begin;
INSERT INTO replication_example(somedata, text) VALUES (1, 1);
INSERT INTO replication_example(somedata, text) VALUES (1, 2);
commit;

ALTER TABLE replication_example ADD COLUMN bar int;

INSERT INTO replication_example(somedata, text, bar) VALUES (2, 1, 4);

BEGIN;
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 2, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 3, 4);
INSERT INTO replication_example(somedata, text, bar) VALUES (2, 4, NULL);

commit;
ALTER TABLE replication_example DROP COLUMN bar;
INSERT INTO replication_example(somedata, text) VALUES (3, 1);
BEGIN;
INSERT INTO replication_example(somedata, text) VALUES (3, 2);
INSERT INTO replication_example(somedata, text) VALUES (3, 3);
commit;

ALTER TABLE replication_example RENAME COLUMN text TO somenum;

INSERT INTO replication_example(somedata, somenum) VALUES (4, 1);

ALTER TABLE replication_example ALTER COLUMN somenum TYPE int4 USING 
(somenum::int4);

INSERT INTO replication_example(somedata, somenum) VALUES (5, 1);

SELECT pg_current_xlog_insert_location();

---- Somewhat later ----

SELECT decode_xlog('0/1893D78', '0/18BE398');

WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:1 somedata[int4]:1 text[varchar]:1
WARNING:  tuple is: id[int4]:2 somedata[int4]:1 text[varchar]:2
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:3 somedata[int4]:2 text[varchar]:1 bar[int4]:4
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:4 somedata[int4]:2 text[varchar]:2 bar[int4]:4
WARNING:  tuple is: id[int4]:5 somedata[int4]:2 text[varchar]:3 bar[int4]:4
WARNING:  tuple is: id[int4]:6 somedata[int4]:2 text[varchar]:4 bar[int4]:
(null)
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:7 somedata[int4]:3 text[varchar]:1
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:8 somedata[int4]:3 text[varchar]:2
WARNING:  tuple is: id[int4]:9 somedata[int4]:3 text[varchar]:3
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:10 somedata[int4]:4 somenum[varchar]:1
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  COMMIT
WARNING:  BEGIN
WARNING:  tuple is: id[int4]:11 somedata[int4]:5 somenum[int4]:1
WARNING:  COMMITdecode_xlog 
-------------t
(1 row)


As you can see the patchset can decode several changes made to a table even 
though we used DDL on it. Not everything is handled yet, but its a prototype 
after all ;)

The way this works is:

A new component called SnapshotBuilder analyzes the xlog and build a special 
kind of Snapshot. This works in a somewhat similar way to the 
KnownAssignedXids machinery for Hot Standby.
Whenever the - mostly unchanged - ApplyCache calls a 'apply_change' callback 
for a single change (INSERT|UPDATE|DELETE) it locally overrides the normal 
SnapshotNow semantics used for catalog access with one of the previously built 
snapshots. They should behave just the same as a normal SnapshotNow would have 
behaved when the tuple change was written to the xlog.

This patch doesn't provide anything that uses the new infrastructure for 
anything real, but I think thats good. Lets get this into something 
committable and then add new things using it!

Small overview over the individual patches that will come as separate mails:

old, Alvaro is doing this properly right now, separate thread
[01] Add embedded list interface (header only) 

A new piece of infrastructure (for k-way mergesort), pretty much untested, 
good idea in general I think, not very interesting:
[02] Add minimal binary heap implementation

Boring, old.:
[03] Add support for a generic wal reading facility dubbed XLogReader

Boring, old, borked:
[04] add simple xlogdump tool

Slightly changed to use (tablespace, relfilenode), possibly similar problems 
to earlier, not interesting at this point.
[05] Add a new syscache to fetch a pg_class entry via (reltablespace, 
relfilenode)

Unchanged:
[06] Log enough data into the wal to reconstruct logical changes from it if 
wal_level=logical

I didn't implement proper cache handling, so I need to use the big hammer...:
[07] Make InvalidateSystemCaches public

The major piece:
[08] Introduce wal decoding via catalog timetravel



[08] has loads of defficiencies. To cite the commit:   The snapshot building has the most critical infrastructure but
misses
 
several   important features:   * loads of docs about the internals   * improve snapshot building/distributions     *
don'tbuild them all the time, cache them     * don't increase ->xmax so slowly, its inefficient     * refcount     *
actuallyfree them   * proper cache handling     * we can probably reuse xl_xact_commit->nmsgs     * generate new local
invalmessages from catalog changes?   * handle transactions with both ddl, and changes     * command_id handling     *
combocidloggin/handling   * Add support for declaring tables as catalog tables that are not 
 
pg_catalog.*   * properly distribute new SnapshotNow snapshots after a transaction 
commits   * loads of testing/edge cases   * provision of a consistent snapshot for pg_dump   * spill state to disk at
checkpoints  * xmin handling
 

The decode_xlog() function is *purely* a debugging tool that I do not want to 
keep in the long run. I introduced it so we can concentrate on the topic at 
hand without involving even more moving parts (see the next paragraph)...

Some parts of this I would like to only discuss later, in separate threads, to 
avoid cluttering this one more than neccessary:
* how do we integrate this into walsender et al
* in which format do we transport changes
* how do we always keep enough wal 


I have some work ontop of this, that handles ComboCid's and CommandId's 
correctly (and thus mixed ddl/dml transactions), but its simply not finished 
enough. I am pretty sure by now that it works even with those additional 
complexities.

So, I am unfortunately too tired to write more than this... It will have to 
suffice. I plan to release a newer version with more documentation soon.

Comments about the approach or even the general direction of the 
implementation? Questions?

Greetings,

Andres
-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: embedded list v2
Next
From: Andres Freund
Date:
Subject: [PATCH 2/8] Add minimal binary heap implementation