Hi,
[ CC'ing to the postgres-r mailing list ]
Mark Mielke wrote:
> On 08/12/2009 12:04 PM, Suno Ano wrote:
>> can anybody tell me, is there a roadmap with regards to
>> http://www.postgres-r.org ?
I'm glad you're asking.
>> I would love to see it become production-ready asap.
Yes, me too. Do you have some spare cycles to spend? I'd be happy to
help you getting started. However, I have a 16 days old daughter at
home, so please don't expect response times under a few days ;-)
> Even a breakdown of what is left to do might be useful in case any of us
> want to pick at it. :-)
The TODO file from the patch is a good place to start from. For the sake
of simplicity, I've attached it.
I've written a series of posts covering various topics of Postgres-R
about a year ago. Here are the links, following the discussions
down-thread might be interesting as well.
Postgres-R: current state of development, Jul 15 2008:
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00735.php
Postgres-R: primary key patches, Jul 16 2008
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00777.php
Postgres-R: tuple serialization, Jul 22 2008
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00969.php
Postgres-R: internal messaging, Jul 23 2008
http://archives.postgresql.org/pgsql-hackers/2008-07/msg01051.php
Some random updates to last year's "current state of development" that
come to mind:
* I've adjusted to the signaling to use the signal
multiplexer code that recently landed on HEAD.
* Work on the initialization and recovery stuff is
progressing slowly, but steadily.
* The tuple serialization code is being refactored ATM
to get a lot smaller and easier to understand and
debug.
That should get you an impression on the current state of development, I
think. Please feel free to ask more specific questions.
Regards
Markus Wanner
P.S.: Sanu, did you note the addition of the link to the Postgres-R
mailing list, which you pointed out was hard to find?
URGENT
======
* Implement parsing of the replication_gcs GUC for spread and ensemble.
* check for places there replication_enabled should be checked more
extensively.
complaint about select() not interrupted by signals:
http://archives.postgresql.org/pgsql-hackers/2008-12/msg00448.php
restartable signals 'n all that
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00003.php
3.2.1 Internal Message Passing
==============================
* Maybe send IMSGT_READY after some other commands, not only after
IMSGT_CHANGESET. Remember that local transactions also have to send an
IMSGT_READY, so that their proc->coid gets reset.
* Make sure the coordinator copes with killed backends (local as
well as remote ones).
* Check if we can use pselect to avoid race conditions with IMessage stuff
within the coordinator's main loop.
* Check error conditions such as out of memory and out of disk space. Those
could prevent a single node from applying a remote transaction. What to
do in such cases? A similar one is "limit of queued remote transactions
reached".
3.2.2 Communication with the Postmaster
=======================================
* Get rid of the SIGHUP signal (was IMSGT_SYSTEM_READY) for the
coordinator and instead only start the coordinator as soon as the
postmaster is ready to fork helper backends. Should simplify things and
make them more similar to the current Postgres code, i.e. for the
autovacuum launcher.
* Handle restarts of the coordinator due to a crashed backend. The
postmaster already sends a signal to terminate an existing
coordinator process and it tries to restart one. But the coordinator
should then start recovery and only allow other backends after that.
Keep in mind that this recovery process is costly and we should somehow
prevent nodes which fail repeatedly from endlessly consuming resources
of the complete cluster.
* The backends need to report errors from remote *and* local transactions
to the coordinator. Worker backends erroring out while waitin for
changesets are critical. Erroring out due to serialization failure is fine,
we can simply ignore the changeset, once it arrives late on. But other
errors are probably pretty bad at that stage. Upon crashes, the postmaster
restarts all backends and the coordinator anyway, so the backend
process itself can take care of informing the coordinator via
imessages.
* Think about a newly requested helper backend crashing before it
registers with the coordinator. That would prevent requesting any further
helper backend.
3.2.3 Group Communication System Issues
=======================================
* Drop the static receive buffers of the GCS interfaces in favor of a
dynamic one. It's much easier to handle.
* Hot swapping of the underlying GCS of a replicated database is currently
not supported. It would involve waiting for all nodes of the group to
have joined the new group, then swap.
If we enforce the GCS group name to equal the database name, that's
needed for renaming a replicated database. Might be a good reason
against that rule.
* Better error reporting to the client in case of GCS errors. There are
three phases: connecting, initialization and joining the group. All those
can potentionally fail. Currently, an ALTER DATABASE waits until it gets
a DB_STATE_CHANGE, for which it waits forever, if something fails.
3.3.1 Group Communication Services
==================================
* Prevent EGCS from sending an initial view which does not include the
local node.
* Complete support for Spread
* Support for Appia
3.3.2 Global Object Identifiers
===============================
* Use a naming service translating local OIDs to global ids, so that we
don't have to send the full schema and table name every time.
3.3.3 Global Transaction Identifiers
====================================
* Drop COIDs in favor of GIDs
3.4 Collection of Transactional Data Changes
============================================
* Make sure we correctly serialize transactions, which modify tuples that
are referenced by a foreign key. An insert or update to a tuple with a
reference to somewhere must make sure the referenced tuple didn't change.
(The other way around should be covered automatically by the changeset,
because it also catches changes by the ON UPDATE or ON DELETE hooks of
the affected foreign key.)
Write tests for that behaviour.
* Think about removing these additional members of the EState:
es_allLocksGranted, es_tupleChangeApplied and es_loopCounter. Those can
certainly be simplified.
* Take care of a correct READ COMMITTED mode, which requires changes of a
committed transaction to be visible immediately to all other concurrently
running transactions. This might be very similar to a fully synchronous,
lock based replication mode. This certainly introduces higher commit
latency.
* Add the schema name to the changeset and seq_increment messages to fully
support namespaces.
* Support for savepoints requires communicating additional sub-transaction
states.
3.6 Application of Change Sets
==============================
* Possibly use heap_{insert,update,delete} directly, instead of going
through ExecInsert, ExecUpdate and ExecDelete? That could save us some
conditionals, but we would probably need to re-add other stuff.
* Possibly limit ExecOpenIndices() to open only the primary key index for
CMD_DELETE?
* Check if ExecInsertIndexTuples() could break due to out of sync replica
with UNIQUE constraint violations.
* Make sure the statement_timeout does not affect hepler backends.
* Prevent possible deadlocks which might occur by re-ordered (optimistic)
application of change sets from remote transactions. Just make sure the
next transaction according to the decided ordering always has a spare
helper backend available to get executed on and is not blocked by other
remote transactions which must wait for it (and thus cause deadlocking).
3.8.2 Data Definition Changes
=============================
* check which messages the coordinator must ignore, because
they could originate from backends which were running concurrently
to a STOP REPLICATION command. Such backends could possibly send
changesets and other replication requests.
* Add proper handling of CREATE / ALTER / DROP TABLE and make sure those
don't interfere with normal, parallel changeset application.
3.9 Initialization and Recovery
===============================
* helper processes connected to template databases should exit immediately
after having performed their job, so CREATE DATABASE from such a
template database works again.
3.9.1 Initialization and Recovery: Data Transfer
================================================
3.9.2 Initialization and Recovery: Schema Adaption
==================================================
* Implement schema adaption
* Make sure triggers and constraints either do only contain functions
which are available on every node _or_ execute the triggers and check the
constraints only on the machines having them (remote execution?)
3.9.5 Initialization and Recovery: Full Cluster Shutdown and Restart
====================================================================
* After a full crash (no majority running, thus stopped cluster wide
operation), we need to be able to recover from the distributed,
permanent storage into a consistent state. This requires nodes
communicating their recently committed transactions, which didn't
make it to the other nodes before the crash.
Cleanup
=======
* merge repl_database_info::state into the group::nodes->state, and the
main_state into main_group::nodes->state. Add a simpler routine to
retrieve the local node.
* Cleanup the "node_id_self_ref" mess. The GCS should not be able to send
a viewchange to the coordinator, which does not include the local node
itself. In that sense, maybe "nodes" doesn't need to include the local
node?
* Reduce the amount of elog(DEBUG...) to a usefull level. Currently mainly
DEBUG3 is used, sometimes DEBUG5. Maybe also rethink the precompiler
flags which enable or disable this verbose debugging.
* At the moment, exec_simple_query is exported to the replication code,
where in stock Postgres, that call is static.
* Same applies for ExecInsert, which is no longer static, but also used
in the recovery code. However, that should be mixed into
ExecProcessCollection() to reduce code duplication anyway.
* Consistently name the backends 'worker' and 'helper' backends?
* Never call cset_process() from worker backends! Fix the comment above
that function.
* The recovery subscriber currently issues a CREATE DATABASE from within
a transaction block. That's unclean.
* The database encoding is transferred as a number, not string. Not sure
if that matters.