Re: postgres-r - Mailing list pgsql-hackers

From Markus Wanner
Subject Re: postgres-r
Date
Msg-id 4A86D7CA.7030007@bluegap.ch
Whole thread Raw
In response to Re: postgres-r  (Mark Mielke <mark@mark.mielke.cc>)
List pgsql-hackers
Hi,

[ CC'ing to the postgres-r mailing list ]

Mark Mielke wrote:
> On 08/12/2009 12:04 PM, Suno Ano wrote:
>> can anybody tell me, is there a roadmap with regards to
>> http://www.postgres-r.org ?

I'm glad you're asking.

>> I would love to see it become production-ready asap.

Yes, me too. Do you have some spare cycles to spend? I'd be happy to
help you getting started. However, I have a 16 days old daughter at
home, so please don't expect response times under a few days ;-)

> Even a breakdown of what is left to do might be useful in case any of us
> want to pick at it. :-)

The TODO file from the patch is a good place to start from. For the sake
of simplicity, I've attached it.

I've written a series of posts covering various topics of Postgres-R
about a year ago. Here are the links, following the discussions
down-thread might be interesting as well.

Postgres-R: current state of development, Jul 15 2008:
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00735.php

Postgres-R: primary key patches, Jul 16 2008
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00777.php

Postgres-R: tuple serialization, Jul 22 2008
http://archives.postgresql.org/pgsql-hackers/2008-07/msg00969.php

Postgres-R: internal messaging, Jul 23 2008
http://archives.postgresql.org/pgsql-hackers/2008-07/msg01051.php


Some random updates to last year's "current state of development" that
come to mind:

 * I've adjusted to the signaling to use the signal
   multiplexer code that recently landed on HEAD.
 * Work on the initialization and recovery stuff is
   progressing slowly, but steadily.
 * The tuple serialization code is being refactored ATM
   to get a lot smaller and easier to understand and
   debug.

That should get you an impression on the current state of development, I
think. Please feel free to ask more specific questions.

Regards

Markus Wanner

P.S.: Sanu, did you note the addition of the link to the Postgres-R
mailing list, which you pointed out was hard to find?
URGENT
======

* Implement parsing of the replication_gcs GUC for spread and ensemble.
* check for places there replication_enabled should be checked more
  extensively.


complaint about select() not interrupted by signals:
http://archives.postgresql.org/pgsql-hackers/2008-12/msg00448.php

restartable signals 'n all that
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00003.php

3.2.1 Internal Message Passing
==============================

* Maybe send IMSGT_READY after some other commands, not only after
  IMSGT_CHANGESET. Remember that local transactions also have to send an
  IMSGT_READY, so that their proc->coid gets reset.

* Make sure the coordinator copes with killed backends (local as
  well as remote ones).

* Check if we can use pselect to avoid race conditions with IMessage stuff
  within the coordinator's main loop.

* Check error conditions such as out of memory and out of disk space. Those
  could prevent a single node from applying a remote transaction. What to
  do in such cases? A similar one is "limit of queued remote transactions
  reached".


3.2.2 Communication with the Postmaster
=======================================

* Get rid of the SIGHUP signal (was IMSGT_SYSTEM_READY) for the
  coordinator and instead only start the coordinator as soon as the
  postmaster is ready to fork helper backends. Should simplify things and
  make them more similar to the current Postgres code, i.e. for the
  autovacuum launcher.

* Handle restarts of the coordinator due to a crashed backend. The
  postmaster already sends a signal to terminate an existing
  coordinator process and it tries to restart one. But the coordinator
  should then start recovery and only allow other backends after that.

  Keep in mind that this recovery process is costly and we should somehow
  prevent nodes which fail repeatedly from endlessly consuming resources
  of the complete cluster.

* The backends need to report errors from remote *and* local transactions
  to the coordinator. Worker backends erroring out while waitin for
  changesets are critical. Erroring out due to serialization failure is fine,
  we can simply ignore the changeset, once it arrives late on. But other
  errors are probably pretty bad at that stage. Upon crashes, the postmaster
  restarts all backends and the coordinator anyway, so the backend
  process itself can take care of informing the coordinator via
  imessages.

* Think about a newly requested helper backend crashing before it
  registers with the coordinator. That would prevent requesting any further
  helper backend.


3.2.3 Group Communication System Issues
=======================================

* Drop the static receive buffers of the GCS interfaces in favor of a
  dynamic one. It's much easier to handle.

* Hot swapping of the underlying GCS of a replicated database is currently
  not supported. It would involve waiting for all nodes of the group to
  have joined the new group, then swap.

  If we enforce the GCS group name to equal the database name, that's
  needed for renaming a replicated database. Might be a good reason
  against that rule.

* Better error reporting to the client in case of GCS errors. There are
  three phases: connecting, initialization and joining the group. All those
  can potentionally fail. Currently, an ALTER DATABASE waits until it gets
  a DB_STATE_CHANGE, for which it waits forever, if something fails.


3.3.1 Group Communication Services
==================================

* Prevent EGCS from sending an initial view which does not include the
  local node.

* Complete support for Spread

* Support for Appia


3.3.2 Global Object Identifiers
===============================

* Use a naming service translating local OIDs to global ids, so that we
  don't have to send the full schema and table name every time.


3.3.3 Global Transaction Identifiers
====================================

* Drop COIDs in favor of GIDs


3.4 Collection of Transactional Data Changes
============================================

* Make sure we correctly serialize transactions, which modify tuples that
  are referenced by a foreign key. An insert or update to a tuple with a
  reference to somewhere must make sure the referenced tuple didn't change.

  (The other way around should be covered automatically by the changeset,
  because it also catches changes by the ON UPDATE or ON DELETE hooks of
  the affected foreign key.)

  Write tests for that behaviour.

* Think about removing these additional members of the EState:
  es_allLocksGranted, es_tupleChangeApplied and es_loopCounter. Those can
  certainly be simplified.

* Take care of a correct READ COMMITTED mode, which requires changes of a
  committed transaction to be visible immediately to all other concurrently
  running transactions. This might be very similar to a fully synchronous,
  lock based replication mode. This certainly introduces higher commit
  latency.

* Add the schema name to the changeset and seq_increment messages to fully
  support namespaces.

* Support for savepoints requires communicating additional sub-transaction
  states.


3.6 Application of Change Sets
==============================

* Possibly use heap_{insert,update,delete} directly, instead of going
  through ExecInsert, ExecUpdate and ExecDelete? That could save us some
  conditionals, but we would probably need to re-add other stuff.

* Possibly limit ExecOpenIndices() to open only the primary key index for
  CMD_DELETE?

* Check if ExecInsertIndexTuples() could break due to out of sync replica
  with UNIQUE constraint violations.

* Make sure the statement_timeout does not affect hepler backends.

* Prevent possible deadlocks which might occur by re-ordered (optimistic)
  application of change sets from remote transactions. Just make sure the
  next transaction according to the decided ordering always has a spare
  helper backend available to get executed on and is not blocked by other
  remote transactions which must wait for it (and thus cause deadlocking).


3.8.2 Data Definition Changes
=============================

* check which messages the coordinator must ignore, because
  they could originate from backends which were running concurrently
  to a STOP REPLICATION command. Such backends could possibly send
  changesets and other replication requests.

* Add proper handling of CREATE / ALTER / DROP TABLE and make sure those
  don't interfere with normal, parallel changeset application.


3.9 Initialization and Recovery
===============================

* helper processes connected to template databases should exit immediately
  after having performed their job, so CREATE DATABASE from such a
  template database works again.


3.9.1 Initialization and Recovery: Data Transfer
================================================


3.9.2 Initialization and Recovery: Schema Adaption
==================================================

* Implement schema adaption

* Make sure triggers and constraints either do only contain functions
  which are available on every node _or_ execute the triggers and check the
  constraints only on the machines having them (remote execution?)


3.9.5 Initialization and Recovery: Full Cluster Shutdown and Restart
====================================================================

* After a full crash (no majority running, thus stopped cluster wide
  operation), we need to be able to recover from the distributed,
  permanent storage into a consistent state. This requires nodes
  communicating their recently committed transactions, which didn't
  make it to the other nodes before the crash.




Cleanup
=======

* merge repl_database_info::state into the group::nodes->state, and the
  main_state into main_group::nodes->state. Add a simpler routine to
  retrieve the local node.

* Cleanup the "node_id_self_ref" mess. The GCS should not be able to send
  a viewchange to the coordinator, which does not include the local node
  itself. In that sense, maybe "nodes" doesn't need to include the local
  node?

* Reduce the amount of elog(DEBUG...) to a usefull level. Currently mainly
  DEBUG3 is used, sometimes DEBUG5. Maybe also rethink the precompiler
  flags which enable or disable this verbose debugging.

* At the moment, exec_simple_query is exported to the replication code,
  where in stock Postgres, that call is static.

* Same applies for ExecInsert, which is no longer static, but also used
  in the recovery code. However, that should be mixed into
  ExecProcessCollection() to reduce code duplication anyway.

* Consistently name the backends 'worker' and 'helper' backends?

* Never call cset_process() from worker backends! Fix the comment above
  that function.

* The recovery subscriber currently issues a CREATE DATABASE from within
  a transaction block. That's unclean.

* The database encoding is transferred as a number, not string. Not sure
  if that matters.


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: GRANT ON ALL IN schema
Next
From: Emanuel Calvo Franco
Date:
Subject: Re: uuid contrib don't compile in OpenSolaris