Re: [HACKERS] redolog - for discussion - Mailing list pgsql-hackers

From jwieck@debis.com (Jan Wieck)
Subject Re: [HACKERS] redolog - for discussion
Date
Msg-id m0zqNDo-000EBTC@orion.SAPserv.Hamburg.dsh.de
Whole thread Raw
In response to Re: [HACKERS] redolog - for discussion  (Vadim Mikheev <vadim@krs.ru>)
List pgsql-hackers
Vadim wrote:

>
> Jan Wieck wrote:
> >
> >     RECOVER DATABASE {ALL | UNTIL 'datetime' | RESET};
> >
> ...
> >
> >         For  the  others, the backend starts the recovery program
> >         which  reads  the  redolog  files,  establishes  database
> >         connections  as  required  and reruns all the commands in
>                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^
> >         them. If a required logfile isn't  found,  it  tells  the
>           ^^^^^
>
> I foresee problems with using _commands_ logging for
> recovery/replication -:((
>
> Let's consider two concurrent updates in READ COMMITTED mode:
>
> update test set x = 2 where y = 1;
>
>    and
>
> update test set x = 3 where y = 1;
>
> The result of both committed transaction will be x = 2
> if the 1st transaction updated row _after_ 2nd transaction
> and x = 3 if the 2nd transaction gets row after 1st one.
> Order of updates is not defined by order in which commands
> begun and so order in which commands should be rerun
> will be unknown...

    Yepp,  the order in which commands begun is absolutely not of
    interest. Locking could already delay the  execution  of  one
    command  until  another  one  started  later has finished and
    released the lock.  It's a classic race condition.

    Thus, my plan was to log the queries just before the call  to
    CommitTransactionCommand()  in  tcop. This has the advantage,
    that queries which bail out with errors don't  get  into  the
    log  at  all  and  must not get rerun. And I can set a static
    flag to false before starting the command, which  is  set  to
    true  in  the buffer manager when a buffer is written (marked
    dirty), so filtering out queries that do no updates at all is
    easy.

    Unfortunately  query  level  logging get's hit by the current
    implementation of sequence numbers. If  a  query  that  get's
    aborted  somewhere  in the middle (maybe by a trigger) called
    nextval() for rows processed  earlier,  the  sequence  number
    isn't  advanced  at  recovery  time,  because  the  query  is
    suppressed at all.   And  sequences  aren't  locked,  so  for
    concurrently  running  queries  getting numbers from the same
    sequence,  the  results   aren't   reproduceable.   If   some
    application  selects  a  value  resulting from a sequence and
    uses that later in another query, how could the redolog  know
    that  this has changed? It's a Const in the query logged, and
    all that corrupts the whole thing.

    All that is painful and I don't see another solution yet than
    to  hook  into  nextval(),  log  out the numbers generated in
    normal operation and getting back the same  numbers  in  redo
    mode.

    The whole thing gets more and more complicated :-(


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#======================================== jwieck@debis.com (Jan Wieck) #

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [HACKERS] MVCC works in serialized mode!
Next
From: Cd Chen
Date:
Subject: Have any ideas to support GNU gettext package ??