Re: [CORE] postpone next week's release - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [CORE] postpone next week's release
Date
Msg-id 557011A0.6070604@iki.fi
Whole thread Raw
In response to Re: [CORE] postpone next week's release  (Andres Freund <andres@anarazel.de>)
Responses Re: [CORE] postpone next week's release  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 05/30/2015 11:47 PM, Andres Freund wrote:
> I don't think it's primarily a problem of lack of review; although that
> is a large problem.  I think the biggest systematic problem is that the
> compound complexity of postgres has increased dramatically over the
> years.  Features have added complexity little by little, each not
> incrementally not looking that bad.  But very little has been done to
> manage complexity. Since 8.0 the codesize has roughly doubled, but
> little has been done to manage the increased complexity. Few new
> abstractions have been introduced and the structure of the code is
> largely the same.
>
> As a somewhat extreme example, let's look at StartupXLOG(). In 8.0 it
> was ~500 LOC, in master it's ~1500.  The interactions in 8.0 were
> complex, they have gotten much more complex since.  It fullfills lots of
> different roles, all in one function:
>
> (roughly in the order things happen, but simplified)
> * Read the control file/determine whether we crashed
> * recovery.conf handling
> * backup label handling
> * tablespace map handling (huh, I missed that this was added directly to
>    StartupXLOG. What a bad idea)
> * Determine whether we're doing archive recovery, read the relevant
>    checkpoint if so
> * relcache init file removal
> * timeline switch handling
> * Loading the checkpoint we're starting from
> * Initialization of a lot of subsystems
> * crash recovery/replay
>    * Including pgstat, unlogged table, exported snapshot handling
>    * iff hot standby, some more subsystems are initialized here
>    * hot standby state handling
>    * replay process intialization
>    * crash replay itself, including
>      * progress tracking
>      * recovery pause handling
>      * nextxid tracking
>      * timeline increase handling
>      * hot standby state handling
>    * unlogged relations handling
>    * archive recovery handling
>    * creation/initialization of the end of recovery checkpoint
>    * timeline increment if failover
> * subsystem initialization iff !hot_standby
> * end of recovery actions
>
> Yes. that's one routine. And, to make things even funnier, half of that
> routine isn't exercised by our tests.
>
> You can argue that this is an outlier, but I don't think so. Heapam, the
> planner, etc. have similar cases.
>
> And I think this, to some degree, explains a lot of the multixact
> problems. While there were a few "simple bugs", most of them were
> interactions between the various subsystems that are rather intricate.

I think this explanation is wrong. I agree that there are many places 
that would be good to refactor - like StartupXLOG() - but the multixact 
code was not too bad in that regard. IIRC the patch included some 
refactoring, it added some new helper functions in heapam.c, for 
example. You can argue that it didn't do enough of it, but that was not 
the big issue.

The big issue was at the architecture level. Basically, we liked 
vacuuming of XIDs and clog so much that we decided that it'd be nice if 
you had to vacuum multixids too, in order to not lose data. Many of the 
bugs and issues were not new - we had multixids before - but we upped 
the ante and turned minor locking bugs into data loss. And that had 
nothing to do with the code structure - we'd have similar issues if we 
had rewritten everything java, with the same design.

So, I'm all for refactoring and adding abstractions where it makes 
sense, but it's not going to solve design problems.

- Heikki




pgsql-hackers by date:

Previous
From: Jeevan Chalke
Date:
Subject: Re: [PATCH] two-arg current_setting() with fallback
Next
From: Andres Freund
Date:
Subject: Re: [CORE] postpone next week's release