Allow declaration after statement and reformat code to use it - Mailing list pgsql-hackers

From Jelte Fennema
Subject Allow declaration after statement and reformat code to use it
Date
Msg-id AM5PR83MB0178E68E4FF1BAF9C66DF0D1F7C09@AM5PR83MB0178.EURPRD83.prod.outlook.com
Whole thread Raw
Responses Re: Allow declaration after statement and reformat code to use it  (Ranier Vilela <ranier.vf@gmail.com>)
List pgsql-hackers
## What is this?

It's a draft patch that replaces code like this:
```c
pg_file_unlink(PG_FUNCTION_ARGS)
{
char   *filename;
requireSuperuser();
filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
```

With this shorter version:
```c
pg_file_unlink(PG_FUNCTION_ARGS)
{
requireSuperuser();
char      *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
```

## Why would we want this?
1. It removes 23328 lines of code that don't have any impact on how the code behaves [1]. This is roughly 2.7% of all the lines of code in the codebase.
2. Declarations are closer to the actual usage. This is advised by the "Code Complete" book [2] and has the following advantages:
   a. This limits variable scope to what is necessary. Which in turn makes the mental model you have to keep of a function when reading the code simpler.
   b. In most cases it allows you to see the the type of a variable without needing to go to the top of the function.
3. You can do input checking and assertions at the top of the function, instead of having to put declarations in front of it. This makes it clear which invariants hold for the function.  (as seen in the example above and the changes for pg_file_rename[3])
4. Declaring variables after statements is allowed in C99. Postgres already requires C99, so it might as well use this feature too.

## How was this changeset created?

I created a script that modifies all C files in the following way:

1. Find a non `static` declaration.
2. If it has an initializer and it is not a single variable/constant, don't consider replacing it. (reason: a function call initializer might have sideffects).
3. Otherwise (it's a simple initializer or it has no initializer at all), take the type and variable from that declaration.
4. Find the next use of the variable.
5. If the next use is not an assignment, don't do anything (the value of the original initialization is used).
6. If the next use is an assignment:
    1. Remove the old declaration
    2. Prefix the found assignment with the type
    3. Unless the variable is also used in the same line of the new initialization, e.g:
      ```c
      int a = 1;
      a = a + a;
      ```

## How does this script work?

It uses a Perl regex to search and replace! (obligatory jokes at the bottom of the email) This regex is based on the ones used in this PR to citus[4] and the similar PR to pg_auto_failover[5]. The script is in the 3rd commit of this patchset.

To visualize the regex in the script in a reasonable way, copy paste the search part of it:
```
\n\t(?!(return|static)\b)(?P<type>(\w+[\t ])+[\t *]*)(?>(?P<variable>\w+)( = [\w>\s\n-]*?)?;\n(?P<code_between>(?>(?P<comment_or_string_or_not_preprocessor>\/\*.*?\*\/|"(?>\\"|.)*?"|(?!goto)[^#]))*?)(\t)?(?=\b(?P=variable)\b))(?<=\n\t)(?<!:\n\t)(?P=variable) =(?![^;]*?[^>_]\b(?P=variable)\b[^_])
```

And paste that into https://www.debuggex.com/, then select PCRE from the dropdown. (Sharing seems to be down at this point, so this is the only way to do it at the moment) Try it out! The regex is not as crazy as it looks.

There's two important assumptions this regex makes:
1. Code is indented using tabs, and the intent level determines the scope. (this is true, because of `pgindent`)
2. Declared variables are actually used. (this is true, because we would get compiler warnings otherwise)

There's two cases where this regex has some special behaviour:
1. Stop searching for the first usage of a variable when either a `goto` or a preprocessor command is found (outside a string or comment). These things affect the control flow in a way that the regex does not understand. (any `#` character is assumed to be a preprocessor commands).
2. Don't replace if the assignment is right after a `label:`, by checking if there was a `:` as the last character on the previous line. This works around errors like this:
```
hashovfl.c:865:3: error: a label can only be part of a statement and a declaration is not a statement
   OffsetNumber maxroffnum = PageGetMaxOffsetNumber(rpage);
   ^~~~~~~~~~~~
```
Detecting this case in this way is not perfect, because sometimes there is an extra newline or a comment between the label and the assignment. This is not easily detectable by the regex, because lookbehinds cannot have a variable length in Perl (or most regex engines for that matter). For these few cases (5 in the entire codebase) a manual change was done either before or after the automatic replacement to fix them so the code compiles again. (2nd and 5th commits of this patchset)

After all these changes `make -s world` doesn't show any warnings or errors and `make check-world` succeeds. I configured compilation like this:
```
./configure --enable-cassert --enable-depend --enable-debug --with-openssl --with-libxml --with-libxslt --with-uuid=e2fs --with-icu
```

## What I want to achieve with this email

1. For people to look at a small sample of the changes made by this script. I will try to send patches with the actual changes made to the code as attachments in a followup email. However, I don't know if that will succeed since the patch files are very big and it might run into some mailserver limits. So in case that doesn't work, the full changeset is available on Github[6]. This make viewing this enormous diff quite a bit easier IMHO anyaway. If you see something weird that is not covered in the "Known issues" section below, please share it. That way it can be discussed and/or fixed.
2. A discussion on if this type of code change would be a net positive for Postgres codebase. Please explain clearly why or why not.
3. Some links to resources on what's necessary to get a big refactoring patch like this merged.

## What I don't want to achieve with this email

1. For someone to go over all the changes right now. There's likely to be changes to the script or something else. Doing a full review of the changes would be better saved for later during a final review.

## Known issues with the currently generated code

There's a few issues with the final generated code that I've already spotted. These should all be relatively easy to fix in an automated way. However, I think doing so is probably better done by `pgindent` or some other auto formatting tool, instead of with the regex. Note that I did run `pgindent`, it just didn't address these things:

1. Whitespace between type and variable is kept the same as it was before moving the declaration. If the original declaration had whitespace added in such a way that all variable names of declarations aligned, then this whitespace is preserved. This looks weird in various cases. See [3] for an example.
2. `pgindent` adds a newline after each block of declarations, even if they are not at the start of function. If this is desired is debatable, but to me it seems like it adds excessive newlines. See [7] for an example.
3. If all declarations are moved away from the top of the function, then an empty newline is kept at the top of the function. This seems like an unnecessary newline to me. See [3] for an example.

Sources:
1. `tokei`[8] results of lines of code:
Before:
```
$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1361309       866514       330933       163862
===============================================================================
 Total                1389      1361309       866514       330933       163862
===============================================================================
```
After:
```
$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1347515       843186       330952       173377
===============================================================================
 Total                1389      1347515       843186       330952       173377
===============================================================================
```
2. https://github.com/mgp/book-notes/blob/master/code-complete.markdown#103-guidelines-for-initializing-variables
3.
```patch
@@ -401,11 +389,10 @@ pg_file_rename_internal(text *file1, text *file2, text *file3)
 Datum
 pg_file_unlink(PG_FUNCTION_ARGS)
 {
- char   *filename;
 
  requireSuperuser();
 
- filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
+ char   *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
 
  if (access(filename, W_OK) < 0)
```
4. https://github.com/citusdata/citus/pull/3181
5. https://github.com/citusdata/pg_auto_failover/pull/266
6. https://github.com/JelteF/postgres/pull/2
7.
```patch
@@ -2863,7 +2886,8 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * longer than we must.
  */
  Buffer buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL,
- state->checkstrategy);
+ state->checkstrategy);
+
  LockBuffer(buffer, BT_READ);
 
  /*
```
8. https://github.com/XAMPPRocky/tokei


Obligatory jokes:
1. https://xkcd.com/1171/
2. https://xkcd.com/208/
3. https://stackoverflow.com/a/1732454/2570866 (and serious response to it https://neilmadden.blog/2019/02/24/why-you-really-can-parse-html-and-anything-else-with-regular-expressions/)

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: NAMEDATALEN increase because of non-latin languages
Next
From: Amit Kapila
Date:
Subject: Re: Skipping logical replication transactions on subscriber side