Thread: Allow declaration after statement and reformat code to use it

Allow declaration after statement and reformat code to use it

From
Jelte Fennema
Date:
## What is this?

It's a draft patch that replaces code like this:
```c
pg_file_unlink(PG_FUNCTION_ARGS)
{
char   *filename;
requireSuperuser();
filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
```

With this shorter version:
```c
pg_file_unlink(PG_FUNCTION_ARGS)
{
requireSuperuser();
char      *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
```

## Why would we want this?
1. It removes 23328 lines of code that don't have any impact on how the code behaves [1]. This is roughly 2.7% of all the lines of code in the codebase.
2. Declarations are closer to the actual usage. This is advised by the "Code Complete" book [2] and has the following advantages:
   a. This limits variable scope to what is necessary. Which in turn makes the mental model you have to keep of a function when reading the code simpler.
   b. In most cases it allows you to see the the type of a variable without needing to go to the top of the function.
3. You can do input checking and assertions at the top of the function, instead of having to put declarations in front of it. This makes it clear which invariants hold for the function.  (as seen in the example above and the changes for pg_file_rename[3])
4. Declaring variables after statements is allowed in C99. Postgres already requires C99, so it might as well use this feature too.

## How was this changeset created?

I created a script that modifies all C files in the following way:

1. Find a non `static` declaration.
2. If it has an initializer and it is not a single variable/constant, don't consider replacing it. (reason: a function call initializer might have sideffects).
3. Otherwise (it's a simple initializer or it has no initializer at all), take the type and variable from that declaration.
4. Find the next use of the variable.
5. If the next use is not an assignment, don't do anything (the value of the original initialization is used).
6. If the next use is an assignment:
    1. Remove the old declaration
    2. Prefix the found assignment with the type
    3. Unless the variable is also used in the same line of the new initialization, e.g:
      ```c
      int a = 1;
      a = a + a;
      ```

## How does this script work?

It uses a Perl regex to search and replace! (obligatory jokes at the bottom of the email) This regex is based on the ones used in this PR to citus[4] and the similar PR to pg_auto_failover[5]. The script is in the 3rd commit of this patchset.

To visualize the regex in the script in a reasonable way, copy paste the search part of it:
```
\n\t(?!(return|static)\b)(?P<type>(\w+[\t ])+[\t *]*)(?>(?P<variable>\w+)( = [\w>\s\n-]*?)?;\n(?P<code_between>(?>(?P<comment_or_string_or_not_preprocessor>\/\*.*?\*\/|"(?>\\"|.)*?"|(?!goto)[^#]))*?)(\t)?(?=\b(?P=variable)\b))(?<=\n\t)(?<!:\n\t)(?P=variable) =(?![^;]*?[^>_]\b(?P=variable)\b[^_])
```

And paste that into https://www.debuggex.com/, then select PCRE from the dropdown. (Sharing seems to be down at this point, so this is the only way to do it at the moment) Try it out! The regex is not as crazy as it looks.

There's two important assumptions this regex makes:
1. Code is indented using tabs, and the intent level determines the scope. (this is true, because of `pgindent`)
2. Declared variables are actually used. (this is true, because we would get compiler warnings otherwise)

There's two cases where this regex has some special behaviour:
1. Stop searching for the first usage of a variable when either a `goto` or a preprocessor command is found (outside a string or comment). These things affect the control flow in a way that the regex does not understand. (any `#` character is assumed to be a preprocessor commands).
2. Don't replace if the assignment is right after a `label:`, by checking if there was a `:` as the last character on the previous line. This works around errors like this:
```
hashovfl.c:865:3: error: a label can only be part of a statement and a declaration is not a statement
   OffsetNumber maxroffnum = PageGetMaxOffsetNumber(rpage);
   ^~~~~~~~~~~~
```
Detecting this case in this way is not perfect, because sometimes there is an extra newline or a comment between the label and the assignment. This is not easily detectable by the regex, because lookbehinds cannot have a variable length in Perl (or most regex engines for that matter). For these few cases (5 in the entire codebase) a manual change was done either before or after the automatic replacement to fix them so the code compiles again. (2nd and 5th commits of this patchset)

After all these changes `make -s world` doesn't show any warnings or errors and `make check-world` succeeds. I configured compilation like this:
```
./configure --enable-cassert --enable-depend --enable-debug --with-openssl --with-libxml --with-libxslt --with-uuid=e2fs --with-icu
```

## What I want to achieve with this email

1. For people to look at a small sample of the changes made by this script. I will try to send patches with the actual changes made to the code as attachments in a followup email. However, I don't know if that will succeed since the patch files are very big and it might run into some mailserver limits. So in case that doesn't work, the full changeset is available on Github[6]. This make viewing this enormous diff quite a bit easier IMHO anyaway. If you see something weird that is not covered in the "Known issues" section below, please share it. That way it can be discussed and/or fixed.
2. A discussion on if this type of code change would be a net positive for Postgres codebase. Please explain clearly why or why not.
3. Some links to resources on what's necessary to get a big refactoring patch like this merged.

## What I don't want to achieve with this email

1. For someone to go over all the changes right now. There's likely to be changes to the script or something else. Doing a full review of the changes would be better saved for later during a final review.

## Known issues with the currently generated code

There's a few issues with the final generated code that I've already spotted. These should all be relatively easy to fix in an automated way. However, I think doing so is probably better done by `pgindent` or some other auto formatting tool, instead of with the regex. Note that I did run `pgindent`, it just didn't address these things:

1. Whitespace between type and variable is kept the same as it was before moving the declaration. If the original declaration had whitespace added in such a way that all variable names of declarations aligned, then this whitespace is preserved. This looks weird in various cases. See [3] for an example.
2. `pgindent` adds a newline after each block of declarations, even if they are not at the start of function. If this is desired is debatable, but to me it seems like it adds excessive newlines. See [7] for an example.
3. If all declarations are moved away from the top of the function, then an empty newline is kept at the top of the function. This seems like an unnecessary newline to me. See [3] for an example.

Sources:
1. `tokei`[8] results of lines of code:
Before:
```
$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1361309       866514       330933       163862
===============================================================================
 Total                1389      1361309       866514       330933       163862
===============================================================================
```
After:
```
$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1347515       843186       330952       173377
===============================================================================
 Total                1389      1347515       843186       330952       173377
===============================================================================
```
2. https://github.com/mgp/book-notes/blob/master/code-complete.markdown#103-guidelines-for-initializing-variables
3.
```patch
@@ -401,11 +389,10 @@ pg_file_rename_internal(text *file1, text *file2, text *file3)
 Datum
 pg_file_unlink(PG_FUNCTION_ARGS)
 {
- char   *filename;
 
  requireSuperuser();
 
- filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
+ char   *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
 
  if (access(filename, W_OK) < 0)
```
4. https://github.com/citusdata/citus/pull/3181
5. https://github.com/citusdata/pg_auto_failover/pull/266
6. https://github.com/JelteF/postgres/pull/2
7.
```patch
@@ -2863,7 +2886,8 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * longer than we must.
  */
  Buffer buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL,
- state->checkstrategy);
+ state->checkstrategy);
+
  LockBuffer(buffer, BT_READ);
 
  /*
```
8. https://github.com/XAMPPRocky/tokei


Obligatory jokes:
1. https://xkcd.com/1171/
2. https://xkcd.com/208/
3. https://stackoverflow.com/a/1732454/2570866 (and serious response to it https://neilmadden.blog/2019/02/24/why-you-really-can-parse-html-and-anything-else-with-regular-expressions/)

Attachment

Re: Allow declaration after statement and reformat code to use it

From
Ranier Vilela
Date:
Em qui., 19 de ago. de 2021 às 08:50, Jelte Fennema <Jelte.Fennema@microsoft.com> escreveu:
## What is this?

It's a draft patch that replaces code like this:
```c
pg_file_unlink(PG_FUNCTION_ARGS)
{
char   *filename;
requireSuperuser();
filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
```

With this shorter version:
```c
pg_file_unlink(PG_FUNCTION_ARGS)
{
requireSuperuser();
char      *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
```

## Why would we want this?
1. It removes 23328 lines of code that don't have any impact on how the code behaves [1]. This is roughly 2.7% of all the lines of code in the codebase.
2. Declarations are closer to the actual usage. This is advised by the "Code Complete" book [2] and has the following advantages:
   a. This limits variable scope to what is necessary. Which in turn makes the mental model you have to keep of a function when reading the code simpler.
   b. In most cases it allows you to see the the type of a variable without needing to go to the top of the function.
3. You can do input checking and assertions at the top of the function, instead of having to put declarations in front of it. This makes it clear which invariants hold for the function.  (as seen in the example above and the changes for pg_file_rename[3])
4. Declaring variables after statements is allowed in C99. Postgres already requires C99, so it might as well use this feature too.

## How was this changeset created?

I created a script that modifies all C files in the following way:

1. Find a non `static` declaration.
2. If it has an initializer and it is not a single variable/constant, don't consider replacing it. (reason: a function call initializer might have sideffects).
3. Otherwise (it's a simple initializer or it has no initializer at all), take the type and variable from that declaration.
4. Find the next use of the variable.
5. If the next use is not an assignment, don't do anything (the value of the original initialization is used).
6. If the next use is an assignment:
    1. Remove the old declaration
    2. Prefix the found assignment with the type
    3. Unless the variable is also used in the same line of the new initialization, e.g:
      ```c
      int a = 1;
      a = a + a;
      ```

## How does this script work?

It uses a Perl regex to search and replace! (obligatory jokes at the bottom of the email) This regex is based on the ones used in this PR to citus[4] and the similar PR to pg_auto_failover[5]. The script is in the 3rd commit of this patchset.

To visualize the regex in the script in a reasonable way, copy paste the search part of it:
```
\n\t(?!(return|static)\b)(?P<type>(\w+[\t ])+[\t *]*)(?>(?P<variable>\w+)( = [\w>\s\n-]*?)?;\n(?P<code_between>(?>(?P<comment_or_string_or_not_preprocessor>\/\*.*?\*\/|"(?>\\"|.)*?"|(?!goto)[^#]))*?)(\t)?(?=\b(?P=variable)\b))(?<=\n\t)(?<!:\n\t)(?P=variable) =(?![^;]*?[^>_]\b(?P=variable)\b[^_])
```

And paste that into https://www.debuggex.com/, then select PCRE from the dropdown. (Sharing seems to be down at this point, so this is the only way to do it at the moment) Try it out! The regex is not as crazy as it looks.

There's two important assumptions this regex makes:
1. Code is indented using tabs, and the intent level determines the scope. (this is true, because of `pgindent`)
2. Declared variables are actually used. (this is true, because we would get compiler warnings otherwise)

There's two cases where this regex has some special behaviour:
1. Stop searching for the first usage of a variable when either a `goto` or a preprocessor command is found (outside a string or comment). These things affect the control flow in a way that the regex does not understand. (any `#` character is assumed to be a preprocessor commands).
2. Don't replace if the assignment is right after a `label:`, by checking if there was a `:` as the last character on the previous line. This works around errors like this:
```
hashovfl.c:865:3: error: a label can only be part of a statement and a declaration is not a statement
   OffsetNumber maxroffnum = PageGetMaxOffsetNumber(rpage);
   ^~~~~~~~~~~~
```
Detecting this case in this way is not perfect, because sometimes there is an extra newline or a comment between the label and the assignment. This is not easily detectable by the regex, because lookbehinds cannot have a variable length in Perl (or most regex engines for that matter). For these few cases (5 in the entire codebase) a manual change was done either before or after the automatic replacement to fix them so the code compiles again. (2nd and 5th commits of this patchset)

After all these changes `make -s world` doesn't show any warnings or errors and `make check-world` succeeds. I configured compilation like this:
```
./configure --enable-cassert --enable-depend --enable-debug --with-openssl --with-libxml --with-libxslt --with-uuid=e2fs --with-icu
```

## What I want to achieve with this email

1. For people to look at a small sample of the changes made by this script. I will try to send patches with the actual changes made to the code as attachments in a followup email. However, I don't know if that will succeed since the patch files are very big and it might run into some mailserver limits. So in case that doesn't work, the full changeset is available on Github[6]. This make viewing this enormous diff quite a bit easier IMHO anyaway. If you see something weird that is not covered in the "Known issues" section below, please share it. That way it can be discussed and/or fixed.
2. A discussion on if this type of code change would be a net positive for Postgres codebase. Please explain clearly why or why not.
3. Some links to resources on what's necessary to get a big refactoring patch like this merged.

## What I don't want to achieve with this email

1. For someone to go over all the changes right now. There's likely to be changes to the script or something else. Doing a full review of the changes would be better saved for later during a final review.

## Known issues with the currently generated code

There's a few issues with the final generated code that I've already spotted. These should all be relatively easy to fix in an automated way. However, I think doing so is probably better done by `pgindent` or some other auto formatting tool, instead of with the regex. Note that I did run `pgindent`, it just didn't address these things:

1. Whitespace between type and variable is kept the same as it was before moving the declaration. If the original declaration had whitespace added in such a way that all variable names of declarations aligned, then this whitespace is preserved. This looks weird in various cases. See [3] for an example.
2. `pgindent` adds a newline after each block of declarations, even if they are not at the start of function. If this is desired is debatable, but to me it seems like it adds excessive newlines. See [7] for an example.
3. If all declarations are moved away from the top of the function, then an empty newline is kept at the top of the function. This seems like an unnecessary newline to me. See [3] for an example.

Sources:
1. `tokei`[8] results of lines of code:
Before:
```
$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1361309       866514       330933       163862
===============================================================================
 Total                1389      1361309       866514       330933       163862
===============================================================================
```
After:
```
$ tokei --type C
===============================================================================
 Language            Files        Lines         Code     Comments       Blanks
===============================================================================
 C                    1389      1347515       843186       330952       173377
===============================================================================
 Total                1389      1347515       843186       330952       173377
===============================================================================
```
3.
```patch
@@ -401,11 +389,10 @@ pg_file_rename_internal(text *file1, text *file2, text *file3)
 Datum
 pg_file_unlink(PG_FUNCTION_ARGS)
 {
- char   *filename;
 
  requireSuperuser();
 
- filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
+ char   *filename = convert_and_check_filename(PG_GETARG_TEXT_PP(0));
 
  if (access(filename, W_OK) < 0)
```
7.
```patch
@@ -2863,7 +2886,8 @@ palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum)
  * longer than we must.
  */
  Buffer buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL,
- state->checkstrategy);
+ state->checkstrategy);
+
  LockBuffer(buffer, BT_READ);
 
  /*
```


Obligatory jokes:
3. https://stackoverflow.com/a/1732454/2570866 (and serious response to it https://neilmadden.blog/2019/02/24/why-you-really-can-parse-html-and-anything-else-with-regular-expressions/)
-1 in short.

C needs readability, not fewer lines.
Aside from horrible code, it doesn't improve 0.1% on anything.

I think it's a bad idea and I'm strongly against it.

regards,
Ranier Vilela
 

Re: Allow declaration after statement and reformat code to use it

From
Tom Lane
Date:
Ranier Vilela <ranier.vf@gmail.com> writes:
> Em qui., 19 de ago. de 2021 às 08:50, Jelte Fennema <
> Jelte.Fennema@microsoft.com> escreveu:
>> ## Why would we want this?
>> 1. It removes 23328 lines of code that don't have any impact on how the
>> code behaves [1]. This is roughly 2.7% of all the lines of code in the
>> codebase.

> C needs readability, not fewer lines.
> Aside from horrible code, it doesn't improve 0.1% on anything.
> I think it's a bad idea and I'm strongly against it.

Same here.  We have thirty-ish years worth of coding habits developed
around the existing C90-based layout rules.  I really doubt that a
quasi-mechanical transformation is going to result in better code style.
What it certainly will do, though, is create a pile of hazards for
back-patching.

            regards, tom lane



Re: Allow declaration after statement and reformat code to use it

From
Chapman Flack
Date:
On 08/19/21 09:38, Ranier Vilela wrote:
>> 2. Declarations are closer to the actual usage. This is advised by the
>> "Code Complete" book [2] and has the following advantages:
>>    a. This limits variable scope to what is necessary. Which in turn makes
>> the mental model you have to keep of a function when reading the code
>> simpler.
>>    b. In most cases it allows you to see the the type of a variable
>> without needing to go to the top of the function.
>> 3. You can do input checking and assertions at the top of the function,
>> instead of having to put declarations in front of it. This makes it clear
>> which invariants hold for the function.  (as seen in the example above and
>> the changes for pg_file_rename[3])

I'm in sympathy with all of those points. I've never believed that the
arbitrary separation of declaration from use that was forced by C < 99
made anything more readable. If the project were started now from scratch,
I would be all in favor of declaring at first use.

But I think Tom's concern about backpatching hazards may carry a lot
of weight in practice.

>> It uses a Perl regex to search and replace! (obligatory jokes at the
>> bottom of the email)

However, even if such an idea were to get the green light, I think I would
take the obligatory regex jokes seriously, and instead use something like
srcML [0] and do the analysis and modification on proper parse trees.

Regards,
-Chap


[0] https://www.srcml.org/about.html



Re: Allow declaration after statement and reformat code to use it

From
Tom Lane
Date:
Chapman Flack <chap@anastigmatix.net> writes:
> I'm in sympathy with all of those points. I've never believed that the
> arbitrary separation of declaration from use that was forced by C < 99
> made anything more readable. If the project were started now from scratch,
> I would be all in favor of declaring at first use.

Yeah, if we were starting today, I'm sure we'd use C99 or C++ as our
baseline and not arbitrarily exclude some language features.  But ...
we're not.  The rule against declaration-after-statement was kept
after significant discussion, which you can find in the archives if you
look.  We're not going to drop it just because one person shows up with
a patch to do so.

>>> It uses a Perl regex to search and replace! (obligatory jokes at the
>>> bottom of the email)

> However, even if such an idea were to get the green light, I think I would
> take the obligatory regex jokes seriously, and instead use something like
> srcML [0] and do the analysis and modification on proper parse trees.

Agreed.  I think the odds of introducing bugs are near 100% if it's done
without using a tool that actually understands the code semantics.

            regards, tom lane



Re: Allow declaration after statement and reformat code to use it

From
Bruce Momjian
Date:
On Thu, Aug 19, 2021 at 11:40:44AM -0400, Chapman Flack wrote:
> I'm in sympathy with all of those points. I've never believed that the
> arbitrary separation of declaration from use that was forced by C < 99
> made anything more readable. If the project were started now from scratch,
> I would be all in favor of declaring at first use.
> 
> But I think Tom's concern about backpatching hazards may carry a lot
> of weight in practice.

I think effectively it would need to be run on all supported versions,
which means the risk of introducing errors would have to be very low.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.




Re: Allow declaration after statement and reformat code to use it

From
Michael Paquier
Date:
On Thu, Aug 19, 2021 at 10:34:04AM -0400, Tom Lane wrote:
> Ranier Vilela <ranier.vf@gmail.com> writes:
>> C needs readability, not fewer lines.
>> Aside from horrible code, it doesn't improve 0.1% on anything.
>> I think it's a bad idea and I'm strongly against it.
>
> Same here.  We have thirty-ish years worth of coding habits developed
> around the existing C90-based layout rules.  I really doubt that a
> quasi-mechanical transformation is going to result in better code style.
> What it certainly will do, though, is create a pile of hazards for
> back-patching.

The back-patching hazard this could create is an argument enough to
avoid any of that IMO if done only on HEAD.  When a bug is involved,
up to 7 branches may require a fix.  That's already a lot of work.
--
Michael

Attachment
However, even if such an idea were to get the green light, I think I would
take the obligatory regex jokes seriously, and instead use something like
srcML [0] and do the analysis and modification on proper parse trees.
@Chapman I think that's a reasonable ask. This is the first time I heard of srcML. It indeed looks like it would be a much better tool than a regex to do this refactor. The refactor logic would still be roughly the same though because the regex essentially extracts and modifies a parse tree too. Just one that only has the bare minimum requirements for this specific refactor. I would like to state, that we've used this regex to refactor both Citus and pg_auto_failover, and no bugs turned up because of it so far. This is probably, because a regex is such a horrible way of writing a parser in the first place. So, any bugs in the regex will very likely create code that does not compile at all, instead of code that can be compiled but has some obscure bug .
I think effectively it would need to be run on all supported versions,
which means the risk of introducing errors would have to be very low.
@Bruce @Tom @Chapman I agree that doing this refactor only on the current master branch would cause way more pain due to backpatching, than the amount of happiness it provides by having easier to reason about code. I like the idea of avoiding this backpatching problem by running it . Luckily, running the automatic refactor on all branches should be pretty easy. However, it seems like PG11 and below only require a C89 compatible compiler. Am I correct in assuming increasing that requirement to C99 is not really an option? If so, I think it might be best if I revive this thread in ~27 months when PG11 is not supported anymore.
The rule against declaration-after-statement was kept
after significant discussion, which you can find in the archives if you
look. 
@Tom I guess my search skills in the archives are not very good, because I cannot find any big discussion on about declaration-after-statement. The only discussion I can find is this: https://www.postgresql.org/message-id/877efaol77.fsf@news-spur.riddles.org.uk If you're able to find that thread, could you share it?
C needs readability, not fewer lines.
Aside from horrible code, it doesn't improve 0.1% on anything.
@Ranier I tried to explain in my initial email that it not only reduces lines, but more importantly improves readability quite a bit IMHO:
2. Declarations are closer to the actual usage. This is advised by the "Code Complete" book [2] and has the following advantages:
   a. This limits variable scope to what is necessary. Which in turn makes the mental model you have to keep of a function when reading the code simpler.
   b. In most cases it allows you to see the the type of a variable without needing to go to the top of the function.
3. You can do input checking and assertions at the top of the function, instead of having to put declarations in front of it. This makes it clear which invariants hold for the function.  (as seen in the example above and the changes for pg_file_rename[3])
If you disagree with those statements, could you explain why?


From: Bruce Momjian <bruce@momjian.us>
Sent: Thursday, August 19, 2021 20:57
To: Chapman Flack <chap@anastigmatix.net>
Cc: Ranier Vilela <ranier.vf@gmail.com>; Jelte Fennema <Jelte.Fennema@microsoft.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>
Subject: [EXTERNAL] Re: Allow declaration after statement and reformat code to use it
 
[You don't often get email from bruce@momjian.us. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

On Thu, Aug 19, 2021 at 11:40:44AM -0400, Chapman Flack wrote:
> I'm in sympathy with all of those points. I've never believed that the
> arbitrary separation of declaration from use that was forced by C < 99
> made anything more readable. If the project were started now from scratch,
> I would be all in favor of declaring at first use.
>
> But I think Tom's concern about backpatching hazards may carry a lot
> of weight in practice.

I think effectively it would need to be run on all supported versions,
which means the risk of introducing errors would have to be very low.

--
  Bruce Momjian  <bruce@momjian.us>        https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmomjian.us%2F&amp;data=04%7C01%7CJelte.Fennema%40microsoft.com%7C9dc5a18c6b2947842ea408d963432ccf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637649963502570152%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZVU2bVto4aJkmh5c6Yz1Ey3pJpGHrBhgFzTDBI1Z2jg%3D&amp;reserved=0
  EDB                                      https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fenterprisedb.com%2F&amp;data=04%7C01%7CJelte.Fennema%40microsoft.com%7C9dc5a18c6b2947842ea408d963432ccf%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637649963502570152%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=N25dABUPg3FMRdRMihMcZb4m1oB2rvko%2FXd2rfuCDFY%3D&amp;reserved=0

  If only the physical world exists, free will is an illusion.

Oops. I'm sorry. That previous email looked horrible in plaintext. Please
regard that as unsent. Same email, but second try:

> However, even if such an idea were to get the green light, I think I would
> take the obligatory regex jokes seriously, and instead use something like
> srcML [0] and do the analysis and modification on proper parse trees.

@Chapman I think that's a reasonable ask. This is the first time I heard of
srcML. It indeed looks like it would be a much better tool than a regex to do
this refactor. The refactor logic would still be roughly the same though
because the regex essentially extracts and modifies a parse tree too. Just one
that only has the bare minimum requirements for this specific refactor. I would
like to state, that we've used this regex to refactor both Citus and
pg_auto_failover, and no bugs turned up because of it so far. This is probably,
because a regex is such a horrible way of writing a parser in the first place.
So, any bugs in the regex will very likely create code that does not compile at
all, instead of code that can be compiled but has some obscure bug .

> I think effectively it would need to be run on all supported versions,
> which means the risk of introducing errors would have to be very low.

@Bruce @Tom @Chapman I agree that doing this refactor only on the current
master branch would cause way more pain due to backpatching, than the amount of
happiness it provides by having easier to reason about code. I like the idea of
avoiding this backpatching problem by running it . Luckily, running the
automatic refactor on all branches should be pretty easy. However, it seems
like PG11 and below only require a C89 compatible compiler. Am I correct in
assuming increasing that requirement to C99 is not really an option? If so, I
think it might be best if I revive this thread in ~27 months when PG11 is not
supported anymore.

> The rule against declaration-after-statement was kept
> after significant discussion, which you can find in the archives if you
> look. 

@Tom I guess my search skills in the archives are not very good, because I
cannot find any big discussion on about declaration-after-statement. The only
discussion I can find is this: 
https://www.postgresql.org/message-id/877efaol77.fsf@news-spur.riddles.org.uk 

If you're able to find the thread that you meant, could you please share it?

> C needs readability, not fewer lines.
> Aside from horrible code, it doesn't improve 0.1% on anything.

@Ranier I tried to explain in my initial email that it not only reduces lines,
but more importantly improves readability quite a bit IMHO:

> 2. Declarations are closer to the actual usage. This is advised by the
>    "Code Complete" book [2] and has the following advantages:
>    a. This limits variable scope to what is necessary. Which in turn makes
>       the mental model you have to keep of a function when reading the code
>       simpler.
>    b. In most cases it allows you to see the the type of a variable without
>       needing to go to the top of the function.
> 3. You can do input checking and assertions at the top of the function,
>    instead of having to put declarations in front of it. This makes it clear
>    which invariants hold for the function.  (as seen in the example above and
>    the changes for pg_file_rename[3])

If you disagree with those statements, could you explain why?


Em sex., 20 de ago. de 2021 às 12:29, Jelte Fennema <Jelte.Fennema@microsoft.com> escreveu:
However, even if such an idea were to get the green light, I think I would
take the obligatory regex jokes seriously, and instead use something like
srcML [0] and do the analysis and modification on proper parse trees.
@Chapman I think that's a reasonable ask. This is the first time I heard of srcML. It indeed looks like it would be a much better tool than a regex to do this refactor. The refactor logic would still be roughly the same though because the regex essentially extracts and modifies a parse tree too. Just one that only has the bare minimum requirements for this specific refactor. I would like to state, that we've used this regex to refactor both Citus and pg_auto_failover, and no bugs turned up because of it so far. This is probably, because a regex is such a horrible way of writing a parser in the first place. So, any bugs in the regex will very likely create code that does not compile at all, instead of code that can be compiled but has some obscure bug .
I think effectively it would need to be run on all supported versions,
which means the risk of introducing errors would have to be very low.
@Bruce @Tom @Chapman I agree that doing this refactor only on the current master branch would cause way more pain due to backpatching, than the amount of happiness it provides by having easier to reason about code. I like the idea of avoiding this backpatching problem by running it . Luckily, running the automatic refactor on all branches should be pretty easy. However, it seems like PG11 and below only require a C89 compatible compiler. Am I correct in assuming increasing that requirement to C99 is not really an option? If so, I think it might be best if I revive this thread in ~27 months when PG11 is not supported anymore.
The rule against declaration-after-statement was kept
after significant discussion, which you can find in the archives if you
look. 
@Tom I guess my search skills in the archives are not very good, because I cannot find any big discussion on about declaration-after-statement. The only discussion I can find is this: https://www.postgresql.org/message-id/877efaol77.fsf@news-spur.riddles.org.uk If you're able to find that thread, could you share it?
C needs readability, not fewer lines.
Aside from horrible code, it doesn't improve 0.1% on anything.
@Ranier I tried to explain in my initial email that it not only reduces lines, but more importantly improves readability quite a bit IMHO:
2. Declarations are closer to the actual usage. This is advised by the "Code Complete" book [2] and has the following advantages:
   a. This limits variable scope to what is necessary. Which in turn makes the mental model you have to keep of a function when reading the code simpler.
   b. In most cases it allows you to see the the type of a variable without needing to go to the top of the function.
3. You can do input checking and assertions at the top of the function, instead of having to put declarations in front of it. This makes it clear which invariants hold for the function.  (as seen in the example above and the changes for pg_file_rename[3])
If you disagree with those statements, could you explain why?
2. Declarations are closer to the actual usage. 
a. This limits variable scope to what is necessary
I don't see this as an advantage, even for someone learning C today, which is not the case for 99% of hackers here.
With all declarations first, im my mental model, I can see all variables together and
and all your relationships and dependencies.
Spreading out the declarations, I would have to read and reread the code to
understand the interdependencies of all the variables and their types.

b. In most cases it allows you to see the the type of a variable without needing to go to the top of the function.
For one variable, perhaps, but I can see all of them and because of one variable I have the opportunity to find some fault in the other variables.
I agree that the Postgres code does not help much, because there is no clear organization.
As I do in my codes, where I organize the variables by type and use, aligned, which greatly facilitates the complete view.
What in my opinion would be the ideal path for Postgres, but it involves changing the habits of people who code for years.
What are you proposing to do now, with a high chance of increasing bugs.

3. You can do input checking and assertions at the top of the function, instead of having to put declarations in front of it. This makes it clear which invariants hold for the function.
And you can easily forget to validate some variable you're not seeing is several lines down.

There is a reason why GMs Brian Kernighan and Dennis Ritchie made the C89, less buggy.
IMHO C99 makes it easy to make more mistakes.
One more step and we won't even need to declare a variable.

regards,
Ranier Vilela

Re: [EXTERNAL] Re: Allow declaration after statement and reformat code to use it

From
Andrew Dunstan
Date:
On 8/20/21 12:30 PM, Ranier Vilela wrote:
>
>
> There is a reason why GMs Brian Kernighan and Dennis Ritchie made the
> C89, less buggy.
> IMHO C99 makes it easy to make more mistakes.
> One more step and we won't even need to declare a variable.
>
>

I've used both styles in different languages over the years. I don't
know that one is a clearly better way, notwithstanding what various
theorists might say.

Note that in C89 it's fantastically easy to put the declaration as close
as you like to the first use of a variable. All it takes is a pair of
braces enclosing the variable scope.

Even if we were to relax our rules on this, making wholesale changes
along these lines and even more backpatching them seem out of the question.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Em sáb., 21 de ago. de 2021 às 12:02, Andrew Dunstan <andrew@dunslane.net> escreveu:

On 8/20/21 12:30 PM, Ranier Vilela wrote:
>
>
> There is a reason why GMs Brian Kernighan and Dennis Ritchie made the
> C89, less buggy.
> IMHO C99 makes it easy to make more mistakes.
> One more step and we won't even need to declare a variable.
>
>

I've used both styles in different languages over the years. I don't
know that one is a clearly better way, notwithstanding what various
theorists might say.
IMO declarations first is better, without a shadow of a doubt.
Maybe because my mind is vision.
Seeing the complete view, I can look for bugs and find some way to improve performance.
It's one of the reasons for having small, well-defined functions.
Of course, with extra reward, you can and should reduce the scope.
In this case the compiler does a better job and generates more optimized code.

XLogRecPtr
XLogInsertRecord(XLogRecData *rdata,
                              XLogRecPtr fpw_lsn,
                              uint8 flags,
                              int num_fpi)
{
           XLogCtlInsert *Insert = &XLogCtl->Insert;                   /* What is the purpose of this variable */
           XLogRecord *rechdr = (XLogRecord *) rdata->data;  /* What is the purpose of this variable */
           XLogRecPtr StartPos;                                                 /* What is the purpose of this variable */
           XLogRecPtr EndPos;                                                  /* What is the purpose of this variable */
           pg_crc32c rdata_crc;                                                 /* What is the purpose of this variable */
           uint8 info = rechdr->xl_info & ~XLR_INFO_MASK;
           bool isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID && info == XLOG_SWITCH);
           bool prevDoPageWrites = doPageWrites;
           bool inserted;

           /* we assume that all of the record header is in the first chunk */
          Assert(rdata->len >= SizeOfXLogRecord);

          /* cross-check on whether we should be here or not */
         if (!XLogInsertAllowed())
             elog(ERROR, "cannot make new WAL entries during recovery");

All declarations organized by type and use, almost a structure.
No need to find or think, just see.
We can immediately connect variable info with isLogSwitch.


Note that in C89 it's fantastically easy to put the declaration as close
as you like to the first use of a variable. All it takes is a pair of
braces enclosing the variable scope.
IMO is an exception, which does not follow the C89 pattern.
It's a valid last resource, of course, but not as good practice.


Even if we were to relax our rules on this, making wholesale changes
along these lines and even more backpatching them seem out of the question.
It would be a nightmare.

regards,
Ranier Vilela