Thread: remove duplicated words in comments .. across lines

remove duplicated words in comments .. across lines

From

Justin Pryzby

Date:

08 September 2018, 04:31:09

Resending to -hackers as I realized this isn't a documentation issue so not
appropriate or apparently interesting to readers of -doc.

Inspired by David's patch [0], find attached fixing words duplicated, across
line boundaries.

I should probably just call the algorithm proprietary, but if you really wanted to know, I've suffered again through
sed'sblack/slashes.
 

time find . -name '*.c' -o -name '*.h' |xargs sed -srn '/\/\*/!d; :l; /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g;
/(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d;s//>>&<</; p'
 

Alternately:
time for f in `find . -name '*.c' -o -name '*.h'`; do x=`<"$f" sed -rn '/\/\*/!d; :l; /\*\//!{N; b l};
s/\n[[:space:]]*\*/\n/g;/(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d; s//>>&<</; p'`; [ -n "$x" ] && echo "$f:" &&
echo"$x"; done |less
 

[0] https://www.postgresql.org/message-id/flat/CAKJS1f8du35u5DprpykWvgNEScxapbWYJdHq%2Bz06Wj3Y2KFPbw%40mail.gmail.com

PS. Not unrelated:
http://3.bp.blogspot.com/-qgW9kcbSh-Q/T5olkOrTWVI/AAAAAAAAAB0/BQhmO5AW_QQ/s1600/4de3efb5846e117e579edc91d6dceb9c.jpg

Attachment

duplicated-words-across-lines.diff

Re: remove duplicated words in comments .. across lines

From

Michael Paquier

Date:

08 September 2018, 22:25:29

On Fri, Sep 07, 2018 at 08:31:09PM -0500, Justin Pryzby wrote:
> Resending to -hackers as I realized this isn't a documentation issue so not
> appropriate or apparently interesting to readers of -doc.
>
> I should probably just call the algorithm proprietary, but if you
> really wanted to know, I've suffered again through sed's
> black/slashes.
>
> [...]
>
> Alternately:
> time for f in `find . -name '*.c' -o -name '*.h'`; do x=`<"$f" sed -rn
> '/\/\*/!d; :l; /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g;
> /(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d; s//>>&<</; p'`; [ -n
> "$x" ] && echo "$f:" && echo "$x"; done |less

This generates a lot of false positives, like "that that" which is
grammatically fine.  And fails to ignore entries separated by multiple
lines, but the concept is cool.  Respect for building that.

I looked at what the command above produces, and it seems to me that you
have spotted all the spots which are problematic, so committed after
applying a proper indentation, which was incorrect in two places.
--
Michael

Attachment

signature.asc

Re: remove duplicated words in comments .. across lines

From

Antonin Houska

Date:

09 September 2018, 12:11:21

Justin Pryzby <pryzby@telsasoft.com> wrote:

> Resending to -hackers as I realized this isn't a documentation issue so not
> appropriate or apparently interesting to readers of -doc.
>
> Inspired by David's patch [0], find attached fixing words duplicated, across
> line boundaries.
>
> I should probably just call the algorithm proprietary, but if you really wanted to know, I've suffered again through
sed'sblack/slashes. 
>
> time find . -name '*.c' -o -name '*.h' |xargs sed -srn '/\/\*/!d; :l; /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g;
/(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d;s//>>&<</; p' 
>
> Alternately:
> time for f in `find . -name '*.c' -o -name '*.h'`; do x=`<"$f" sed -rn '/\/\*/!d; :l; /\*\//!{N; b l};
s/\n[[:space:]]*\*/\n/g;/(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d; s//>>&<</; p'`; [ -n "$x" ] && echo "$f:" &&
echo"$x"; done |less 

Alternatively you could have used awk as it can maintain variables across
lines. This is a script that I used to find those duplicates in a single file
(Just out of fun, I know that your findings have already been processed.):

BEGIN{prev_line_last_token = NULL}
{
    if (NF > 1 && $1 == "*" && length(prev_line_last_token) > 0)
    {
    if ($2 == prev_line_last_token &&
        # Characters used in ASCII charts are not duplicate words.
        $2 != "|" && $2 != "}")
        # Found a duplicate.
        printf("%s:%s, duplicate token: %s\n", FILENAME, FNR, $2);
    }

    if (NF > 1 && ($1 == "*" || $1 == "/*"))
    prev_line_last_token = $NF;
    else
    {
    # Empty line or not a comment line. Start a new search.
        prev_line_last_token = NULL;
    }
}

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com