Thread: remove duplicated words in comments .. across lines
Resending to -hackers as I realized this isn't a documentation issue so not appropriate or apparently interesting to readers of -doc. Inspired by David's patch [0], find attached fixing words duplicated, across line boundaries. I should probably just call the algorithm proprietary, but if you really wanted to know, I've suffered again through sed'sblack/slashes. time find . -name '*.c' -o -name '*.h' |xargs sed -srn '/\/\*/!d; :l; /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g; /(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d;s//>>&<</; p' Alternately: time for f in `find . -name '*.c' -o -name '*.h'`; do x=`<"$f" sed -rn '/\/\*/!d; :l; /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g;/(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d; s//>>&<</; p'`; [ -n "$x" ] && echo "$f:" && echo"$x"; done |less [0] https://www.postgresql.org/message-id/flat/CAKJS1f8du35u5DprpykWvgNEScxapbWYJdHq%2Bz06Wj3Y2KFPbw%40mail.gmail.com PS. Not unrelated: http://3.bp.blogspot.com/-qgW9kcbSh-Q/T5olkOrTWVI/AAAAAAAAAB0/BQhmO5AW_QQ/s1600/4de3efb5846e117e579edc91d6dceb9c.jpg
Attachment
On Fri, Sep 07, 2018 at 08:31:09PM -0500, Justin Pryzby wrote: > Resending to -hackers as I realized this isn't a documentation issue so not > appropriate or apparently interesting to readers of -doc. > > I should probably just call the algorithm proprietary, but if you > really wanted to know, I've suffered again through sed's > black/slashes. > > [...] > > Alternately: > time for f in `find . -name '*.c' -o -name '*.h'`; do x=`<"$f" sed -rn > '/\/\*/!d; :l; /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g; > /(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d; s//>>&<</; p'`; [ -n > "$x" ] && echo "$f:" && echo "$x"; done |less This generates a lot of false positives, like "that that" which is grammatically fine. And fails to ignore entries separated by multiple lines, but the concept is cool. Respect for building that. I looked at what the command above produces, and it seems to me that you have spotted all the spots which are problematic, so committed after applying a proper indentation, which was incorrect in two places. -- Michael
Attachment
Justin Pryzby <pryzby@telsasoft.com> wrote: > Resending to -hackers as I realized this isn't a documentation issue so not > appropriate or apparently interesting to readers of -doc. > > Inspired by David's patch [0], find attached fixing words duplicated, across > line boundaries. > > I should probably just call the algorithm proprietary, but if you really wanted to know, I've suffered again through sed'sblack/slashes. > > time find . -name '*.c' -o -name '*.h' |xargs sed -srn '/\/\*/!d; :l; /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g; /(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d;s//>>&<</; p' > > Alternately: > time for f in `find . -name '*.c' -o -name '*.h'`; do x=`<"$f" sed -rn '/\/\*/!d; :l; /\*\//!{N; b l}; s/\n[[:space:]]*\*/\n/g;/(\<[[:alpha:]]{1,})\>\n[[:space:]]*\<\1\>/!d; s//>>&<</; p'`; [ -n "$x" ] && echo "$f:" && echo"$x"; done |less Alternatively you could have used awk as it can maintain variables across lines. This is a script that I used to find those duplicates in a single file (Just out of fun, I know that your findings have already been processed.): BEGIN{prev_line_last_token = NULL} { if (NF > 1 && $1 == "*" && length(prev_line_last_token) > 0) { if ($2 == prev_line_last_token && # Characters used in ASCII charts are not duplicate words. $2 != "|" && $2 != "}") # Found a duplicate. printf("%s:%s, duplicate token: %s\n", FILENAME, FNR, $2); } if (NF > 1 && ($1 == "*" || $1 == "/*")) prev_line_last_token = $NF; else { # Empty line or not a comment line. Start a new search. prev_line_last_token = NULL; } } -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com