Hello hackers,
During work in the separate thread [1], I discovered more cases
where the link in docs wasn't the canonical link [2].
[1] https://postgr.es/m/CAKFQuwYEX9Pj9G0ZHJeWSmSbnqUyGH+FYcW-66eZjfVG4KOjiQ@mail.gmail.com
[2] https://en.wikipedia.org/wiki/Canonical_link_element
The. below script e.g. doesn't parse SGML, and is broken in some other ways
also, but probably good enough to suggest changes that can then be manually
carefully verified.
```
#!/bin/bash
output_file="changes.log"
> $output_file
extract_canonical() {
local url=$1
canonical=$(curl -s "$url" | sed -n 's/.*<link rel="canonical" href="\([^"]*\)".*/\1/p')
if [[ -n "$canonical" && "$canonical" != "$url" ]]; then
echo "-$url" >> $output_file
echo "+$canonical" >> $output_file
echo $canonical
else
echo $url
fi
}
find . -type f -name '*.sgml' | while read -r file; do
urls=$(sed -n 's/.*\(https:\/\/[^"]*\).*/\1/p' "$file")
for url in $urls; do
canonical_url=$(extract_canonical "$url")
if [[ "$canonical_url" != "$url" ]]; then
# Replace the original URL with the canonical URL in the file
sed -i '' "s|$url|$canonical_url|g" "$file"
fi
done
done
```
Most of what it found was indeed correct, but I had to undo some mistakes it did.
All the changes in the attached patch have been manually verified, by clicking
the original link, and observing the URL seen in the browser.
/Joel