Thread: Copypasta in the PostgreSQL source
Folks, Please find attached a run of a tool that looks for duplicated tokens. I've removed some things that seem like false positives, basically all from the stemmer part of the source, but there's still a lot. - Is it worth trying to track these down and factor them out? - Would it make sense to make some kind of git commit trigger that at least warns when a new one has been introduced? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Attachment
David Fetter <david@fetter.org> writes: > Please find attached a run of a tool that looks for duplicated tokens. > I've removed some things that seem like false positives, basically all > from the stemmer part of the source, but there's still a lot. I thought you were talking about problems like "that that" typos, but on looking at the file, what this is actually complaining about is any duplicated code segments anywhere. I do not find this helpful. Refactoring to the point that dozen-line code stanzas never appear more than once would be incredibly invasive, likely very bad for performance, and I don't think it'd improve readability either. > - Would it make sense to make some kind of git commit trigger that at > least warns when a new one has been introduced? Commit triggers are NOT the place for heuristics about code quality, even if they're well-considered heuristics. regards, tom lane
On 2018-Dec-17, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > Please find attached a run of a tool that looks for duplicated tokens. > > I've removed some things that seem like false positives, basically all > > from the stemmer part of the source, but there's still a lot. > > I thought you were talking about problems like "that that" typos, > but on looking at the file, what this is actually complaining > about is any duplicated code segments anywhere. I do not find > this helpful. Refactoring to the point that dozen-line code > stanzas never appear more than once would be incredibly invasive, > likely very bad for performance, and I don't think it'd improve > readability either. Agreed. Skimming the report, we get some very silly duplications, such as a function definition duplicating its prototype. I agree that this is mostly unhelpful noise and we shouldn't spend too much time on it. On the other hand, I'm not clear on why do we need four copies of number_of_ones. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On the other hand, I'm not clear on why do we need four copies of > number_of_ones. Yeah, it might be time to move something like that into a common location. Not sure where the threshold of pain is, though. regards, tom lane
On Mon, Dec 17, 2018 at 05:31:22PM -0500, Tom Lane wrote: > David Fetter <david@fetter.org> writes: > > Please find attached a run of a tool that looks for duplicated > > tokens. I've removed some things that seem like false positives, > > basically all from the stemmer part of the source, but there's > > still a lot. > > I thought you were talking about problems like "that that" typos, > but on looking at the file, what this is actually complaining about > is any duplicated code segments anywhere. I do not find this > helpful. Refactoring to the point that dozen-line code stanzas > never appear more than once would be incredibly invasive, likely > very bad for performance, and I don't think it'd improve readability > either. Is there a threshold, possibly much larger than a dozen lines, where such refactoring would actually make sense? > > - Would it make sense to make some kind of git commit trigger that > > at least warns when a new one has been introduced? > > Commit triggers are NOT the place for heuristics about code quality, > even if they're well-considered heuristics. Where's a better place for such heuristics? I'd like to think that there are opportunities for more automation than we currently have, on that score. Maybe a `make` target? Best, David. -- David Fetter <david(at)fetter(dot)org> http://fetter.org/ Phone: +1 415 235 3778 Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Hi, On 2018-12-18 00:36:55 +0100, David Fetter wrote: > On Mon, Dec 17, 2018 at 05:31:22PM -0500, Tom Lane wrote: > > David Fetter <david@fetter.org> writes: > > > Please find attached a run of a tool that looks for duplicated > > > tokens. I've removed some things that seem like false positives, > > > basically all from the stemmer part of the source, but there's > > > still a lot. > > > > I thought you were talking about problems like "that that" typos, > > but on looking at the file, what this is actually complaining about > > is any duplicated code segments anywhere. I do not find this > > helpful. Refactoring to the point that dozen-line code stanzas > > never appear more than once would be incredibly invasive, likely > > very bad for performance, and I don't think it'd improve readability > > either. > > Is there a threshold, possibly much larger than a dozen lines, where > such refactoring would actually make sense? Sure, sometimes. But it seems unlikely that a report based on the technology at hand here would be useful. If none of the code has changed in recent times, it's not that usually worthy of a lot of attention to adapt, unless we're talking much larger pieces of code. And a fair bit of what's pointed out in that report is well known (e.g. that ecpg duplicates a lot of code in a horrible manner). I think to be useful it'd need to actually point out very similar code, that's starting to diverge further - but that's obviously a much harder thing to achieve. > > > - Would it make sense to make some kind of git commit trigger that > > > at least warns when a new one has been introduced? > > > > Commit triggers are NOT the place for heuristics about code quality, > > even if they're well-considered heuristics. > > Where's a better place for such heuristics? I'd like to think that > there are opportunities for more automation than we currently have, on > that score. Maybe a `make` target? If we find useful targets, we can integrate them that way. I don't consider this to be a worthy case though. Greetings, Andres Freund
On Tue, Dec 18, 2018 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > On the other hand, I'm not clear on why do we need four copies of > > number_of_ones. > > Yeah, it might be time to move something like that into a common > location. Not sure where the threshold of pain is, though. Yeah. I think we need bitutils.c as discussed here: https://www.postgresql.org/message-id/flat/CAEepm%3D3k%2B%2BYtf2LNQCvpP6m1%3DgY9zZHP_cfnn47%3DWTsoCrLCvA%40mail.gmail.com I sort of foundered on the the difficulty of using popcnt and bitscan/log2 instructions/builtins without putting a runtime test and a honking great function pointer in front of them. +1 for at least deduplicating all these implementations as discussed in that thread. -- Thomas Munro http://www.enterprisedb.com