Thread: Copypasta in the PostgreSQL source

Copypasta in the PostgreSQL source

From

David Fetter

Date:

17 December 2018, 17:35:55

Folks,

Please find attached a run of a tool that looks for duplicated tokens.
I've removed some things that seem like false positives, basically all
from the stemmer part of the source, but there's still a lot.

- Is it worth trying to track these down and factor them out?
- Would it make sense to make some kind of git commit trigger that at
  least warns when a new one has been introduced?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Attachment

copypasta.txt.gz

Re: Copypasta in the PostgreSQL source

From

Tom Lane

Date:

17 December 2018, 22:31:22

David Fetter <david@fetter.org> writes:
> Please find attached a run of a tool that looks for duplicated tokens.
> I've removed some things that seem like false positives, basically all
> from the stemmer part of the source, but there's still a lot.

I thought you were talking about problems like "that that" typos,
but on looking at the file, what this is actually complaining
about is any duplicated code segments anywhere.  I do not find
this helpful.  Refactoring to the point that dozen-line code
stanzas never appear more than once would be incredibly invasive,
likely very bad for performance, and I don't think it'd improve
readability either.

> - Would it make sense to make some kind of git commit trigger that at
>   least warns when a new one has been introduced?

Commit triggers are NOT the place for heuristics about code quality,
even if they're well-considered heuristics.

            regards, tom lane

Re: Copypasta in the PostgreSQL source

From

Alvaro Herrera

Date:

17 December 2018, 22:35:23

On 2018-Dec-17, Tom Lane wrote:

> David Fetter <david@fetter.org> writes:
> > Please find attached a run of a tool that looks for duplicated tokens.
> > I've removed some things that seem like false positives, basically all
> > from the stemmer part of the source, but there's still a lot.
> 
> I thought you were talking about problems like "that that" typos,
> but on looking at the file, what this is actually complaining
> about is any duplicated code segments anywhere.  I do not find
> this helpful.  Refactoring to the point that dozen-line code
> stanzas never appear more than once would be incredibly invasive,
> likely very bad for performance, and I don't think it'd improve
> readability either.

Agreed.  Skimming the report, we get some very silly duplications, such
as a function definition duplicating its prototype.

I agree that this is mostly unhelpful noise and we shouldn't spend too
much time on it.

On the other hand, I'm not clear on why do we need four copies of
number_of_ones.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Copypasta in the PostgreSQL source

From

Tom Lane

Date:

17 December 2018, 22:46:29

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On the other hand, I'm not clear on why do we need four copies of
> number_of_ones.

Yeah, it might be time to move something like that into a common
location.  Not sure where the threshold of pain is, though.

            regards, tom lane

Re: Copypasta in the PostgreSQL source

From

David Fetter

Date:

17 December 2018, 23:36:55

On Mon, Dec 17, 2018 at 05:31:22PM -0500, Tom Lane wrote:
> David Fetter <david@fetter.org> writes:
> > Please find attached a run of a tool that looks for duplicated
> > tokens.  I've removed some things that seem like false positives,
> > basically all from the stemmer part of the source, but there's
> > still a lot.
> 
> I thought you were talking about problems like "that that" typos,
> but on looking at the file, what this is actually complaining about
> is any duplicated code segments anywhere.  I do not find this
> helpful.  Refactoring to the point that dozen-line code stanzas
> never appear more than once would be incredibly invasive, likely
> very bad for performance, and I don't think it'd improve readability
> either.

Is there a threshold, possibly much larger than a dozen lines, where
such refactoring would actually make sense?

> > - Would it make sense to make some kind of git commit trigger that
> > at least warns when a new one has been introduced?
> 
> Commit triggers are NOT the place for heuristics about code quality,
> even if they're well-considered heuristics.

Where's a better place for such heuristics?  I'd like to think that
there are opportunities for more automation than we currently have, on
that score.  Maybe a `make` target?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: Copypasta in the PostgreSQL source

From

Andres Freund

Date:

17 December 2018, 23:41:23

Hi,

On 2018-12-18 00:36:55 +0100, David Fetter wrote:
> On Mon, Dec 17, 2018 at 05:31:22PM -0500, Tom Lane wrote:
> > David Fetter <david@fetter.org> writes:
> > > Please find attached a run of a tool that looks for duplicated
> > > tokens.  I've removed some things that seem like false positives,
> > > basically all from the stemmer part of the source, but there's
> > > still a lot.
> > 
> > I thought you were talking about problems like "that that" typos,
> > but on looking at the file, what this is actually complaining about
> > is any duplicated code segments anywhere.  I do not find this
> > helpful.  Refactoring to the point that dozen-line code stanzas
> > never appear more than once would be incredibly invasive, likely
> > very bad for performance, and I don't think it'd improve readability
> > either.
> 
> Is there a threshold, possibly much larger than a dozen lines, where
> such refactoring would actually make sense?

Sure, sometimes. But it seems unlikely that a report based on the
technology at hand here would be useful. If none of the code has changed
in recent times, it's not that usually worthy of a lot of attention to
adapt, unless we're talking much larger pieces of code. And a fair bit
of what's pointed out in that report is well known (e.g. that ecpg
duplicates a lot of code in a horrible manner).  I think to be useful
it'd need to actually point out very similar code, that's starting to
diverge further - but that's obviously a much harder thing to achieve.

> > > - Would it make sense to make some kind of git commit trigger that
> > > at least warns when a new one has been introduced?
> > 
> > Commit triggers are NOT the place for heuristics about code quality,
> > even if they're well-considered heuristics.
> 
> Where's a better place for such heuristics?  I'd like to think that
> there are opportunities for more automation than we currently have, on
> that score.  Maybe a `make` target?

If we find useful targets, we can integrate them that way. I don't
consider this to be a worthy case though.

Greetings,

Andres Freund

Re: Copypasta in the PostgreSQL source

From

Thomas Munro

Date:

18 December 2018, 00:13:22

On Tue, Dec 18, 2018 at 9:46 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > On the other hand, I'm not clear on why do we need four copies of
> > number_of_ones.
>
> Yeah, it might be time to move something like that into a common
> location.  Not sure where the threshold of pain is, though.

Yeah.  I think we need bitutils.c as discussed here:

https://www.postgresql.org/message-id/flat/CAEepm%3D3k%2B%2BYtf2LNQCvpP6m1%3DgY9zZHP_cfnn47%3DWTsoCrLCvA%40mail.gmail.com

I sort of foundered on the the difficulty of using popcnt and
bitscan/log2 instructions/builtins without putting a runtime test and
a honking great function pointer in front of them.  +1 for at least
deduplicating all these implementations as discussed in that thread.

-- 
Thomas Munro
http://www.enterprisedb.com