Re: Commitfest 2023-03 starting tomorrow! - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Commitfest 2023-03 starting tomorrow!
Date
Msg-id CA+hUKGK=7mTwheXRfxz=bD47+m7WUa2xWmce0EfoycsfRN98wg@mail.gmail.com
Whole thread Raw
In response to Re: Commitfest 2023-03 starting tomorrow!  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: Commitfest 2023-03 starting tomorrow!  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Tue, Mar 21, 2023 at 10:59 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> I gave a talk on Friday at a private EDB mini-conference about the
> PostgreSQL open source process; and while preparing for that one, I
> ran some 'git log' commands to obtain the number of code contributors
> for each release, going back to 9.4 (when we started using the
> 'Authors:' tag more prominently).  What I saw is a decline in the number
> of unique contributors, from its maximum at version 12, down to the
> numbers we had in 9.5.  We went back 4 years.  That scared me a lot.

Can you share the subtotals?

One immediate thought about commit log-based data is that we're not
using git Author, and the Author footer convention is only used by
some committers.  So I guess it must have been pretty laborious to
read the prose-form data?  We do have machine-readable Discussion
footers though.  By scanning those threads for SMTP From headers on
messages that had patches attached, we can find the set of (distinct)
addresses that contributed to each commit.  (I understand that some
people are co-authors and may not send an email, but if you counted
those and I didn't then you counted more, not fewer, contributors I
guess?  On the other hand if someone posted a patch that wasn't used
in the commit, or posted from two home/work/whatever accounts that's a
false positive for my technique.)

In a quick and dirty attempt at this made from bits of Python I
already had lying around (which may of course later turn out to be
flawed and need refinement), I extracted, for example:

postgres=# select * from t where commit =
'8d578b9b2e37a4d9d6f422ced5126acec62365a7';
                  commit                  |          time          |
                address
------------------------------------------+------------------------+----------------------------------------------
 8d578b9b2e37a4d9d6f422ced5126acec62365a7 | 2023-03-21 14:29:34+13 |
Melanie Plageman <melanieplageman@gmail.com>
 8d578b9b2e37a4d9d6f422ced5126acec62365a7 | 2023-03-21 14:29:34+13 |
Thomas Munro <thomas.munro@gmail.com>
(2 rows)

You can really only go back about 5-7 years before that technique runs
out of steam, as the links run out. For what they're worth, these
numbers seem to suggests around ~260 distinct email addresses send
patches to threads referenced by commits.  Maybe we're in a 3-year
long plateau, but I don't see a peak back in r12:

postgres=# select date_trunc('year', time), count(distinct address)
from t group by 1 order by 1;
       date_trunc       | count
------------------------+-------
 2015-01-01 00:00:00+13 |    13
 2016-01-01 00:00:00+13 |    37
 2017-01-01 00:00:00+13 |   144
 2018-01-01 00:00:00+13 |   187
 2019-01-01 00:00:00+13 |   225
 2020-01-01 00:00:00+13 |   260
 2021-01-01 00:00:00+13 |   256
 2022-01-01 00:00:00+13 |   262
 2023-01-01 00:00:00+13 |   119
(9 rows)

Of course 2023 is only just getting started.  Zooming in closer, the
peak period for this measurement is March/April, as I guess a lot of
little things make it into the final push:

postgres=# select date_trunc('month', time), count(distinct address)
from t where time > '2021-01-01' group by 1 order by 1;
       date_trunc       | count
------------------------+-------
 2021-01-01 00:00:00+13 |    83
 2021-02-01 00:00:00+13 |    70
 2021-03-01 00:00:00+13 |   100
 2021-04-01 00:00:00+13 |   109
 2021-05-01 00:00:00+12 |    54
 2021-06-01 00:00:00+12 |    82
 2021-07-01 00:00:00+12 |    86
 2021-08-01 00:00:00+12 |    83
 2021-09-01 00:00:00+12 |    73
 2021-10-01 00:00:00+13 |    68
 2021-11-01 00:00:00+13 |    66
 2021-12-01 00:00:00+13 |    48
 2022-01-01 00:00:00+13 |    68
 2022-02-01 00:00:00+13 |    73
 2022-03-01 00:00:00+13 |   110
 2022-04-01 00:00:00+13 |    90
 2022-05-01 00:00:00+12 |    47
 2022-06-01 00:00:00+12 |    50
 2022-07-01 00:00:00+12 |    72
 2022-08-01 00:00:00+12 |    81
 2022-09-01 00:00:00+12 |   105
 2022-10-01 00:00:00+13 |    68
 2022-11-01 00:00:00+13 |    74
 2022-12-01 00:00:00+13 |    58
 2023-01-01 00:00:00+13 |    65
 2023-02-01 00:00:00+13 |    61
 2023-03-01 00:00:00+13 |    64
(27 rows)

Perhaps the present March is looking a little light compared to the
usual 100+ number, but actually if you take just the 1st to the 21st
of previous Marches, they were similar sorts of numbers.

postgres=# select date_trunc('month', time), count(distinct address)
           from t
           where (time >= '2022-03-01' and time <= '2022-03-21') or
                 (time >= '2021-03-01' and time <= '2021-03-21') or
                 (time >= '2020-03-01' and time <= '2020-03-21') or
                 (time >= '2019-03-01' and time <= '2019-03-21')
           group by 1 order by 1;
       date_trunc       | count
------------------------+-------
 2019-03-01 00:00:00+13 |    57
 2020-03-01 00:00:00+13 |    57
 2021-03-01 00:00:00+13 |    77
 2022-03-01 00:00:00+13 |    72
(4 rows)

Another thing we could count is distinct names in the Commitfest app.
I count 162 names in Commitfest 42 today.  Unfortunately I don't have
the data to hand to look at earlier Commitfests.  That'd be
interesting.  I've plotted that before back in 2018 for some
conference talk, and it was at ~100 and climbing back then.

> So I started a conversation about that and some people told me that it's
> very easy to be discouraged by our process.  I don't need to mention
> that it's antiquated -- this in itself turns off youngsters.  But in
> addition to that, I think newbies might be discouraged because their
> contributions seem to go nowhere even after following the process.

I don't disagree with your sentiment, though.

> This led me to suggesting that perhaps we need to be more lenient when
> it comes to new contributors.  As I said, for seasoned contributors,
> it's not a problem to keep up with our requirements, however silly they
> are.  But people who spend their evenings a whole week or month trying
> to understand how to patch for one thing that they want, to be received
> by six months of silence followed by a constant influx of "please rebase
> please rebase please rebase", no useful feedback, and termination with
> "eh, you haven't rebased for the 1001th time, your patch has been WoA
> for X days, we're setting it RwF, feel free to return next year" ...
> they are most certainly off-put and will *not* try again next year.

Right, that is pretty discouraging.



pgsql-hackers by date:

Previous
From: Kyotaro Horiguchi
Date:
Subject: Re: Error "initial slot snapshot too large" in create replication slot
Next
From: Michael Paquier
Date:
Subject: Re: [PoC] Let libpq reject unexpected authentication requests