Hi hackers,
In my analytics work, I frequently conduct extensive correlation discovery.
i.e., given a list of columns, run corr(X, Y) over different pairs and see what pairs score high.
Standard Postgres as-is offers the well-known corr(X, Y)
which is based on the classic Spearman correlation.
Its main drawback is that it detects linear associations.
Over the last 20 years, several measures have been proposed that can detect non-linear relationships as well.
including the Kendall rank and the Maximal Information Coefficient.
The latest celebrity in the area is the xi (ξ) correlation coefficient proposed by Chatterjee [0].
It's rank-based, and is very appealing due to its relatively simple implementation.
I've already released pgxicor [1], an extension.
However, since Scipy has already added this to its library [2], I thought I'd propose it for core PG as well.
Here’s a first cut of a patch at this stage I’m mainly looking to gauge interest in including this in core.
Future versions will likely refine the implementation details (e.g., use ArrayType instead of a growable buffer of doubles,
revisit the way ties are handled, and decide whether clamping of negative values is appropriate).