Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee - Mailing list pgsql-hackers

From Florents Tselai
Subject Add xicorr(X, Y): support for the xi (ξ) correlation coefficient by Chatterjee
Date
Msg-id CA+v5N43gerwrQMX=Zv7sKF3XDmgef+jsiwGW-cu3X=gNLo8_1A@mail.gmail.com
Whole thread Raw
List pgsql-hackers
Hi hackers, 

In my analytics work, I frequently conduct extensive correlation discovery.
i.e., given a list of columns, run corr(X, Y) over different pairs and see what pairs score high.
Standard Postgres as-is offers the well-known corr(X, Y) 
which is based on the classic Spearman correlation.
Its main drawback is that it detects linear associations.

Over the last 20 years, several measures have been proposed that can detect non-linear relationships as well. 
including the Kendall rank and the Maximal Information Coefficient.

The latest celebrity in the area is the xi (ξ) correlation coefficient proposed by Chatterjee [0].
It's rank-based, and is very appealing due to its relatively simple implementation. 
You can view a by-hand computation in this video (https://www.youtube.com/watch?v=2OTHH8wz25c

I've already released pgxicor [1], an extension.
However, since Scipy has already added this to its library [2], I thought I'd propose it for core PG as well.

Here’s a first cut of a patch at this stage I’m mainly looking to gauge interest in including this in core. 
Future versions will likely refine the implementation details (e.g., use ArrayType instead of a growable buffer of doubles, 
revisit the way ties are handled, and decide whether clamping of negative values is appropriate).


Attachment

pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat
Next
From: John Naylor
Date:
Subject: hash + LRC better than CRC?