Measure Theoretic Data Types in Postgresql - Mailing list pgsql-hackers

From Aaron Sheldon
Subject Measure Theoretic Data Types in Postgresql
Date
Msg-id CADyahXZBmBS4ceqjPYidpFN8pA7GzE6PaBeCBd3SWM47ZFRbMA@mail.gmail.com
Whole thread Raw
Responses Re: Measure Theoretic Data Types in Postgresql  (Josh Berkus <josh@agliodbs.com>)
Re: Measure Theoretic Data Types in Postgresql  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
Release 9.2 introduced the range data types, which is a much needed extension for a large number applications, particularly in statistics, and measurement. The natural generalization of a range is a finite partition of the underlying abstract data space into disjoint covering ranges; for example two partitions of the real line could be

{(-\infty, -1], (-1, 0.5], (0.5, 0.75), [0.75, 25], (25, \infty)}

and

{(-\infty,  -1), [-1, 1), [1, 1], (1, 2), [2, \infty)}

To be of use we have to assign a value to each range in the partition, this defines a simple measurable function (see Billingsley "Probability and Measure" or Rudin "Principles of Mathematical Analysis" chapter 11, for introductory material on Measure Theory). For the two example partitions we could have two functions mapping to a character datatype:

f : {(-\infty, -1] -> NULL, (-1, 0.5] -> 'A', (0.5, 0.75) -> 'B', [0.75, 25] -> 'A', (25, \infty) -> 'B'}

and

g : {(-\infty,  -1) -> 'A', [-1, 1) -> 'B', [1, 1] -> 'C', (1, 2) -> NULL, [2, \infty) -> 'A'}

The simple functions can be stored using a composite datatype containing three arrays: an n element array of ascending boundary points, an n element array of boundary topologies either left open ')[' or right open '](', and an n+1 element array of values assigned to each range in the partition. With the functions stored we can define point-wise operators for summation, product, variance, etc... and more interestingly array and string concatenation, and count distinct; the last example would result in the new function:

h = count_distinct(f, g) : {(-\infty, -1] -> 1, (-1, 0.5] -> 2, (0.5, 0.75) -> 1, [0.75, 1] -> 2, (1, 25] -> 1, (25, \infty) -> 2}
 
Which can be generalized to point-wise aggregates over finite lists of measurable functions. While the point-wise aggregate functions are powerful in and of themselves, the true power would be realised when dis-aggregating the data type into the original ranges that form the partition. This would allow for a succinct syntax to do calculations such as finding the daily unique patient count given the intervals of their attendance in particular programs; a computation I encounter routinely as a statistician for a health services provider.

I have begun sketching these ideas in off the shelf pgSQL using composite types and constructor functions; but am far from tackling anything like building external C extensions to handle the data types. I can set-up a GitHub repository if anyone is interested in seeing where I am taking this. So far I have defined the composite types (lots of them) and the constructor functions 'indicator(lower topology, lower point, upper point, upper topology)' which creates an indicator (0/1) function of a range, and 'measure(lower topology, lower point, upper point, upper topology, value)' which creates a measure function of the range; after this I will tackle the dis-aggregation functions, the operators, and finally the aggregation functions. 

I realize similar work has been done in this vain using the set of sets idea, but this work is meant to be an implementation of formal measure theoretic concepts, and thus the algorithmic behaviour can be analysed and proven using the tool kit of measure theory.
 
Aaron Sheldon


pgsql-hackers by date:

Previous
From: Cédric Villemain
Date:
Subject: Re: Truncate if exists
Next
From: Magnus Hagander
Date:
Subject: Re: September 2012 commitfest