Re: crosstab/repivot...any interest? - Mailing list pgsql-hackers

From Merlin Moncure
Subject Re: crosstab/repivot...any interest?
Date
Msg-id CAHyXU0zYpYLpUuTJdHvDmPXJQy3TdMdVjZ=dZVXzoMTYtjEfFw@mail.gmail.com
Whole thread Raw
In response to Re: crosstab/repivot...any interest?  (Joe Conway <mail@joeconway.com>)
List pgsql-hackers
On Tue, Feb 26, 2019 at 8:31 AM Joe Conway <mail@joeconway.com> wrote:
> On 2/25/19 8:34 PM, Merlin Moncure wrote:
> > The interface I'm looking at is:
> > SELECT repivot(
> >   query TEXT,
> >   static_attributes INT,  /* number of static attributes that are
> > unchanged around key; we need this in our usages */
> >   attribute_query  TEXT); /* query that draws up the pivoted attribute
> > list */
> >
> > The driving query is expected to return 0+ static attributes which are
> > essentially simply pasted to the output. The next two columns are the
> > key column and the pivoting column.   So if you had three attributes,
> > the input set would be:
> >
> > a1, a2, a3, k1, p, v1...vn
> >
> > Where the coordinates v and p would exchange.  I need to get this done
> > quickly and so am trying to avoid more abstracted designs (maybe multi
> > part keys should be supported through...this is big limitation of
> > crosstab albeit with some obnoxious work arounds).
>
> Perhaps not enough coffee yet, but I am not sure I fully grok the design
> here. A fully fleshed out example would be useful.

Thanks for your interest.

A lot of our application data is organized in columnar type formats.
Typically, the analytical reports on the back of the database are
organized withe output columns being a KPI or a unit of time.

KPI Columns:
Country,Make,Model,TimeSlice,Sales,Units
US,Ford,Fusion,Y2018Q1,165MM$,250k
US,Ford,Fusion,Y2018Q2,172MM$,261k
US,Ford,Fusion,Y2018Q3,183MM$,268k
...

Time Columns:
Country,Make,Mode, KPI,Y2018Q1,Y2018Q2,Y2018Q3
US,Ford,Fusion,Y2018Q1,Sales,165MM$,172MM$,183MM$
US,Ford,Fusion,Y2018Q2,Units,250k,261k,268k

SELECT repivot(
  <query pulling KPI format>,
  1, /* only one static attribute */
  2, /* number of key columns, only needed if multi part keys are supported */
  <query developing list of time columns>);

In this example supporting multi-part, the repivot column is the 4th,
assumed to be attributes first, keys second, pivot column third, data
block last.  Multi column pivots might be considered but I don't need
them and that's a significant expansion in scope, so I'm avoiding that
consideration.

What we need to do is convert from the first format above (which is
how data is materialized on table) to the second format.  Just like as
within crosstab, if the key column(s) are ordered into the function we
can exploit that structure for an efficient implementation.

'Ford' and 'Fusion' are key columns; 'US' is uninteresting attribute
column (it is unambiguously represented by 'Ford', 'Fusion') and so
would be simply be copied from input to output set.

Our data is materialized in KPI style (which is pretty typical) but we
have large analytical reports that want it with columns representing
units of time.  This is farily common in the industry IMO.

There are various pure SQL approaches to do this but they tend to
force you to build out the X,Y columns and then join it all back
together; this can spiral out of control quite quickly from a
performance standpoint.  A crosstab style crawling cursor over ordered
set ought to get the job done very efficiently.  tablefunc doesn't
support multi part keys today, and the workaround is that you have so
stuff a key into a formatted string and than parse it out for
downstream joining, some needs of having to do that joining might
hopefully be eliminated by allowing the attribute columns to be copied
through.

merlin


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Protect syscache from bloating with negative cache entries
Next
From: Fujii Masao
Date:
Subject: Re: Remove Deprecated Exclusive Backup Mode