Thread: Crosstab Confusion

Crosstab Confusion

From
Adam Sherman
Date:
I'm really trying to understand how the tablefunc crosstab function works, to no avail.

I have a table that looks like this:

customer_id integer
date timestamp with time zone
amount numeric(10,4)

There are rows in this table every-time a customer gets charged an amount, which is multiple times per day.

I would like to get a result like this:

customer_id,day1,day2,day3,(…)
1,400.00,500.01,123.00,(…)

So, one row for each customer id and a column for every day in the current month. Anyone used crosstab for something
likethis? 

Thanks for your help,

A.


--
www.sherman.ca / +1-613-797-6819 / +1-646-233-3400

"When the burning husks of your startups warm the last of your bones, remember I told you so." - Zed



Re: Crosstab Confusion

From
Lee Hachadoorian
Date:
I'm flying blind here since I have nothing that looks like the structure you described to experiment on, but based on
somecrosstabs I have set up this should get you started. Explanatory notes follow.<br /><br />SELECT pivot.*<br /> FROM
crosstab('<br/>    --row header, column header, cell value<br />    SELECT customer_id, extract(day FROM date),
sum(amount)<br/>    WHERE extract(month FROM date) = 1 --desired month, 1=Jan, 2=Feb, etc.<br />    FROM your_table<br
/>    GROUP BY 1, 2<br />    ORDER BY 1, 2<br />    ','<br />    --list of column headers<br />    SELECT 1, 2, 3, […]
31<br/>    ') <br />    --list of column names for final result set<br />    --does not have to match column names from
crosstab()section!<br />     pivot(customer_id integer, day1 numeric(10,4), day2 numeric(10,4), day3 numeric(10,4), […]
day31numeric(10,4))<br />;<br /><br />You basically have three parts:<br /><br />1) SELECT query in the form (row
header,column header, cell value). In this case it is an aggregate query so that  you can sum the transactions over a
givenday.<br /><br />2) List of column headers. If you want, this can SELECT from another table, so you can have a
tablewith rows 1, 2, 3, etc and use it to select the days from the month instead of listing them manually.<br /><br
/>3)List of output columns, which follows "pivot" in the text above. Note that "pivot" is an arbitrary name. You can
usefoo, bar, or whatever, but that will be the name of the table which must be used to reference the columns in the top
SELECTlist.<br /><br />Note that my somewhat bizarre indenting / end of line structure is designed so that, when using
pgAdmin,I can very quickly select and execute just the value query or column header query, which lets me experiment as
Ibuild the crosstab.<br /><br />Note, finally, that the list of output columns (following "pivot") should be one more
thanthe number of columns in section 2 (technically, you can circumvent this limitation, but it's complicated and seems
unnecessaryfor your use case). That could be problematic when running this query in the middle of the month and the
numberof value rows per customer doesn't match the number expected in the list of column headers and list of output
columns.I don't know of any way to make the list of output columns vary dynamically, but perhaps someone else here
does?Otherwise, you would have to either (a) make 31 different crosstab queries and choose which one to run based on
theday of the month, or (b) come up with some way to "pad" the value table created in part 1 so that it lists
customer_id,day, 0 for all days which have no customer transactions. (Perhaps create a dummy customer_id that has no
transactionsfor all days in the month, which should be enough to trick the crosstab function into thinking it has
somethingto fill the last columns.<br /><br />--Lee<br /><div class="gmail_quote"><br /></div>-- <br />Lee
Hachadoorian<br/>PhD Student, Geography<br />Program in Earth & Environmental Sciences<br />CUNY Graduate Center<br
/>

Re: Crosstab Confusion

From
Adam Sherman
Date:
On 2010-02-01, at 11:34 , Lee Hachadoorian wrote:
> You basically have three parts:
>
> 1) SELECT query in the form (row header, column header, cell value). In this case it is an aggregate query so that
youcan sum the transactions over a given day. 
>
> 2) List of column headers. If you want, this can SELECT from another table, so you can have a table with rows 1, 2,
3,etc and use it to select the days from the month instead of listing them manually. 
>
> 3) List of output columns, which follows "pivot" in the text above. Note that "pivot" is an arbitrary name. You can
usefoo, bar, or whatever, but that will be the name of the table which must be used to reference the columns in the top
SELECTlist. 

Wow that's an incredibly complete response!

I'm not getting any data in my rows though. This query produces the data:

SELECT
cust_id as customer,
date_trunc('day', date) AS day,
SUM(billed_duration) AS minutes
FROM master_cdr
WHERE date >= '2010-01-01'
GROUP BY 1,2
ORDER BY 1,2;

Which looks like:
customer |         day         | minutes
----------+---------------------+---------       1 | 2010-01-01 00:00:00 |    1110       1 | 2010-01-03 00:00:00 |
60      1 | 2010-01-26 00:00:00 |   23010       1 | 2010-01-27 00:00:00 |   17910       2 | 2010-01-01 00:00:00 |
60      2 | 2010-01-02 00:00:00 |      30       2 | 2010-01-04 00:00:00 |   26310 
etc, etc, etc

But this query:

-- clients by day
SELECT pivot.* FROM crosstab( 'SELECT   cust_id as customer,   date_trunc(''day'', date) AS day,   SUM(billed_duration)
ASminutes FROM master_cdr WHERE date >= ''2010-01-01'' GROUP BY 1,2 ORDER BY 1,2', 'select * from day_of_month' 
) pivot ( customer integer, day1 numeric(10,4), day2 numeric(10,4),
(…) day31 numeric(10,4) )
ORDER BY customer;

Gives me a table that looks right but all values are null for the days.

Something simple maybe?

Thanks,

A.

--
www.sherman.ca / +1-613-797-6819 / +1-646-233-3400

"When the burning husks of your startups warm the last of your bones, remember I told you so." - Zed



Re: Crosstab Confusion

From
Lee Hachadoorian
Date:
The output column data type (day1, day2, etc.) is supposed to match the value data type. I used numeric(10,4) because that's what your original post specified, but the billed_duration column in your most recent post looks like it might be integer? (Or is it defined as numeric(10,4), but you never enter noninteger values?)

What's the output of the category query by itself? I forgot to include ORDER BY 1 at the end of the category query. (The order should match the order of output columns, but I think without it you wouldn't get NULL values, just the values would be in the wrong columns.) I assume day_of_month has only one column, but I would suggest naming it explicitly instead of using *. And is the day_of_month column defined in the same format as date_trunc('day', date)? They must successfully pass an "equals" test to get included in the right crosstab cell. If a category value in the source query doesn't match any value produced by the category query, I think the crosstab function just throws out that row, which could lead to a table with the correct structure but all NULLs.

I didn't think about this before, but you can also pull the category headers from your source table like this:

'SELECT DISTINCT date_trunc(''day'', date) AS day WHERE date >= ''2010-01-01'' ORDER BY 1'

Let me know if this gets you anywhere.

--Lee

On Mon, Feb 1, 2010 at 11:52 AM, Adam Sherman <adam@sherman.ca> wrote:
On 2010-02-01, at 11:34 , Lee Hachadoorian wrote:
> You basically have three parts:
>
> 1) SELECT query in the form (row header, column header, cell value). In this case it is an aggregate query so that  you can sum the transactions over a given day.
>
> 2) List of column headers. If you want, this can SELECT from another table, so you can have a table with rows 1, 2, 3, etc and use it to select the days from the month instead of listing them manually.
>
> 3) List of output columns, which follows "pivot" in the text above. Note that "pivot" is an arbitrary name. You can use foo, bar, or whatever, but that will be the name of the table which must be used to reference the columns in the top SELECT list.

Wow that's an incredibly complete response!

I'm not getting any data in my rows though. This query produces the data:

SELECT
cust_id as customer,
date_trunc('day', date) AS day,
SUM(billed_duration) AS minutes
FROM master_cdr
WHERE date >= '2010-01-01'
GROUP BY 1,2
ORDER BY 1,2;

Which looks like:

 customer |         day         | minutes
----------+---------------------+---------
       1 | 2010-01-01 00:00:00 |    1110
       1 | 2010-01-03 00:00:00 |      60
       1 | 2010-01-26 00:00:00 |   23010
       1 | 2010-01-27 00:00:00 |   17910
       2 | 2010-01-01 00:00:00 |      60
       2 | 2010-01-02 00:00:00 |      30
       2 | 2010-01-04 00:00:00 |   26310
etc, etc, etc

But this query:

-- clients by day
SELECT pivot.* FROM crosstab(
 'SELECT
   cust_id as customer,
   date_trunc(''day'', date) AS day,
   SUM(billed_duration) AS minutes
 FROM master_cdr
 WHERE date >= ''2010-01-01''
 GROUP BY 1,2
 ORDER BY 1,2',
 'select * from day_of_month'
) pivot (
 customer integer,
 day1 numeric(10,4),
 day2 numeric(10,4),
(…)
 day31 numeric(10,4)
 )
ORDER BY customer;

Gives me a table that looks right but all values are null for the days.

Something simple maybe?

Thanks,

A.

--
www.sherman.ca / +1-613-797-6819 / +1-646-233-3400

"When the burning husks of your startups warm the last of your bones, remember I told you so." - Zed




--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center

Re: Crosstab Confusion

From
Adam Sherman
Date:
On 2010-02-01, at 14:22 , Lee Hachadoorian wrote:
> The output column data type (day1, day2, etc.) is supposed to match the value data type. I used numeric(10,4) because
that'swhat your original post specified, but the billed_duration column in your most recent post looks like it might be
integer?(Or is it defined as numeric(10,4), but you never enter noninteger values?) 

Actually, the query I was running is:

SELECT   cust_id as customer,   date_trunc(''day'', date) AS day,   SUM(billed_duration)/60.0::numeric(10,4) AS minutes

billed_duration is an integer. Make sense?

> What's the output of the category query by itself? I forgot to include ORDER BY 1 at the end of the category query.
(Theorder should match the order of output columns, but I think without it you wouldn't get NULL values, just the
valueswould be in the wrong columns.) I assume day_of_month has only one column, but I would suggest naming it
explicitlyinstead of using *. And is the day_of_month column defined in the same format as date_trunc('day', date)?
Theymust successfully pass an "equals" test to get included in the right crosstab cell. If a category value in the
sourcequery doesn't match any value produced by the category query, I think the crosstab function just throws out that
row,which could lead to a table with the correct structure but all NULLs. 

Right, my list of columns weren't equal to the truncated date. Using your suggested query to generate the columns fixed
theproblem! 

Now, is there a way to generate the labels? Otherwise I have to adjust the query for th number of days returned.

Such nice output though! Awesome!

Thanks,

A.


--
www.sherman.ca / +1-613-797-6819 / +1-646-233-3400

"When the burning husks of your startups warm the last of your bones, remember I told you so." - Zed



Re: Crosstab Confusion

From
Lee Hachadoorian
Date:
On Mon, Feb 1, 2010 at 3:11 PM, Adam Sherman <adam@sherman.ca> wrote:
Actually, the query I was running is:

SELECT
   cust_id as customer,
   date_trunc(''day'', date) AS day,
   SUM(billed_duration)/60.0::numeric(10,4) AS minutes

billed_duration is an integer. Make sense?

If billed_duration is an integer, sum(billed_duration) will be int or bigint. I would just define the output columns as bigint (day1 bigint, day2 bigint, etc.). Although, formatting it in the source SQL (I see you are dividing by 60 and casting to numeric) saves you from having to format 31 output columns.

Right, my list of columns weren't equal to the truncated date. Using your suggested query to generate the columns fixed the problem!

Now, is there a way to generate the labels? Otherwise I have to adjust the query for th number of days returned.


Not that I know of, but I have confirmed that as long as the category SQL matches the output column list, you can have output columns with no data in them. In order to have your list of column headers match the source SQL, I would recommend going back to the extract() function I first recommended to extract the day of month as an integer, and then generate a 31 number series for your category headers. It would look like this:

SELECT pivot.* FROM crosstab(
 'SELECT
   cust_id as customer,
   extract(day from date)::integer AS day,
   SUM(billed_duration) AS minutes
 FROM master_cdr
 WHERE extract(month FROM date) = 1
 GROUP BY 1,2
 ORDER BY 1,2',
 'select day from generate_series(1,31) day'
) pivot (
 customer integer,
 day1 bigint,
 day2 bigint,
(…)
 day31 bigint
 )
ORDER BY customer;

For half-over months or months with fewer than 31 days, the final columns of the crosstab should just be blank.

PS: The way I have constructed it, I would avoid using WHERE date >= ''2010-01-01''. If data from February gets into the table, it will aggregate data from, e.g. Jan 2 and Feb 2 as both being part of "Day 2".

--Lee

--
Lee Hachadoorian
PhD Student, Geography
Program in Earth & Environmental Sciences
CUNY Graduate Center