Thread: custom crosstab question

custom crosstab question

From
"Michael Swierczek"
Date:
Ladies and Gentlemen,
This will be long, I apologize.   I'm mostly looking for someone to
tell me there's an obvious solution I'm missing.

I have a table 'answer' with medical survey answers.  Periodically the
data must be exported to CSV files in a quasi-crosstab format so they
can be examined by statisticians.

The complication is that many of the questions in the medical surveys
can be skipped entirely (if you report not using drugs at all, we
don't ask how recently you used marijuana, and so forth...)  For
analysis, the exported data needs a 'category' column for each survey
question even if it has no corresponding answer and not just the
survey questions with answers that actually exist in the database.

Here is a simplified example of the tables, showing sample data for one survey:
survey_question: (id |  survey_id  |  question_code  | question_order)
1  |   1  |   'drug'  |   1
2  |   1  |   'marijuana' |  2
3  |   1  |   'sick'  |  3

answer:  (id |  survey_event_id |  question_code  |  answer_order |
answer_value)
1  |  1  |   'drug' |  1  |   1
2  |  1  |   'sick' |  2  |   1

Using tablefunc/crosstab, the resulting export would be
(survey_event_id |  'drug' | 'sick'  ....  )
But I need  (survey_event_id | 'drug' | 'marijuana' | 'sick' .... )

I wrote a program that takes a 'survey_id' input, queries PostgreSQL
for the set of question_codes, and then generates a massive query in
the form:
--- begin huge query
select an0.survey_event_id, an0.answer_value, an1.answer_value,
an2.answer_value,
an3.answer_value .....
from answer an0
left join answer an1 where an0.survey_event_id = an1.survey_event_id
and an1.question_code = 'drug'
left join answer an2 where an0.survey_event_id = an2.survey_event_id and
an2.question_code = 'marijuana'
left join....
--- end huge query

Depending upon the survey, the resulting queries have between 10 and
250 joins on the answer_table.  I run them through psql and pipe the
results to a csv file.   They work, but they're pretty slow.  There
are 14,300 total survey_event entries from 50 different surveys in the
test database I'm using, and exporting all of the data in this way
takes 3 hours.   I'm using PostgreSQL 8.1.9 and 8.2.4 (the latter is
faster, although I don't have exact figures how much).

Thanks,
Mike

Re: custom crosstab question

From
Joe Conway
Date:
Michael Swierczek wrote:
> Ladies and Gentlemen,
> This will be long, I apologize.   I'm mostly looking for someone to
> tell me there's an obvious solution I'm missing.
>

Try something like this:

create table survey_question (id int, survey_id int, question_code text,
question_order int);

insert into survey_question values
(1,1,'drug',1),(2,1,'marijuana',2),(3,1,'sick',3);

create table answer (id int, survey_event_id int, question_code text,
answer_order int, answer_value int);

insert into answer values (1,1,'drug',1,1),(2,1,'sick',2,1);

select * from crosstab(
   'select survey_event_id, question_code, answer_value from answer',
   'select question_code from survey_question order by question_order'
) as (survey_event_id int, cat1 int, cat2 int, cat3 int);
  survey_event_id | cat1 | cat2 | cat3
-----------------+------+------+------
                1 |    1 |      |    1
(1 row)

In 8.2 you could do "select question_code from survey_question order by
question_order" first, and in your application build this query
dynamically and run it:

select * from crosstab(
   'select survey_event_id, question_code, answer_value from answer',
   'values (''drug''), (''marijuana''), (''sick'')'
) as (survey_event_id int, drug int, marijuana int, sick int);
  survey_event_id | drug | marijuana | sick
-----------------+------+-----------+------
                1 |    1 |           |    1
(1 row)

HTH,

Joe

Re: custom crosstab question

From
"Michael Swierczek"
Date:
Joe,
     That's exactly what I needed.   I completely missed that it would
be possible with the tablefunc/crosstab.   Since you're the main
(only?) name I see associated with that code, I'm sure you would know.

-Mike

On 4/25/07, Joe Conway <mail@joeconway.com> wrote:
> Michael Swierczek wrote:
> > Ladies and Gentlemen,
> > This will be long, I apologize.   I'm mostly looking for someone to
> > tell me there's an obvious solution I'm missing.
> >
>
> Try something like this:
>
> create table survey_question (id int, survey_id int, question_code text,
> question_order int);
>
> insert into survey_question values
> (1,1,'drug',1),(2,1,'marijuana',2),(3,1,'sick',3);
>
> create table answer (id int, survey_event_id int, question_code text,
> answer_order int, answer_value int);
>
> insert into answer values (1,1,'drug',1,1),(2,1,'sick',2,1);
>
> select * from crosstab(
>    'select survey_event_id, question_code, answer_value from answer',
>    'select question_code from survey_question order by question_order'
> ) as (survey_event_id int, cat1 int, cat2 int, cat3 int);
>   survey_event_id | cat1 | cat2 | cat3
> -----------------+------+------+------
>                 1 |    1 |      |    1
> (1 row)
>
> In 8.2 you could do "select question_code from survey_question order by
> question_order" first, and in your application build this query
> dynamically and run it:
>
> select * from crosstab(
>    'select survey_event_id, question_code, answer_value from answer',
>    'values (''drug''), (''marijuana''), (''sick'')'
> ) as (survey_event_id int, drug int, marijuana int, sick int);
>   survey_event_id | drug | marijuana | sick
> -----------------+------+-----------+------
>                 1 |    1 |           |    1
> (1 row)
>
> HTH,
>
> Joe
>

Re: custom crosstab question

From
"Michael Swierczek"
Date:
On 4/25/07, Michael Swierczek <mike.swierczek@gmail.com> wrote:
> Joe,
>      That's exactly what I needed.   I completely missed that it would
> be possible with the tablefunc/crosstab.   Since you're the main
> (only?) name I see associated with that code, I'm sure you would know.
>
> -Mike
>

I spoke too soon, there's an additional factor at play that
complicates things.
Most of the questions take a single answer, but the multiple select
questions can have several answers.   So for a regular question with
code 'drug', a given survey_event can have 0 or 1 entries in the
answer table with question_code 'drug'.  For a multiple select
question like, 'health', 0-7 entries are possible in the answer table,
and we want them exported as columns 'health0', 'health1', 'health2'
through 'health7' and each column populated according to whether that
section of the multiple select was chosen.

However, I fear I am crossing from "I have a novice question" into
"This is complicated enough that requesting help is exploiting the
generosity of the community for free consulting services."  I won't
pursue it any further here.

-Mike




> On 4/25/07, Joe Conway <mail@joeconway.com> wrote:
> > Michael Swierczek wrote:
> > > Ladies and Gentlemen,
> > > This will be long, I apologize.   I'm mostly looking for someone to
> > > tell me there's an obvious solution I'm missing.
> > >
> >
> > Try something like this:
> >
> > create table survey_question (id int, survey_id int, question_code text,
> > question_order int);
> >
> > insert into survey_question values
> > (1,1,'drug',1),(2,1,'marijuana',2),(3,1,'sick',3);
> >
> > create table answer (id int, survey_event_id int, question_code text,
> > answer_order int, answer_value int);
> >
> > insert into answer values (1,1,'drug',1,1),(2,1,'sick',2,1);
> >
> > select * from crosstab(
> >    'select survey_event_id, question_code, answer_value from answer',
> >    'select question_code from survey_question order by question_order'
> > ) as (survey_event_id int, cat1 int, cat2 int, cat3 int);
> >   survey_event_id | cat1 | cat2 | cat3
> > -----------------+------+------+------
> >                 1 |    1 |      |    1
> > (1 row)
> >
> > In 8.2 you could do "select question_code from survey_question order by
> > question_order" first, and in your application build this query
> > dynamically and run it:
> >
> > select * from crosstab(
> >    'select survey_event_id, question_code, answer_value from answer',
> >    'values (''drug''), (''marijuana''), (''sick'')'
> > ) as (survey_event_id int, drug int, marijuana int, sick int);
> >   survey_event_id | drug | marijuana | sick
> > -----------------+------+-----------+------
> >                 1 |    1 |           |    1
> > (1 row)
> >
> > HTH,
> >
> > Joe
> >
>

Re: custom crosstab question

From
Sean Davis
Date:
Michael Swierczek wrote:
> On 4/25/07, Michael Swierczek <mike.swierczek@gmail.com> wrote:
>> Joe,
>>      That's exactly what I needed.   I completely missed that it would
>> be possible with the tablefunc/crosstab.   Since you're the main
>> (only?) name I see associated with that code, I'm sure you would know.
>>
>> -Mike
>>
>
> I spoke too soon, there's an additional factor at play that
> complicates things.
> Most of the questions take a single answer, but the multiple select
> questions can have several answers.   So for a regular question with
> code 'drug', a given survey_event can have 0 or 1 entries in the
> answer table with question_code 'drug'.  For a multiple select
> question like, 'health', 0-7 entries are possible in the answer table,
> and we want them exported as columns 'health0', 'health1', 'health2'
> through 'health7' and each column populated according to whether that
> section of the multiple select was chosen.
>
> However, I fear I am crossing from "I have a novice question" into
> "This is complicated enough that requesting help is exploiting the
> generosity of the community for free consulting services."  I won't
> pursue it any further here.
Sometimes, in cases like this, where there is business logic
intermingled with database logic, the best option is to move your
reporting code to the client.  While this may result in a performance
hit, more than a single query, and some data structures slightly more
complex than rows in a table, the flexibility of this route might be
worth the effort.

Of course, you are using postgresql, so you can certainly write
functions within the database that return data structures more complex
than single rows from within the database.  For example, you could write
a function using pl/perl or pl/python (or java, etc.) that returns an
XML chunk that represents the report results and then use XSLT on the
client side to format that result into whatever you like.  The
possibilities are pretty endless.

Sean

Re: custom crosstab question

From
Joe Conway
Date:
Michael Swierczek wrote:
> On 4/25/07, Michael Swierczek <mike.swierczek@gmail.com> wrote:
>> Joe,
>>      That's exactly what I needed.   I completely missed that it would
>> be possible with the tablefunc/crosstab.   Since you're the main
>> (only?) name I see associated with that code, I'm sure you would know.
>>
>> -Mike
>>
>
> I spoke too soon, there's an additional factor at play that
> complicates things.
> Most of the questions take a single answer, but the multiple select
> questions can have several answers.   So for a regular question with
> code 'drug', a given survey_event can have 0 or 1 entries in the
> answer table with question_code 'drug'.  For a multiple select
> question like, 'health', 0-7 entries are possible in the answer table,
> and we want them exported as columns 'health0', 'health1', 'health2'
> through 'health7' and each column populated according to whether that
> section of the multiple select was chosen.

It seems to me that if you really want health0 and health1 tracked
independently, they should each be considered separate questions. Then
the existing code would "just work". Perhaps if it is important to tie
those "healthN" questions together, you could have some other attribute
for questions that allows grouping. So, for example, grp_id = 0 means
this question is "stand alone" and grp_id > 0 is a grouped question.
Something like:

create table survey_question_grps (
  grp_id int,
  grp_name text
);

insert into survey_question_grps values
(0, 'stand alone question'),
(1, 'health');

create table survey_question (
  id int,
  survey_id int,
  question_code text,
  grp_id int,
  question_order int
);

insert into survey_question values
  (1,1,'drug',0,1),
  (2,1,'marijuana',0,2),
  (3,1,'sick',0,3),
  (4,1,'health1',1,4),
  (5,1,'health2',1,5);

Joe

Re: custom crosstab question

From
"Michael Swierczek"
Date:
On 4/26/07, Joe Conway <mail@joeconway.com> wrote:
> Michael Swierczek wrote:
> > On 4/25/07, Michael Swierczek <mike.swierczek@gmail.com> wrote:
> >> Joe,
> >>      That's exactly what I needed.   I completely missed that it would
> >> be possible with the tablefunc/crosstab.   Since you're the main
> >> (only?) name I see associated with that code, I'm sure you would know.
> >>
> >> -Mike
> >>
> >
> > I spoke too soon, there's an additional factor at play that
> > complicates things.
> > Most of the questions take a single answer, but the multiple select
> > questions can have several answers.   So for a regular question with
> > code 'drug', a given survey_event can have 0 or 1 entries in the
> > answer table with question_code 'drug'.  For a multiple select
> > question like, 'health', 0-7 entries are possible in the answer table,
> > and we want them exported as columns 'health0', 'health1', 'health2'
> > through 'health7' and each column populated according to whether that
> > section of the multiple select was chosen.
>
> It seems to me that if you really want health0 and health1 tracked
> independently, they should each be considered separate questions. Then
> the existing code would "just work". Perhaps if it is important to tie
> those "healthN" questions together, you could have some other attribute
> for questions that allows grouping. So, for example, grp_id = 0 means
> this question is "stand alone" and grp_id > 0 is a grouped question.
> Something like:

The software has been in production for a few years, and although I
have the ability and authority to retrofit the existing data, I'm
nervous about doing so.   A number of seemingly innocent tweaks in the
past caused major headaches further along.

But it's definitely worth considering.
-Mike


>
> create table survey_question_grps (
>   grp_id int,
>   grp_name text
> );
>
> insert into survey_question_grps values
> (0, 'stand alone question'),
> (1, 'health');
>
> create table survey_question (
>   id int,
>   survey_id int,
>   question_code text,
>   grp_id int,
>   question_order int
> );
>
> insert into survey_question values
>   (1,1,'drug',0,1),
>   (2,1,'marijuana',0,2),
>   (3,1,'sick',0,3),
>   (4,1,'health1',1,4),
>   (5,1,'health2',1,5);
>
> Joe
>