Thread: big un stacking query - help save me from myself
I hace a fairly large table with two columns that I need to "de-normalize" (235 million rows) There has got to be a better(i.e. faster) approach than what I am doing. I am using a MAX CASE on each of the 24 variables (column names variableand value) that I want to unstack. Any suggestions would be most appreciated. Here is the ANALYZE GroupAggregate (cost=107156950.16..174988710.12 rows=23511876 width=286) -> Sort (cost=107156950.16..107744747.04 rows=235118752 width=286) Sort Key: site, canopy, block, plot, measurement_interval, warming_treatment, treatment_code, treatment_abbr, water_treatment,variable_name, variable_channel, variable_id, rowid, time2, "timestamp", block_name, table_name, batt_volt,program, flag1, flag2, flag3, amb_a_avg, amb_closed_avg, airtc_avg, airtemp_avg, airtemp_max, airtemp_min, all_avgt,am25tref1, ptemp_avg, rh, s_all_avgt_avg, vp_avg, tabove_sdupper, tabove_sdlower, tabove_meantrim, tabove_mean_dc,tsoil_sdupper, tsoil_sdlower, tsoil_meantrim, tsoil_mean_dc -> Seq Scan on derived_15min_joined (cost=0.00..11782048.52 rows=235118752 width=286) (4 rows) And here is the table: b4warmed3=# \d derived_15min_joined Table "public.derived_15min_joined" Column | Type | Modifiers ----------------------+-----------------------------+----------- site | character varying(6) | canopy | character varying(24) | block | character(2) | plot | character(6) | measurement_interval | interval | warming_treatment | character varying(24) | treatment_code | character varying(24) | treatment_abbr | character varying(24) | water_treatment | character varying(24) | variable_name | character varying(24) | variable_channel | character varying(24) | variable_id | character varying(24) | rowid | character varying(48) | time2 | timestamp without time zone | timestamp | timestamp without time zone | block_name | character varying(8) | table_name | character varying(10) | batt_volt | real | program | character varying(48) | flag1 | integer | flag2 | integer | flag3 | integer | amb_a_avg | real | amb_closed_avg | real | airtc_avg | real | airtemp_avg | real | airtemp_max | real | airtemp_min | real | all_avgt | real | am25tref1 | real | ptemp_avg | real | rh | real | s_all_avgt_avg | real | vp_avg | real | tabove_sdupper | real | tabove_sdlower | real | tabove_meantrim | real | tabove_mean_dc | real | tsoil_sdupper | real | tsoil_sdlower | real | tsoil_meantrim | real | tsoil_mean_dc | real | variable | text | value | real | Indexes: "derived_15min_joined_lower_idx" btree (lower(variable_name::text)) "derived_15min_joined_time2_idx" btree (time2) b4warmed3=# And here is my query. SELECT site, canopy, block, plot, measurement_interval, warming_treatment, treatment_code, treatment_abbr, water_treatment, variable_name, variable_channel, variable_id, rowid, time2, timestamp, block_name, table_name, batt_volt, program, flag1, flag2, flag3, amb_a_avg, amb_closed_avg, airtc_avg, airtemp_avg, airtemp_max, airtemp_min, all_avgt, am25tref1, ptemp_avg, rh, s_all_avgt_avg, vp_avg, tabove_sdupper, tabove_sdlower, tabove_meantrim, tabove_mean_dc, tsoil_sdupper, tsoil_sdlower, tsoil_meantrim, tsoil_mean_dc, MAX ( CASE WHEN lower(variable_name) = 'tabove' THEN value END ) AS tabove, MAX ( CASE WHEN lower(variable_name) = 'tabove_sc' THEN value END ) AS tabove_sc, MAX ( CASE WHEN lower(variable_name) = 'tabove_delta' THEN value END ) AS tabove_delta, MAX ( CASE WHEN lower(variable_name) = 'tsoil' THEN value END ) AS tsoil, MAX ( CASE WHEN lower(variable_name) = 'tsoil_sc' THEN value END ) AS tsoil_sc, MAX ( CASE WHEN lower(variable_name) = 'tsoil_delta' THEN value END ) AS tsoil_delta, MAX ( CASE WHEN lower(variable_name) = 's_sdm_out' THEN value END ) AS s_sdm_out, MAX ( CASE WHEN lower(variable_name) = 'sbtemp' THEN value END ) AS sbtemp, MAX ( CASE WHEN lower(variable_name) = 'heat_a_avg' THEN value END ) AS heat_a_avg, MAX ( CASE WHEN lower(variable_name) = 'b_dc_avg' THEN value END ) AS b_dc_avg, MAX ( CASE WHEN lower(variable_name) = 'targettemp' THEN value END ) AS targettemp, MAX ( CASE WHEN lower(variable_name) = 's_scldout' THEN value END ) AS s_scldout, MAX ( CASE WHEN lower(variable_name) = 'tmv' THEN value END ) AS tmv, MAX ( CASE WHEN lower(variable_name) = 'a_dc' THEN value END ) AS a_dc, MAX ( CASE WHEN lower(variable_name) = 'a_targettemp' THEN value END ) AS a_targettemp, MAX ( CASE WHEN lower(variable_name) = 'scldout' THEN value END ) AS scldout, MAX ( CASE WHEN lower(variable_name) = 'pid_lmt' THEN value END ) AS pid_lmt, MAX ( CASE WHEN lower(variable_name) = 'targettemp_adj' THEN value END ) AS targettemp_adj, MAX ( CASE WHEN lower(variable_name) = 'sdm_out' THEN value END ) AS sdm_out, MAX ( CASE WHEN lower(variable_name) = 's_pid_lmt' THEN value END ) AS s_pid_lmt, MAX ( CASE WHEN lower(variable_name) = 'tsoilr' THEN value END ) AS tsoilr, MAX ( CASE WHEN lower(variable_name) = 's_pid_out' THEN value END ) AS s_pid_out, MAX ( CASE WHEN lower(variable_name) = 'sctemp' THEN value END ) AS sctemp, MAX ( CASE WHEN lower(variable_name) = 'amb_avg' THEN value END ) AS amb_avg FROM derived_15min_joined GROUP BY site, canopy, block, plot, measurement_interval, warming_treatment, treatment_code, treatment_abbr, water_treatment, variable_name, variable_channel, variable_id, rowid, time2, timestamp, block_name, table_name, batt_volt, program, flag1, flag2, flag3, amb_a_avg, amb_closed_avg, airtc_avg, airtemp_avg, airtemp_max, airtemp_min, all_avgt, am25tref1, ptemp_avg, rh, s_all_avgt_avg, vp_avg, tabove_sdupper, tabove_sdlower, tabove_meantrim, tabove_mean_dc, tsoil_sdupper, tsoil_sdlower, tsoil_meantrim, tsoil_mean_dc ;
I have a fairly large table with two columns that I need to "de-normalize" (235 million rows) There has got to be a better(i.e. faster) approach than what I am doing. I am using a MAX CASE on each of the 24 variables (column names variableand value) that I want to unstack. Any suggestions would be most appreciated. Here is the ANALYZE GroupAggregate (cost=107156950.16..174988710.12 rows=23511876 width=286) -> Sort (cost=107156950.16..107744747.04 rows=235118752 width=286) Sort Key: site, canopy, block, plot, measurement_interval, warming_treatment, treatment_code, treatment_abbr, water_treatment,variable_name, variable_channel, variable_id, rowid, time2, "timestamp", block_name, table_name, batt_volt,program, flag1, flag2, flag3, amb_a_avg, amb_closed_avg, airtc_avg, airtemp_avg, airtemp_max, airtemp_min, all_avgt,am25tref1, ptemp_avg, rh, s_all_avgt_avg, vp_avg, tabove_sdupper, tabove_sdlower, tabove_meantrim, tabove_mean_dc,tsoil_sdupper, tsoil_sdlower, tsoil_meantrim, tsoil_mean_dc -> Seq Scan on derived_15min_joined (cost=0.00..11782048.52 rows=235118752 width=286) (4 rows) And here is the table: b4warmed3=# \d derived_15min_joined Table "public.derived_15min_joined" Column | Type | Modifiers ----------------------+-----------------------------+----------- site | character varying(6) | canopy | character varying(24) | block | character(2) | plot | character(6) | measurement_interval | interval | warming_treatment | character varying(24) | treatment_code | character varying(24) | treatment_abbr | character varying(24) | water_treatment | character varying(24) | variable_name | character varying(24) | variable_channel | character varying(24) | variable_id | character varying(24) | rowid | character varying(48) | time2 | timestamp without time zone | timestamp | timestamp without time zone | block_name | character varying(8) | table_name | character varying(10) | batt_volt | real | program | character varying(48) | flag1 | integer | flag2 | integer | flag3 | integer | amb_a_avg | real | amb_closed_avg | real | airtc_avg | real | airtemp_avg | real | airtemp_max | real | airtemp_min | real | all_avgt | real | am25tref1 | real | ptemp_avg | real | rh | real | s_all_avgt_avg | real | vp_avg | real | tabove_sdupper | real | tabove_sdlower | real | tabove_meantrim | real | tabove_mean_dc | real | tsoil_sdupper | real | tsoil_sdlower | real | tsoil_meantrim | real | tsoil_mean_dc | real | variable | text | value | real | Indexes: "derived_15min_joined_lower_idx" btree (lower(variable_name::text)) "derived_15min_joined_time2_idx" btree (time2) b4warmed3=# And here is my query. SELECT site, canopy, block, plot, measurement_interval, warming_treatment, treatment_code, treatment_abbr, water_treatment, variable_name, variable_channel, variable_id, rowid, time2, timestamp, block_name, table_name, batt_volt, program, flag1, flag2, flag3, amb_a_avg, amb_closed_avg, airtc_avg, airtemp_avg, airtemp_max, airtemp_min, all_avgt, am25tref1, ptemp_avg, rh, s_all_avgt_avg, vp_avg, tabove_sdupper, tabove_sdlower, tabove_meantrim, tabove_mean_dc, tsoil_sdupper, tsoil_sdlower, tsoil_meantrim, tsoil_mean_dc, MAX ( CASE WHEN lower(variable_name) = 'tabove' THEN value END ) AS tabove, MAX ( CASE WHEN lower(variable_name) = 'tabove_sc' THEN value END ) AS tabove_sc, MAX ( CASE WHEN lower(variable_name) = 'tabove_delta' THEN value END ) AS tabove_delta, MAX ( CASE WHEN lower(variable_name) = 'tsoil' THEN value END ) AS tsoil, MAX ( CASE WHEN lower(variable_name) = 'tsoil_sc' THEN value END ) AS tsoil_sc, MAX ( CASE WHEN lower(variable_name) = 'tsoil_delta' THEN value END ) AS tsoil_delta, MAX ( CASE WHEN lower(variable_name) = 's_sdm_out' THEN value END ) AS s_sdm_out, MAX ( CASE WHEN lower(variable_name) = 'sbtemp' THEN value END ) AS sbtemp, MAX ( CASE WHEN lower(variable_name) = 'heat_a_avg' THEN value END ) AS heat_a_avg, MAX ( CASE WHEN lower(variable_name) = 'b_dc_avg' THEN value END ) AS b_dc_avg, MAX ( CASE WHEN lower(variable_name) = 'targettemp' THEN value END ) AS targettemp, MAX ( CASE WHEN lower(variable_name) = 's_scldout' THEN value END ) AS s_scldout, MAX ( CASE WHEN lower(variable_name) = 'tmv' THEN value END ) AS tmv, MAX ( CASE WHEN lower(variable_name) = 'a_dc' THEN value END ) AS a_dc, MAX ( CASE WHEN lower(variable_name) = 'a_targettemp' THEN value END ) AS a_targettemp, MAX ( CASE WHEN lower(variable_name) = 'scldout' THEN value END ) AS scldout, MAX ( CASE WHEN lower(variable_name) = 'pid_lmt' THEN value END ) AS pid_lmt, MAX ( CASE WHEN lower(variable_name) = 'targettemp_adj' THEN value END ) AS targettemp_adj, MAX ( CASE WHEN lower(variable_name) = 'sdm_out' THEN value END ) AS sdm_out, MAX ( CASE WHEN lower(variable_name) = 's_pid_lmt' THEN value END ) AS s_pid_lmt, MAX ( CASE WHEN lower(variable_name) = 'tsoilr' THEN value END ) AS tsoilr, MAX ( CASE WHEN lower(variable_name) = 's_pid_out' THEN value END ) AS s_pid_out, MAX ( CASE WHEN lower(variable_name) = 'sctemp' THEN value END ) AS sctemp, MAX ( CASE WHEN lower(variable_name) = 'amb_avg' THEN value END ) AS amb_avg FROM derived_15min_joined GROUP BY site, canopy, block, plot, measurement_interval, warming_treatment, treatment_code, treatment_abbr, water_treatment, variable_name, variable_channel, variable_id, rowid, time2, timestamp, block_name, table_name, batt_volt, program, flag1, flag2, flag3, amb_a_avg, amb_closed_avg, airtc_avg, airtemp_avg, airtemp_max, airtemp_min, all_avgt, am25tref1, ptemp_avg, rh, s_all_avgt_avg, vp_avg, tabove_sdupper, tabove_sdlower, tabove_meantrim, tabove_mean_dc, tsoil_sdupper, tsoil_sdlower, tsoil_meantrim, tsoil_mean_dc ;
Kirk Wythers <kirk.wythers@gmail.com> wrote: > I hace a fairly large table with two columns that I need to > "de-normalize" (235 million rows) There has got to be a better > (i.e. faster) approach than what I am doing. I am using a MAX > CASE on each of the 24 variables (column names variable and > value) that I want to unstack. Any suggestions would be most > appreciated. I didn't understand your description of what you are trying to do, and the example has so many columns and cases that it would take a long time to understand it. Can you distill this down to just a few columns and cases so that it is easier to understand what you are trying to accomplish? Even better would be a self-contained test case with just a few rows so people can see "before" and "after" data. What you have already posted will help give context on how it needs to scale, which is important, too; but if you make the issue easier to understand, the odds improve that someone will volunteer the time needed to make a suggestion. -Kevin -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mar 14, 2013, at 10:27 AM, Kevin Grittner <kgrittn@ymail.com> wrote: >> > > I didn't understand your description of what you are trying to do, > and the example has so many columns and cases that it would take a > long time to understand it. Can you distill this down to just a > few columns and cases so that it is easier to understand what you > are trying to accomplish? Even better would be a self-contained > test case with just a few rows so people can see "before" and > "after" data. What you have already posted will help give context > on how it needs to scale, which is important, too; but if you make > the issue easier to understand, the odds improve that someone will > volunteer the time needed to make a suggestion. Here is a simplified version of the query approach I am attempting. First of all, this query works fine with a limited numberof columns. There are some colums that I am leaving alone (those are the var1, var2, var3 variables) and a limitednumber of variables that I am trying to "unstack" (those are unstack1, unstack2, unstack3…. variables). The problem lies in that the real table I am working with is vary large. There are 30 plus var1, var2… columns I am not unstacking,and 30 plus variables (unstack1, unstack2…) that I am unstacking, from a 25 million row table. I have looked at the tablefunc approach, and I am wondering if it is any more efficient than using the CASE approach I amtrying here. I let the full version of the below query run for 2 days before killing it when it threatened to fill theentire hard drive (250 Gigs). CREATE TABLE unstacked_table AS ( SELECT var1, var2, var3, MAX ( CASE WHEN variable_name = 'unstack1' THEN VALUE END ) AS unstack1, MAX ( CASE WHEN variable_name = 'unstack2' THEN VALUE END ) AS unstack2, MAX ( CASE WHEN variable_name = 'unstack3' THEN VALUE END ) AS unstack3, MAX ( CASE WHEN variable_name = 'unstack4' THEN VALUE END ) AS unstack4 FROM stacked_table GROUP BY variable1, variable2, variable3 ) ;
Kirk Wythers <wythe001@umn.edu> wrote: > Here is a simplified version of the query approach I am > attempting. > CREATE TABLE unstacked_table AS ( > SELECT > var1, > var2, > var3, > MAX ( > CASE > WHEN variable_name = 'unstack1' THEN > > VALUE > > END > ) AS unstack1, > MAX ( > CASE > WHEN variable_name = 'unstack2' THEN > > VALUE > > END > ) AS unstack2, > MAX ( > CASE > WHEN variable_name = 'unstack3' THEN > > VALUE > > END > ) AS unstack3, > MAX ( > CASE > WHEN variable_name = 'unstack4' THEN > > VALUE > > END > ) AS unstack4 > FROM > stacked_table > GROUP BY > variable1, > variable2, > variable3 > ) > ; This is still not making sense to me. Is VALUE intended to be a place-holder showing where you have a literal in the code? If so, what is the point of using MAX? Could you create a sample "stacked" table, insert about 10 rows, and show the SELECT which would give the "unstacked" form? -- Kevin Grittner EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company