Thread: Join runs for > 10 hours and then fills up >1.3TB of disk space
Hi List; I have a table with 9,961,914 rows in it (see the describe of bigtab_stats_fact_tmp14 below) I also have a table with 7,785 rows in it (see the describe of xsegment_dim below) I'm running the join shown below and it takes > 10 hours and eventually runs out of disk space on a 1.4TB file system I've included below a describe of both tables, the join and an explain plan, any help / suggestions would be much appreciated ! I need to get this beast to run as quickly as possible (without filling up my file system) Thanks in advance... select f14.xpublisher_dim_id, f14.xtime_dim_id, f14.xlocation_dim_id, f14.xreferrer_dim_id, f14.xsite_dim_id, f14.xsystem_cfg_dim_id, f14.xaffiliate_dim_id, f14.customer_id, pf_dts_id, episode_id, sessionid, bytes_received, bytes_transmitted, total_played_time_sec, segdim.xsegment_dim_id as episode_level_segid from bigtab_stats_fact_tmp14 f14, xsegment_dim segdim where f14.customer_id = segdim.customer_srcid and f14.show_id = segdim.show_srcid and f14.season_id = segdim.season_srcid and f14.episode_id = segdim.episode_srcid and segdim.segment_srcid is NULL; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=1757001.74..73569676.49 rows=3191677219 width=118) Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND (segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid = f14.show_id) AND (segdim.season_srcid = f14.season_id)) -> Sort (cost=1570.35..1579.46 rows=3643 width=40) Sort Key: segdim.episode_srcid, segdim.customer_srcid, segdim.show_srcid, segdim.season_srcid -> Seq Scan on xsegment_dim segdim (cost=0.00..1354.85 rows=3643 width=40) Filter: (segment_srcid IS NULL) -> Sort (cost=1755323.26..1780227.95 rows=9961874 width=126) Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id -> Seq Scan on bigtab_stats_fact_tmp14 f14 (cost=0.00..597355.74 rows=9961874 width=126) (9 rows) # \d bigtab_stats_fact_tmp14 Table "public.bigtab_stats_fact_tmp14" Column | Type | Modifiers --------------------------+-----------------------------+----------- pf_dts_id | bigint | pf_device_id | bigint | segment_id | bigint | cdn_id | bigint | collector_id | bigint | digital_envoy_id | bigint | maxmind_id | bigint | quova_id | bigint | website_id | bigint | referrer_id | bigint | affiliate_id | bigint | custom_info_id | bigint | start_dt | timestamp without time zone | total_played_time_sec | numeric(18,5) | bytes_received | bigint | bytes_transmitted | bigint | stall_count | integer | stall_duration_sec | numeric(18,5) | hiccup_count | integer | hiccup_duration_sec | numeric(18,5) | watched_duration_sec | numeric(18,5) | rewatched_duration_sec | numeric(18,5) | requested_start_position | numeric(18,5) | requested_stop_position | numeric(18,5) | post_position | numeric(18,5) | is_vod | numeric(1,0) | sessionid | bigint | create_dt | timestamp without time zone | segment_type_id | bigint | customer_id | bigint | content_publisher_id | bigint | content_owner_id | bigint | episode_id | bigint | duration_sec | numeric(18,5) | device_id | bigint | os_id | bigint | browser_id | bigint | cpu_id | bigint | xsystem_cfg_dim_id | bigint | xreferrer_dim_id | bigint | xaffiliate_dim_id | bigint | xsite_dim_id | bigint | xpublisher_dim_id | bigint | season_id | bigint | show_id | bigint | xsegment_dim_id | bigint | location_id | bigint | zipcode | character varying(20) | xlocation_dim_id | bigint | location_srcid | bigint | timezone | real | xtime_dim_id | bigint | Indexes: "bigtab_stats_fact_tmp14_idx1" btree (customer_id) "bigtab_stats_fact_tmp14_idx2" btree (show_id) "bigtab_stats_fact_tmp14_idx3" btree (season_id) "bigtab_stats_fact_tmp14_idx4" btree (episode_id) # \d xsegment_dim Table "public.xsegment_dim" Column | Type | Modifiers ----------------------+----------------------------- +------------------------------------------------------------- xsegment_dim_id | bigint | not null default nextval('xsegment_dim_seq'::regclass) customer_srcid | bigint | not null show_srcid | bigint | not null show_name | character varying(500) | not null season_srcid | bigint | not null season_name | character varying(500) | not null episode_srcid | bigint | not null episode_name | character varying(500) | not null segment_type_id | integer | segment_type | character varying(500) | segment_srcid | bigint | segment_name | character varying(500) | effective_dt | timestamp without time zone | not null default now() inactive_dt | timestamp without time zone | last_update_dt | timestamp without time zone | not null default now() Indexes: "xsegment_dim_pk" PRIMARY KEY, btree (xsegment_dim_id) "seg1" btree (customer_srcid) "seg2" btree (show_srcid) "seg3" btree (season_srcid) "seg4" btree (episode_srcid) "seg5" btree (segment_srcid) "xsegment_dim_ix1" btree (customer_srcid)
> I have a table with 9,961,914 rows in it (see the describe of > bigtab_stats_fact_tmp14 below) > > I also have a table with 7,785 rows in it (see the describe of xsegment_dim > below) > > I'm running the join shown below and it takes > 10 hours and eventually runs > out of disk space on a 1.4TB file system > > I've included below a describe of both tables, the join and an explain plan, > any help / suggestions would be much appreciated ! > > I need to get this beast to run as quickly as possible (without filling up > my file system) > > > Thanks in advance... What version of postgresql are you using? According to http://www.postgresql.org/docs/8.2/static/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-MEMORY you may benefit from adjusting work_mem. You also index segment_srcid (in table xsegment_dim) but if you search for NULL and you have enough of those it defaults to a seq. scan: Seq Scan on xsegment_dim segdim (cost=0.00..1354.85 rows=3643 width=40) > Filter: (segment_srcid IS NULL) Maby you could insert some default value into segment_srcid (some arbitrary large numbers) instead of NULL and then search for values greater than?? You could also try to lower random_page_cost from default to 2. > select > f14.xpublisher_dim_id, > f14.xtime_dim_id, > f14.xlocation_dim_id, > f14.xreferrer_dim_id, > f14.xsite_dim_id, > f14.xsystem_cfg_dim_id, > f14.xaffiliate_dim_id, > f14.customer_id, > pf_dts_id, > episode_id, > sessionid, > bytes_received, > bytes_transmitted, > total_played_time_sec, > segdim.xsegment_dim_id as episode_level_segid > from > bigtab_stats_fact_tmp14 f14, > xsegment_dim segdim > where > f14.customer_id = segdim.customer_srcid > and f14.show_id = segdim.show_srcid > and f14.season_id = segdim.season_srcid > and f14.episode_id = segdim.episode_srcid > and segdim.segment_srcid is NULL; > > > > > > > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Merge Join (cost=1757001.74..73569676.49 rows=3191677219 width=118) > Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND > (segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid = > f14.show_id) AND (segdim.season_srcid = f14.season_id)) > -> Sort (cost=1570.35..1579.46 rows=3643 width=40) > Sort Key: segdim.episode_srcid, segdim.customer_srcid, segdim.show_srcid, > segdim.season_srcid > -> Seq Scan on xsegment_dim segdim (cost=0.00..1354.85 rows=3643 width=40) > Filter: (segment_srcid IS NULL) > -> Sort (cost=1755323.26..1780227.95 rows=9961874 width=126) > Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id > -> Seq Scan on bigtab_stats_fact_tmp14 f14 (cost=0.00..597355.74 > rows=9961874 width=126) > (9 rows) > > > > > > > > > > # \d bigtab_stats_fact_tmp14 > Table "public.bigtab_stats_fact_tmp14" > Column | Type | Modifiers > --------------------------+-----------------------------+----------- > pf_dts_id | bigint | > pf_device_id | bigint | > segment_id | bigint | > cdn_id | bigint | > collector_id | bigint | > digital_envoy_id | bigint | > maxmind_id | bigint | > quova_id | bigint | > website_id | bigint | > referrer_id | bigint | > affiliate_id | bigint | > custom_info_id | bigint | > start_dt | timestamp without time zone | > total_played_time_sec | numeric(18,5) | > bytes_received | bigint | > bytes_transmitted | bigint | > stall_count | integer | > stall_duration_sec | numeric(18,5) | > hiccup_count | integer | > hiccup_duration_sec | numeric(18,5) | > watched_duration_sec | numeric(18,5) | > rewatched_duration_sec | numeric(18,5) | > requested_start_position | numeric(18,5) | > requested_stop_position | numeric(18,5) | > post_position | numeric(18,5) | > is_vod | numeric(1,0) | > sessionid | bigint | > create_dt | timestamp without time zone | > segment_type_id | bigint | > customer_id | bigint | > content_publisher_id | bigint | > content_owner_id | bigint | > episode_id | bigint | > duration_sec | numeric(18,5) | > device_id | bigint | > os_id | bigint | > browser_id | bigint | > cpu_id | bigint | > xsystem_cfg_dim_id | bigint | > xreferrer_dim_id | bigint | > xaffiliate_dim_id | bigint | > xsite_dim_id | bigint | > xpublisher_dim_id | bigint | > season_id | bigint | > show_id | bigint | > xsegment_dim_id | bigint | > location_id | bigint | > zipcode | character varying(20) | > xlocation_dim_id | bigint | > location_srcid | bigint | > timezone | real | > xtime_dim_id | bigint | > Indexes: > "bigtab_stats_fact_tmp14_idx1" btree (customer_id) > "bigtab_stats_fact_tmp14_idx2" btree (show_id) > "bigtab_stats_fact_tmp14_idx3" btree (season_id) > "bigtab_stats_fact_tmp14_idx4" btree (episode_id) > > > > > > > # \d xsegment_dim > Table "public.xsegment_dim" > Column | Type | > Modifiers > ----------------------+-----------------------------+------------------------------------------------------------- > xsegment_dim_id | bigint | not null default > nextval('xsegment_dim_seq'::regclass) > customer_srcid | bigint | not null > show_srcid | bigint | not null > show_name | character varying(500) | not null > season_srcid | bigint | not null > season_name | character varying(500) | not null > episode_srcid | bigint | not null > episode_name | character varying(500) | not null > segment_type_id | integer | > segment_type | character varying(500) | > segment_srcid | bigint | > segment_name | character varying(500) | > effective_dt | timestamp without time zone | not null default now() > inactive_dt | timestamp without time zone | > last_update_dt | timestamp without time zone | not null default now() > Indexes: > "xsegment_dim_pk" PRIMARY KEY, btree (xsegment_dim_id) > "seg1" btree (customer_srcid) > "seg2" btree (show_srcid) > "seg3" btree (season_srcid) > "seg4" btree (episode_srcid) > "seg5" btree (segment_srcid) > "xsegment_dim_ix1" btree (customer_srcid) > > > > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- regards Claus When lenity and cruelty play for a kingdom, the gentlest gamester is the soonest winner. Shakespeare
Also, I'm running version 8.3 on a centOS box with 2 dual core CPU's and 32Gig of ram On May 16, 2008, at 12:58 AM, kevin kempter wrote: > Sorry I goofed on the query text Here's the correct query: > > select > f14.xpublisher_dim_id, > f14.xtime_dim_id, > f14.xlocation_dim_id, > f14.xreferrer_dim_id, > f14.xsite_dim_id, > f14.xsystem_cfg_dim_id, > f14.xaffiliate_dim_id, > f14.customer_id, > f14.pf_dts_id, > f14.episode_id, > f14.sessionid, > f14.bytes_received, > f14.bytes_transmitted, > f14.total_played_time_sec, > segdim.xsegment_dim_id as episode_level_segid > from > bigtab_stats_fact_tmp14 f14, > xsegment_dim segdim > where > f14.customer_id = segdim.customer_srcid > and f14.show_id = segdim.show_srcid > and f14.season_id = segdim.season_srcid > and f14.episode_id = segdim.episode_srcid > and segdim.segment_srcid is NULL; > > > > > > > On May 16, 2008, at 12:31 AM, kevin kempter wrote: > >> Hi List; >> >> I have a table with 9,961,914 rows in it (see the describe of >> bigtab_stats_fact_tmp14 below) >> >> I also have a table with 7,785 rows in it (see the describe of >> xsegment_dim below) >> >> I'm running the join shown below and it takes > 10 hours and >> eventually runs out of disk space on a 1.4TB file system >> >> I've included below a describe of both tables, the join and an >> explain plan, any help / suggestions would be much appreciated ! >> >> I need to get this beast to run as quickly as possible (without >> filling up my file system) >> >> >> Thanks in advance... >> >> >> >> >> >> >> >> >> >> >> >> select >> f14.xpublisher_dim_id, >> f14.xtime_dim_id, >> f14.xlocation_dim_id, >> f14.xreferrer_dim_id, >> f14.xsite_dim_id, >> f14.xsystem_cfg_dim_id, >> f14.xaffiliate_dim_id, >> f14.customer_id, >> pf_dts_id, >> episode_id, >> sessionid, >> bytes_received, >> bytes_transmitted, >> total_played_time_sec, >> segdim.xsegment_dim_id as episode_level_segid >> from >> bigtab_stats_fact_tmp14 f14, >> xsegment_dim segdim >> where >> f14.customer_id = segdim.customer_srcid >> and f14.show_id = segdim.show_srcid >> and f14.season_id = segdim.season_srcid >> and f14.episode_id = segdim.episode_srcid >> and segdim.segment_srcid is NULL; >> >> >> >> >> >> >> QUERY PLAN >> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- >> Merge Join (cost=1757001.74..73569676.49 rows=3191677219 width=118) >> Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND >> (segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid = >> f14.show_id) AND (segdim.season_srcid = f14.season_id)) >> -> Sort (cost=1570.35..1579.46 rows=3643 width=40) >> Sort Key: segdim.episode_srcid, segdim.customer_srcid, >> segdim.show_srcid, segdim.season_srcid >> -> Seq Scan on xsegment_dim segdim (cost=0.00..1354.85 rows=3643 >> width=40) >> Filter: (segment_srcid IS NULL) >> -> Sort (cost=1755323.26..1780227.95 rows=9961874 width=126) >> Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id >> -> Seq Scan on bigtab_stats_fact_tmp14 f14 (cost=0.00..597355.74 >> rows=9961874 width=126) >> (9 rows) >> >> >> >> >> >> >> >> >> >> # \d bigtab_stats_fact_tmp14 >> Table "public.bigtab_stats_fact_tmp14" >> Column | Type | Modifiers >> --------------------------+-----------------------------+----------- >> pf_dts_id | bigint | >> pf_device_id | bigint | >> segment_id | bigint | >> cdn_id | bigint | >> collector_id | bigint | >> digital_envoy_id | bigint | >> maxmind_id | bigint | >> quova_id | bigint | >> website_id | bigint | >> referrer_id | bigint | >> affiliate_id | bigint | >> custom_info_id | bigint | >> start_dt | timestamp without time zone | >> total_played_time_sec | numeric(18,5) | >> bytes_received | bigint | >> bytes_transmitted | bigint | >> stall_count | integer | >> stall_duration_sec | numeric(18,5) | >> hiccup_count | integer | >> hiccup_duration_sec | numeric(18,5) | >> watched_duration_sec | numeric(18,5) | >> rewatched_duration_sec | numeric(18,5) | >> requested_start_position | numeric(18,5) | >> requested_stop_position | numeric(18,5) | >> post_position | numeric(18,5) | >> is_vod | numeric(1,0) | >> sessionid | bigint | >> create_dt | timestamp without time zone | >> segment_type_id | bigint | >> customer_id | bigint | >> content_publisher_id | bigint | >> content_owner_id | bigint | >> episode_id | bigint | >> duration_sec | numeric(18,5) | >> device_id | bigint | >> os_id | bigint | >> browser_id | bigint | >> cpu_id | bigint | >> xsystem_cfg_dim_id | bigint | >> xreferrer_dim_id | bigint | >> xaffiliate_dim_id | bigint | >> xsite_dim_id | bigint | >> xpublisher_dim_id | bigint | >> season_id | bigint | >> show_id | bigint | >> xsegment_dim_id | bigint | >> location_id | bigint | >> zipcode | character varying(20) | >> xlocation_dim_id | bigint | >> location_srcid | bigint | >> timezone | real | >> xtime_dim_id | bigint | >> Indexes: >> "bigtab_stats_fact_tmp14_idx1" btree (customer_id) >> "bigtab_stats_fact_tmp14_idx2" btree (show_id) >> "bigtab_stats_fact_tmp14_idx3" btree (season_id) >> "bigtab_stats_fact_tmp14_idx4" btree (episode_id) >> >> >> >> >> >> >> # \d xsegment_dim >> Table "public.xsegment_dim" >> Column | Type >> | Modifiers >> ----------------------+----------------------------- >> +------------------------------------------------------------- >> xsegment_dim_id | bigint | not null default >> nextval('xsegment_dim_seq'::regclass) >> customer_srcid | bigint | not null >> show_srcid | bigint | not null >> show_name | character varying(500) | not null >> season_srcid | bigint | not null >> season_name | character varying(500) | not null >> episode_srcid | bigint | not null >> episode_name | character varying(500) | not null >> segment_type_id | integer | >> segment_type | character varying(500) | >> segment_srcid | bigint | >> segment_name | character varying(500) | >> effective_dt | timestamp without time zone | not null >> default now() >> inactive_dt | timestamp without time zone | >> last_update_dt | timestamp without time zone | not null >> default now() >> Indexes: >> "xsegment_dim_pk" PRIMARY KEY, btree (xsegment_dim_id) >> "seg1" btree (customer_srcid) >> "seg2" btree (show_srcid) >> "seg3" btree (season_srcid) >> "seg4" btree (episode_srcid) >> "seg5" btree (segment_srcid) >> "xsegment_dim_ix1" btree (customer_srcid) >> >> >> >> >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org >> ) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >
Sorry I goofed on the query text Here's the correct query: select f14.xpublisher_dim_id, f14.xtime_dim_id, f14.xlocation_dim_id, f14.xreferrer_dim_id, f14.xsite_dim_id, f14.xsystem_cfg_dim_id, f14.xaffiliate_dim_id, f14.customer_id, f14.pf_dts_id, f14.episode_id, f14.sessionid, f14.bytes_received, f14.bytes_transmitted, f14.total_played_time_sec, segdim.xsegment_dim_id as episode_level_segid from bigtab_stats_fact_tmp14 f14, xsegment_dim segdim where f14.customer_id = segdim.customer_srcid and f14.show_id = segdim.show_srcid and f14.season_id = segdim.season_srcid and f14.episode_id = segdim.episode_srcid and segdim.segment_srcid is NULL; On May 16, 2008, at 12:31 AM, kevin kempter wrote: > Hi List; > > I have a table with 9,961,914 rows in it (see the describe of > bigtab_stats_fact_tmp14 below) > > I also have a table with 7,785 rows in it (see the describe of > xsegment_dim below) > > I'm running the join shown below and it takes > 10 hours and > eventually runs out of disk space on a 1.4TB file system > > I've included below a describe of both tables, the join and an > explain plan, any help / suggestions would be much appreciated ! > > I need to get this beast to run as quickly as possible (without > filling up my file system) > > > Thanks in advance... > > > > > > > > > > > > select > f14.xpublisher_dim_id, > f14.xtime_dim_id, > f14.xlocation_dim_id, > f14.xreferrer_dim_id, > f14.xsite_dim_id, > f14.xsystem_cfg_dim_id, > f14.xaffiliate_dim_id, > f14.customer_id, > pf_dts_id, > episode_id, > sessionid, > bytes_received, > bytes_transmitted, > total_played_time_sec, > segdim.xsegment_dim_id as episode_level_segid > from > bigtab_stats_fact_tmp14 f14, > xsegment_dim segdim > where > f14.customer_id = segdim.customer_srcid > and f14.show_id = segdim.show_srcid > and f14.season_id = segdim.season_srcid > and f14.episode_id = segdim.episode_srcid > and segdim.segment_srcid is NULL; > > > > > > > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > Merge Join (cost=1757001.74..73569676.49 rows=3191677219 width=118) > Merge Cond: ((segdim.episode_srcid = f14.episode_id) AND > (segdim.customer_srcid = f14.customer_id) AND (segdim.show_srcid = > f14.show_id) AND (segdim.season_srcid = f14.season_id)) > -> Sort (cost=1570.35..1579.46 rows=3643 width=40) > Sort Key: segdim.episode_srcid, segdim.customer_srcid, > segdim.show_srcid, segdim.season_srcid > -> Seq Scan on xsegment_dim segdim (cost=0.00..1354.85 rows=3643 > width=40) > Filter: (segment_srcid IS NULL) > -> Sort (cost=1755323.26..1780227.95 rows=9961874 width=126) > Sort Key: f14.episode_id, f14.customer_id, f14.show_id, f14.season_id > -> Seq Scan on bigtab_stats_fact_tmp14 f14 (cost=0.00..597355.74 > rows=9961874 width=126) > (9 rows) > > > > > > > > > > # \d bigtab_stats_fact_tmp14 > Table "public.bigtab_stats_fact_tmp14" > Column | Type | Modifiers > --------------------------+-----------------------------+----------- > pf_dts_id | bigint | > pf_device_id | bigint | > segment_id | bigint | > cdn_id | bigint | > collector_id | bigint | > digital_envoy_id | bigint | > maxmind_id | bigint | > quova_id | bigint | > website_id | bigint | > referrer_id | bigint | > affiliate_id | bigint | > custom_info_id | bigint | > start_dt | timestamp without time zone | > total_played_time_sec | numeric(18,5) | > bytes_received | bigint | > bytes_transmitted | bigint | > stall_count | integer | > stall_duration_sec | numeric(18,5) | > hiccup_count | integer | > hiccup_duration_sec | numeric(18,5) | > watched_duration_sec | numeric(18,5) | > rewatched_duration_sec | numeric(18,5) | > requested_start_position | numeric(18,5) | > requested_stop_position | numeric(18,5) | > post_position | numeric(18,5) | > is_vod | numeric(1,0) | > sessionid | bigint | > create_dt | timestamp without time zone | > segment_type_id | bigint | > customer_id | bigint | > content_publisher_id | bigint | > content_owner_id | bigint | > episode_id | bigint | > duration_sec | numeric(18,5) | > device_id | bigint | > os_id | bigint | > browser_id | bigint | > cpu_id | bigint | > xsystem_cfg_dim_id | bigint | > xreferrer_dim_id | bigint | > xaffiliate_dim_id | bigint | > xsite_dim_id | bigint | > xpublisher_dim_id | bigint | > season_id | bigint | > show_id | bigint | > xsegment_dim_id | bigint | > location_id | bigint | > zipcode | character varying(20) | > xlocation_dim_id | bigint | > location_srcid | bigint | > timezone | real | > xtime_dim_id | bigint | > Indexes: > "bigtab_stats_fact_tmp14_idx1" btree (customer_id) > "bigtab_stats_fact_tmp14_idx2" btree (show_id) > "bigtab_stats_fact_tmp14_idx3" btree (season_id) > "bigtab_stats_fact_tmp14_idx4" btree (episode_id) > > > > > > > # \d xsegment_dim > Table "public.xsegment_dim" > Column | Type > | Modifiers > ----------------------+----------------------------- > +------------------------------------------------------------- > xsegment_dim_id | bigint | not null default > nextval('xsegment_dim_seq'::regclass) > customer_srcid | bigint | not null > show_srcid | bigint | not null > show_name | character varying(500) | not null > season_srcid | bigint | not null > season_name | character varying(500) | not null > episode_srcid | bigint | not null > episode_name | character varying(500) | not null > segment_type_id | integer | > segment_type | character varying(500) | > segment_srcid | bigint | > segment_name | character varying(500) | > effective_dt | timestamp without time zone | not null > default now() > inactive_dt | timestamp without time zone | > last_update_dt | timestamp without time zone | not null > default now() > Indexes: > "xsegment_dim_pk" PRIMARY KEY, btree (xsegment_dim_id) > "seg1" btree (customer_srcid) > "seg2" btree (show_srcid) > "seg3" btree (season_srcid) > "seg4" btree (episode_srcid) > "seg5" btree (segment_srcid) > "xsegment_dim_ix1" btree (customer_srcid) > > > > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Fri, 2008-05-16 at 00:31 -0600, kevin kempter wrote: > I'm running the join shown below and it takes > 10 hours and > eventually runs out of disk space on a 1.4TB file system Well, running in 10 hours doesn't mean there's a software problem, nor does running out of disk space. Please crunch some numbers before you ask, such as how much disk space was used by the query, how big you'd expect it to be etc, plus provide information such as what the primary key of the large table is and what is your release level is etc.. Are you sure you want to retrieve an estimated 3 billion rows? Can you cope if that estimate is wrong and the true figure is much higher? Do you think the estimate is realistic? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
kevin kempter wrote: > Hi List; > > I have a table with 9,961,914 rows in it (see the describe of > bigtab_stats_fact_tmp14 below) > > I also have a table with 7,785 rows in it (see the describe of > xsegment_dim below) > > I'm running the join shown below and it takes > 10 hours and eventually > runs out of disk space on a 1.4TB file system > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Merge Join (cost=1757001.74..73569676.49 rows=3191677219 width=118) Dumb question Kevin, but are you really expecting 3.2 billion rows in the result-set? Because that's approaching 400GB of result-set without any overheads. -- Richard Huxton Archonet Ltd
I'm expecting 9,961,914 rows returned. Each row in the big table should have a corresponding key in the smaller tale, I want to basically "expand" the big table column list by one, via adding the appropriate key from the smaller table for each row in the big table. It's not a cartesion product join. On May 16, 2008, at 1:40 AM, Richard Huxton wrote: > kevin kempter wrote: >> Hi List; >> I have a table with 9,961,914 rows in it (see the describe of >> bigtab_stats_fact_tmp14 below) >> I also have a table with 7,785 rows in it (see the describe of >> xsegment_dim below) >> I'm running the join shown below and it takes > 10 hours and >> eventually runs out of disk space on a 1.4TB file system > >> QUERY PLAN >> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge >> Join (cost=1757001.74..73569676.49 rows=3191677219 width=118) > > Dumb question Kevin, but are you really expecting 3.2 billion rows > in the result-set? Because that's approaching 400GB of result-set > without any overheads. > > -- > Richard Huxton > Archonet Ltd
kevin kempter wrote: > I'm expecting 9,961,914 rows returned. Each row in the big table should > have a corresponding key in the smaller tale, I want to basically > "expand" the big table column list by one, via adding the appropriate > key from the smaller table for each row in the big table. It's not a > cartesion product join. Didn't seem likely, to be honest. What happens if you try the query as a cursor, perhaps with an order-by on customer_id or something to encourage index use? Do you ever get a first row back? In fact, what happens if you slap an index over all your join columns on xsegment_dim? With 7,000 rows that should make it a cheap test. -- Richard Huxton Archonet Ltd
kevin kempter wrote: > Hi List; > > I have a table with 9,961,914 rows in it (see the describe of > bigtab_stats_fact_tmp14 below) > > I also have a table with 7,785 rows in it (see the describe of > xsegment_dim below) Something else is puzzling me with this - you're joining over four fields. > from > bigtab_stats_fact_tmp14 f14, > xsegment_dim segdim > where > f14.customer_id = segdim.customer_srcid > and f14.show_id = segdim.show_srcid > and f14.season_id = segdim.season_srcid > and f14.episode_id = segdim.episode_srcid > and segdim.segment_srcid is NULL; > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Merge Join (cost=1757001.74..73569676.49 rows=3191677219 width=118) > -> Sort (cost=1570.35..1579.46 rows=3643 width=40) > -> Sort (cost=1755323.26..1780227.95 rows=9961874 width=126) Here it's still expecting 320 matches against each row from the large table. That's ~ 10% of the small table (or that fraction of it that PG expects) which seems very high for four clauses ANDed together. -- Richard Huxton Archonet Ltd
Try 'set enable-mergejoin=false' and see if you get a hashjoin.
- Luke
----- Original Message -----
From: pgsql-performance-owner@postgresql.org <pgsql-performance-owner@postgresql.org>
To: Richard Huxton <dev@archonet.com>
Cc: pgsql-performance@postgresql.org <pgsql-performance@postgresql.org>
Sent: Fri May 16 04:00:41 2008
Subject: Re: [PERFORM] Join runs for > 10 hours and then fills up >1.3TB of disk space
I'm expecting 9,961,914 rows returned. Each row in the big table
should have a corresponding key in the smaller tale, I want to
basically "expand" the big table column list by one, via adding the
appropriate key from the smaller table for each row in the big table.
It's not a cartesion product join.
On May 16, 2008, at 1:40 AM, Richard Huxton wrote:
> kevin kempter wrote:
>> Hi List;
>> I have a table with 9,961,914 rows in it (see the describe of
>> bigtab_stats_fact_tmp14 below)
>> I also have a table with 7,785 rows in it (see the describe of
>> xsegment_dim below)
>> I'm running the join shown below and it takes > 10 hours and
>> eventually runs out of disk space on a 1.4TB file system
>
>> QUERY PLAN
>> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge
>> Join (cost=1757001.74..73569676.49 rows=3191677219 width=118)
>
> Dumb question Kevin, but are you really expecting 3.2 billion rows
> in the result-set? Because that's approaching 400GB of result-set
> without any overheads.
>
> --
> Richard Huxton
> Archonet Ltd
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On further investigation it turns out that I/we have a serious data issue in that my small table is full of 'UNKNOWN' tags so my query cannot associate the data correctly - thus I will end up with 2+ billion rows. Thanks everyone for your help On May 16, 2008, at 1:38 AM, Simon Riggs wrote: > > On Fri, 2008-05-16 at 00:31 -0600, kevin kempter wrote: > >> I'm running the join shown below and it takes > 10 hours and >> eventually runs out of disk space on a 1.4TB file system > > Well, running in 10 hours doesn't mean there's a software problem, nor > does running out of disk space. > > Please crunch some numbers before you ask, such as how much disk space > was used by the query, how big you'd expect it to be etc, plus provide > information such as what the primary key of the large table is and > what > is your release level is etc.. > > Are you sure you want to retrieve an estimated 3 billion rows? Can you > cope if that estimate is wrong and the true figure is much higher? Do > you think the estimate is realistic? > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org > ) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance