Re: Looking for ideas on how to speed up warehouse loading - Mailing list pgsql-performance
From | Nicholas Shanny |
---|---|
Subject | Re: Looking for ideas on how to speed up warehouse loading |
Date | |
Msg-id | BE0C810C-94C0-11D8-B96E-000A95770900@tripadvisor.com Whole thread Raw |
In response to | Re: Looking for ideas on how to speed up warehouse loading (Sean Shanny <shannyconsulting@earthlink.net>) |
List | pgsql-performance |
One other thing: we are running with a block size of 32K. Nick Shanny (Brother of above person) On Apr 22, 2004, at 7:30 PM, Sean Shanny wrote: > I should have included this as well: > > show all; > name | setting > --------------------------------+---------------- > add_missing_from | on > australian_timezones | off > authentication_timeout | 60 > check_function_bodies | on > checkpoint_segments | 64 > checkpoint_timeout | 30 > checkpoint_warning | 30 > client_encoding | UNICODE > client_min_messages | notice > commit_delay | 0 > commit_siblings | 5 > cpu_index_tuple_cost | 0.001 > cpu_operator_cost | 0.0025 > cpu_tuple_cost | 0.01 > DateStyle | ISO, MDY > db_user_namespace | off > deadlock_timeout | 1000 > debug_pretty_print | off > debug_print_parse | off > debug_print_plan | off > debug_print_rewritten | off > default_statistics_target | 1000 > default_transaction_isolation | read committed > default_transaction_read_only | off > dynamic_library_path | $libdir > effective_cache_size | 400000 > enable_hashagg | on > enable_hashjoin | on > enable_indexscan | on > enable_mergejoin | on > enable_nestloop | on > enable_seqscan | on > enable_sort | on > enable_tidscan | on > explain_pretty_print | on > extra_float_digits | 0 > from_collapse_limit | 8 > fsync | on > geqo | on > geqo_effort | 1 > geqo_generations | 0 > geqo_pool_size | 0 > geqo_selection_bias | 2 > geqo_threshold | 11 > join_collapse_limit | 8 > krb_server_keyfile | unset > lc_collate | C > lc_ctype | C > lc_messages | C > lc_monetary | C > lc_numeric | C > lc_time | C > log_connections | off > log_duration | off > log_error_verbosity | default > log_executor_stats | off > log_hostname | off > log_min_duration_statement | -1 > log_min_error_statement | panic > log_min_messages | notice > log_parser_stats | off > log_pid | off > log_planner_stats | off > log_source_port | off > log_statement | off > log_statement_stats | off > log_timestamp | on > max_connections | 100 > max_expr_depth | 10000 > max_files_per_process | 1000 > max_fsm_pages | 20000 > max_fsm_relations | 1000 > max_locks_per_transaction | 64 > password_encryption | on > port | 5432 > pre_auth_delay | 0 > preload_libraries | unset > random_page_cost | 4 > regex_flavor | advanced > rendezvous_name | unset > search_path | $user,public > server_encoding | UNICODE > server_version | 7.4.1 > shared_buffers | 4000 > silent_mode | off > sort_mem | 64000 > sql_inheritance | on > ssl | off > statement_timeout | 0 > stats_block_level | on > stats_command_string | on > stats_reset_on_server_start | off > stats_row_level | on > stats_start_collector | on > superuser_reserved_connections | 2 > syslog | 0 > syslog_facility | LOCAL0 > syslog_ident | postgres > tcpip_socket | on > TimeZone | unknown > trace_notify | off > transaction_isolation | read committed > transaction_read_only | off > transform_null_equals | off > unix_socket_directory | unset > unix_socket_group | unset > unix_socket_permissions | 511 > vacuum_mem | 64000 > virtual_host | unset > wal_buffers | 1024 > wal_debug | 0 > wal_sync_method | open_sync > zero_damaged_pages | off > > > Sean Shanny wrote: > >> To all, >> >> Essentials: Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB >> memory, direct attached via fibre channel to a fully optioned 3.5TB >> XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50. >> >> Background: We are loading what are essentially xml based access >> logs from about 20+ webservers daily, about 6GB of raw data. We have >> a classic star schema. All the ETL tools are custom java code or >> standard *nix tools like sort, uniq etc... >> >> The problem: We have about 46 million rows in a table with the >> following schema: >> >> Table "public.d_referral" >> Column | Type | Modifiers >> --------------------+---------+----------- >> id | integer | not null >> referral_raw_url | text | not null >> job_control_number | integer | not null >> Indexes: >> "d_referral_pkey" primary key, btree (id) >> "idx_referral_url" btree (referral_raw_url) >> >> This is one of our dimension tables. Part of the daily ETL process >> is to match all the new referral URL's against existing data in the >> d_referral table. Some of the values in referral_raw_url can be 5000 >> characters long :-( . The avg length is : 109.57 characters. >> >> I sort and uniq all the incoming referrals and load them into a temp >> table. >> >> Table "public.referral_temp" >> Column | Type | Modifiers >> --------+------+----------- >> url | text | not null >> Indexes: >> "referral_temp_pkey" primary key, btree (url) >> >> I then do a left join >> >> SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral >> t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id >> >> This is the output from an explain analyze (Please note that I do a >> set enable_index_scan = false prior to issuing this because it takes >> forever using indexes.): >> >> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER >> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id; >> >> QUERY PLAN >> ---------------------------------------------------------------------- >> ---------------------------------------------------------------------- >> ---------- >> Sort (cost=4012064.81..4013194.45 rows=451856 width=115) (actual >> time=1297320.823..1297739.813 rows=476176 loops=1) >> Sort Key: t1.id >> -> Hash Left Join (cost=1052345.95..3969623.10 rows=451856 >> width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1) >> Hash Cond: ("outer".url = "inner".referral_raw_url) >> -> Seq Scan on referral_temp t2 (cost=0.00..6645.56 >> rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 >> loops=1) >> -> Hash (cost=729338.16..729338.16 rows=46034716 width=124) >> (actual time=1146440.710..1146440.710 rows=0 loops=1) >> -> Seq Scan on d_referral t1 (cost=0.00..729338.16 >> rows=46034716 width=124) (actual time=14.502..-1064277.123 >> rows=46034715 loops=1) >> Total runtime: 1298153.193 ms >> (8 rows) >> >> >> >> What I would like to know is if there are better ways to do the join? >> I need to get all the rows back from the referral_temp table as they >> are used for assigning FK's for the fact table later in processing. >> When I iterate over the values that I get back those with t1.id = >> null I assign a new FK and push both into the d_referral table as new >> entries as well as a text file for later use. The matching records >> are written to a text file for later use. >> If we cannot improve the join performance my question becomes are >> there better tools to match up the 46 million and growing at the rate >> of 1 million every 3 days, strings outside of postgresql? We don't >> want to have to invest in zillions of dollars worth of hardware but >> if we have to we will. I just want to make sure we have all the non >> hardware possibilities for improvement covered before we start >> investing in large disk arrays. >> Thanks. >> >> --sean >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 9: the planner will ignore your desire to choose an index scan if >> your >> joining column's datatypes do not match >> > > ---------------------------(end of > broadcast)--------------------------- > TIP 8: explain analyze is your friend >
pgsql-performance by date: