Re: Looking for ideas on how to speed up warehouse loading - Mailing list pgsql-performance
From | Sean Shanny |
---|---|
Subject | Re: Looking for ideas on how to speed up warehouse loading |
Date | |
Msg-id | 408855AD.4090500@earthlink.net Whole thread Raw |
In response to | Looking for ideas on how to speed up warehouse loading (Sean Shanny <shannyconsulting@earthlink.net>) |
Responses |
Re: Looking for ideas on how to speed up warehouse loading
(Nicholas Shanny <nshanny@tripadvisor.com>)
Re: Looking for ideas on how to speed up warehouse loading ("scott.marlowe" <scott.marlowe@ihs.com>) |
List | pgsql-performance |
I should have included this as well: show all; name | setting --------------------------------+---------------- add_missing_from | on australian_timezones | off authentication_timeout | 60 check_function_bodies | on checkpoint_segments | 64 checkpoint_timeout | 30 checkpoint_warning | 30 client_encoding | UNICODE client_min_messages | notice commit_delay | 0 commit_siblings | 5 cpu_index_tuple_cost | 0.001 cpu_operator_cost | 0.0025 cpu_tuple_cost | 0.01 DateStyle | ISO, MDY db_user_namespace | off deadlock_timeout | 1000 debug_pretty_print | off debug_print_parse | off debug_print_plan | off debug_print_rewritten | off default_statistics_target | 1000 default_transaction_isolation | read committed default_transaction_read_only | off dynamic_library_path | $libdir effective_cache_size | 400000 enable_hashagg | on enable_hashjoin | on enable_indexscan | on enable_mergejoin | on enable_nestloop | on enable_seqscan | on enable_sort | on enable_tidscan | on explain_pretty_print | on extra_float_digits | 0 from_collapse_limit | 8 fsync | on geqo | on geqo_effort | 1 geqo_generations | 0 geqo_pool_size | 0 geqo_selection_bias | 2 geqo_threshold | 11 join_collapse_limit | 8 krb_server_keyfile | unset lc_collate | C lc_ctype | C lc_messages | C lc_monetary | C lc_numeric | C lc_time | C log_connections | off log_duration | off log_error_verbosity | default log_executor_stats | off log_hostname | off log_min_duration_statement | -1 log_min_error_statement | panic log_min_messages | notice log_parser_stats | off log_pid | off log_planner_stats | off log_source_port | off log_statement | off log_statement_stats | off log_timestamp | on max_connections | 100 max_expr_depth | 10000 max_files_per_process | 1000 max_fsm_pages | 20000 max_fsm_relations | 1000 max_locks_per_transaction | 64 password_encryption | on port | 5432 pre_auth_delay | 0 preload_libraries | unset random_page_cost | 4 regex_flavor | advanced rendezvous_name | unset search_path | $user,public server_encoding | UNICODE server_version | 7.4.1 shared_buffers | 4000 silent_mode | off sort_mem | 64000 sql_inheritance | on ssl | off statement_timeout | 0 stats_block_level | on stats_command_string | on stats_reset_on_server_start | off stats_row_level | on stats_start_collector | on superuser_reserved_connections | 2 syslog | 0 syslog_facility | LOCAL0 syslog_ident | postgres tcpip_socket | on TimeZone | unknown trace_notify | off transaction_isolation | read committed transaction_read_only | off transform_null_equals | off unix_socket_directory | unset unix_socket_group | unset unix_socket_permissions | 511 vacuum_mem | 64000 virtual_host | unset wal_buffers | 1024 wal_debug | 0 wal_sync_method | open_sync zero_damaged_pages | off Sean Shanny wrote: > To all, > > Essentials: Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB > memory, direct attached via fibre channel to a fully optioned 3.5TB > XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50. > > Background: We are loading what are essentially xml based access logs > from about 20+ webservers daily, about 6GB of raw data. We have a > classic star schema. All the ETL tools are custom java code or > standard *nix tools like sort, uniq etc... > > The problem: We have about 46 million rows in a table with the > following schema: > > Table "public.d_referral" > Column | Type | Modifiers > --------------------+---------+----------- > id | integer | not null > referral_raw_url | text | not null > job_control_number | integer | not null > Indexes: > "d_referral_pkey" primary key, btree (id) > "idx_referral_url" btree (referral_raw_url) > > This is one of our dimension tables. Part of the daily ETL process is > to match all the new referral URL's against existing data in the > d_referral table. Some of the values in referral_raw_url can be 5000 > characters long :-( . The avg length is : 109.57 characters. > > I sort and uniq all the incoming referrals and load them into a temp > table. > > Table "public.referral_temp" > Column | Type | Modifiers > --------+------+----------- > url | text | not null > Indexes: > "referral_temp_pkey" primary key, btree (url) > > I then do a left join > > SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral > t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id > > This is the output from an explain analyze (Please note that I do a > set enable_index_scan = false prior to issuing this because it takes > forever using indexes.): > > explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER > JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id; > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------ > > Sort (cost=4012064.81..4013194.45 rows=451856 width=115) (actual > time=1297320.823..1297739.813 rows=476176 loops=1) > Sort Key: t1.id > -> Hash Left Join (cost=1052345.95..3969623.10 rows=451856 > width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1) > Hash Cond: ("outer".url = "inner".referral_raw_url) > -> Seq Scan on referral_temp t2 (cost=0.00..6645.56 > rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1) > -> Hash (cost=729338.16..729338.16 rows=46034716 width=124) > (actual time=1146440.710..1146440.710 rows=0 loops=1) > -> Seq Scan on d_referral t1 (cost=0.00..729338.16 > rows=46034716 width=124) (actual time=14.502..-1064277.123 > rows=46034715 loops=1) > Total runtime: 1298153.193 ms > (8 rows) > > > > What I would like to know is if there are better ways to do the join? > I need to get all the rows back from the referral_temp table as they > are used for assigning FK's for the fact table later in processing. > When I iterate over the values that I get back those with t1.id = null > I assign a new FK and push both into the d_referral table as new > entries as well as a text file for later use. The matching records > are written to a text file for later use. > If we cannot improve the join performance my question becomes are > there better tools to match up the 46 million and growing at the rate > of 1 million every 3 days, strings outside of postgresql? We don't > want to have to invest in zillions of dollars worth of hardware but if > we have to we will. I just want to make sure we have all the non > hardware possibilities for improvement covered before we start > investing in large disk arrays. > Thanks. > > --sean > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if > your > joining column's datatypes do not match >
pgsql-performance by date: