Re: Looking for ideas on how to speed up warehouse loading - Mailing list pgsql-performance

From Nicholas Shanny
Subject Re: Looking for ideas on how to speed up warehouse loading
Date
Msg-id BE0C810C-94C0-11D8-B96E-000A95770900@tripadvisor.com
Whole thread Raw
In response to Re: Looking for ideas on how to speed up warehouse loading  (Sean Shanny <shannyconsulting@earthlink.net>)
List pgsql-performance
One other thing: we are running with a block size of 32K.


Nick Shanny
(Brother of above person)

On Apr 22, 2004, at 7:30 PM, Sean Shanny wrote:

> I should have included this as well:
>
> show all;
>              name              |    setting
> --------------------------------+----------------
> add_missing_from               | on
> australian_timezones           | off
> authentication_timeout         | 60
> check_function_bodies          | on
> checkpoint_segments            | 64
> checkpoint_timeout             | 30
> checkpoint_warning             | 30
> client_encoding                | UNICODE
> client_min_messages            | notice
> commit_delay                   | 0
> commit_siblings                | 5
> cpu_index_tuple_cost           | 0.001
> cpu_operator_cost              | 0.0025
> cpu_tuple_cost                 | 0.01
> DateStyle                      | ISO, MDY
> db_user_namespace              | off
> deadlock_timeout               | 1000
> debug_pretty_print             | off
> debug_print_parse              | off
> debug_print_plan               | off
> debug_print_rewritten          | off
> default_statistics_target      | 1000
> default_transaction_isolation  | read committed
> default_transaction_read_only  | off
> dynamic_library_path           | $libdir
> effective_cache_size           | 400000
> enable_hashagg                 | on
> enable_hashjoin                | on
> enable_indexscan               | on
> enable_mergejoin               | on
> enable_nestloop                | on
> enable_seqscan                 | on
> enable_sort                    | on
> enable_tidscan                 | on
> explain_pretty_print           | on
> extra_float_digits             | 0
> from_collapse_limit            | 8
> fsync                          | on
> geqo                           | on
> geqo_effort                    | 1
> geqo_generations               | 0
> geqo_pool_size                 | 0
> geqo_selection_bias            | 2
> geqo_threshold                 | 11
> join_collapse_limit            | 8
> krb_server_keyfile             | unset
> lc_collate                     | C
> lc_ctype                       | C
> lc_messages                    | C
> lc_monetary                    | C
> lc_numeric                     | C
> lc_time                        | C
> log_connections                | off
> log_duration                   | off
> log_error_verbosity            | default
> log_executor_stats             | off
> log_hostname                   | off
> log_min_duration_statement     | -1
> log_min_error_statement        | panic
> log_min_messages               | notice
> log_parser_stats               | off
> log_pid                        | off
> log_planner_stats              | off
> log_source_port                | off
> log_statement                  | off
> log_statement_stats            | off
> log_timestamp                  | on
> max_connections                | 100
> max_expr_depth                 | 10000
> max_files_per_process          | 1000
> max_fsm_pages                  | 20000
> max_fsm_relations              | 1000
> max_locks_per_transaction      | 64
> password_encryption            | on
> port                           | 5432
> pre_auth_delay                 | 0
> preload_libraries              | unset
> random_page_cost               | 4
> regex_flavor                   | advanced
> rendezvous_name                | unset
> search_path                    | $user,public
> server_encoding                | UNICODE
> server_version                 | 7.4.1
> shared_buffers                 | 4000
> silent_mode                    | off
> sort_mem                       | 64000
> sql_inheritance                | on
> ssl                            | off
> statement_timeout              | 0
> stats_block_level              | on
> stats_command_string           | on
> stats_reset_on_server_start    | off
> stats_row_level                | on
> stats_start_collector          | on
> superuser_reserved_connections | 2
> syslog                         | 0
> syslog_facility                | LOCAL0
> syslog_ident                   | postgres
> tcpip_socket                   | on
> TimeZone                       | unknown
> trace_notify                   | off
> transaction_isolation          | read committed
> transaction_read_only          | off
> transform_null_equals          | off
> unix_socket_directory          | unset
> unix_socket_group              | unset
> unix_socket_permissions        | 511
> vacuum_mem                     | 64000
> virtual_host                   | unset
> wal_buffers                    | 1024
> wal_debug                      | 0
> wal_sync_method                | open_sync
> zero_damaged_pages             | off
>
>
> Sean Shanny wrote:
>
>> To all,
>>
>> Essentials:  Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB
>> memory, direct attached via fibre channel to a fully optioned 3.5TB
>> XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.
>>
>> Background:  We are loading what are essentially xml based access
>> logs from about 20+ webservers daily, about 6GB of raw data.  We have
>> a classic star schema.  All the ETL tools are custom java code or
>> standard *nix tools like sort, uniq etc...
>>
>> The problem:  We have about 46 million rows in a table with the
>> following schema:
>>
>> Table "public.d_referral"
>>       Column       |  Type   | Modifiers
>> --------------------+---------+-----------
>> id                 | integer | not null
>> referral_raw_url   | text    | not null
>> job_control_number | integer | not null
>> Indexes:
>>    "d_referral_pkey" primary key, btree (id)
>>    "idx_referral_url" btree (referral_raw_url)
>>
>> This is one of our dimension tables.  Part of the daily ETL process
>> is to match all the new referral URL's against existing data in the
>> d_referral table.  Some of the values in referral_raw_url can be 5000
>> characters long :-( .  The avg length is :  109.57 characters.
>>
>> I sort and uniq all the incoming referrals and load them into a temp
>> table.
>>
>> Table "public.referral_temp"
>> Column | Type | Modifiers
>> --------+------+-----------
>> url    | text | not null
>> Indexes:
>>    "referral_temp_pkey" primary key, btree (url)
>>
>> I then do a left join
>>
>> SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral
>> t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id
>>
>> This is the output from an explain analyze (Please note that I do a
>> set enable_index_scan = false prior to issuing this because it takes
>> forever using indexes.):
>>
>> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
>> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
>>
>> QUERY PLAN
>> ----------------------------------------------------------------------
>> ----------------------------------------------------------------------
>> ----------
>> Sort  (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
>> time=1297320.823..1297739.813 rows=476176 loops=1)
>>   Sort Key: t1.id
>>   ->  Hash Left Join  (cost=1052345.95..3969623.10 rows=451856
>> width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
>>         Hash Cond: ("outer".url = "inner".referral_raw_url)
>>         ->  Seq Scan on referral_temp t2  (cost=0.00..6645.56
>> rows=451856 width=111) (actual time=20.285..1449.634 rows=476176
>> loops=1)
>>         ->  Hash  (cost=729338.16..729338.16 rows=46034716 width=124)
>> (actual time=1146440.710..1146440.710 rows=0 loops=1)
>>               ->  Seq Scan on d_referral t1  (cost=0.00..729338.16
>> rows=46034716 width=124) (actual time=14.502..-1064277.123
>> rows=46034715 loops=1)
>> Total runtime: 1298153.193 ms
>> (8 rows)
>>
>>
>>
>> What I would like to know is if there are better ways to do the join?
>>  I need to get all the rows back from the referral_temp table as they
>> are used for assigning FK's for the fact table later in processing.
>> When I iterate over the values that I get back those with t1.id =
>> null I assign a new FK and push both into the d_referral table as new
>> entries as well as a text file for later use.  The matching records
>> are written to a text file for later use.
>> If we cannot improve the join performance my question becomes are
>> there better tools to match up the 46 million and growing at the rate
>> of 1 million every 3 days, strings outside of postgresql?  We don't
>> want to have to invest in zillions of dollars worth of hardware but
>> if we have to we will.  I just want to make sure we have all the non
>> hardware possibilities for improvement covered before we start
>> investing in large disk arrays.
>> Thanks.
>>
>> --sean
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 9: the planner will ignore your desire to choose an index scan if
>> your
>>      joining column's datatypes do not match
>>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend
>


pgsql-performance by date:

Previous
From: Sean Shanny
Date:
Subject: Re: Looking for ideas on how to speed up warehouse loading
Next
From: Tom Lane
Date:
Subject: Re: Looking for ideas on how to speed up warehouse loading