Re: Looking for ideas on how to speed up warehouse loading - Mailing list pgsql-performance

From Sean Shanny
Subject Re: Looking for ideas on how to speed up warehouse loading
Date
Msg-id 408855AD.4090500@earthlink.net
Whole thread Raw
In response to Looking for ideas on how to speed up warehouse loading  (Sean Shanny <shannyconsulting@earthlink.net>)
Responses Re: Looking for ideas on how to speed up warehouse loading  (Nicholas Shanny <nshanny@tripadvisor.com>)
Re: Looking for ideas on how to speed up warehouse loading  ("scott.marlowe" <scott.marlowe@ihs.com>)
List pgsql-performance
I should have included this as well:

 show all;
              name              |    setting
--------------------------------+----------------
 add_missing_from               | on
 australian_timezones           | off
 authentication_timeout         | 60
 check_function_bodies          | on
 checkpoint_segments            | 64
 checkpoint_timeout             | 30
 checkpoint_warning             | 30
 client_encoding                | UNICODE
 client_min_messages            | notice
 commit_delay                   | 0
 commit_siblings                | 5
 cpu_index_tuple_cost           | 0.001
 cpu_operator_cost              | 0.0025
 cpu_tuple_cost                 | 0.01
 DateStyle                      | ISO, MDY
 db_user_namespace              | off
 deadlock_timeout               | 1000
 debug_pretty_print             | off
 debug_print_parse              | off
 debug_print_plan               | off
 debug_print_rewritten          | off
 default_statistics_target      | 1000
 default_transaction_isolation  | read committed
 default_transaction_read_only  | off
 dynamic_library_path           | $libdir
 effective_cache_size           | 400000
 enable_hashagg                 | on
 enable_hashjoin                | on
 enable_indexscan               | on
 enable_mergejoin               | on
 enable_nestloop                | on
 enable_seqscan                 | on
 enable_sort                    | on
 enable_tidscan                 | on
 explain_pretty_print           | on
 extra_float_digits             | 0
 from_collapse_limit            | 8
 fsync                          | on
 geqo                           | on
 geqo_effort                    | 1
 geqo_generations               | 0
 geqo_pool_size                 | 0
 geqo_selection_bias            | 2
 geqo_threshold                 | 11
 join_collapse_limit            | 8
 krb_server_keyfile             | unset
 lc_collate                     | C
 lc_ctype                       | C
 lc_messages                    | C
 lc_monetary                    | C
 lc_numeric                     | C
 lc_time                        | C
 log_connections                | off
 log_duration                   | off
 log_error_verbosity            | default
 log_executor_stats             | off
 log_hostname                   | off
 log_min_duration_statement     | -1
 log_min_error_statement        | panic
 log_min_messages               | notice
 log_parser_stats               | off
 log_pid                        | off
 log_planner_stats              | off
 log_source_port                | off
 log_statement                  | off
 log_statement_stats            | off
 log_timestamp                  | on
 max_connections                | 100
 max_expr_depth                 | 10000
 max_files_per_process          | 1000
 max_fsm_pages                  | 20000
 max_fsm_relations              | 1000
 max_locks_per_transaction      | 64
 password_encryption            | on
 port                           | 5432
 pre_auth_delay                 | 0
 preload_libraries              | unset
 random_page_cost               | 4
 regex_flavor                   | advanced
 rendezvous_name                | unset
 search_path                    | $user,public
 server_encoding                | UNICODE
 server_version                 | 7.4.1
 shared_buffers                 | 4000
 silent_mode                    | off
 sort_mem                       | 64000
 sql_inheritance                | on
 ssl                            | off
 statement_timeout              | 0
 stats_block_level              | on
 stats_command_string           | on
 stats_reset_on_server_start    | off
 stats_row_level                | on
 stats_start_collector          | on
 superuser_reserved_connections | 2
 syslog                         | 0
 syslog_facility                | LOCAL0
 syslog_ident                   | postgres
 tcpip_socket                   | on
 TimeZone                       | unknown
 trace_notify                   | off
 transaction_isolation          | read committed
 transaction_read_only          | off
 transform_null_equals          | off
 unix_socket_directory          | unset
 unix_socket_group              | unset
 unix_socket_permissions        | 511
 vacuum_mem                     | 64000
 virtual_host                   | unset
 wal_buffers                    | 1024
 wal_debug                      | 0
 wal_sync_method                | open_sync
 zero_damaged_pages             | off


Sean Shanny wrote:

> To all,
>
> Essentials:  Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB
> memory, direct attached via fibre channel to a fully optioned 3.5TB
> XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.
>
> Background:  We are loading what are essentially xml based access logs
> from about 20+ webservers daily, about 6GB of raw data.  We have a
> classic star schema.  All the ETL tools are custom java code or
> standard *nix tools like sort, uniq etc...
>
> The problem:  We have about 46 million rows in a table with the
> following schema:
>
> Table "public.d_referral"
>       Column       |  Type   | Modifiers
> --------------------+---------+-----------
> id                 | integer | not null
> referral_raw_url   | text    | not null
> job_control_number | integer | not null
> Indexes:
>    "d_referral_pkey" primary key, btree (id)
>    "idx_referral_url" btree (referral_raw_url)
>
> This is one of our dimension tables.  Part of the daily ETL process is
> to match all the new referral URL's against existing data in the
> d_referral table.  Some of the values in referral_raw_url can be 5000
> characters long :-( .  The avg length is :  109.57 characters.
>
> I sort and uniq all the incoming referrals and load them into a temp
> table.
>
> Table "public.referral_temp"
> Column | Type | Modifiers
> --------+------+-----------
> url    | text | not null
> Indexes:
>    "referral_temp_pkey" primary key, btree (url)
>
> I then do a left join
>
> SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral
> t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id
>
> This is the output from an explain analyze (Please note that I do a
> set enable_index_scan = false prior to issuing this because it takes
> forever using indexes.):
>
> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
>
> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------

>
> Sort  (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
> time=1297320.823..1297739.813 rows=476176 loops=1)
>   Sort Key: t1.id
>   ->  Hash Left Join  (cost=1052345.95..3969623.10 rows=451856
> width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
>         Hash Cond: ("outer".url = "inner".referral_raw_url)
>         ->  Seq Scan on referral_temp t2  (cost=0.00..6645.56
> rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1)
>         ->  Hash  (cost=729338.16..729338.16 rows=46034716 width=124)
> (actual time=1146440.710..1146440.710 rows=0 loops=1)
>               ->  Seq Scan on d_referral t1  (cost=0.00..729338.16
> rows=46034716 width=124) (actual time=14.502..-1064277.123
> rows=46034715 loops=1)
> Total runtime: 1298153.193 ms
> (8 rows)
>
>
>
> What I would like to know is if there are better ways to do the join?
> I need to get all the rows back from the referral_temp table as they
> are used for assigning FK's for the fact table later in processing.
> When I iterate over the values that I get back those with t1.id = null
> I assign a new FK and push both into the d_referral table as new
> entries as well as a text file for later use.  The matching records
> are written to a text file for later use.
> If we cannot improve the join performance my question becomes are
> there better tools to match up the 46 million and growing at the rate
> of 1 million every 3 days, strings outside of postgresql?  We don't
> want to have to invest in zillions of dollars worth of hardware but if
> we have to we will.  I just want to make sure we have all the non
> hardware possibilities for improvement covered before we start
> investing in large disk arrays.
> Thanks.
>
> --sean
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
> your
>      joining column's datatypes do not match
>

pgsql-performance by date:

Previous
From: "scott.marlowe"
Date:
Subject: Re: Setting Shared Buffers , Effective Cache, Sort Mem
Next
From: Nicholas Shanny
Date:
Subject: Re: Looking for ideas on how to speed up warehouse loading