Thread: Looking for ideas on how to speed up warehouse loading

Looking for ideas on how to speed up warehouse loading

From
Sean Shanny
Date:
To all,

Essentials:  Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB
memory, direct attached via fibre channel to a fully optioned 3.5TB
XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.

Background:  We are loading what are essentially xml based access logs
from about 20+ webservers daily, about 6GB of raw data.  We have a
classic star schema.  All the ETL tools are custom java code or standard
*nix tools like sort, uniq etc...

The problem:  We have about 46 million rows in a table with the
following schema:

Table "public.d_referral"
       Column       |  Type   | Modifiers
--------------------+---------+-----------
 id                 | integer | not null
 referral_raw_url   | text    | not null
 job_control_number | integer | not null
Indexes:
    "d_referral_pkey" primary key, btree (id)
    "idx_referral_url" btree (referral_raw_url)

This is one of our dimension tables.  Part of the daily ETL process is
to match all the new referral URL's against existing data in the
d_referral table.  Some of the values in referral_raw_url can be 5000
characters long :-( .  The avg length is :  109.57 characters.

I sort and uniq all the incoming referrals and load them into a temp table.

Table "public.referral_temp"
 Column | Type | Modifiers
--------+------+-----------
 url    | text | not null
Indexes:
    "referral_temp_pkey" primary key, btree (url)

I then do a left join

SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral t1
ON t2.url = t1.referral_raw_url ORDER BY t1.id

This is the output from an explain analyze (Please note that I do a set
enable_index_scan = false prior to issuing this because it takes forever
using indexes.):

explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
time=1297320.823..1297739.813 rows=476176 loops=1)
   Sort Key: t1.id
   ->  Hash Left Join  (cost=1052345.95..3969623.10 rows=451856
width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
         Hash Cond: ("outer".url = "inner".referral_raw_url)
         ->  Seq Scan on referral_temp t2  (cost=0.00..6645.56
rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1)
         ->  Hash  (cost=729338.16..729338.16 rows=46034716 width=124)
(actual time=1146440.710..1146440.710 rows=0 loops=1)
               ->  Seq Scan on d_referral t1  (cost=0.00..729338.16
rows=46034716 width=124) (actual time=14.502..-1064277.123 rows=46034715
loops=1)
 Total runtime: 1298153.193 ms
(8 rows)



What I would like to know is if there are better ways to do the join?  I
need to get all the rows back from the referral_temp table as they are
used for assigning FK's for the fact table later in processing.  When I
iterate over the values that I get back those with t1.id = null I assign
a new FK and push both into the d_referral table as new entries as well
as a text file for later use.  The matching records are written to a
text file for later use.

If we cannot improve the join performance my question becomes are there
better tools to match up the 46 million and growing at the rate of 1
million every 3 days, strings outside of postgresql?  We don't want to
have to invest in zillions of dollars worth of hardware but if we have
to we will.  I just want to make sure we have all the non hardware
possibilities for improvement covered before we start investing in large
disk arrays.

Thanks.

--sean

Re: Looking for ideas on how to speed up warehouse loading

From
Sean Shanny
Date:
I should have included this as well:

 show all;
              name              |    setting
--------------------------------+----------------
 add_missing_from               | on
 australian_timezones           | off
 authentication_timeout         | 60
 check_function_bodies          | on
 checkpoint_segments            | 64
 checkpoint_timeout             | 30
 checkpoint_warning             | 30
 client_encoding                | UNICODE
 client_min_messages            | notice
 commit_delay                   | 0
 commit_siblings                | 5
 cpu_index_tuple_cost           | 0.001
 cpu_operator_cost              | 0.0025
 cpu_tuple_cost                 | 0.01
 DateStyle                      | ISO, MDY
 db_user_namespace              | off
 deadlock_timeout               | 1000
 debug_pretty_print             | off
 debug_print_parse              | off
 debug_print_plan               | off
 debug_print_rewritten          | off
 default_statistics_target      | 1000
 default_transaction_isolation  | read committed
 default_transaction_read_only  | off
 dynamic_library_path           | $libdir
 effective_cache_size           | 400000
 enable_hashagg                 | on
 enable_hashjoin                | on
 enable_indexscan               | on
 enable_mergejoin               | on
 enable_nestloop                | on
 enable_seqscan                 | on
 enable_sort                    | on
 enable_tidscan                 | on
 explain_pretty_print           | on
 extra_float_digits             | 0
 from_collapse_limit            | 8
 fsync                          | on
 geqo                           | on
 geqo_effort                    | 1
 geqo_generations               | 0
 geqo_pool_size                 | 0
 geqo_selection_bias            | 2
 geqo_threshold                 | 11
 join_collapse_limit            | 8
 krb_server_keyfile             | unset
 lc_collate                     | C
 lc_ctype                       | C
 lc_messages                    | C
 lc_monetary                    | C
 lc_numeric                     | C
 lc_time                        | C
 log_connections                | off
 log_duration                   | off
 log_error_verbosity            | default
 log_executor_stats             | off
 log_hostname                   | off
 log_min_duration_statement     | -1
 log_min_error_statement        | panic
 log_min_messages               | notice
 log_parser_stats               | off
 log_pid                        | off
 log_planner_stats              | off
 log_source_port                | off
 log_statement                  | off
 log_statement_stats            | off
 log_timestamp                  | on
 max_connections                | 100
 max_expr_depth                 | 10000
 max_files_per_process          | 1000
 max_fsm_pages                  | 20000
 max_fsm_relations              | 1000
 max_locks_per_transaction      | 64
 password_encryption            | on
 port                           | 5432
 pre_auth_delay                 | 0
 preload_libraries              | unset
 random_page_cost               | 4
 regex_flavor                   | advanced
 rendezvous_name                | unset
 search_path                    | $user,public
 server_encoding                | UNICODE
 server_version                 | 7.4.1
 shared_buffers                 | 4000
 silent_mode                    | off
 sort_mem                       | 64000
 sql_inheritance                | on
 ssl                            | off
 statement_timeout              | 0
 stats_block_level              | on
 stats_command_string           | on
 stats_reset_on_server_start    | off
 stats_row_level                | on
 stats_start_collector          | on
 superuser_reserved_connections | 2
 syslog                         | 0
 syslog_facility                | LOCAL0
 syslog_ident                   | postgres
 tcpip_socket                   | on
 TimeZone                       | unknown
 trace_notify                   | off
 transaction_isolation          | read committed
 transaction_read_only          | off
 transform_null_equals          | off
 unix_socket_directory          | unset
 unix_socket_group              | unset
 unix_socket_permissions        | 511
 vacuum_mem                     | 64000
 virtual_host                   | unset
 wal_buffers                    | 1024
 wal_debug                      | 0
 wal_sync_method                | open_sync
 zero_damaged_pages             | off


Sean Shanny wrote:

> To all,
>
> Essentials:  Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB
> memory, direct attached via fibre channel to a fully optioned 3.5TB
> XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.
>
> Background:  We are loading what are essentially xml based access logs
> from about 20+ webservers daily, about 6GB of raw data.  We have a
> classic star schema.  All the ETL tools are custom java code or
> standard *nix tools like sort, uniq etc...
>
> The problem:  We have about 46 million rows in a table with the
> following schema:
>
> Table "public.d_referral"
>       Column       |  Type   | Modifiers
> --------------------+---------+-----------
> id                 | integer | not null
> referral_raw_url   | text    | not null
> job_control_number | integer | not null
> Indexes:
>    "d_referral_pkey" primary key, btree (id)
>    "idx_referral_url" btree (referral_raw_url)
>
> This is one of our dimension tables.  Part of the daily ETL process is
> to match all the new referral URL's against existing data in the
> d_referral table.  Some of the values in referral_raw_url can be 5000
> characters long :-( .  The avg length is :  109.57 characters.
>
> I sort and uniq all the incoming referrals and load them into a temp
> table.
>
> Table "public.referral_temp"
> Column | Type | Modifiers
> --------+------+-----------
> url    | text | not null
> Indexes:
>    "referral_temp_pkey" primary key, btree (url)
>
> I then do a left join
>
> SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral
> t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id
>
> This is the output from an explain analyze (Please note that I do a
> set enable_index_scan = false prior to issuing this because it takes
> forever using indexes.):
>
> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
>
> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------

>
> Sort  (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
> time=1297320.823..1297739.813 rows=476176 loops=1)
>   Sort Key: t1.id
>   ->  Hash Left Join  (cost=1052345.95..3969623.10 rows=451856
> width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
>         Hash Cond: ("outer".url = "inner".referral_raw_url)
>         ->  Seq Scan on referral_temp t2  (cost=0.00..6645.56
> rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1)
>         ->  Hash  (cost=729338.16..729338.16 rows=46034716 width=124)
> (actual time=1146440.710..1146440.710 rows=0 loops=1)
>               ->  Seq Scan on d_referral t1  (cost=0.00..729338.16
> rows=46034716 width=124) (actual time=14.502..-1064277.123
> rows=46034715 loops=1)
> Total runtime: 1298153.193 ms
> (8 rows)
>
>
>
> What I would like to know is if there are better ways to do the join?
> I need to get all the rows back from the referral_temp table as they
> are used for assigning FK's for the fact table later in processing.
> When I iterate over the values that I get back those with t1.id = null
> I assign a new FK and push both into the d_referral table as new
> entries as well as a text file for later use.  The matching records
> are written to a text file for later use.
> If we cannot improve the join performance my question becomes are
> there better tools to match up the 46 million and growing at the rate
> of 1 million every 3 days, strings outside of postgresql?  We don't
> want to have to invest in zillions of dollars worth of hardware but if
> we have to we will.  I just want to make sure we have all the non
> hardware possibilities for improvement covered before we start
> investing in large disk arrays.
> Thanks.
>
> --sean
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if
> your
>      joining column's datatypes do not match
>

Re: Looking for ideas on how to speed up warehouse loading

From
Nicholas Shanny
Date:
One other thing: we are running with a block size of 32K.


Nick Shanny
(Brother of above person)

On Apr 22, 2004, at 7:30 PM, Sean Shanny wrote:

> I should have included this as well:
>
> show all;
>              name              |    setting
> --------------------------------+----------------
> add_missing_from               | on
> australian_timezones           | off
> authentication_timeout         | 60
> check_function_bodies          | on
> checkpoint_segments            | 64
> checkpoint_timeout             | 30
> checkpoint_warning             | 30
> client_encoding                | UNICODE
> client_min_messages            | notice
> commit_delay                   | 0
> commit_siblings                | 5
> cpu_index_tuple_cost           | 0.001
> cpu_operator_cost              | 0.0025
> cpu_tuple_cost                 | 0.01
> DateStyle                      | ISO, MDY
> db_user_namespace              | off
> deadlock_timeout               | 1000
> debug_pretty_print             | off
> debug_print_parse              | off
> debug_print_plan               | off
> debug_print_rewritten          | off
> default_statistics_target      | 1000
> default_transaction_isolation  | read committed
> default_transaction_read_only  | off
> dynamic_library_path           | $libdir
> effective_cache_size           | 400000
> enable_hashagg                 | on
> enable_hashjoin                | on
> enable_indexscan               | on
> enable_mergejoin               | on
> enable_nestloop                | on
> enable_seqscan                 | on
> enable_sort                    | on
> enable_tidscan                 | on
> explain_pretty_print           | on
> extra_float_digits             | 0
> from_collapse_limit            | 8
> fsync                          | on
> geqo                           | on
> geqo_effort                    | 1
> geqo_generations               | 0
> geqo_pool_size                 | 0
> geqo_selection_bias            | 2
> geqo_threshold                 | 11
> join_collapse_limit            | 8
> krb_server_keyfile             | unset
> lc_collate                     | C
> lc_ctype                       | C
> lc_messages                    | C
> lc_monetary                    | C
> lc_numeric                     | C
> lc_time                        | C
> log_connections                | off
> log_duration                   | off
> log_error_verbosity            | default
> log_executor_stats             | off
> log_hostname                   | off
> log_min_duration_statement     | -1
> log_min_error_statement        | panic
> log_min_messages               | notice
> log_parser_stats               | off
> log_pid                        | off
> log_planner_stats              | off
> log_source_port                | off
> log_statement                  | off
> log_statement_stats            | off
> log_timestamp                  | on
> max_connections                | 100
> max_expr_depth                 | 10000
> max_files_per_process          | 1000
> max_fsm_pages                  | 20000
> max_fsm_relations              | 1000
> max_locks_per_transaction      | 64
> password_encryption            | on
> port                           | 5432
> pre_auth_delay                 | 0
> preload_libraries              | unset
> random_page_cost               | 4
> regex_flavor                   | advanced
> rendezvous_name                | unset
> search_path                    | $user,public
> server_encoding                | UNICODE
> server_version                 | 7.4.1
> shared_buffers                 | 4000
> silent_mode                    | off
> sort_mem                       | 64000
> sql_inheritance                | on
> ssl                            | off
> statement_timeout              | 0
> stats_block_level              | on
> stats_command_string           | on
> stats_reset_on_server_start    | off
> stats_row_level                | on
> stats_start_collector          | on
> superuser_reserved_connections | 2
> syslog                         | 0
> syslog_facility                | LOCAL0
> syslog_ident                   | postgres
> tcpip_socket                   | on
> TimeZone                       | unknown
> trace_notify                   | off
> transaction_isolation          | read committed
> transaction_read_only          | off
> transform_null_equals          | off
> unix_socket_directory          | unset
> unix_socket_group              | unset
> unix_socket_permissions        | 511
> vacuum_mem                     | 64000
> virtual_host                   | unset
> wal_buffers                    | 1024
> wal_debug                      | 0
> wal_sync_method                | open_sync
> zero_damaged_pages             | off
>
>
> Sean Shanny wrote:
>
>> To all,
>>
>> Essentials:  Running 7.4.1 on OSX on a loaded G5 with dual procs, 8GB
>> memory, direct attached via fibre channel to a fully optioned 3.5TB
>> XRaid (14 spindles, 2 sets of 7 in RAID 5) box running RAID 50.
>>
>> Background:  We are loading what are essentially xml based access
>> logs from about 20+ webservers daily, about 6GB of raw data.  We have
>> a classic star schema.  All the ETL tools are custom java code or
>> standard *nix tools like sort, uniq etc...
>>
>> The problem:  We have about 46 million rows in a table with the
>> following schema:
>>
>> Table "public.d_referral"
>>       Column       |  Type   | Modifiers
>> --------------------+---------+-----------
>> id                 | integer | not null
>> referral_raw_url   | text    | not null
>> job_control_number | integer | not null
>> Indexes:
>>    "d_referral_pkey" primary key, btree (id)
>>    "idx_referral_url" btree (referral_raw_url)
>>
>> This is one of our dimension tables.  Part of the daily ETL process
>> is to match all the new referral URL's against existing data in the
>> d_referral table.  Some of the values in referral_raw_url can be 5000
>> characters long :-( .  The avg length is :  109.57 characters.
>>
>> I sort and uniq all the incoming referrals and load them into a temp
>> table.
>>
>> Table "public.referral_temp"
>> Column | Type | Modifiers
>> --------+------+-----------
>> url    | text | not null
>> Indexes:
>>    "referral_temp_pkey" primary key, btree (url)
>>
>> I then do a left join
>>
>> SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral
>> t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id
>>
>> This is the output from an explain analyze (Please note that I do a
>> set enable_index_scan = false prior to issuing this because it takes
>> forever using indexes.):
>>
>> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
>> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
>>
>> QUERY PLAN
>> ----------------------------------------------------------------------
>> ----------------------------------------------------------------------
>> ----------
>> Sort  (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
>> time=1297320.823..1297739.813 rows=476176 loops=1)
>>   Sort Key: t1.id
>>   ->  Hash Left Join  (cost=1052345.95..3969623.10 rows=451856
>> width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
>>         Hash Cond: ("outer".url = "inner".referral_raw_url)
>>         ->  Seq Scan on referral_temp t2  (cost=0.00..6645.56
>> rows=451856 width=111) (actual time=20.285..1449.634 rows=476176
>> loops=1)
>>         ->  Hash  (cost=729338.16..729338.16 rows=46034716 width=124)
>> (actual time=1146440.710..1146440.710 rows=0 loops=1)
>>               ->  Seq Scan on d_referral t1  (cost=0.00..729338.16
>> rows=46034716 width=124) (actual time=14.502..-1064277.123
>> rows=46034715 loops=1)
>> Total runtime: 1298153.193 ms
>> (8 rows)
>>
>>
>>
>> What I would like to know is if there are better ways to do the join?
>>  I need to get all the rows back from the referral_temp table as they
>> are used for assigning FK's for the fact table later in processing.
>> When I iterate over the values that I get back those with t1.id =
>> null I assign a new FK and push both into the d_referral table as new
>> entries as well as a text file for later use.  The matching records
>> are written to a text file for later use.
>> If we cannot improve the join performance my question becomes are
>> there better tools to match up the 46 million and growing at the rate
>> of 1 million every 3 days, strings outside of postgresql?  We don't
>> want to have to invest in zillions of dollars worth of hardware but
>> if we have to we will.  I just want to make sure we have all the non
>> hardware possibilities for improvement covered before we start
>> investing in large disk arrays.
>> Thanks.
>>
>> --sean
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 9: the planner will ignore your desire to choose an index scan if
>> your
>>      joining column's datatypes do not match
>>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend
>


Re: Looking for ideas on how to speed up warehouse loading

From
Tom Lane
Date:
Sean Shanny <shannyconsulting@earthlink.net> writes:
> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;

> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------
>  Sort  (cost=4012064.81..4013194.45 rows=451856 width=115) (actual
> time=1297320.823..1297739.813 rows=476176 loops=1)
>    Sort Key: t1.id
>    ->  Hash Left Join  (cost=1052345.95..3969623.10 rows=451856
> width=115) (actual time=1146650.487..1290230.590 rows=476176 loops=1)
>          Hash Cond: ("outer".url = "inner".referral_raw_url)
>          ->  Seq Scan on referral_temp t2  (cost=0.00..6645.56
> rows=451856 width=111) (actual time=20.285..1449.634 rows=476176 loops=1)
>          ->  Hash  (cost=729338.16..729338.16 rows=46034716 width=124)
> (actual time=1146440.710..1146440.710 rows=0 loops=1)
>                ->  Seq Scan on d_referral t1  (cost=0.00..729338.16
> rows=46034716 width=124) (actual time=14.502..-1064277.123 rows=46034715
> loops=1)
>  Total runtime: 1298153.193 ms
> (8 rows)

> What I would like to know is if there are better ways to do the join?

What have you got sort_mem set to?  You might try increasing it to a gig
or so, since you seem to have plenty of RAM in that box ...

            regards, tom lane

Re: Looking for ideas on how to speed up warehouse loading

From
Joe Conway
Date:
Sean Shanny wrote:
> explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
> JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;

> What I would like to know is if there are better ways to do the join?  I
> need to get all the rows back from the referral_temp table as they are
> used for assigning FK's for the fact table later in processing.  When I
> iterate over the values that I get back those with t1.id = null I assign
> a new FK and push both into the d_referral table as new entries as well
> as a text file for later use.  The matching records are written to a
> text file for later use.

Would something like this work any better (without disabling index scans):

SELECT t1.id, t2.url
FROM referral_temp t2, d_referral t1
WHERE t1.referral_raw_url = t2.url;

<process rows with a match>

SELECT t1.id, t2.url
FROM referral_temp t2
WHERE NOT EXISTS
(select 1 FROM d_referral t1 WHERE t1.referral_raw_url = t2.url);

<process rows without a match>

?

Joe

Re: Looking for ideas on how to speed up warehouse loading

From
"Aaron Werman"
Date:
By definition, it is equivalent to:

SELECT t1.id, t2.url FROM referral_temp t2 LEFT /*OUTER*/ JOIN d_referral t1
ON t2.url = t1.referral_raw_url
union all
SELECT null, url FROM referral_temp WHERE url is null
ORDER BY 1;



/Aaron

----- Original Message -----
From: "Joe Conway" <mail@joeconway.com>
To: "Sean Shanny" <shannyconsulting@earthlink.net>
Cc: <pgsql-performance@postgresql.org>
Sent: Friday, April 23, 2004 12:38 AM
Subject: Re: [PERFORM] Looking for ideas on how to speed up warehouse
loading


> Sean Shanny wrote:
> > explain analyze SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER
> > JOIN d_referral t1 ON t2.url = t1.referral_raw_url ORDER BY t1.id;
>
> > What I would like to know is if there are better ways to do the join?  I
> > need to get all the rows back from the referral_temp table as they are
> > used for assigning FK's for the fact table later in processing.  When I
> > iterate over the values that I get back those with t1.id = null I assign
> > a new FK and push both into the d_referral table as new entries as well
> > as a text file for later use.  The matching records are written to a
> > text file for later use.
>
> Would something like this work any better (without disabling index scans):
>
> SELECT t1.id, t2.url
> FROM referral_temp t2, d_referral t1
> WHERE t1.referral_raw_url = t2.url;
>
> <process rows with a match>
>
> SELECT t1.id, t2.url
> FROM referral_temp t2
> WHERE NOT EXISTS
> (select 1 FROM d_referral t1 WHERE t1.referral_raw_url = t2.url);
>
> <process rows without a match>
>
> ?
>
> Joe
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Re: Looking for ideas on how to speed up warehouse loading

From
"scott.marlowe"
Date:
On Thu, 22 Apr 2004, Sean Shanny wrote:

> I should have included this as well:
>  fsync                          | on
>  shared_buffers                 | 4000
>  sort_mem                       | 64000

For purposes of loading only, you can try turning off fsync, assuming this
is a virgin load and you can just re-initdb should bad things happen (OS,
postgresql crash, power plug pulled, etc...)

Also increasing sort_mem and shared_buffers might help.  Especially
sort_mem.  But turn it back down to something reasonable after the import.

And turn fsync back on after the import too.  Note you have to restart
postgresql to make fsync = off take effect.


Re: Looking for ideas on how to speed up warehouse loading

From
CoL
Date:
hi,

Sean Shanny wrote, On 4/22/2004 23:56:
>
> SELECT t1.id, t2.url FROM referral_temp t2 LEFT OUTER JOIN d_referral t1
> ON t2.url = t1.referral_raw_url ORDER BY t1.id

index on url (text) has no sense. Try to use and md5 (char(32) column)
which contains the md5 hash of url field. and join these ones. You can
have a better index on this char 32 field.

do not forget to analyze the tables after data load, and you can fine
tune you postgresql.conf, default_statistics_target for better index
info, and others.
check this info pages:
http://www.varlena.com/varlena/GeneralBits/Tidbits/annotated_conf_e.html
http://www.varlena.com/varlena/GeneralBits/Tidbits/perf.html

C.