Re: performance issue using DBI - Mailing list pgsql-general

From Joshua b. Jore
Subject Re: performance issue using DBI
Date
Msg-id Pine.BSO.4.44.0206060832210.2703-100000@kitten.greentechnologist.org
Whole thread Raw
In response to performance issue using DBI  ("Nicolas Nolst" <nnolst@hotmail.com>)
Responses Re: performance issue using DBI
List pgsql-general
How much data is 20,000 lines? 20KB? 100MB? You might do well to just
process this in memory and then just COPY the right data to the table.
This gets away from doing the SELECT/INSERT/SELECT/INSERT thing which is
going to be painful for batch jobs.

See, the thing is that on every insert the indexes have to be updated.
They won't be used until after you VACUUM the tables so it does no good
inside your transaction. Drop the indexes.

Also, leave off with your use of currval/nextval. The point of a serial
type is that the column increments each time automagically. For what you
are doing you could turn that serial into an integer, create your sequence
separately, exec nextval('yourseq') and then just use the value you
retrieved. Don't use currval since some other process might alter the
sequence between the time you call nextval and currval. Just store the
value.

I think this problem is better solved on the perl side than on the
PostgreSQL side. Consider using data structures like so. You can probably
create a better structure since you know your data and I don't.

sessions
{ remote_ip => { phone_type => { phone_number => session_id,
                                 phone_number => session_id
                               },
                 phone_type => { phone_number => session_id,
                                 phone_number => session_id
                               }
               },
  remote_ip => { phone_type => { phone_number => session_id,
                                 phone_number => session_id
                               },
                 phone_type => { phone_number => session_id,
                                 phone_number => session_id
                               }
               }
}

actions - now session_id is the array offset.
[ [ url, timestamp ],
  [ url, timestamp ],
  [ url, timestamp ] ],
[ [ url, timestamp ],
  [ url, timestamp ],
  [ url, timestamp ] ],
[ [ url, timestamp ],
  [ url, timestamp ],
  [ url, timestamp ] ],


> If the session already exists I add a line in the table actions with a INSERT
>
> If the session doesn't exist or if the criteria is true, I add a line in the
> table sessions with an INSERT and then add  a line with a INSERT in the table actions (I use nextval and currval).
>
> I have put indexes on sessions(session_id), sessions(msisdn),
> actions(session_id) and actions(timestamp). I process one log file of about 20000 lines every day. All the lines are
> processed in one transaction (autocommit set to 0).
>
> My problem is that populating my database is slower  when the data gets bigger
> and the performance falls dramatically. I thought that is would be improve with
> my indexes but the problem still persists.
>
> Could you please give me some clues that could solve this issue.
>
> Thanks.
>
>
>
> Nicolas Nolst
> [belgium_gs.gif]
>
>
______________________________________________________________________________________________________________________________
> MSN Photos is the easiest way to share and print your photos: Click Here
>
>





pgsql-general by date:

Previous
From: "Kasper (swebase)"
Date:
Subject: pg_shadow
Next
From: Stephan Szabo
Date:
Subject: Re: pg_shadow