Thread: Populating huge tables each day

Populating huge tables each day

From
Ben-Nes Yonatan
Date:
Hi All,

First I apologize for the length of this email im just afraid that my
problem is quite complicated for explination & also about my english
which is not my native language.... sorry :)

I'm currently building a site for a client which will need to update his
database on a daily routine with about 3.5 million rows from data files
(the files vary in size and can reach up to 1 million rows per file).
That data will be retrieved from several files (Text or XML by my
chioce) and part of the rows will need to be INSERT cause they doesnt
have any previous record and some will need to UPDATE previous records
or just INSERT and DELETE the previous records.
Beside of that the new data also bring information in it that tell how
to populate another table (2 fields and the result is alot less rows)
and connect between the two tables by foreign key which is written in
each row of the main table (the one with the 3.5 million rows).
Now the site must work 24/7 and it will probably have 2 servers which
will run PostreSQL (working as 1 database), the scripting language that
ill use is PHP if it change anything.

I thought on how to accomplish this and I would like to receive comments
and ideas.
I'm mostly afraid from the stress it will make on the server during the
process and that the site will display working information all the time.

A brief explaniation on the definitions ill use ahead:
1. New data - The retrieved data from the files (Text or XML).
2. Main table - The final table which need to hold about 3.5 million
rows, it can be empty before the process or can hold already information
that some of it need to get updated using the "New data" and the rest of
the "New data" need to get inserted into it while the previous data
which didnt got updated need to get deleted.
3. Second table - The table which hold information that the data at the
"Main table" need to get connected to using foreign keys.
4. Temp table - A temporary table which will hold the "New data" till it
will be ready to be INSERT/UPDATE the data at "Main table" (got the
exact same columns as the "Main table").
4. Temp table2 - A temporary table which is created by CREATE TEMPORARY
TABLE AS (former SELECT INTO).

My plan:
1. With a COPY FROM ill insert the data to the "Temp table" (all of the
files in Text format).
2. Run at PL/pgSQL function: {

A. Start transaction
B. DELETE the content of the current existing "Second table".
C. INSERT data into the "Second table" using the "Temp table" - each row
will be checked to its values and compared to check if they exist
already (SELECT) at the "Second table" and if not it will run an INSERT
to create it at the "Second table" - im afraid that this process will be
extremly heavy on the server.
D. DELETE the "Main table" rows.
E. With CREATE TEMPORARY TABLE AS ill create "Temp table2" which will
contain all of the information of the "Temp table" + a subquery will
retrieve the ID of the "Second table" for the foreign key - quite heavy
process i suspect.
F. INSERT the data from "Temp table2" to the "Main table".
G. End transaction + quit from PL/pgSQL. }

3. Delete all the files.

Thanks alot in advance and again im sorry for the length of the mail :)

Ben-Nes Yonatan
Canaan Surfing ltd.
http://www.canaan.net.il


Re: Populating huge tables each day

From
"Dann Corbit"
Date:
I see a lot of problems with this idea.

You mention that the database is supposed to be available 24x7.
While you are loading, the database table receiving data will not be
available.  Therefore, you will have to have one server online (with
only the old data), while the other one is loading.  Once the load and
all procedures are complete, you could switch the active server.

You do not mention your time constraints and the total volume of data.
If the new data must become available at some critical time, this is a
very important requirement that must be spelled out clearly.

You will need some kind of guarantee of relational integrity.  This is
always difficult to achieve when bulk loading from foreign sources.

I think it is important to spell things out more clearly.

How many tables are to be replicated?
What is the total number of expected rows for each table?
How fast are the tables expected to grow?
When must the new data become available online?
Are all of the tables in the database populated from a foreign source or
just some of them?
Do you also have access to the source data in its database format, or
only as text dumps?

Is the goal to serve as a reporting server?  Is the goal to make the
same data as the original server online for end-users and in so doing to
reduce the load on the original server?  What is the real purpose of the
task to be accomplished?

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Ben-Nes Yonatan
> Sent: Monday, June 27, 2005 1:13 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Populating huge tables each day
>
> Hi All,
>
> First I apologize for the length of this email im just afraid that my
> problem is quite complicated for explination & also about my english
> which is not my native language.... sorry :)
>
> I'm currently building a site for a client which will need to update
his
> database on a daily routine with about 3.5 million rows from data
files
> (the files vary in size and can reach up to 1 million rows per file).
> That data will be retrieved from several files (Text or XML by my
> chioce) and part of the rows will need to be INSERT cause they doesnt
> have any previous record and some will need to UPDATE previous records
> or just INSERT and DELETE the previous records.
> Beside of that the new data also bring information in it that tell how
> to populate another table (2 fields and the result is alot less rows)
> and connect between the two tables by foreign key which is written in
> each row of the main table (the one with the 3.5 million rows).
> Now the site must work 24/7 and it will probably have 2 servers which
> will run PostreSQL (working as 1 database), the scripting language
that
> ill use is PHP if it change anything.
>
> I thought on how to accomplish this and I would like to receive
comments
> and ideas.
> I'm mostly afraid from the stress it will make on the server during
the
> process and that the site will display working information all the
time.
>
> A brief explaniation on the definitions ill use ahead:
> 1. New data - The retrieved data from the files (Text or XML).
> 2. Main table - The final table which need to hold about 3.5 million
> rows, it can be empty before the process or can hold already
information
> that some of it need to get updated using the "New data" and the rest
of
> the "New data" need to get inserted into it while the previous data
> which didnt got updated need to get deleted.
> 3. Second table - The table which hold information that the data at
the
> "Main table" need to get connected to using foreign keys.
> 4. Temp table - A temporary table which will hold the "New data" till
it
> will be ready to be INSERT/UPDATE the data at "Main table" (got the
> exact same columns as the "Main table").
> 4. Temp table2 - A temporary table which is created by CREATE
TEMPORARY
> TABLE AS (former SELECT INTO).
>
> My plan:
> 1. With a COPY FROM ill insert the data to the "Temp table" (all of
the
> files in Text format).
> 2. Run at PL/pgSQL function: {
>
> A. Start transaction
> B. DELETE the content of the current existing "Second table".
> C. INSERT data into the "Second table" using the "Temp table" - each
row
> will be checked to its values and compared to check if they exist
> already (SELECT) at the "Second table" and if not it will run an
INSERT
> to create it at the "Second table" - im afraid that this process will
be
> extremly heavy on the server.
> D. DELETE the "Main table" rows.
> E. With CREATE TEMPORARY TABLE AS ill create "Temp table2" which will
> contain all of the information of the "Temp table" + a subquery will
> retrieve the ID of the "Second table" for the foreign key - quite
heavy
> process i suspect.
> F. INSERT the data from "Temp table2" to the "Main table".
> G. End transaction + quit from PL/pgSQL. }
>
> 3. Delete all the files.
>
> Thanks alot in advance and again im sorry for the length of the mail
:)
>
> Ben-Nes Yonatan
> Canaan Surfing ltd.
> http://www.canaan.net.il
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>        subscribe-nomail command to majordomo@postgresql.org so that
your
>        message can get through to the mailing list cleanly

Re: Populating huge tables each day

From
"Jim C. Nasby"
Date:
On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote:
> I see a lot of problems with this idea.
>
> You mention that the database is supposed to be available 24x7.
> While you are loading, the database table receiving data will not be
> available.  Therefore, you will have to have one server online (with

Why do you think that's the case?
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Populating huge tables each day

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Jim C. Nasby [mailto:decibel@decibel.org]
> Sent: Monday, June 27, 2005 12:58 PM
> To: Dann Corbit
> Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Populating huge tables each day
>
> On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote:
> > I see a lot of problems with this idea.
> >
> > You mention that the database is supposed to be available 24x7.
> > While you are loading, the database table receiving data will not be
> > available.  Therefore, you will have to have one server online (with
>
> Why do you think that's the case?

He's doing a bulk load.  I assume he will have to truncate the table and
load it with the copy command.

Is there an alternative I do not know of that is equally fast?

Re: Populating huge tables each day

From
"Jim C. Nasby"
Date:
On Mon, Jun 27, 2005 at 01:05:42PM -0700, Dann Corbit wrote:
>
> > -----Original Message-----
> > From: Jim C. Nasby [mailto:decibel@decibel.org]
> > Sent: Monday, June 27, 2005 12:58 PM
> > To: Dann Corbit
> > Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org
> > Subject: Re: [GENERAL] Populating huge tables each day
> >
> > On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote:
> > > I see a lot of problems with this idea.
> > >
> > > You mention that the database is supposed to be available 24x7.
> > > While you are loading, the database table receiving data will not be
> > > available.  Therefore, you will have to have one server online (with
> >
> > Why do you think that's the case?
>
> He's doing a bulk load.  I assume he will have to truncate the table and
> load it with the copy command.

Don't ass-u-me; he said he'd be deleting from the main table, not
truncating.

> Is there an alternative I do not know of that is equally fast?

Nope, truncate is undoubtedly faster. But it also means you would have
downtime as you mentioned. If it were me, I'd probably make the
trade-off of using a delete inside a transaction.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Populating huge tables each day

From
"Dann Corbit"
Date:
> -----Original Message-----
> From: Jim C. Nasby [mailto:decibel@decibel.org]
> Sent: Monday, June 27, 2005 6:55 PM
> To: Dann Corbit
> Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Populating huge tables each day
>
> On Mon, Jun 27, 2005 at 01:05:42PM -0700, Dann Corbit wrote:
> >
> > > -----Original Message-----
> > > From: Jim C. Nasby [mailto:decibel@decibel.org]
> > > Sent: Monday, June 27, 2005 12:58 PM
> > > To: Dann Corbit
> > > Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org
> > > Subject: Re: [GENERAL] Populating huge tables each day
> > >
> > > On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote:
> > > > I see a lot of problems with this idea.
> > > >
> > > > You mention that the database is supposed to be available 24x7.
> > > > While you are loading, the database table receiving data will
not be
> > > > available.  Therefore, you will have to have one server online
(with
> > >
> > > Why do you think that's the case?
> >
> > He's doing a bulk load.  I assume he will have to truncate the table
and
> > load it with the copy command.
>
> Don't ass-u-me; he said he'd be deleting from the main table, not
> truncating.
>
> > Is there an alternative I do not know of that is equally fast?
>
> Nope, truncate is undoubtedly faster. But it also means you would have
> downtime as you mentioned. If it were me, I'd probably make the
> trade-off of using a delete inside a transaction.

For every record in a bulk loaded table?

If it were that important that both servers be available all the time, I
would bulk load into a second table with the same shape and then rename
when completed.

Be that as it may, I don't think that there is enough information yet to
give good advice.

Re: Populating huge tables each day

From
"Jim C. Nasby"
Date:
On Tue, Jun 28, 2005 at 10:36:58AM -0700, Dann Corbit wrote:
> > Nope, truncate is undoubtedly faster. But it also means you would have
> > downtime as you mentioned. If it were me, I'd probably make the
> > trade-off of using a delete inside a transaction.
>
> For every record in a bulk loaded table?
Sure. If the data's only being loaded once a day, it probably doesn't
matter if that delete takes 10 minutes.

> If it were that important that both servers be available all the time, I
> would bulk load into a second table with the same shape and then rename
> when completed.
Interesting idea, though the problem is that AFAIK everything will block
on the rename. If everything didn't block though, this might be a better
way to do it, although it potentially complicates the code greatly
(think about needing to add indexes, rebuild RI, etc.)
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Populating huge tables each day

From
Ben-Nes Yonatan
Date:
> On Tue, Jun 28, 2005 at 10:36:58AM -0700, Dann Corbit wrote:
>
>>>Nope, truncate is undoubtedly faster. But it also means you would have
>>>downtime as you mentioned. If it were me, I'd probably make the
>>>trade-off of using a delete inside a transaction.
>>
>>For every record in a bulk loaded table?
>
> Sure. If the data's only being loaded once a day, it probably doesn't
> matter if that delete takes 10 minutes.
>
>
>>If it were that important that both servers be available all the time, I
>>would bulk load into a second table with the same shape and then rename
>>when completed.
>
> Interesting idea, though the problem is that AFAIK everything will block
> on the rename. If everything didn't block though, this might be a better
> way to do it, although it potentially complicates the code greatly
> (think about needing to add indexes, rebuild RI, etc.)

Sorry for the lack of answers i was away and unable to answer...

I thought about the idea of loading everything into a temporary table
and then renaming it but indeed as Jim mentioned it will also complicate
everything alot more... but im not sure that indeed everything will be
blocked during that time if it will all be under a transaction or am i
wrong here?

Here are some answer to Dann questions from earlier mail:
How many tables are to be replicated?
* 2 tables + another one which will get modified a little.

What is the total number of expected rows for each table?
* 1 table about 3.5 million and the second is quite hard to determine
but its about 40,000, the third one modification will probably be about
10 new rows each day maximum so its really not a problem (it will
require one query on all of the data at the end though...).

How fast are the tables expected to grow?
* im not sure that i understand your question but if you ask it about
the time that it takes the tables to get to their final size then its
supposed to take minutes i guess cause it will probably load everything
and i want to cut that time as much as i can (ill run it under "nice").

When must the new data become available online?
* right after the changes will complete (its not really a time in
seconds or a specific hour).

Are all of the tables in the database populated from a foreign source or
just some of them?
* just those 3.

Do you also have access to the source data in its database format, or
only as text dumps?
* only as text or XML, i prefer text cause i want to use COPY though any
diffrent thoughts will be accepted gladly.

** The goal of the site is to create a portal of products for end-users.


Im startring to wonder maybe i shouldnt upload the "New Data" to a "Temp
table" but instead upload it directly to the "Main table" and just add a
status field which will tell that its "in progress", when deleting ill
delete all of those rows which their status is "active" and afterwards
update all of the rows with the status of "in progress" to "active".
By this solution ill also be able to save from deletion rows which ill
want to keep from deletion by just changing their status to something
else then "active" or "in progress".
Also ill save the need to transfer all of the data from the "Temp table"
to the "Main table".

2 points bother me here...
1. by uploading the data to the "Main table" before deleting its content
i create a table with 7 million rows which will stress the system more
for every query that ill run on it (like the one which i need for the
deletion of the old rows).
2. im not sure if ill be able to restart the counting of the indexing
when there will be data at the table (after all at this method there
wont be any period of time without data at the "Main table").

I guess that what i really want to know is how much all of this process
will stress the server... and what can i do to let the server work on it
in a way that it wont disturb the rest of the processes.

Thanks alot again,
Ben-Nes Yonatan
Canaan Surfing ltd.
http://www.canaan.net.il


Trigger help

From
David Pratt
Date:
Basically I want this trigger to work after a language record in my
languages table is added.

CREATE TRIGGER language_add_trig AFTER INSERT ON languages
    FOR EACH ROW EXECUTE PROCEDURE trigger_language_add();

Here is my function but it is not working.  I am wanting to loop for
each record in my multi_language
table, grab the english text from each record and append and array to
multi dimensional array in
lang_code_and_text field in same record.  I am new to triggers so help
appreciated debugging or
correcting my syntax.

Thanks,
David

CREATE FUNCTION trigger_language_add() RETURNS opaque AS '
    DECLARE

    r record;                     -- record
    en_key int;                   -- holds english key value
    default_text text;            -- holds english text value
    iso_en text := ''en'';        -- holds value of en


    BEGIN
        -- Sets english language key
        SELECT INTO en_key
        ml_key
        FROM languages
        WHERE iso_id = iso_en;

        FOR r in SELECT * from multi_language LOOP

            -- Sets default_text to english text value for record
            SELECT INTO default_text
            lang_code_and_text[en_key][2]
            FROM multi_language
            WHERE id = r.id;

            RAISE NOTICE ''Current record id is %.'', r.id;
            RAISE NOTICE ''Default english text is %.'', default_text;

            -- Appends new language arrray to exisiting multidimensional array
            -- New language array of form {''fr'',''Default english text here''}

            UPDATE multi_language
            SET lang_code_and_text = r.lang_code_and_text || ARRAY[new.iso_id,
default_text]
            WHERE id = r.id;

        END LOOP;
    END;
' LANGUAGE 'plpgsql';

Re: Trigger help Solved

From
David Pratt
Date:
I figured out my trigger trouble:

>             SET lang_code_and_text = r.lang_code_and_text || ARRAY[new.iso_id,
> default_text]

Above was not casting ARRAY[] as text[] so it would not concatenate
with existing array -  so had to set a variable to cast the type and
then concatenate it to original and also had to return NULL since this
is an after trigger.

Regards,
David