Thread: Populating huge tables each day
Hi All, First I apologize for the length of this email im just afraid that my problem is quite complicated for explination & also about my english which is not my native language.... sorry :) I'm currently building a site for a client which will need to update his database on a daily routine with about 3.5 million rows from data files (the files vary in size and can reach up to 1 million rows per file). That data will be retrieved from several files (Text or XML by my chioce) and part of the rows will need to be INSERT cause they doesnt have any previous record and some will need to UPDATE previous records or just INSERT and DELETE the previous records. Beside of that the new data also bring information in it that tell how to populate another table (2 fields and the result is alot less rows) and connect between the two tables by foreign key which is written in each row of the main table (the one with the 3.5 million rows). Now the site must work 24/7 and it will probably have 2 servers which will run PostreSQL (working as 1 database), the scripting language that ill use is PHP if it change anything. I thought on how to accomplish this and I would like to receive comments and ideas. I'm mostly afraid from the stress it will make on the server during the process and that the site will display working information all the time. A brief explaniation on the definitions ill use ahead: 1. New data - The retrieved data from the files (Text or XML). 2. Main table - The final table which need to hold about 3.5 million rows, it can be empty before the process or can hold already information that some of it need to get updated using the "New data" and the rest of the "New data" need to get inserted into it while the previous data which didnt got updated need to get deleted. 3. Second table - The table which hold information that the data at the "Main table" need to get connected to using foreign keys. 4. Temp table - A temporary table which will hold the "New data" till it will be ready to be INSERT/UPDATE the data at "Main table" (got the exact same columns as the "Main table"). 4. Temp table2 - A temporary table which is created by CREATE TEMPORARY TABLE AS (former SELECT INTO). My plan: 1. With a COPY FROM ill insert the data to the "Temp table" (all of the files in Text format). 2. Run at PL/pgSQL function: { A. Start transaction B. DELETE the content of the current existing "Second table". C. INSERT data into the "Second table" using the "Temp table" - each row will be checked to its values and compared to check if they exist already (SELECT) at the "Second table" and if not it will run an INSERT to create it at the "Second table" - im afraid that this process will be extremly heavy on the server. D. DELETE the "Main table" rows. E. With CREATE TEMPORARY TABLE AS ill create "Temp table2" which will contain all of the information of the "Temp table" + a subquery will retrieve the ID of the "Second table" for the foreign key - quite heavy process i suspect. F. INSERT the data from "Temp table2" to the "Main table". G. End transaction + quit from PL/pgSQL. } 3. Delete all the files. Thanks alot in advance and again im sorry for the length of the mail :) Ben-Nes Yonatan Canaan Surfing ltd. http://www.canaan.net.il
I see a lot of problems with this idea. You mention that the database is supposed to be available 24x7. While you are loading, the database table receiving data will not be available. Therefore, you will have to have one server online (with only the old data), while the other one is loading. Once the load and all procedures are complete, you could switch the active server. You do not mention your time constraints and the total volume of data. If the new data must become available at some critical time, this is a very important requirement that must be spelled out clearly. You will need some kind of guarantee of relational integrity. This is always difficult to achieve when bulk loading from foreign sources. I think it is important to spell things out more clearly. How many tables are to be replicated? What is the total number of expected rows for each table? How fast are the tables expected to grow? When must the new data become available online? Are all of the tables in the database populated from a foreign source or just some of them? Do you also have access to the source data in its database format, or only as text dumps? Is the goal to serve as a reporting server? Is the goal to make the same data as the original server online for end-users and in so doing to reduce the load on the original server? What is the real purpose of the task to be accomplished? > -----Original Message----- > From: pgsql-general-owner@postgresql.org [mailto:pgsql-general- > owner@postgresql.org] On Behalf Of Ben-Nes Yonatan > Sent: Monday, June 27, 2005 1:13 PM > To: pgsql-general@postgresql.org > Subject: [GENERAL] Populating huge tables each day > > Hi All, > > First I apologize for the length of this email im just afraid that my > problem is quite complicated for explination & also about my english > which is not my native language.... sorry :) > > I'm currently building a site for a client which will need to update his > database on a daily routine with about 3.5 million rows from data files > (the files vary in size and can reach up to 1 million rows per file). > That data will be retrieved from several files (Text or XML by my > chioce) and part of the rows will need to be INSERT cause they doesnt > have any previous record and some will need to UPDATE previous records > or just INSERT and DELETE the previous records. > Beside of that the new data also bring information in it that tell how > to populate another table (2 fields and the result is alot less rows) > and connect between the two tables by foreign key which is written in > each row of the main table (the one with the 3.5 million rows). > Now the site must work 24/7 and it will probably have 2 servers which > will run PostreSQL (working as 1 database), the scripting language that > ill use is PHP if it change anything. > > I thought on how to accomplish this and I would like to receive comments > and ideas. > I'm mostly afraid from the stress it will make on the server during the > process and that the site will display working information all the time. > > A brief explaniation on the definitions ill use ahead: > 1. New data - The retrieved data from the files (Text or XML). > 2. Main table - The final table which need to hold about 3.5 million > rows, it can be empty before the process or can hold already information > that some of it need to get updated using the "New data" and the rest of > the "New data" need to get inserted into it while the previous data > which didnt got updated need to get deleted. > 3. Second table - The table which hold information that the data at the > "Main table" need to get connected to using foreign keys. > 4. Temp table - A temporary table which will hold the "New data" till it > will be ready to be INSERT/UPDATE the data at "Main table" (got the > exact same columns as the "Main table"). > 4. Temp table2 - A temporary table which is created by CREATE TEMPORARY > TABLE AS (former SELECT INTO). > > My plan: > 1. With a COPY FROM ill insert the data to the "Temp table" (all of the > files in Text format). > 2. Run at PL/pgSQL function: { > > A. Start transaction > B. DELETE the content of the current existing "Second table". > C. INSERT data into the "Second table" using the "Temp table" - each row > will be checked to its values and compared to check if they exist > already (SELECT) at the "Second table" and if not it will run an INSERT > to create it at the "Second table" - im afraid that this process will be > extremly heavy on the server. > D. DELETE the "Main table" rows. > E. With CREATE TEMPORARY TABLE AS ill create "Temp table2" which will > contain all of the information of the "Temp table" + a subquery will > retrieve the ID of the "Second table" for the foreign key - quite heavy > process i suspect. > F. INSERT the data from "Temp table2" to the "Main table". > G. End transaction + quit from PL/pgSQL. } > > 3. Delete all the files. > > Thanks alot in advance and again im sorry for the length of the mail :) > > Ben-Nes Yonatan > Canaan Surfing ltd. > http://www.canaan.net.il > > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote: > I see a lot of problems with this idea. > > You mention that the database is supposed to be available 24x7. > While you are loading, the database table receiving data will not be > available. Therefore, you will have to have one server online (with Why do you think that's the case? -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> -----Original Message----- > From: Jim C. Nasby [mailto:decibel@decibel.org] > Sent: Monday, June 27, 2005 12:58 PM > To: Dann Corbit > Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Populating huge tables each day > > On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote: > > I see a lot of problems with this idea. > > > > You mention that the database is supposed to be available 24x7. > > While you are loading, the database table receiving data will not be > > available. Therefore, you will have to have one server online (with > > Why do you think that's the case? He's doing a bulk load. I assume he will have to truncate the table and load it with the copy command. Is there an alternative I do not know of that is equally fast?
On Mon, Jun 27, 2005 at 01:05:42PM -0700, Dann Corbit wrote: > > > -----Original Message----- > > From: Jim C. Nasby [mailto:decibel@decibel.org] > > Sent: Monday, June 27, 2005 12:58 PM > > To: Dann Corbit > > Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org > > Subject: Re: [GENERAL] Populating huge tables each day > > > > On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote: > > > I see a lot of problems with this idea. > > > > > > You mention that the database is supposed to be available 24x7. > > > While you are loading, the database table receiving data will not be > > > available. Therefore, you will have to have one server online (with > > > > Why do you think that's the case? > > He's doing a bulk load. I assume he will have to truncate the table and > load it with the copy command. Don't ass-u-me; he said he'd be deleting from the main table, not truncating. > Is there an alternative I do not know of that is equally fast? Nope, truncate is undoubtedly faster. But it also means you would have downtime as you mentioned. If it were me, I'd probably make the trade-off of using a delete inside a transaction. -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> -----Original Message----- > From: Jim C. Nasby [mailto:decibel@decibel.org] > Sent: Monday, June 27, 2005 6:55 PM > To: Dann Corbit > Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org > Subject: Re: [GENERAL] Populating huge tables each day > > On Mon, Jun 27, 2005 at 01:05:42PM -0700, Dann Corbit wrote: > > > > > -----Original Message----- > > > From: Jim C. Nasby [mailto:decibel@decibel.org] > > > Sent: Monday, June 27, 2005 12:58 PM > > > To: Dann Corbit > > > Cc: Ben-Nes Yonatan; pgsql-general@postgresql.org > > > Subject: Re: [GENERAL] Populating huge tables each day > > > > > > On Mon, Jun 27, 2005 at 12:43:57PM -0700, Dann Corbit wrote: > > > > I see a lot of problems with this idea. > > > > > > > > You mention that the database is supposed to be available 24x7. > > > > While you are loading, the database table receiving data will not be > > > > available. Therefore, you will have to have one server online (with > > > > > > Why do you think that's the case? > > > > He's doing a bulk load. I assume he will have to truncate the table and > > load it with the copy command. > > Don't ass-u-me; he said he'd be deleting from the main table, not > truncating. > > > Is there an alternative I do not know of that is equally fast? > > Nope, truncate is undoubtedly faster. But it also means you would have > downtime as you mentioned. If it were me, I'd probably make the > trade-off of using a delete inside a transaction. For every record in a bulk loaded table? If it were that important that both servers be available all the time, I would bulk load into a second table with the same shape and then rename when completed. Be that as it may, I don't think that there is enough information yet to give good advice.
On Tue, Jun 28, 2005 at 10:36:58AM -0700, Dann Corbit wrote: > > Nope, truncate is undoubtedly faster. But it also means you would have > > downtime as you mentioned. If it were me, I'd probably make the > > trade-off of using a delete inside a transaction. > > For every record in a bulk loaded table? Sure. If the data's only being loaded once a day, it probably doesn't matter if that delete takes 10 minutes. > If it were that important that both servers be available all the time, I > would bulk load into a second table with the same shape and then rename > when completed. Interesting idea, though the problem is that AFAIK everything will block on the rename. If everything didn't block though, this might be a better way to do it, although it potentially complicates the code greatly (think about needing to add indexes, rebuild RI, etc.) -- Jim C. Nasby, Database Consultant decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828 Windows: "Where do you want to go today?" Linux: "Where do you want to go tomorrow?" FreeBSD: "Are you guys coming, or what?"
> On Tue, Jun 28, 2005 at 10:36:58AM -0700, Dann Corbit wrote: > >>>Nope, truncate is undoubtedly faster. But it also means you would have >>>downtime as you mentioned. If it were me, I'd probably make the >>>trade-off of using a delete inside a transaction. >> >>For every record in a bulk loaded table? > > Sure. If the data's only being loaded once a day, it probably doesn't > matter if that delete takes 10 minutes. > > >>If it were that important that both servers be available all the time, I >>would bulk load into a second table with the same shape and then rename >>when completed. > > Interesting idea, though the problem is that AFAIK everything will block > on the rename. If everything didn't block though, this might be a better > way to do it, although it potentially complicates the code greatly > (think about needing to add indexes, rebuild RI, etc.) Sorry for the lack of answers i was away and unable to answer... I thought about the idea of loading everything into a temporary table and then renaming it but indeed as Jim mentioned it will also complicate everything alot more... but im not sure that indeed everything will be blocked during that time if it will all be under a transaction or am i wrong here? Here are some answer to Dann questions from earlier mail: How many tables are to be replicated? * 2 tables + another one which will get modified a little. What is the total number of expected rows for each table? * 1 table about 3.5 million and the second is quite hard to determine but its about 40,000, the third one modification will probably be about 10 new rows each day maximum so its really not a problem (it will require one query on all of the data at the end though...). How fast are the tables expected to grow? * im not sure that i understand your question but if you ask it about the time that it takes the tables to get to their final size then its supposed to take minutes i guess cause it will probably load everything and i want to cut that time as much as i can (ill run it under "nice"). When must the new data become available online? * right after the changes will complete (its not really a time in seconds or a specific hour). Are all of the tables in the database populated from a foreign source or just some of them? * just those 3. Do you also have access to the source data in its database format, or only as text dumps? * only as text or XML, i prefer text cause i want to use COPY though any diffrent thoughts will be accepted gladly. ** The goal of the site is to create a portal of products for end-users. Im startring to wonder maybe i shouldnt upload the "New Data" to a "Temp table" but instead upload it directly to the "Main table" and just add a status field which will tell that its "in progress", when deleting ill delete all of those rows which their status is "active" and afterwards update all of the rows with the status of "in progress" to "active". By this solution ill also be able to save from deletion rows which ill want to keep from deletion by just changing their status to something else then "active" or "in progress". Also ill save the need to transfer all of the data from the "Temp table" to the "Main table". 2 points bother me here... 1. by uploading the data to the "Main table" before deleting its content i create a table with 7 million rows which will stress the system more for every query that ill run on it (like the one which i need for the deletion of the old rows). 2. im not sure if ill be able to restart the counting of the indexing when there will be data at the table (after all at this method there wont be any period of time without data at the "Main table"). I guess that what i really want to know is how much all of this process will stress the server... and what can i do to let the server work on it in a way that it wont disturb the rest of the processes. Thanks alot again, Ben-Nes Yonatan Canaan Surfing ltd. http://www.canaan.net.il
Basically I want this trigger to work after a language record in my languages table is added. CREATE TRIGGER language_add_trig AFTER INSERT ON languages FOR EACH ROW EXECUTE PROCEDURE trigger_language_add(); Here is my function but it is not working. I am wanting to loop for each record in my multi_language table, grab the english text from each record and append and array to multi dimensional array in lang_code_and_text field in same record. I am new to triggers so help appreciated debugging or correcting my syntax. Thanks, David CREATE FUNCTION trigger_language_add() RETURNS opaque AS ' DECLARE r record; -- record en_key int; -- holds english key value default_text text; -- holds english text value iso_en text := ''en''; -- holds value of en BEGIN -- Sets english language key SELECT INTO en_key ml_key FROM languages WHERE iso_id = iso_en; FOR r in SELECT * from multi_language LOOP -- Sets default_text to english text value for record SELECT INTO default_text lang_code_and_text[en_key][2] FROM multi_language WHERE id = r.id; RAISE NOTICE ''Current record id is %.'', r.id; RAISE NOTICE ''Default english text is %.'', default_text; -- Appends new language arrray to exisiting multidimensional array -- New language array of form {''fr'',''Default english text here''} UPDATE multi_language SET lang_code_and_text = r.lang_code_and_text || ARRAY[new.iso_id, default_text] WHERE id = r.id; END LOOP; END; ' LANGUAGE 'plpgsql';
I figured out my trigger trouble: > SET lang_code_and_text = r.lang_code_and_text || ARRAY[new.iso_id, > default_text] Above was not casting ARRAY[] as text[] so it would not concatenate with existing array - so had to set a variable to cast the type and then concatenate it to original and also had to return NULL since this is an after trigger. Regards, David