Re: Populating huge tables each day - Mailing list pgsql-general

From Ben-Nes Yonatan
Subject Re: Populating huge tables each day
Date
Msg-id 42C7C956.2080301@canaan.co.il
Whole thread Raw
In response to Re: Populating huge tables each day  ("Jim C. Nasby" <decibel@decibel.org>)
Responses Trigger help
List pgsql-general
> On Tue, Jun 28, 2005 at 10:36:58AM -0700, Dann Corbit wrote:
>
>>>Nope, truncate is undoubtedly faster. But it also means you would have
>>>downtime as you mentioned. If it were me, I'd probably make the
>>>trade-off of using a delete inside a transaction.
>>
>>For every record in a bulk loaded table?
>
> Sure. If the data's only being loaded once a day, it probably doesn't
> matter if that delete takes 10 minutes.
>
>
>>If it were that important that both servers be available all the time, I
>>would bulk load into a second table with the same shape and then rename
>>when completed.
>
> Interesting idea, though the problem is that AFAIK everything will block
> on the rename. If everything didn't block though, this might be a better
> way to do it, although it potentially complicates the code greatly
> (think about needing to add indexes, rebuild RI, etc.)

Sorry for the lack of answers i was away and unable to answer...

I thought about the idea of loading everything into a temporary table
and then renaming it but indeed as Jim mentioned it will also complicate
everything alot more... but im not sure that indeed everything will be
blocked during that time if it will all be under a transaction or am i
wrong here?

Here are some answer to Dann questions from earlier mail:
How many tables are to be replicated?
* 2 tables + another one which will get modified a little.

What is the total number of expected rows for each table?
* 1 table about 3.5 million and the second is quite hard to determine
but its about 40,000, the third one modification will probably be about
10 new rows each day maximum so its really not a problem (it will
require one query on all of the data at the end though...).

How fast are the tables expected to grow?
* im not sure that i understand your question but if you ask it about
the time that it takes the tables to get to their final size then its
supposed to take minutes i guess cause it will probably load everything
and i want to cut that time as much as i can (ill run it under "nice").

When must the new data become available online?
* right after the changes will complete (its not really a time in
seconds or a specific hour).

Are all of the tables in the database populated from a foreign source or
just some of them?
* just those 3.

Do you also have access to the source data in its database format, or
only as text dumps?
* only as text or XML, i prefer text cause i want to use COPY though any
diffrent thoughts will be accepted gladly.

** The goal of the site is to create a portal of products for end-users.


Im startring to wonder maybe i shouldnt upload the "New Data" to a "Temp
table" but instead upload it directly to the "Main table" and just add a
status field which will tell that its "in progress", when deleting ill
delete all of those rows which their status is "active" and afterwards
update all of the rows with the status of "in progress" to "active".
By this solution ill also be able to save from deletion rows which ill
want to keep from deletion by just changing their status to something
else then "active" or "in progress".
Also ill save the need to transfer all of the data from the "Temp table"
to the "Main table".

2 points bother me here...
1. by uploading the data to the "Main table" before deleting its content
i create a table with 7 million rows which will stress the system more
for every query that ill run on it (like the one which i need for the
deletion of the old rows).
2. im not sure if ill be able to restart the counting of the indexing
when there will be data at the table (after all at this method there
wont be any period of time without data at the "Main table").

I guess that what i really want to know is how much all of this process
will stress the server... and what can i do to let the server work on it
in a way that it wont disturb the rest of the processes.

Thanks alot again,
Ben-Nes Yonatan
Canaan Surfing ltd.
http://www.canaan.net.il


pgsql-general by date:

Previous
From: "Sim Zacks"
Date:
Subject: Re: trigger update delete - found it
Next
From: David Pratt
Date:
Subject: Trigger help