Thread: Table partition for very large table
I have table with more than 4 millions records and when I do select query it gives me "out of memory" error. Does postgres has feature like table partition to handle table with very large records. Just wondering what do you guys do to deal with very large table? Thanks!
On Mon, 2005-03-28 at 11:32, Yudie Gunawan wrote: > I have table with more than 4 millions records and when I do select > query it gives me "out of memory" error. > Does postgres has feature like table partition to handle table with > very large records. > Just wondering what do you guys do to deal with very large table? Is this a straight "select * from table" or is there more being done to the data? If it's a straight select, you are likely running out of memory to hold the result set, and need to look at using a cursor to grab the result in pieces.
On Mon, 2005-03-28 at 11:32 -0600, Yudie Gunawan wrote: > I have table with more than 4 millions records and when I do select > query it gives me "out of memory" error. > Does postgres has feature like table partition to handle table with > very large records. > Just wondering what do you guys do to deal with very large table? You can use a cursor to help with this problem and fetch through the records. http://www.postgresql.org/docs/8.0/interactive/sql-declare.html Sincerely, Joshua D. Drake > > > Thanks! > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend -- Command Prompt, Inc., Your PostgreSQL solutions company. 503-667-4564 Custom programming, 24x7 support, managed services, and hosting Open Source Authors: plPHP, pgManage, Co-Authors: plPerlNG Reliable replication, Mammoth Replicator - http://www.commandprompt.com/
On Mon, Mar 28, 2005 at 11:32:04AM -0600, Yudie Gunawan wrote: > I have table with more than 4 millions records and when I do select > query it gives me "out of memory" error. What's the query and how are you issuing it? Where are you seeing the error? This could be a client problem: the client might be trying to fetch all rows before doing anything with them, thereby exhausting all memory. If that's the case then a cursor might be useful. > Does postgres has feature like table partition to handle table with > very large records. Let's identify the problem before guessing how to fix it. -- Michael Fuhr http://www.fuhr.org/~mfuhr/
I actualy need to join from 2 tables. Both of them similar and has more than 4 millions records. CREATE TABLE prdt_old ( groupnum int4 NOT NULL, sku varchar(30) NOT NULL, url varchar(150), ); CREATE TABLE prdt_new( groupnum int4 NOT NULL, sku varchar(30) NOT NULL, url varchar(150) NOT NULL, ); The query returns group number and sku from old table where has no url in prdt_new table. INSERT into prdtexpired SELECT pn.groupnum, pn.sku FROM prdt_old po LEFT OUTER JOIN prdt_new pn ON (pn.groupnum = po.groupnum and pn.sku = po.sku) WHERE pn.url is null or pn.url= ''; I already have resolution for this problem where I seperate the query for each group. But when I address this question, I hope that Postgresql has some kind of table optimazion for large records. Based my experience it is faster to query from chopped smaller table rather than query from single huge table. I heard Oracle has some kind of table partition that acts like single table.
On Mon, 2005-03-28 at 13:02, Yudie Gunawan wrote: > I actualy need to join from 2 tables. Both of them similar and has > more than 4 millions records. > > CREATE TABLE prdt_old ( > groupnum int4 NOT NULL, > sku varchar(30) NOT NULL, > url varchar(150), > ); > > CREATE TABLE prdt_new( > groupnum int4 NOT NULL, > sku varchar(30) NOT NULL, > url varchar(150) NOT NULL, > ); > > The query returns group number and sku from old table where has no url > in prdt_new table. > > INSERT into prdtexpired > SELECT pn.groupnum, pn.sku > FROM prdt_old po > LEFT OUTER JOIN prdt_new pn > ON (pn.groupnum = po.groupnum and pn.sku = po.sku) > WHERE pn.url is null or pn.url= ''; > > I already have resolution for this problem where I seperate the query > for each group. > > But when I address this question, I hope that Postgresql has some kind > of table optimazion for large records. Based my experience it is > faster to query from chopped smaller table rather than query from > single huge table. I heard Oracle has some kind of table partition > that acts like single table. Hold on, let's diagnose the real problem before we look for solutions. What does explain <query> tell you? Have you analyzed the database? What are your postgresql.conf settings?
> Hold on, let's diagnose the real problem before we look for solutions. > What does explain <query> tell you? Have you analyzed the database? This is the QUERY PLAN Hash Left Join (cost=25.00..412868.31 rows=4979686 width=17) Hash Cond: (("outer".groupnum = "inner".groupnum) AND (("outer".sku)::text = ("inner".sku)::text)) Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text)) -> Seq Scan on prdt_old mc (cost=0.00..288349.86 rows=4979686 width=17) -> Hash (cost=20.00..20.00 rows=1000 width=78) -> Seq Scan on prdt_new mi (cost=0.00..20.00 rows=1000 width=78) > What are your postgresql.conf settings? What suspected specific setting need to be changed?
On Mon, 2005-03-28 at 13:50, Yudie Gunawan wrote: > > Hold on, let's diagnose the real problem before we look for solutions. > > What does explain <query> tell you? Have you analyzed the database? > > > This is the QUERY PLAN > Hash Left Join (cost=25.00..412868.31 rows=4979686 width=17) > Hash Cond: (("outer".groupnum = "inner".groupnum) AND > (("outer".sku)::text = ("inner".sku)::text)) > Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text)) > -> Seq Scan on prdt_old mc (cost=0.00..288349.86 rows=4979686 width=17) > -> Hash (cost=20.00..20.00 rows=1000 width=78) > -> Seq Scan on prdt_new mi (cost=0.00..20.00 rows=1000 width=78) > > > > What are your postgresql.conf settings? > > What suspected specific setting need to be changed? sort_mem also known as work_mem (in 8.0) Also, this is important, have you anayzed the table? I'm guessing no, since the estimates are 1,000 rows, but the has join is getting a little bit more than that. :) Analyze your database and then run the query again.
Looks like you need to create some indexes, probably on (groupnum) and possibly on (groupnum,sku) on both tables. Hope this helps, On Mon, Mar 28, 2005 at 01:50:06PM -0600, Yudie Gunawan wrote: > > Hold on, let's diagnose the real problem before we look for solutions. > > What does explain <query> tell you? Have you analyzed the database? > > > This is the QUERY PLAN > Hash Left Join (cost=25.00..412868.31 rows=4979686 width=17) > Hash Cond: (("outer".groupnum = "inner".groupnum) AND > (("outer".sku)::text = ("inner".sku)::text)) > Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text)) > -> Seq Scan on prdt_old mc (cost=0.00..288349.86 rows=4979686 width=17) > -> Hash (cost=20.00..20.00 rows=1000 width=78) > -> Seq Scan on prdt_new mi (cost=0.00..20.00 rows=1000 width=78) > > > > What are your postgresql.conf settings? > > What suspected specific setting need to be changed? -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
> Also, this is important, have you anayzed the table? I'm guessing no, > since the estimates are 1,000 rows, but the has join is getting a little > bit more than that. :) > > Analyze your database and then run the query again. I analyze the table and it decrease number of rows in nested loop on query plan. Then it stuck or could be timeout when I execute the query. This work around to optimize the database seems not helping to cut the query time. What about table partition? anyone know about it?
On Mon, 2005-03-28 at 15:38, Yudie Pg wrote: > > Also, this is important, have you anayzed the table? I'm guessing no, > > since the estimates are 1,000 rows, but the has join is getting a little > > bit more than that. :) > > > > Analyze your database and then run the query again. > > I analyze the table and it decrease number of rows in nested loop on query plan. > Then it stuck or could be timeout when I execute the query. > This work around to optimize the database seems not helping to cut the > query time. > > What about table partition? anyone know about it? Hold your horses there. Calm down. We'll get it running faster. Our first step was to get the analyzer to find out the right count of how many rows you have in your table. There aren't any built in table partitions, and they might or might not help if they did exist anyway. First we had to get the patient's heart beating, now we'll work on the exercise program. This is a huge amount of data you're running across. What does explain <yourquery> say now? If you can let it run, then you might want to try explain analyze <yourquery> as well, but that has to run the whole query. Now, are you running the original query you listed: INSERT into prdtexpired SELECT pn.groupnum, pn.sku FROM prdt_old po LEFT OUTER JOIN prdt_new pn ON (pn.groupnum = po.groupnum and pn.sku = po.sku) WHERE pn.url is null or pn.url= ''; ??? Possibly helpful indexes would be: create index prdt_new_url_dx on prdt_new (url) create index prdt_new_sku_dx on prdt_new (sku) create index prdt_old_sku_dx on prdt_old (sku) create index prdt_new_url_null_dx on prdt_new (url) where prdt_new.url IS NULL Don't necessarily make them all. it really depends on how many rows match and what not.
On Mon, 2005-03-28 at 16:02, Scott Marlowe wrote: > On Mon, 2005-03-28 at 15:38, Yudie Pg wrote: > > > Also, this is important, have you anayzed the table? I'm guessing no, > > > since the estimates are 1,000 rows, but the has join is getting a little > > > bit more than that. :) > > > > > > Analyze your database and then run the query again. > > > > I analyze the table and it decrease number of rows in nested loop on query plan. > > Then it stuck or could be timeout when I execute the query. > > This work around to optimize the database seems not helping to cut the > > query time. > > > > What about table partition? anyone know about it? > > Hold your horses there. Calm down. We'll get it running faster. Our > first step was to get the analyzer to find out the right count of how > many rows you have in your table. > > There aren't any built in table partitions, and they might or might not > help if they did exist anyway. > > First we had to get the patient's heart beating, now we'll work on the > exercise program. > > This is a huge amount of data you're running across. What does explain > <yourquery> say now? If you can let it run, then you might want to try > explain analyze <yourquery> as well, but that has to run the whole > query. > > Now, are you running the original query you listed: > > INSERT into prdtexpired > SELECT pn.groupnum, pn.sku > FROM prdt_old po > LEFT OUTER JOIN prdt_new pn > ON (pn.groupnum = po.groupnum and pn.sku = po.sku) > WHERE pn.url is null or pn.url= ''; > > ??? > > Possibly helpful indexes would be: > > create index prdt_new_url_dx on prdt_new (url) > create index prdt_new_sku_dx on prdt_new (sku) > create index prdt_old_sku_dx on prdt_old (sku) > create index prdt_new_url_null_dx on prdt_new (url) where prdt_new.url > IS NULL > > Don't necessarily make them all. it really depends on how many rows > match and what not. > Oh, and look at indexing these two columns as well: pn.groupnum = po.groupnum
> > create index prdt_new_url_dx on prdt_new (url) > > create index prdt_new_sku_dx on prdt_new (sku) > > create index prdt_old_sku_dx on prdt_old (sku) > > create index prdt_new_url_null_dx on prdt_new (url) where prdt_new.url > > IS NULL I added indexes & redo the analyze - Query plan looks better, But when I execute the query it still can't finish all at once. (i've been waiting more than 30 minutes) and seems it time out (Error "canceling query due to user request") :-B Maybe you can help to analyze this query plan (the second one) to see what make it slow? QUERY PLAN (BEFORE ANALYZE): Merge Left Join (cost=1886617.54..1960855.12 rows=4979571 width=19) Merge Cond: (("outer"."?column3?" = "inner"."?column4?") AND ("outer".groupnum = "inner".groupnum)) Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text)) -> Sort (cost=969258.98..981707.91 rows=4979571 width=19) Sort Key: (mc.sku)::text, mc.groupnum -> Seq Scan on prdt_old mc (cost=0.00..297611.71 rows=4979571 width=19) -> Sort (cost=917358.56..928785.51 rows=4570779 width=82) Sort Key: (mi.sku)::text, mi.groupnum -> Seq Scan on prdt_new mi (cost=0.00..126438.79 rows=4570779 width=82) QUERY PLAN AFTER ANALYZE: Nested Loop Left Join (cost=0.00..233277.42 rows=23747 width=17) Join Filter: ("inner".groupnum = "outer".groupnum) Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text)) -> Index Scan using idx_prdtold_sku on prdt_old mc (cost=0.00..112211.85 rows=23747 width=17) -> Index Scan using idx_prdtnew_sku on prdt_new mi (cost=0.00..5.08 rows=1 width=82) Index Cond: ((mi.sku)::text = ("outer".sku)::text) I work around with indexes before and I did not sure to create index for the url field because it is a text field and not the main key for this query. But I just know that we can create index with condition (in this case where url is null), I guess it may cut some of the query time. I seperate the query by groupnum instead querying the whole table that cause the bottle neck effect. It works but yes it's slow (less than 1 hour) but goes thru. I wish could do this simultaniously.