Thread: Table partition for very large table

Table partition for very large table

From
Yudie Gunawan
Date:
I have table with more than 4 millions records and when I do select
query it gives me "out of memory" error.
Does postgres has feature like table partition to handle table with
very large records.
Just wondering what do you guys do to deal with very large table?


Thanks!

Re: Table partition for very large table

From
Scott Marlowe
Date:
On Mon, 2005-03-28 at 11:32, Yudie Gunawan wrote:
> I have table with more than 4 millions records and when I do select
> query it gives me "out of memory" error.
> Does postgres has feature like table partition to handle table with
> very large records.
> Just wondering what do you guys do to deal with very large table?

Is this a straight "select * from table" or is there more being done to
the data?

If it's a straight select, you are likely running out of memory to hold
the result set, and need to look at using a cursor to grab the result in
pieces.

Re: Table partition for very large table

From
"Joshua D. Drake"
Date:
On Mon, 2005-03-28 at 11:32 -0600, Yudie Gunawan wrote:
> I have table with more than 4 millions records and when I do select
> query it gives me "out of memory" error.
> Does postgres has feature like table partition to handle table with
> very large records.
> Just wondering what do you guys do to deal with very large table?

You can use a cursor to help with this problem and fetch through the
records.

http://www.postgresql.org/docs/8.0/interactive/sql-declare.html

Sincerely,

Joshua D. Drake


>
>
> Thanks!
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
--
Command Prompt, Inc., Your PostgreSQL solutions company. 503-667-4564
Custom programming, 24x7 support, managed services, and hosting
Open Source Authors: plPHP, pgManage, Co-Authors: plPerlNG
Reliable replication, Mammoth Replicator - http://www.commandprompt.com/


Re: Table partition for very large table

From
Michael Fuhr
Date:
On Mon, Mar 28, 2005 at 11:32:04AM -0600, Yudie Gunawan wrote:

> I have table with more than 4 millions records and when I do select
> query it gives me "out of memory" error.

What's the query and how are you issuing it?  Where are you seeing
the error?  This could be a client problem: the client might be
trying to fetch all rows before doing anything with them, thereby
exhausting all memory.  If that's the case then a cursor might be
useful.

> Does postgres has feature like table partition to handle table with
> very large records.

Let's identify the problem before guessing how to fix it.

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/

Re: Table partition for very large table

From
Yudie Gunawan
Date:
I actualy need to join from 2 tables. Both of them similar and has
more than 4 millions records.

CREATE TABLE prdt_old (
 groupnum int4 NOT NULL,
 sku varchar(30) NOT NULL,
 url varchar(150),
);

CREATE TABLE prdt_new(
 groupnum int4 NOT NULL,
 sku varchar(30) NOT NULL,
 url varchar(150) NOT NULL,
);

The query returns group number and sku from old table where has no url
in prdt_new table.

INSERT into prdtexpired
SELECT pn.groupnum, pn.sku
 FROM prdt_old po
 LEFT OUTER JOIN prdt_new pn
   ON (pn.groupnum = po.groupnum and pn.sku = po.sku)
WHERE pn.url is null or pn.url= '';

I already have resolution for this problem where I seperate the query
for each group.

But when I address this question, I hope that Postgresql has some kind
of table optimazion for large records. Based my experience it is
faster to query from chopped smaller table rather than query from
single huge table. I heard Oracle has some kind of table partition
that acts like single table.

Re: Table partition for very large table

From
Scott Marlowe
Date:
On Mon, 2005-03-28 at 13:02, Yudie Gunawan wrote:
> I actualy need to join from 2 tables. Both of them similar and has
> more than 4 millions records.
>
> CREATE TABLE prdt_old (
>  groupnum int4 NOT NULL,
>  sku varchar(30) NOT NULL,
>  url varchar(150),
> );
>
> CREATE TABLE prdt_new(
>  groupnum int4 NOT NULL,
>  sku varchar(30) NOT NULL,
>  url varchar(150) NOT NULL,
> );
>
> The query returns group number and sku from old table where has no url
> in prdt_new table.
>
> INSERT into prdtexpired
> SELECT pn.groupnum, pn.sku
>  FROM prdt_old po
>  LEFT OUTER JOIN prdt_new pn
>    ON (pn.groupnum = po.groupnum and pn.sku = po.sku)
> WHERE pn.url is null or pn.url= '';
>
> I already have resolution for this problem where I seperate the query
> for each group.
>
> But when I address this question, I hope that Postgresql has some kind
> of table optimazion for large records. Based my experience it is
> faster to query from chopped smaller table rather than query from
> single huge table. I heard Oracle has some kind of table partition
> that acts like single table.

Hold on, let's diagnose the real problem before we look for solutions.
What does explain <query> tell you?  Have you analyzed the database?
What are your postgresql.conf settings?

Re: Table partition for very large table

From
Yudie Gunawan
Date:
> Hold on, let's diagnose the real problem before we look for solutions.
> What does explain <query> tell you?  Have you analyzed the database?


This is the QUERY PLAN
Hash Left Join  (cost=25.00..412868.31 rows=4979686 width=17)
  Hash Cond: (("outer".groupnum = "inner".groupnum) AND
(("outer".sku)::text = ("inner".sku)::text))
  Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
  ->  Seq Scan on prdt_old mc  (cost=0.00..288349.86 rows=4979686 width=17)
  ->  Hash  (cost=20.00..20.00 rows=1000 width=78)
        ->  Seq Scan on prdt_new mi  (cost=0.00..20.00 rows=1000 width=78)


> What are your postgresql.conf settings?

What suspected specific setting need to be changed?

Re: Table partition for very large table

From
Scott Marlowe
Date:
On Mon, 2005-03-28 at 13:50, Yudie Gunawan wrote:
> > Hold on, let's diagnose the real problem before we look for solutions.
> > What does explain <query> tell you?  Have you analyzed the database?
>
>
> This is the QUERY PLAN
> Hash Left Join  (cost=25.00..412868.31 rows=4979686 width=17)
>   Hash Cond: (("outer".groupnum = "inner".groupnum) AND
> (("outer".sku)::text = ("inner".sku)::text))
>   Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
>   ->  Seq Scan on prdt_old mc  (cost=0.00..288349.86 rows=4979686 width=17)
>   ->  Hash  (cost=20.00..20.00 rows=1000 width=78)
>         ->  Seq Scan on prdt_new mi  (cost=0.00..20.00 rows=1000 width=78)
>
>
> > What are your postgresql.conf settings?
>
> What suspected specific setting need to be changed?

sort_mem also known as work_mem (in 8.0)

Also, this is important, have you anayzed the table?  I'm guessing no,
since the estimates are 1,000 rows, but the has join is getting a little
bit more than that.  :)

Analyze your database and then run the query again.

Re: Table partition for very large table

From
Martijn van Oosterhout
Date:
Looks like you need to create some indexes, probably on (groupnum) and
possibly on (groupnum,sku) on both tables.

Hope this helps,

On Mon, Mar 28, 2005 at 01:50:06PM -0600, Yudie Gunawan wrote:
> > Hold on, let's diagnose the real problem before we look for solutions.
> > What does explain <query> tell you?  Have you analyzed the database?
>
>
> This is the QUERY PLAN
> Hash Left Join  (cost=25.00..412868.31 rows=4979686 width=17)
>   Hash Cond: (("outer".groupnum = "inner".groupnum) AND
> (("outer".sku)::text = ("inner".sku)::text))
>   Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
>   ->  Seq Scan on prdt_old mc  (cost=0.00..288349.86 rows=4979686 width=17)
>   ->  Hash  (cost=20.00..20.00 rows=1000 width=78)
>         ->  Seq Scan on prdt_new mi  (cost=0.00..20.00 rows=1000 width=78)
>
>
> > What are your postgresql.conf settings?
>
> What suspected specific setting need to be changed?
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Attachment

Re: Table partition for very large table

From
Yudie Pg
Date:
> Also, this is important, have you anayzed the table?  I'm guessing no,
> since the estimates are 1,000 rows, but the has join is getting a little
> bit more than that.  :)
>
> Analyze your database and then run the query again.

I analyze the table and it decrease number of rows in nested loop on query plan.
Then it stuck or could be timeout when I execute the query.
This work around to optimize the database seems not helping to cut the
query time.

What about table partition? anyone know about it?

Re: Table partition for very large table

From
Scott Marlowe
Date:
On Mon, 2005-03-28 at 15:38, Yudie Pg wrote:
> > Also, this is important, have you anayzed the table?  I'm guessing no,
> > since the estimates are 1,000 rows, but the has join is getting a little
> > bit more than that.  :)
> >
> > Analyze your database and then run the query again.
>
> I analyze the table and it decrease number of rows in nested loop on query plan.
> Then it stuck or could be timeout when I execute the query.
> This work around to optimize the database seems not helping to cut the
> query time.
>
> What about table partition? anyone know about it?

Hold your horses there.  Calm down.  We'll get it running faster.  Our
first step was to get the analyzer to find out the right count of how
many rows you have in your table.

There aren't any built in table partitions, and they might or might not
help if they did exist anyway.

First we had to get the patient's heart beating, now we'll work on the
exercise program.

This is a huge amount of data you're running across.  What does explain
<yourquery> say now?  If you can let it run, then you might want to try
explain analyze <yourquery> as well, but that has to run the whole
query.

Now, are you running the original query you listed:

INSERT into prdtexpired
SELECT pn.groupnum, pn.sku
 FROM prdt_old po
 LEFT OUTER JOIN prdt_new pn
   ON (pn.groupnum = po.groupnum and pn.sku = po.sku)
WHERE pn.url is null or pn.url= '';

???

Possibly helpful indexes would be:

create index prdt_new_url_dx on prdt_new (url)
create index prdt_new_sku_dx on prdt_new (sku)
create index prdt_old_sku_dx on prdt_old (sku)
create index prdt_new_url_null_dx on prdt_new (url) where prdt_new.url
IS NULL

Don't necessarily make them all.  it really depends on how many rows
match and what not.

Re: Table partition for very large table

From
Scott Marlowe
Date:
On Mon, 2005-03-28 at 16:02, Scott Marlowe wrote:
> On Mon, 2005-03-28 at 15:38, Yudie Pg wrote:
> > > Also, this is important, have you anayzed the table?  I'm guessing no,
> > > since the estimates are 1,000 rows, but the has join is getting a little
> > > bit more than that.  :)
> > >
> > > Analyze your database and then run the query again.
> >
> > I analyze the table and it decrease number of rows in nested loop on query plan.
> > Then it stuck or could be timeout when I execute the query.
> > This work around to optimize the database seems not helping to cut the
> > query time.
> >
> > What about table partition? anyone know about it?
>
> Hold your horses there.  Calm down.  We'll get it running faster.  Our
> first step was to get the analyzer to find out the right count of how
> many rows you have in your table.
>
> There aren't any built in table partitions, and they might or might not
> help if they did exist anyway.
>
> First we had to get the patient's heart beating, now we'll work on the
> exercise program.
>
> This is a huge amount of data you're running across.  What does explain
> <yourquery> say now?  If you can let it run, then you might want to try
> explain analyze <yourquery> as well, but that has to run the whole
> query.
>
> Now, are you running the original query you listed:
>
> INSERT into prdtexpired
> SELECT pn.groupnum, pn.sku
>  FROM prdt_old po
>  LEFT OUTER JOIN prdt_new pn
>    ON (pn.groupnum = po.groupnum and pn.sku = po.sku)
> WHERE pn.url is null or pn.url= '';
>
> ???
>
> Possibly helpful indexes would be:
>
> create index prdt_new_url_dx on prdt_new (url)
> create index prdt_new_sku_dx on prdt_new (sku)
> create index prdt_old_sku_dx on prdt_old (sku)
> create index prdt_new_url_null_dx on prdt_new (url) where prdt_new.url
> IS NULL
>
> Don't necessarily make them all.  it really depends on how many rows
> match and what not.
>

Oh, and look at indexing these two columns as well:

pn.groupnum = po.groupnum


Re: Table partition for very large table

From
Yudie Pg
Date:
> > create index prdt_new_url_dx on prdt_new (url)
> > create index prdt_new_sku_dx on prdt_new (sku)
> > create index prdt_old_sku_dx on prdt_old (sku)
> > create index prdt_new_url_null_dx on prdt_new (url) where prdt_new.url
> > IS NULL

I added indexes & redo the analyze - Query plan looks better,
But when I execute the query it still can't finish all at once. (i've
been waiting more than 30 minutes) and seems it time out (Error
"canceling query due to user request") :-B
Maybe you can help to analyze this query plan (the second one) to see
what make it slow?

QUERY PLAN (BEFORE ANALYZE):
Merge Left Join  (cost=1886617.54..1960855.12 rows=4979571 width=19)
  Merge Cond: (("outer"."?column3?" = "inner"."?column4?") AND
("outer".groupnum = "inner".groupnum))
  Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
  ->  Sort  (cost=969258.98..981707.91 rows=4979571 width=19)
        Sort Key: (mc.sku)::text, mc.groupnum
        ->  Seq Scan on prdt_old mc  (cost=0.00..297611.71
rows=4979571 width=19)
  ->  Sort  (cost=917358.56..928785.51 rows=4570779 width=82)
        Sort Key: (mi.sku)::text, mi.groupnum
        ->  Seq Scan on prdt_new mi  (cost=0.00..126438.79
rows=4570779 width=82)

QUERY PLAN AFTER ANALYZE:
Nested Loop Left Join  (cost=0.00..233277.42 rows=23747 width=17)
  Join Filter: ("inner".groupnum = "outer".groupnum)
  Filter: (("inner".url IS NULL) OR (("inner".url)::text = ''::text))
  ->  Index Scan using idx_prdtold_sku on prdt_old mc
(cost=0.00..112211.85 rows=23747 width=17)
  ->  Index Scan using idx_prdtnew_sku on prdt_new mi
(cost=0.00..5.08 rows=1 width=82)
        Index Cond: ((mi.sku)::text = ("outer".sku)::text)

I work around with indexes before and I did not sure to create index
for the url field because it is a text field and not the main key for
this query. But I just know that we can create index with condition
(in this case where url is null), I guess it may cut some of the query
time.

I seperate the query by groupnum instead querying the whole table that
cause the bottle neck effect. It works but yes it's slow (less than 1
hour) but goes thru.
I wish could do this simultaniously.