Thread: splitting data into multiple tables

splitting data into multiple tables

From

nair rajiv

Date:

25 January 2010, 16:23:53

Hello,

I am working on a project that will take out structured content from wikipedia
and put it in our database. Before putting the data into the database I wrote a script to
find out the number of rows every table would be having after the data is in and I found
there is a table which will approximately have 5 crore entries after data harvesting.
Is it advisable to keep so much data in one table ?
I have read about 'partitioning' a table. An other idea I have is to break the table into
different tables after the no of rows in a table has reached a certain limit say 10 lacs.
For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10 lac entries.
I needed advice on whether I should go for partitioning or the approach I have thought of.
We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD).
We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the
postgres configuration file so that the database makes maximum utilization of the server we have.
For eg parameters that would increase the speed of inserts and selects.

Thank you in advance
Rajiv Nair

Re: splitting data into multiple tables

From

Amitabh Kant

Date:

25 January 2010, 16:34:07

On Mon, Jan 25, 2010 at 10:53 PM, nair rajiv <nair331@gmail.com> wrote:

Hello,

I am working on a project that will take out structured content from wikipedia
and put it in our database. Before putting the data into the database I wrote a script to
find out the number of rows every table would be having after the data is in and I found
there is a table which will approximately have 5 crore entries after data harvesting.
Is it advisable to keep so much data in one table ?
I have read about 'partitioning' a table. An other idea I have is to break the table into
different tables after the no of rows in a table has reached a certain limit say 10 lacs.
For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10 lac entries.
I needed advice on whether I should go for partitioning or the approach I have thought of.
We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD).
We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the
postgres configuration file so that the database makes maximum utilization of the server we have.
For eg parameters that would increase the speed of inserts and selects.

Thank you in advance
Rajiv Nair

We have several servers that regularly run into records exceeding 50 million records on dual quad core machine with 8 GB RAM and 4 SAS 15K hard disks in RAID 10. If 50 million is the max amount of records that you are looking at, I would suggest not breaking the table. Rather, configure the database settings present in postgresql.conf file to handle such loads.

You already have a powerful machine (I assume it's 16 core, not 16 physical processors), and if configured well, I hope would present no problems in accessing those records. For tuning PostgreSql, you can take a look at pgtune (http://pgfoundry.org/projects/pgtune/) .

Two changes that I can suggest in your hardware would be to go in for SAS 15K disks instead of SATA if you can do with less capacity, and goign in for RAID 10 instead of RAID 5.

Regards

Amitabh Kant

Re: splitting data into multiple tables

From

Viji V Nair

Date:

25 January 2010, 16:39:30

On Mon, Jan 25, 2010 at 10:53 PM, nair rajiv <nair331@gmail.com> wrote:

Hello,

I am working on a project that will take out structured content from wikipedia
and put it in our database. Before putting the data into the database I wrote a script to
find out the number of rows every table would be having after the data is in and I found
there is a table which will approximately have 5 crore entries after data harvesting.
Is it advisable to keep so much data in one table ?

It is not good to keep these much amount of data in a single table, again, it depends on your application and the database usage.

I have read about 'partitioning' a table. An other idea I have is to break the table into
different tables after the no of rows in a table has reached a certain limit say 10 lacs.
For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10 lac entries.

I think this wont help that much if you have a single machine. Partition the table and keep the data in different nodes. Have a look at the tools like pgpool.II

I needed advice on whether I should go for partitioning or the approach I have thought of.
We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD).
We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the
postgres configuration file so that the database makes maximum utilization of the server we have.

What would be your total data base size? What is the IOPS? You should partition the db and keep the data across multiple nodes and process them in parallel.

For eg parameters that would increase the speed of inserts and selects.

pgfoundry.org/projects/pgtune/ - have a look at check the docs

Thank you in advance
Rajiv Nair

Re: splitting data into multiple tables

From

"Kevin Grittner"

Date:

25 January 2010, 16:47:37

nair rajiv <nair331@gmail.com> wrote:

> I found there is a table which will approximately have 5 crore
> entries after data harvesting.
> Is it advisable to keep so much data in one table ?

That's 50,000,000 rows, right?  At this site, you're looking at a
non-partitioned table with more than seven times that if you go to a
case and click the "Court Record Events" button:

http://wcca.wicourts.gov/

> I have read about 'partitioning' a table. An other idea I have is
> to break the table into different tables after the no of rows  in
> a table has reached a certain limit say 10 lacs.
> For example, dividing a table 'datatable' to 'datatable_a',
> 'datatable_b' each having 10 lac entries.
> I needed advice on whether I should go for partitioning or the
> approach I have thought of.

It can help, and it can hurt.  It depends on the nature of the data
and how it is used.  To get a meaningful answer, I think we'd need
to know a bit more about it.

> We have a HP server with 32GB ram,16 processors. The storage has
> 24TB diskspace (1TB/HD).
> We have put them on RAID-5. It will be great if we could know the
> parameters that can be changed in the postgres configuration file
> so that the database makes maximum utilization of the server we
> have.

Again, it depends a bit on the nature of the queries.  For ideas on
where to start, you might want to look here:

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

If you get any particular queries which aren't performing as well as
you think they should, you can post here with details.  See this for
information to include:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

-Kevin

Re: splitting data into multiple tables

From

Craig James

Date:

25 January 2010, 19:15:09

Kevin Grittner wrote:
> nair rajiv <nair331@gmail.com> wrote:
>
>> I found there is a table which will approximately have 5 crore
>> entries after data harvesting.
>> Is it advisable to keep so much data in one table ?
>
> That's 50,000,000 rows, right?

You should remember that words like lac and crore are not English words, and most English speakers around the world
don'tknow what they mean.  Thousand, million, billion and so forth are the English words that everyone knows. 

Craig

Re: splitting data into multiple tables

From

nair rajiv

Date:

25 January 2010, 23:40:01

On Tue, Jan 26, 2010 at 1:01 AM, Craig James <craig_james@emolecules.com> wrote:

Kevin Grittner wrote:
nair rajiv <nair331@gmail.com> wrote:

I found there is a table which will approximately have 5 crore
entries after data harvesting.
Is it advisable to keep so much data in one table ?
That's 50,000,000 rows, right?

You should remember that words like lac and crore are not English words, and most English speakers around the world don't know what they mean. Thousand, million, billion and so forth are the English words that everyone knows.

Oh I am Sorry. I wasn't aware of that
I repost my query with suggested changes.

Hello,

I am working on a project that will take out structured content from wikipedia
and put it in our database. Before putting the data into the database I wrote a script to
find out the number of rows every table would be having after the data is in and I found
there is a table which will approximately have 50,000,000 rows after data harvesting.
Is it advisable to keep so much data in one table ?
I have read about 'partitioning' a table. An other idea I have is to break the table into
different tables after the no of rows in a table has reached a certain limit say 10,00,000.
For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b' each having 10,00,000 rows.
I needed advice on whether I should go for partitioning or the approach I have thought of.
We have a HP server with 32GB ram,16 processors. The storage has 24TB diskspace (1TB/HD).
We have put them on RAID-5. It will be great if we could know the parameters that can be changed in the
postgres configuration file so that the database makes maximum utilization of the server we have.
For eg parameters that would increase the speed of inserts and selects.

Thank you in advance
Rajiv Nair

Craig

Re: splitting data into multiple tables

From

Andres Freund

Date:

25 January 2010, 23:49:26

On Tuesday 26 January 2010 01:39:48 nair rajiv wrote:
> On Tue, Jan 26, 2010 at 1:01 AM, Craig James
<craig_james@emolecules.com>wrote:
>           I am working on a project that will take out structured content
> from wikipedia
> and put it in our database. Before putting the data into the database I
> wrote a script to
> find out the number of rows every table would be having after the data is
> in and I found
> there is a table which will approximately have 50,000,000 rows after data
> harvesting.
> Is it advisable to keep so much data in one table ?
Depends on your access patterns. I.e. how many rows are you accessing at the
same time - do those have some common locality and such.


>           I have read about 'partitioning' a table. An other idea I have is
> to break the table into
> different tables after the no of rows  in a table has reached a certain
> limit say 10,00,000.
> For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
> each having 10,00,000 rows.
> I needed advice on whether I should go for partitioning or the approach I
> have thought of.
Your approach is pretty close to partitioning - except that partitioning makes
that mostly invisible to the outside so it is imho preferrable.

>           We have a HP server with 32GB ram,16 processors. The storage has
> 24TB diskspace (1TB/HD).
> We have put them on RAID-5. It will be great if we could know the
> parameters that can be changed in the
> postgres configuration file so that the database makes maximum utilization
> of the server we have.
> For eg parameters that would increase the speed of inserts and selects.
Not using RAID-5 possibly would be a good start - many people (me included)
experienced bad write performance on it. It depends a great deal on the
controller/implementation though.
RAID-10 is normally to be considered more advantageous despite its lower
usable space ratio.
Did you create one big RAID-5 out of all disks? Thats not a good idea, because
its pretty likely that another disk fails while you restore a previously
failed disk. Unfortunately in that configuration that means you have lost your
complete data (in the most common implementations at least).

Andres

PS: Your lines are strangely wrapped...

Re: splitting data into multiple tables

From

nair rajiv

Date:

26 January 2010, 02:49:07

On Tue, Jan 26, 2010 at 6:19 AM, Andres Freund <andres@anarazel.de> wrote:

On Tuesday 26 January 2010 01:39:48 nair rajiv wrote:
> On Tue, Jan 26, 2010 at 1:01 AM, Craig James
<craig_james@emolecules.com>wrote:
> I am working on a project that will take out structured content
> from wikipedia
> and put it in our database. Before putting the data into the database I
> wrote a script to
> find out the number of rows every table would be having after the data is
> in and I found
> there is a table which will approximately have 50,000,000 rows after data
> harvesting.
> Is it advisable to keep so much data in one table ?
Depends on your access patterns. I.e. how many rows are you accessing at the
same time - do those have some common locality and such.

I'll give a brief idea of how this table is. The important columns are
subject, predicate and object. So given a predicate and object one should
be able to get all the subjects, given subject and a predicate one should
be able to retrieve all the objects. I have created an indexes on these three
columns.

> I have read about 'partitioning' a table. An other idea I have is
> to break the table into
> different tables after the no of rows in a table has reached a certain
> limit say 10,00,000.
> For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
> each having 10,00,000 rows.
> I needed advice on whether I should go for partitioning or the approach I
> have thought of.
Your approach is pretty close to partitioning - except that partitioning makes
that mostly invisible to the outside so it is imho preferrable.

> We have a HP server with 32GB ram,16 processors. The storage has
> 24TB diskspace (1TB/HD).
> We have put them on RAID-5. It will be great if we could know the
> parameters that can be changed in the
> postgres configuration file so that the database makes maximum utilization
> of the server we have.
> For eg parameters that would increase the speed of inserts and selects.
Not using RAID-5 possibly would be a good start - many people (me included)
experienced bad write performance on it. It depends a great deal on the
controller/implementation though.
RAID-10 is normally to be considered more advantageous despite its lower
usable space ratio.
Did you create one big RAID-5 out of all disks? Thats not a good idea, because
its pretty likely that another disk fails while you restore a previously
failed disk. Unfortunately in that configuration that means you have lost your
complete data (in the most common implementations at least).

No, I am using only 12TB i.e 12 HDs of the 24TB I have

Andres

PS: Your lines are strangely wrapped...

Re: splitting data into multiple tables

From

Viji V Nair

Date:

26 January 2010, 06:58:16

On Tue, Jan 26, 2010 at 9:18 AM, nair rajiv <nair331@gmail.com> wrote:

On Tue, Jan 26, 2010 at 6:19 AM, Andres Freund <andres@anarazel.de> wrote:
On Tuesday 26 January 2010 01:39:48 nair rajiv wrote:
> On Tue, Jan 26, 2010 at 1:01 AM, Craig James
<craig_james@emolecules.com>wrote:
> I am working on a project that will take out structured content
> from wikipedia
> and put it in our database. Before putting the data into the database I
> wrote a script to
> find out the number of rows every table would be having after the data is
> in and I found
> there is a table which will approximately have 50,000,000 rows after data
> harvesting.
> Is it advisable to keep so much data in one table ?
Depends on your access patterns. I.e. how many rows are you accessing at the
same time - do those have some common locality and such.

I'll give a brief idea of how this table is. The important columns are
subject, predicate and object. So given a predicate and object one should
be able to get all the subjects, given subject and a predicate one should
be able to retrieve all the objects. I have created an indexes on these three
columns.

> I have read about 'partitioning' a table. An other idea I have is
> to break the table into
> different tables after the no of rows in a table has reached a certain
> limit say 10,00,000.
> For example, dividing a table 'datatable' to 'datatable_a', 'datatable_b'
> each having 10,00,000 rows.
> I needed advice on whether I should go for partitioning or the approach I
> have thought of.
Your approach is pretty close to partitioning - except that partitioning makes
that mostly invisible to the outside so it is imho preferrable.

> We have a HP server with 32GB ram,16 processors. The storage has
> 24TB diskspace (1TB/HD).
> We have put them on RAID-5. It will be great if we could know the
> parameters that can be changed in the
> postgres configuration file so that the database makes maximum utilization
> of the server we have.
> For eg parameters that would increase the speed of inserts and selects.
Not using RAID-5 possibly would be a good start - many people (me included)
experienced bad write performance on it. It depends a great deal on the
controller/implementation though.
RAID-10 is normally to be considered more advantageous despite its lower
usable space ratio.
Did you create one big RAID-5 out of all disks? Thats not a good idea, because
its pretty likely that another disk fails while you restore a previously
failed disk. Unfortunately in that configuration that means you have lost your
complete data (in the most common implementations at least).

No, I am using only 12TB i.e 12 HDs of the 24TB I have

A 15k rpm SAS drive will give you a throughput of 12MB and 120 IOPS. Now you can calculate the number of disks, specifically spindles, for getting your desired throughput and IOPs

Andres

PS: Your lines are strangely wrapped...

Re: splitting data into multiple tables

From

Matthew Wakeling

Date:

26 January 2010, 10:42:10

On Mon, 25 Jan 2010, Viji V Nair wrote:
> I think this wont help that much if you have a single machine. Partition the
> table and keep the data in different nodes. Have a look at the tools like
> pgpool.II

So partitioning. You have three choices:

1. Use a single table
2. Partition the table on the same server
3. Partition the data across multiple servers.

This is in increasing order of complexity.

There will probably be no problem at all with option 1. The only problem
arises if you run a query that performs a full sequential scan of the
entire table, which would obviously take a while. If your queries are
indexable, then option 1 is almost certainly the best option.

Option 2 adds complexity in the Postgres server. You will need to
partition your tables in a logical manner - that is, there needs to be
some difference between rows in table a compared to rows in table b. This
means that the partitioning will in effect be a little like indexing. You
do not want to have too many partitions. The advantage is that if a query
requires a full sequential scan, then there is the possibility of skipping
some of the partitions, although there is some complexity involved in
getting this to work correctly. In a lot of cases, partitioning will make
queries slower by confusing the planner.

Option 3 is only useful when you have a real performance problem with
long-running queries (partitioning the data across servers) or with very
large numbers of queries (duplicating the data across servers). It also
adds much complexity. It is fairly simple to run a "filter these results
from the table" queries across multiple servers, but if that was all you
were doing, you may as well use an index instead. It becomes impossible to
perform proper cross-referencing queries without some very clever software
(because not all the data is available on the server), which will probably
be hard to manage and slow down the execution anyway.

My recommendation would be to stick with a single table unless you have a
real need to partition.

Matthew

--
Note: some countries impose serious penalties for a conspiracy to overthrow
      the political system. THIS DOES NOT FIX THE VULNERABILITY.
                          -- http://seclists.org/vulnwatch/2003/q2/0002.html

Re: splitting data into multiple tables

From

Matthew Wakeling

Date:

26 January 2010, 10:45:55

On Mon, 25 Jan 2010, nair rajiv wrote:
> I am working on a project that will take out structured content from
> wikipedia and put it in our database...
> there is a table which will approximately have 5 crore entries after data
> harvesting.

Have you asked the Wikimedia Foundation if they mind you consuming that
much of their bandwidth, or even if there are copyright issues involved in
grabbing that much of their data?

(The other problem with using the word "crore" is that although it may
mean 10000000 in a few countries, it could also mean 500000.)

Matthew

--
 Of course it's your fault. Everything here's your fault - it says so in your
 contract.                                    - Quark

Re: splitting data into multiple tables

From

nair rajiv

Date:

26 January 2010, 14:19:00

On Tue, Jan 26, 2010 at 5:15 PM, Matthew Wakeling <matthew@flymine.org> wrote:

On Mon, 25 Jan 2010, nair rajiv wrote:
I am working on a project that will take out structured content from wikipedia and put it in our database...

there is a table which will approximately have 5 crore entries after data
harvesting.

Have you asked the Wikimedia Foundation if they mind you consuming that much of their bandwidth, or even if there are copyright issues involved in grabbing that much of their data?

We are downloading the nt and owl files kept for download at
http://wiki.dbpedia.org/Downloads34

(The other problem with using the word "crore" is that although it may mean 10000000 in a few countries, it could also mean 500000.)

Matthew

--
Of course it's your fault. Everything here's your fault - it says so in your
contract. - Quark

Re: splitting data into multiple tables

From

Greg Smith

Date:

26 January 2010, 16:41:48

Viji V Nair wrote:
> A 15k rpm SAS drive will give you a throughput of 12MB  and 120 IOPS.
> Now you can calculate the number of disks, specifically spindles, for
> getting your desired throughput and IOPs

I think you mean 120MB/s for that first part.  Regardless, presuming you
can provision a database just based on IOPS rarely works.  It's nearly
impossible to estimate what you really need anyway for a database app,
given that much of real-world behavior depends on the cached in memory
vs. uncached footprint of the data you're working with.  By the time you
put a number of disks into an array, throw a controller card cache on
top of it, then add the OS and PostgreSQL caches on top of those, you
are so far disconnected from the underlying drive IOPS that speaking in
those terms doesn't get you very far.  I struggle with this every time I
talk with a SAN vendor.  Their fixation on IOPS without considering
things like how sequential scans mixed into random I/O will get handled
is really disconnected from how databases work in practice.  For
example, I constantly end up needing to detune IOPS in favor of
readahead to make "SELECT x,y,z FROM t" run at an acceptable speed on
big tables.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: splitting data into multiple tables

From

Viji V Nair

Date:

26 January 2010, 17:23:40

On Tue, Jan 26, 2010 at 11:11 PM, Greg Smith <greg@2ndquadrant.com> wrote:

Viji V Nair wrote:
A 15k rpm SAS drive will give you a throughput of 12MB and 120 IOPS. Now you can calculate the number of disks, specifically spindles, for getting your desired throughput and IOPs

I think you mean 120MB/s for that first part. Regardless, presuming you can provision a database just based on IOPS rarely works. It's nearly impossible to estimate what you really need anyway for a database app, given that much of real-world behavior depends on the cached in memory vs. uncached footprint of the data you're working with. By the time you put a number of disks into an array, throw a controller card cache on top of it, then add the OS and PostgreSQL caches on top of those, you are so far disconnected from the underlying drive IOPS that speaking in those terms doesn't get you very far. I struggle with this every time I talk with a SAN vendor. Their fixation on IOPS without considering things like how sequential scans mixed into random I/O will get handled is really disconnected from how databases work in practice. For example, I constantly end up needing to detune IOPS in favor of readahead to make "SELECT x,y,z FROM t" run at an acceptable speed on big tables.

Yes, you are right.

There are catches in the SAN controllers also. SAN vendors wont give that much information regarding their internal controller design. They will say they have 4 external 4G ports, you should also check how many internal ports they have and the how the controllers are operating, in Active-Active or Active- Standby mode.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com www.2ndQuadrant.com

Re: splitting data into multiple tables

From

Greg Smith

Date:

26 January 2010, 19:32:50

Viji V Nair wrote:
> There are catches in the SAN controllers also. SAN vendors wont give
> that much information regarding their internal controller design. They
> will say they have 4 external 4G ports, you should also check how many
> internal ports they have and the how the controllers are operating,
> in Active-Active or Active- Standby mode.

Right, the SAN cache serves the same purpose as the controller cache on
direct-attached storage.  I've never seen a Fiber Channel card that had
its own local cache too; doubt that's even possible.  So I think of them
as basically being the same type of cache, with the primary difference
being that the transfers between the host and the cache has some latency
on it with FC compared to direct storage.

You're right that people should question the internal design too of
course.  Some days I wonder if I'm in the wrong business--the people who
do SAN tuning seem to have no idea what they're doing and yet are still
expensive to hire.  But this is off-topic for the question being asked here.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com