Thread: large xml database

large xml database

From

Viktor Bojović

Date:

30 October 2010, 18:48:35

Hi,

i have very big XML documment which is larger than 50GB and want to import it into databse, and transform it to relational schema.

When splitting this documment to smaller independent xml documments i get ~11.1mil XML documents.

I have spent lots of time trying to get fastest way to transform all this data but every time i give up because it takes too much time. Sometimes more than month it would take if not stopped.

I have tried to insert each line as varchar into database and parse it using plperl regex..

also i have tried to store every documment as XML and parse it, but it is also to slow.

i have tried to store every documment as varchar but it is also slow when using regex to get data.

many tries have failed because 8GB of ram and 10gb of swap were not enough. also sometimes i get that more than 2^32 operations were performed, and functions stopped to work.

i wanted just to ask if someone knows how to speed this up.

thanx in advance

--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me

Re: large xml database

From

Mike Christensen

Date:

30 October 2010, 19:05:02

Geeeeeeeez.

Maybe you can lease a bunch of Amazon EC2 high computing slices and
parallelize it?  I think throwing ridiculous amounts of hardware at
things is always the best approach.

On Sat, Oct 30, 2010 at 2:48 PM, Viktor Bojović
<viktor.bojovic@gmail.com> wrote:
> Hi,
> i have very big XML documment which is larger than 50GB and want to import
> it into databse, and transform it to relational schema.
> When splitting this documment to smaller independent xml documments i get
> ~11.1mil XML documents.
> I have spent lots of time trying to get fastest way to transform all this
> data but every time i give up because it takes too much time. Sometimes more
> than month it would take if not stopped.
> I have tried to insert each line as varchar into database and parse it using
> plperl regex..
> also i have tried to store every documment  as XML and parse it, but it is
> also to slow.
> i have tried to store every documment as varchar but it is also slow when
> using regex to get data.
> many tries have failed because 8GB of ram and 10gb of swap were not enough.
> also sometimes i get that more than 2^32 operations  were performed, and
> functions stopped to work.
> i wanted just to ask if someone knows how to speed this up.
>
> thanx in advance
> --
> ---------------------------------------
> Viktor Bojović
> ---------------------------------------
> Wherever I go, Murphy goes with me
>

Re: large xml database

From

Andy Colson

Date:

31 October 2010, 10:51:07

On 10/30/2010 4:48 PM, Viktor Bojović wrote:
> Hi,
> i have very big XML documment which is larger than 50GB and want to
> import it into databse, and transform it to relational schema.
> When splitting this documment to smaller independent xml documments i
> get ~11.1mil XML documents.
> I have spent lots of time trying to get fastest way to transform all
> this data but every time i give up because it takes too much time.
> Sometimes more than month it would take if not stopped.
> I have tried to insert each line as varchar into database and parse it
> using plperl regex..
> also i have tried to store every documment  as XML and parse it, but it
> is also to slow.
> i have tried to store every documment as varchar but it is also slow
> when using regex to get data.
>
> many tries have failed because 8GB of ram and 10gb of swap were not
> enough. also sometimes i get that more than 2^32 operations  were
> performed, and functions stopped to work.
>
> i wanted just to ask if someone knows how to speed this up.
>
> thanx in advance
> --
> ---------------------------------------
> Viktor Bojović
> ---------------------------------------
> Wherever I go, Murphy goes with me

It might help a little to know a few more detail.  Like what is in the
xml doc.  When you say convert to relational, do you mean multiple
tables (and no more xml tags), or do you mean a table with a blob column
that contains some xml fragment?

I have imported millions of rows and never run out of memory.  The
database will take care of itself unless you are doing something really
bad.  I'd guess its the xml parser running out of ram and not the
database.  Are you using dom or sax?

You say it took too much time.  What did?  The xml parsing?  The
database inserts?  Were you cpu bound or io bound?

What tools are you using to write this in?  What OS are you on?  What
version of PG?  You know... just a "few" more details :-)

-Andy

DATA Location

From

Ozz Nixon

Date:

28 December 2010, 11:06:15

Is it possible (and how) to implement a data path on another partition (linux) for an existing system? And then if I do
notgain anything, merging it to the production /data path? 

Scenario of what I want to achieve (/mnt/data is already running)

    /mnt/data    resides on an NFS share
                Contains over 2 Billion web sites crawled (yeah another search-engine site)

want to add:

    /opt/data        resides on internal drive
                Will contain keyword hash system

Then if I find this does not improve anything - or runs tight (running on IBM Blade center with 76gb internals - so I
maybe limited), that I can simple shutdown postgres, scp /opt/data/folder/ to the NFS - bring up postgres - fix any
conffile, and everything is on the multi-terabyte array? 

Checking the pulse of all you speed freaks that just hit the floor because this is on NFS not iSCSI -- don't worry,
whenthis site goes live it will be iSCSI. We have been in private development for 3 years... and the NFS still out runs
theInternet connections.

Re: DATA Location

From

Raymond O'Donnell

Date:

28 December 2010, 11:14:58

On 28/12/2010 14:56, Ozz Nixon wrote:
> Is it possible (and how) to implement a data path on another
> partition (linux) for an existing system? And then if I do not gain
> anything, merging it to the production /data path?

I think tablespaces will do what you want:

   http://www.postgresql.org/docs/9.0/static/manage-ag-tablespaces.html

HTH,

Ray.


--
Raymond O'Donnell :: Galway :: Ireland
rod@iol.ie

Re: DATA Location

From

Andreas Kretschmer

Date:

28 December 2010, 11:18:59

Ozz Nixon <ozznixon@gmail.com> wrote:

> Is it possible (and how) to implement a data path on another partition
> (linux) for an existing system? And then if I do not gain anything,
> merging it to the production /data path?

I think, you should read our docu about tablespaces:

http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html


Andreas
--
Really, I'm not out to destroy Microsoft. That will just be a completely
unintentional side effect.                              (Linus Torvalds)
"If I was god, I would recompile penguin with --enable-fly."   (unknown)
Kaufbach, Saxony, Germany, Europe.              N 51.05082°, E 13.56889°

Re: DATA Location

From

Scott Marlowe

Date:

28 December 2010, 11:42:55

On Tue, Dec 28, 2010 at 7:56 AM, Ozz Nixon <ozznixon@gmail.com> wrote:
> Is it possible (and how) to implement a data path on another partition (linux) for an existing system? And then if I
donot gain anything, merging it to the production /data path? 
>
> Scenario of what I want to achieve (/mnt/data is already running)
>
>        /mnt/data       resides on an NFS share
>                                Contains over 2 Billion web sites crawled (yeah another search-engine site)
>
> want to add:
>
>        /opt/data               resides on internal drive
>                                Will contain keyword hash system
>
> Then if I find this does not improve anything - or runs tight (running on IBM Blade center with 76gb internals - so I
maybe limited), that I can simple shutdown postgres, scp /opt/data/folder/ to the NFS - bring up postgres - fix any
conffile, and everything is on the multi-terabyte array? 

You can't just merge two pgsql databases into one directory.  You have
to pick one to run and dump the other and load it into the other db
with pg_dump / psql / pg_restore.  If you completely replace the db
instance in /opt/data/folder with the other one that will work, but
erase the original db that was there.

> Checking the pulse of all you speed freaks that just hit the floor because this is on NFS not iSCSI -- don't worry,
whenthis site goes live it will be iSCSI. We have been in private development for 3 years... and the NFS still out runs
theInternet connections. 

It's more an issue of reliability in case of crashes and power losses
than anything else.  most iSCSI implementations will honor fsync
properly, but many nfs mounts will lose data / corrupt your data store
if one or the other machine crashes in the wrong way.

Re: DATA Location

From

Jasen Betts

Date:

29 December 2010, 04:58:12

On 2010-12-28, Ozz Nixon <ozznixon@gmail.com> wrote:
> Is it possible (and how) to implement a data path on another partition (linux) for an existing system? And then if I
donot gain anything, merging it to the production /data path? 

tablespaces

> Scenario of what I want to achieve (/mnt/data is already running)
>
>     /mnt/data    resides on an NFS share
>                 Contains over 2 Billion web sites crawled (yeah another search-engine site)

using postgres on nfs is not recommended,

> want to add:
>
>     /opt/data        resides on internal drive
>                 Will contain keyword hash system

local is good.

> Then if I find this does not improve anything - or runs tight
>(running on IBM Blade center with 76gb internals - so I may be
>limited), that I can simple shutdown postgres, scp /opt/data/folder/
>to the NFS - bring up postgres - fix any conf file, and everything is
>on the multi-terabyte array?

or leave postgres running and just copy the table using "select * into"
or similar.

it's possibly better to use iSCSI instead of NFS, it's not like you can share the data
directory.

--
⚂⚃ 100% natural