Thread: large xml database

large xml database

From
Viktor Bojović
Date:
Hi,<br />i have very big XML documment which is larger than 50GB and want to import it into databse, and transform it
torelational schema.<br />When splitting this documment to smaller independent xml documments i get ~11.1mil XML
documents.<br/> I have spent lots of time trying to get fastest way to transform all this data but every time i give up
becauseit takes too much time. Sometimes more than month it would take if not stopped.<br />I have tried to insert each
lineas varchar into database and parse it using plperl regex..<br /> also i have tried to store every documment as XML
andparse it, but it is also to slow.<br />i have tried to store every documment as varchar but it is also slow when
usingregex to get data.<br /><br />many tries have failed because 8GB of ram and 10gb of swap were not enough. also
sometimesi get that more than 2^32 operations were performed, and functions stopped to work.<br /><br />i wanted just
toask if someone knows how to speed this up.<br /><br />thanx in advance<br /><br />-- <br
/>---------------------------------------<br/>Viktor Bojović<br />---------------------------------------<br />Wherever
Igo, Murphy goes with me<br /> 

Re: large xml database

From
Andreas Joseph Krogh
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 10/30/2010 11:49 PM, Viktor Bojović wrote:
> Hi,
> i have very big XML documment which is larger than 50GB and want to import
> it into databse, and transform it to relational schema.
> When splitting this documment to smaller independent xml documments i get
> ~11.1mil XML documents.
> I have spent lots of time trying to get fastest way to transform all this
> data but every time i give up because it takes too much time. Sometimes more
> than month it would take if not stopped.
> I have tried to insert each line as varchar into database and parse it using
> plperl regex..
> also i have tried to store every documment as XML and parse it, but it is
> also to slow.
> i have tried to store every documment as varchar but it is also slow when
> using regex to get data.
> 
> many tries have failed because 8GB of ram and 10gb of swap were not enough.
> also sometimes i get that more than 2^32 operations were performed, and
> functions stopped to work.
> 
> i wanted just to ask if someone knows how to speed this up.
> 
> thanx in advance

Use a SAX-parser and handle the endElement(String name) events to insert
the element's content into your db.

- -- 
Andreas Joseph Krogh <andreak@officenet.no>
Senior Software Developer / CTO
Public key: http://home.officenet.no/~andreak/public_key.asc
- ------------------------+---------------------------------------------+
OfficeNet AS            | The most difficult thing in the world is to |
Rosenholmveien 25       | know how to do a thing and to watch         |
1414 Trollåsen          | somebody else doing it wrong, without       |
NORWAY                  | comment.                                    |                       |
                   |
 
Tlf:    +47 24 15 38 90 |                                             |
Fax:    +47 24 15 38 91 |                                             |
Mobile: +47 909  56 963 |                                             |
- ------------------------+---------------------------------------------+
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iF4EAREIAAYFAkzMltwACgkQ+QNFm4X8jCLZzwD/ZIAktYXFqwUgtLLiHgYpoYNo
Nf+r1r9cGNVIwMC6kH8A/i0RUwAkL45xeQ8CsiyALXYAawZF/n6Fnql15qAkZDip
=t+Xo
-----END PGP SIGNATURE-----


Re: large xml database

From
Rob Sargent
Date:

Andreas Joseph Krogh wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> On 10/30/2010 11:49 PM, Viktor Bojović wrote:
>   
>> Hi,
>> i have very big XML documment which is larger than 50GB and want to import
>> it into databse, and transform it to relational schema.
>> When splitting this documment to smaller independent xml documments i get
>> ~11.1mil XML documents.
>> I have spent lots of time trying to get fastest way to transform all this
>> data but every time i give up because it takes too much time. Sometimes more
>> than month it would take if not stopped.
>> I have tried to insert each line as varchar into database and parse it using
>> plperl regex..
>> also i have tried to store every documment as XML and parse it, but it is
>> also to slow.
>> i have tried to store every documment as varchar but it is also slow when
>> using regex to get data.
>>
>> many tries have failed because 8GB of ram and 10gb of swap were not enough.
>> also sometimes i get that more than 2^32 operations were performed, and
>> functions stopped to work.
>>
>> i wanted just to ask if someone knows how to speed this up.
>>
>> thanx in advance
>>     
>
> Use a SAX-parser and handle the endElement(String name) events to insert
> the element's content into your db.
>
>   
If you still have the 11 million subfiles, I would start there, sax 
parse as above and maybe make make csv files, then load those with bulk 
as begin/end transaction on each data item discovered will hurt.

Can the subfiles be segregated into specific data types, or at least 
holder of specific data types such that they releate to a specific 
subset of your new db/schema?  This will play into what get's loaded 
first and who depends on whom w.r.t. foreign keys etc.

You can parallelize marginally with multiple threads (hoping to split 
file read from sax paring from element construction from save/network) 
but more boxes would be the way to go. Partitioning remains a problem.



Re: large xml database

From
James Cloos
Date:
>>>>> "VB" == Viktor Bojović <viktor.bojovic@gmail.com> writes:

VB> i have very big XML documment which is larger than 50GB and want to
VB> import it into databse, and transform it to relational schema.

Were I doing such a conversion, I'd use perl to convert the xml into
something which COPY can grok.  Any other language, script or compiled,
would work just as well.  The goal is to avoid having to slurp the whole
xml structure into memory.

-JimC
-- 
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6


Re: large xml database

From
Lutz Steinborn
Date:
On Sat, 30 Oct 2010 23:49:29 +0200
Viktor Bojović <viktor.bojovic@gmail.com> wrote:

> 
> many tries have failed because 8GB of ram and 10gb of swap were not enough.
> also sometimes i get that more than 2^32 operations were performed, and
> functions stopped to work.
> 
we have a similar problem and we use the Amara xml Toolkit for python. To avoid
the big memory consumption use pushbind. A 30G bme catalog file takes a maximum
up to 20min to import. It might be faster because we are preparing complex
objects with an orm. So the time consumption depends how complex the catalog is.
If you use amara only to perform a conversion from xml to csv the final import
can be done much faster.

regards

-- 
Lutz

http://www.4c-gmbh.de



Re: large xml database

From
Viktor Bojović
Date:


On Sun, Oct 31, 2010 at 7:08 AM, Lutz Steinborn <l.steinborn@4c-ag.de> wrote:
On Sat, 30 Oct 2010 23:49:29 +0200
Viktor Bojović <viktor.bojovic@gmail.com> wrote:

>
> many tries have failed because 8GB of ram and 10gb of swap were not enough.
> also sometimes i get that more than 2^32 operations were performed, and
> functions stopped to work.
>
we have a similar problem and we use the Amara xml Toolkit for python. To avoid
the big memory consumption use pushbind. A 30G bme catalog file takes a maximum
up to 20min to import. It might be faster because we are preparing complex
objects with an orm. So the time consumption depends how complex the catalog is.
If you use amara only to perform a conversion from xml to csv the final import
can be done much faster.

regards

--
Lutz

http://www.4c-gmbh.de


Thanx Lutz, I will try to use that Amara and also I will try to parse it with SAX.
I have tried twig and some other parsers but they consumed too much RAM.



--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me

Re: large xml database

From
Viktor Bojović
Date:


On Sun, Oct 31, 2010 at 2:26 AM, James Cloos <cloos@jhcloos.com> wrote:
>>>>> "VB" == Viktor Bojović <viktor.bojovic@gmail.com> writes:

VB> i have very big XML documment which is larger than 50GB and want to
VB> import it into databse, and transform it to relational schema.

Were I doing such a conversion, I'd use perl to convert the xml into
something which COPY can grok.  Any other language, script or compiled,
would work just as well.  The goal is to avoid having to slurp the whole
xml structure into memory.

-JimC
--
James Cloos <cloos@jhcloos.com>         OpenPGP: 1024D/ED7DAEA6

The insertion into dabase is not very big problem.
I insert it as XML docs, or as varchar lines or as XML docs in varchar format. Usually i use transaction and commit after block of 1000 inserts and it goes very fast. so insertion is over after few hours.
But the problem occurs when i want to transform it inside database from XML(varchar or XML format) into tables by parsing.
That processing takes too much time in database no matter if it is stored as varchar lines, varchar nodes or XML data type.

--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me

Re: large xml database

From
Rob Sargent
Date:


Viktor Bojovic' wrote:
>
>
> On Sun, Oct 31, 2010 at 2:26 AM, James Cloos <cloos@jhcloos.com 
> <mailto:cloos@jhcloos.com>> wrote:
>
>     >>>>> "VB" == Viktor Bojovic' <viktor.bojovic@gmail.com
>     <mailto:viktor.bojovic@gmail.com>> writes:
>
>     VB> i have very big XML documment which is larger than 50GB and
>     want to
>     VB> import it into databse, and transform it to relational schema.
>
>     Were I doing such a conversion, I'd use perl to convert the xml into
>     something which COPY can grok. Any other language, script or compiled,
>     would work just as well. The goal is to avoid having to slurp the
>     whole
>     xml structure into memory.
>
>     -JimC
>     --
>     James Cloos <cloos@jhcloos.com <mailto:cloos@jhcloos.com>>
>     OpenPGP: 1024D/ED7DAEA6
>
>
> The insertion into dabase is not very big problem.
> I insert it as XML docs, or as varchar lines or as XML docs in varchar 
> format. Usually i use transaction and commit after block of 1000 
> inserts and it goes very fast. so insertion is over after few hours.
> But the problem occurs when i want to transform it inside database 
> from XML(varchar or XML format) into tables by parsing.
> That processing takes too much time in database no matter if it is 
> stored as varchar lines, varchar nodes or XML data type.
>
> -- 
> ---------------------------------------
> Viktor Bojovic'
> ---------------------------------------
> Wherever I go, Murphy goes with me

Are you saying you first load the xml into the database, then parse that 
xml into instance of objects (rows in tables)?



Re: large xml database

From
Viktor Bojović
Date:


On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent <robjsargent@gmail.com> wrote:



Viktor Bojovic' wrote:


On Sun, Oct 31, 2010 at 2:26 AM, James Cloos <cloos@jhcloos.com <mailto:cloos@jhcloos.com>> wrote:

   >>>>> "VB" == Viktor Bojovic' <viktor.bojovic@gmail.com

   <mailto:viktor.bojovic@gmail.com>> writes:

   VB> i have very big XML documment which is larger than 50GB and
   want to
   VB> import it into databse, and transform it to relational schema.

   Were I doing such a conversion, I'd use perl to convert the xml into
   something which COPY can grok. Any other language, script or compiled,
   would work just as well. The goal is to avoid having to slurp the
   whole
   xml structure into memory.

   -JimC
   --
   James Cloos <cloos@jhcloos.com <mailto:cloos@jhcloos.com>>

   OpenPGP: 1024D/ED7DAEA6


The insertion into dabase is not very big problem.
I insert it as XML docs, or as varchar lines or as XML docs in varchar format. Usually i use transaction and commit after block of 1000 inserts and it goes very fast. so insertion is over after few hours.
But the problem occurs when i want to transform it inside database from XML(varchar or XML format) into tables by parsing.
That processing takes too much time in database no matter if it is stored as varchar lines, varchar nodes or XML data type.

--
---------------------------------------
Viktor Bojovic'

---------------------------------------
Wherever I go, Murphy goes with me

Are you saying you first load the xml into the database, then parse that xml into instance of objects (rows in tables)?


Yes. That way takes less ram then using twig or simple xml, so I tried using postgre xml functions or regexes.



--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me
Attachment

Re: large xml database

From
Rob Sargent
Date:

Viktor Bojović wrote:
>
>
> On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent <robjsargent@gmail.com 
> <mailto:robjsargent@gmail.com>> wrote:
>
>
>
>
>     Viktor Bojovic' wrote:
>
>
>
>         On Sun, Oct 31, 2010 at 2:26 AM, James Cloos
>         <cloos@jhcloos.com <mailto:cloos@jhcloos.com>
>         <mailto:cloos@jhcloos.com <mailto:cloos@jhcloos.com>>> wrote:
>
>            >>>>> "VB" == Viktor Bojovic' <viktor.bojovic@gmail.com
>         <mailto:viktor.bojovic@gmail.com>
>
>            <mailto:viktor.bojovic@gmail.com
>         <mailto:viktor.bojovic@gmail.com>>> writes:
>
>            VB> i have very big XML documment which is larger than 50GB and
>            want to
>            VB> import it into databse, and transform it to relational
>         schema.
>
>            Were I doing such a conversion, I'd use perl to convert the
>         xml into
>            something which COPY can grok. Any other language, script
>         or compiled,
>            would work just as well. The goal is to avoid having to
>         slurp the
>            whole
>            xml structure into memory.
>
>            -JimC
>            --
>            James Cloos <cloos@jhcloos.com <mailto:cloos@jhcloos.com>
>         <mailto:cloos@jhcloos.com <mailto:cloos@jhcloos.com>>>
>
>            OpenPGP: 1024D/ED7DAEA6
>
>
>         The insertion into dabase is not very big problem.
>         I insert it as XML docs, or as varchar lines or as XML docs in
>         varchar format. Usually i use transaction and commit after
>         block of 1000 inserts and it goes very fast. so insertion is
>         over after few hours.
>         But the problem occurs when i want to transform it inside
>         database from XML(varchar or XML format) into tables by parsing.
>         That processing takes too much time in database no matter if
>         it is stored as varchar lines, varchar nodes or XML data type.
>
>         -- 
>         ---------------------------------------
>         Viktor Bojovic'
>
>         ---------------------------------------
>         Wherever I go, Murphy goes with me
>
>
>     Are you saying you first load the xml into the database, then
>     parse that xml into instance of objects (rows in tables)?
>
>
> Yes. That way takes less ram then using twig or simple xml, so I tried 
> using postgre xml functions or regexes.
>
>
>
> -- 
> ---------------------------------------
> Viktor Bojović
> ---------------------------------------
> Wherever I go, Murphy goes with me
Is the entire load a set of "entry" elements as your example contains?  
This I believe would parse nicely into a tidy but non-trivial schema 
directly without the "middle-man" of having xml in db (unless of course 
you prefer xpath to sql ;) )

The single most significant caveat I would have for you is Beware: 
Biologists involved. Inconsistency (at least overloaded concepts)  
almost assured :).  EMBL too is suspect imho, but I've been out of that 
arena for a while.





Re: large xml database

From
Viktor Bojović
Date:


On Sun, Oct 31, 2010 at 10:26 PM, Rob Sargent <robjsargent@gmail.com> wrote:
 
 
Viktor Bojović wrote:


On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>> wrote:




   Viktor Bojovic' wrote:



       On Sun, Oct 31, 2010 at 2:26 AM, James Cloos
       <cloos@jhcloos.com <mailto:cloos@jhcloos.com>
       <mailto:cloos@jhcloos.com <mailto:cloos@jhcloos.com>>> wrote:

          >>>>> "VB" == Viktor Bojovic' <viktor.bojovic@gmail.com
       <mailto:viktor.bojovic@gmail.com>

          <mailto:viktor.bojovic@gmail.com
       <mailto:viktor.bojovic@gmail.com>>> writes:

          VB> i have very big XML documment which is larger than 50GB and
          want to
          VB> import it into databse, and transform it to relational
       schema.

          Were I doing such a conversion, I'd use perl to convert the
       xml into
          something which COPY can grok. Any other language, script
       or compiled,
          would work just as well. The goal is to avoid having to
       slurp the
          whole
          xml structure into memory.

          -JimC
          --
          James Cloos <cloos@jhcloos.com <mailto:cloos@jhcloos.com>
       <mailto:cloos@jhcloos.com <mailto:cloos@jhcloos.com>>>


          OpenPGP: 1024D/ED7DAEA6


       The insertion into dabase is not very big problem.
       I insert it as XML docs, or as varchar lines or as XML docs in
       varchar format. Usually i use transaction and commit after
       block of 1000 inserts and it goes very fast. so insertion is
       over after few hours.
       But the problem occurs when i want to transform it inside
       database from XML(varchar or XML format) into tables by parsing.
       That processing takes too much time in database no matter if
       it is stored as varchar lines, varchar nodes or XML data type.

       --         ---------------------------------------
       Viktor Bojovic'

       ---------------------------------------
       Wherever I go, Murphy goes with me


   Are you saying you first load the xml into the database, then
   parse that xml into instance of objects (rows in tables)?


Yes. That way takes less ram then using twig or simple xml, so I tried using postgre xml functions or regexes.



--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me
Is the entire load a set of "entry" elements as your example contains?  This I believe would parse nicely into a tidy but non-trivial schema directly without the "middle-man" of having xml in db (unless of course you prefer xpath to sql ;) )

The single most significant caveat I would have for you is Beware: Biologists involved. Inconsistency (at least overloaded concepts)  almost assured :).  EMBL too is suspect imho, but I've been out of that arena for a while.


Unfortunately some elements are always missing, so I had to create script which scanned whole document of swissprot and trembl , and stored it into file to use it as a template to build a code generator if I find a best parser for this purpose. To parse all elements it in one day I should use parser which is capable to parse at least 128 entry blocks for an second @ 2.4GHz. You are right about inconsistency, im constantly have problems with PDB files.

btw.
you have mentioned "This I believe would parse nicely into a tidy but non-trivial schema directly", does it mean that postgre has a support for restoring the database schema from xml files?

--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me
Attachment

Re: large xml database

From
Rob Sargent
Date:
Skipping much of the included thread, urgently.
> btw.
> you have mentioned "This I believe would parse nicely into a tidy but 
> non-trivial schema directly", does it mean that postgre has a support 
> for restoring the database schema from xml files?
>
> -- 
> ---------------------------------------
> Viktor Bojović
> ---------------------------------------
> Wherever I go, Murphy goes with me
No. Sorry.  Did not mean to imply that.  If you had a xsd file from them 
you might have a better chance.  Have never looked but I but someone has 
tried that sort of manipulation.  Throw out the cardinality constraints 
of course :).