Re: large xml database - Mailing list pgsql-sql

From Viktor Bojović
Subject Re: large xml database
Date
Msg-id AANLkTi=iQkAA3Bc0ua-ZauXwfFJg+eLqvEb4rb+NFaj2@mail.gmail.com
Whole thread Raw
In response to Re: large xml database  (Rob Sargent <robjsargent@gmail.com>)
Responses Re: large xml database  (Rob Sargent <robjsargent@gmail.com>)
List pgsql-sql


On Sun, Oct 31, 2010 at 10:26 PM, Rob Sargent <robjsargent@gmail.com> wrote:
 
 
Viktor Bojović wrote:


On Sun, Oct 31, 2010 at 9:42 PM, Rob Sargent <robjsargent@gmail.com <mailto:robjsargent@gmail.com>> wrote:




   Viktor Bojovic' wrote:



       On Sun, Oct 31, 2010 at 2:26 AM, James Cloos
       <cloos@jhcloos.com <mailto:cloos@jhcloos.com>
       <mailto:cloos@jhcloos.com <mailto:cloos@jhcloos.com>>> wrote:

          >>>>> "VB" == Viktor Bojovic' <viktor.bojovic@gmail.com
       <mailto:viktor.bojovic@gmail.com>

          <mailto:viktor.bojovic@gmail.com
       <mailto:viktor.bojovic@gmail.com>>> writes:

          VB> i have very big XML documment which is larger than 50GB and
          want to
          VB> import it into databse, and transform it to relational
       schema.

          Were I doing such a conversion, I'd use perl to convert the
       xml into
          something which COPY can grok. Any other language, script
       or compiled,
          would work just as well. The goal is to avoid having to
       slurp the
          whole
          xml structure into memory.

          -JimC
          --
          James Cloos <cloos@jhcloos.com <mailto:cloos@jhcloos.com>
       <mailto:cloos@jhcloos.com <mailto:cloos@jhcloos.com>>>


          OpenPGP: 1024D/ED7DAEA6


       The insertion into dabase is not very big problem.
       I insert it as XML docs, or as varchar lines or as XML docs in
       varchar format. Usually i use transaction and commit after
       block of 1000 inserts and it goes very fast. so insertion is
       over after few hours.
       But the problem occurs when i want to transform it inside
       database from XML(varchar or XML format) into tables by parsing.
       That processing takes too much time in database no matter if
       it is stored as varchar lines, varchar nodes or XML data type.

       --         ---------------------------------------
       Viktor Bojovic'

       ---------------------------------------
       Wherever I go, Murphy goes with me


   Are you saying you first load the xml into the database, then
   parse that xml into instance of objects (rows in tables)?


Yes. That way takes less ram then using twig or simple xml, so I tried using postgre xml functions or regexes.



--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me
Is the entire load a set of "entry" elements as your example contains?  This I believe would parse nicely into a tidy but non-trivial schema directly without the "middle-man" of having xml in db (unless of course you prefer xpath to sql ;) )

The single most significant caveat I would have for you is Beware: Biologists involved. Inconsistency (at least overloaded concepts)  almost assured :).  EMBL too is suspect imho, but I've been out of that arena for a while.


Unfortunately some elements are always missing, so I had to create script which scanned whole document of swissprot and trembl , and stored it into file to use it as a template to build a code generator if I find a best parser for this purpose. To parse all elements it in one day I should use parser which is capable to parse at least 128 entry blocks for an second @ 2.4GHz. You are right about inconsistency, im constantly have problems with PDB files.

btw.
you have mentioned "This I believe would parse nicely into a tidy but non-trivial schema directly", does it mean that postgre has a support for restoring the database schema from xml files?

--
---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me
Attachment

pgsql-sql by date:

Previous
From: Rob Sargent
Date:
Subject: Re: large xml database
Next
From: Rob Sargent
Date:
Subject: Re: large xml database