Thread: large xml database
Hi,
i have very big XML documment which is larger than 50GB and want to import it into databse, and transform it to relational schema.
When splitting this documment to smaller independent xml documments i get ~11.1mil XML documents.
I have spent lots of time trying to get fastest way to transform all this data but every time i give up because it takes too much time. Sometimes more than month it would take if not stopped.
I have tried to insert each line as varchar into database and parse it using plperl regex..
also i have tried to store every documment as XML and parse it, but it is also to slow.
i have tried to store every documment as varchar but it is also slow when using regex to get data.
many tries have failed because 8GB of ram and 10gb of swap were not enough. also sometimes i get that more than 2^32 operations were performed, and functions stopped to work.
i wanted just to ask if someone knows how to speed this up.
thanx in advance
-- ---------------------------------------
Viktor Bojović
---------------------------------------
Wherever I go, Murphy goes with me
Geeeeeeeez. Maybe you can lease a bunch of Amazon EC2 high computing slices and parallelize it? I think throwing ridiculous amounts of hardware at things is always the best approach. On Sat, Oct 30, 2010 at 2:48 PM, Viktor Bojović <viktor.bojovic@gmail.com> wrote: > Hi, > i have very big XML documment which is larger than 50GB and want to import > it into databse, and transform it to relational schema. > When splitting this documment to smaller independent xml documments i get > ~11.1mil XML documents. > I have spent lots of time trying to get fastest way to transform all this > data but every time i give up because it takes too much time. Sometimes more > than month it would take if not stopped. > I have tried to insert each line as varchar into database and parse it using > plperl regex.. > also i have tried to store every documment as XML and parse it, but it is > also to slow. > i have tried to store every documment as varchar but it is also slow when > using regex to get data. > many tries have failed because 8GB of ram and 10gb of swap were not enough. > also sometimes i get that more than 2^32 operations were performed, and > functions stopped to work. > i wanted just to ask if someone knows how to speed this up. > > thanx in advance > -- > --------------------------------------- > Viktor Bojović > --------------------------------------- > Wherever I go, Murphy goes with me >
On 10/30/2010 4:48 PM, Viktor Bojović wrote: > Hi, > i have very big XML documment which is larger than 50GB and want to > import it into databse, and transform it to relational schema. > When splitting this documment to smaller independent xml documments i > get ~11.1mil XML documents. > I have spent lots of time trying to get fastest way to transform all > this data but every time i give up because it takes too much time. > Sometimes more than month it would take if not stopped. > I have tried to insert each line as varchar into database and parse it > using plperl regex.. > also i have tried to store every documment as XML and parse it, but it > is also to slow. > i have tried to store every documment as varchar but it is also slow > when using regex to get data. > > many tries have failed because 8GB of ram and 10gb of swap were not > enough. also sometimes i get that more than 2^32 operations were > performed, and functions stopped to work. > > i wanted just to ask if someone knows how to speed this up. > > thanx in advance > -- > --------------------------------------- > Viktor Bojović > --------------------------------------- > Wherever I go, Murphy goes with me It might help a little to know a few more detail. Like what is in the xml doc. When you say convert to relational, do you mean multiple tables (and no more xml tags), or do you mean a table with a blob column that contains some xml fragment? I have imported millions of rows and never run out of memory. The database will take care of itself unless you are doing something really bad. I'd guess its the xml parser running out of ram and not the database. Are you using dom or sax? You say it took too much time. What did? The xml parsing? The database inserts? Were you cpu bound or io bound? What tools are you using to write this in? What OS are you on? What version of PG? You know... just a "few" more details :-) -Andy
Is it possible (and how) to implement a data path on another partition (linux) for an existing system? And then if I do notgain anything, merging it to the production /data path? Scenario of what I want to achieve (/mnt/data is already running) /mnt/data resides on an NFS share Contains over 2 Billion web sites crawled (yeah another search-engine site) want to add: /opt/data resides on internal drive Will contain keyword hash system Then if I find this does not improve anything - or runs tight (running on IBM Blade center with 76gb internals - so I maybe limited), that I can simple shutdown postgres, scp /opt/data/folder/ to the NFS - bring up postgres - fix any conffile, and everything is on the multi-terabyte array? Checking the pulse of all you speed freaks that just hit the floor because this is on NFS not iSCSI -- don't worry, whenthis site goes live it will be iSCSI. We have been in private development for 3 years... and the NFS still out runs theInternet connections.
On 28/12/2010 14:56, Ozz Nixon wrote: > Is it possible (and how) to implement a data path on another > partition (linux) for an existing system? And then if I do not gain > anything, merging it to the production /data path? I think tablespaces will do what you want: http://www.postgresql.org/docs/9.0/static/manage-ag-tablespaces.html HTH, Ray. -- Raymond O'Donnell :: Galway :: Ireland rod@iol.ie
Ozz Nixon <ozznixon@gmail.com> wrote: > Is it possible (and how) to implement a data path on another partition > (linux) for an existing system? And then if I do not gain anything, > merging it to the production /data path? I think, you should read our docu about tablespaces: http://www.postgresql.org/docs/current/static/manage-ag-tablespaces.html Andreas -- Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect. (Linus Torvalds) "If I was god, I would recompile penguin with --enable-fly." (unknown) Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889°
On Tue, Dec 28, 2010 at 7:56 AM, Ozz Nixon <ozznixon@gmail.com> wrote: > Is it possible (and how) to implement a data path on another partition (linux) for an existing system? And then if I donot gain anything, merging it to the production /data path? > > Scenario of what I want to achieve (/mnt/data is already running) > > /mnt/data resides on an NFS share > Contains over 2 Billion web sites crawled (yeah another search-engine site) > > want to add: > > /opt/data resides on internal drive > Will contain keyword hash system > > Then if I find this does not improve anything - or runs tight (running on IBM Blade center with 76gb internals - so I maybe limited), that I can simple shutdown postgres, scp /opt/data/folder/ to the NFS - bring up postgres - fix any conffile, and everything is on the multi-terabyte array? You can't just merge two pgsql databases into one directory. You have to pick one to run and dump the other and load it into the other db with pg_dump / psql / pg_restore. If you completely replace the db instance in /opt/data/folder with the other one that will work, but erase the original db that was there. > Checking the pulse of all you speed freaks that just hit the floor because this is on NFS not iSCSI -- don't worry, whenthis site goes live it will be iSCSI. We have been in private development for 3 years... and the NFS still out runs theInternet connections. It's more an issue of reliability in case of crashes and power losses than anything else. most iSCSI implementations will honor fsync properly, but many nfs mounts will lose data / corrupt your data store if one or the other machine crashes in the wrong way.
On 2010-12-28, Ozz Nixon <ozznixon@gmail.com> wrote: > Is it possible (and how) to implement a data path on another partition (linux) for an existing system? And then if I donot gain anything, merging it to the production /data path? tablespaces > Scenario of what I want to achieve (/mnt/data is already running) > > /mnt/data resides on an NFS share > Contains over 2 Billion web sites crawled (yeah another search-engine site) using postgres on nfs is not recommended, > want to add: > > /opt/data resides on internal drive > Will contain keyword hash system local is good. > Then if I find this does not improve anything - or runs tight >(running on IBM Blade center with 76gb internals - so I may be >limited), that I can simple shutdown postgres, scp /opt/data/folder/ >to the NFS - bring up postgres - fix any conf file, and everything is >on the multi-terabyte array? or leave postgres running and just copy the table using "select * into" or similar. it's possibly better to use iSCSI instead of NFS, it's not like you can share the data directory. -- ⚂⚃ 100% natural