Thread: [GENERAL] Insertion of large xml files into PostgreSQL 10beta1

[GENERAL] Insertion of large xml files into PostgreSQL 10beta1

From
Alain Toussaint
Date:
Hello,

I am building up a PostgreSQL server which I intend to load the
entirety of the pubmed database data (23GB bzip2 compressed, 216GB
unpacked) which is available in xml form of which, here is an example:

https://www.ncbi.nlm.nih.gov/pubmed/21833294?report=xml&format=text

I looked at the documentation as well as several examples code for
loading the data and the one example who nearly succeeded is this
procedure:

/usr/bin/psql medline

\set :largexmlfile: 'cat /srv/pgsql/pubmed/medline17n0001.xml'

INSERT INTO samples (xmldata) VALUES :largexmlfile:

(from reading the list post here:
https://www.postgresql.org/message-id/20160624083757.GA5459%40msg.df7cb.de)

In which, about 334MB of data from medline17n0001.xml will flood the
monitor. I do notice some error code values flooding at some point
during the load and then it end up throwing an error because it
interpret some accented comment in the pubmed files (abstract data in
a non-us language most likely).

I will work out a way to script these commands into a bash script[1]
when I get back home (I'm at work at the moment, returning home at 5pm
EST) with two log files (stdout & stderr) but I would like to know if
it is possible to turn off validation of the content between the xml
tags of the files.

[1] == there is close to 1200 medline files averaging 350MB each.

platform is Linux From Scratch subversion tracked release running out
of a 1.5GB ramdrive plus dhcpcd, Python 2.7.13, libxml2-2.9.4,
libxslt-1.1.29 and PostgreSQL 10 beta 1 with datafiles being
/srv/pgsql/data on a single partition 931.5GB western digital drive
dedicated to PostgreSQL for the moment.

The goal is to build a server but PostgreSQL is running in the
ramdrive at the moment for partition sizing and calculation purpose.

The server is a core2 based old machine, 4GB of ram and the
aforementioned 931.5GB hard disk. It also has an nvidia card intended
for use on text mining application.

Thanks you very much.

Alain


Re: [GENERAL] Insertion of large xml files into PostgreSQL 10beta1

From
"David G. Johnston"
Date:
On Fri, Jun 23, 2017 at 8:19 AM, Alain Toussaint <atoussaint1976@gmail.com> wrote:
Hello,

I am building up a PostgreSQL server which I intend to load the
entirety of the pubmed database data (23GB bzip2 compressed, 216GB
unpacked) which is available in xml form of which, here is an example:

https://www.ncbi.nlm.nih.gov/pubmed/21833294?report=xml&format=text

I looked at the documentation as well as several examples code for
loading the data and the one example who nearly succeeded is this
procedure:

/usr/bin/psql medline

\set :largexmlfile: 'cat /srv/pgsql/pubmed/medline17n0001.xml'
INSERT INTO samples (xmldata) VALUES :largexmlfile:

​I'll assume you've just mis-keyed this from memory since the syntax of the above doesn't like right.

(from reading the list post here:
https://www.postgresql.org/message-id/20160624083757.GA5459%40msg.df7cb.de)

In which, about 334MB of data from medline17n0001.xml will flood the
monitor.

​If the above general command sequence is done right, and echoing of commands is turned off, you should not see any of the XML file content echoed to the output.​
 

it is possible to turn off validation of the content between the xml
tags of the files.


​You can either turn off validation for the entire file or leave it on - PostgreSQL isn't recognizing tags here (you haven't defined the samples table for us...).​

​Narrowing down the entire file to a small problem region and posting a self-contained example, or at least providing the error messages and content, might help elicit good responses.​  Even if you could load the data without incident using it make end up proving problematic.  That said character encodings and sets are not my strong suit.

David J.

Re: [GENERAL] Insertion of large xml files into PostgreSQL 10beta1

From
Alain Toussaint
Date:
> Narrowing down the entire file to a small problem region and posting a
> self-contained example,

The url here contain the set of xml records from a publication I
worked on many years ago:

https://www.ncbi.nlm.nih.gov/pubmed/21833294?report=xml&format=text

The particularly problematic region of the xml content is this:

        <CommentsCorrectionsList>
            <CommentsCorrections RefType="Cites">
                <RefSource>Neuroreport. 2000 Sep 11;11(13):2969-72</RefSource>
                <PMID Version="1">11006976</PMID>
            </CommentsCorrections>
            <CommentsCorrections RefType="Cites">
                <RefSource>J Neurosci. 2005 May 25;25(21):5148-58</RefSource>
                <PMID Version="1">15917455</PMID>
            </CommentsCorrections>
            <CommentsCorrections RefType="Cites">
                <RefSource>Neuroimage. 2003 Dec;20(4):1944-54</RefSource>
                <PMID Version="1">14683700</PMID>
            </CommentsCorrections>

There is more of these type of comments in an given citation.

> or at least providing the error messages and
> content, might help elicit good responses.

here it is:

ERROR: syntax error at or near "44"
LINE 1: 44(1):37-43</RefSources>

the command I used is this one:

echo "INSERT INTO samples (xmldata) VALUES $(cat
/srv/pgsql/pubmed/medline17n0001.xml)" | /usr/bin/psql medline
1>/dev/null 2>error.log

wc -l error.log
11145 error.log

The error message given is repeated a metric ton of time but I didn't
check the entire log if there were other kind of error messages.

>  Even if you could load the data
> without incident using it make end up proving problematic.

Agreed, the box will definitely need more ram and I could be better
off with a more recent graphic card (nvidia or amd but whatever is
supported by tensorflow 1.2 and up). I'll figure it out as I go.

Many thanks.

Alain


Re: [GENERAL] Insertion of large xml files into PostgreSQL 10beta1

From
Jan de Visser
Date:
On Sunday, June 25, 2017 11:02:41 PM EDT Alain Toussaint wrote:
> > Narrowing down the entire file to a small problem region and posting a
> > self-contained example,
>
> The url here contain the set of xml records from a publication I
> worked on many years ago:
>
> https://www.ncbi.nlm.nih.gov/pubmed/21833294?report=xml&format=text
>
> The particularly problematic region of the xml content is this:
>
>         <CommentsCorrectionsList>
>             <CommentsCorrections RefType="Cites">
>                 <RefSource>Neuroreport. 2000 Sep
> 11;11(13):2969-72</RefSource> <PMID Version="1">11006976</PMID>
>             </CommentsCorrections>
>             <CommentsCorrections RefType="Cites">
>                 <RefSource>J Neurosci. 2005 May
> 25;25(21):5148-58</RefSource> <PMID Version="1">15917455</PMID>
>             </CommentsCorrections>
>             <CommentsCorrections RefType="Cites">
>                 <RefSource>Neuroimage. 2003 Dec;20(4):1944-54</RefSource>
>                 <PMID Version="1">14683700</PMID>
>             </CommentsCorrections>
>
> There is more of these type of comments in an given citation.
>
> > or at least providing the error messages and
> > content, might help elicit good responses.
>
> here it is:
>
> ERROR: syntax error at or near "44"
> LINE 1: 44(1):37-43</RefSources>

This string does not appear in your link above. Please match your example data
with your error message.

>
> the command I used is this one:
>
> echo "INSERT INTO samples (xmldata) VALUES $(cat
> /srv/pgsql/pubmed/medline17n0001.xml)" | /usr/bin/psql medline
> 1>/dev/null 2>error.log

I'm going to go out on a limb and assume the problem is unescaped quote
characters, or some other escape/quote problem. Hard to say though without
having the right data to look at.





Re: [GENERAL] Insertion of large xml files into PostgreSQL 10beta1

From
"David G. Johnston"
Date:
On Sun, Jun 25, 2017 at 8:02 PM, Alain Toussaint <atoussaint1976@gmail.com> wrote:

echo "INSERT INTO samples (xmldata) VALUES $(cat
/srv/pgsql/pubmed/medline17n0001.xml)" | /usr/bin/psql medline
1>/dev/null 2>error.log

Typically I do:

--psql file
\set xmlcontent `cat 'projectblogs-categories.xml'`
INSERT INTO tbl (col) VALUES (:'xmlcontent');

though at the moment I cannot recollect whether the above munges whitespace.  I copied it from some json processing code where I wouldn't care if it did...

If you want to do something with INSERT/VALUES you should refresh yourself as to the syntax for the VALUES SQL Command.  Namely non-optional parentheses.

I'd also suggest you try to get a one or two line file importing correctly first.

David J.

Re: [GENERAL] Insertion of large xml files into PostgreSQL 10beta1

From
Alain Toussaint
Date:
Hello Jan, Hello David,

I'll be able to do further testing next week as I am in the middle of
a move but for next week, I'll be in vacation from work for the whole
month.

Best regards,

Alain

2017-06-29 0:13 GMT-04:00 David G. Johnston <david.g.johnston@gmail.com>:
> On Sun, Jun 25, 2017 at 8:02 PM, Alain Toussaint <atoussaint1976@gmail.com>
> wrote:
>>
>>
>> echo "INSERT INTO samples (xmldata) VALUES $(cat
>> /srv/pgsql/pubmed/medline17n0001.xml)" | /usr/bin/psql medline
>> 1>/dev/null 2>error.log
>
>
> Typically I do:
>
> --psql file
> \set xmlcontent `cat 'projectblogs-categories.xml'`
> INSERT INTO tbl (col) VALUES (:'xmlcontent');
>
> though at the moment I cannot recollect whether the above munges whitespace.
> I copied it from some json processing code where I wouldn't care if it
> did...
>
> If you want to do something with INSERT/VALUES you should refresh yourself
> as to the syntax for the VALUES SQL Command.  Namely non-optional
> parentheses.
>
> I'd also suggest you try to get a one or two line file importing correctly
> first.
>
> David J.
>