Thread: xml-file as foreign table?

xml-file as foreign table?

From
Johann Spies
Date:
I have several large (7GB+) xml files to get into an SQL database.

The xml-files can contain up to 500 000 subrecords which I want to be able to query in the database.

They are too large to do something like this:


insert into rawxml (xml) select XMLPARSE (DOCUMENT CONVERT_FROM(PG_READ_BINARY_FILE('FOO.xml' ), 'UTF8'));

If it were possible, each file would be one huge record in the table which can then be unpacked using XPATH.


The options I am considering is :

1. Unpack the individual records (will be more than 50 million) using something like python with lxml and psycopg2 and insert them after dropping all indexes and triggers on the table

2. Unpack the individual records and write a (very) large tsv-file and then insert it using 'copy'

It would be convenient If I could use the present xml files as 'foreign tables'  and parse them using the xpath-capabilities of PostgreSQL.

Is this possible?

Of not, what would be the most efficient way of getting the individual xml-records xml into a table into a xml-type field?
 
The documentation did not give me a clear answer.

Regards
Johann
--
Because experiencing your loyal love is better than life itself,
my lips will praise you.  (Psalm 63:3)

Re: xml-file as foreign table?

From
Arjen Nienhuis
Date:


On Apr 28, 2016 14:33, "Johann Spies" <johann.spies@gmail.com> wrote:
>
> I have several large (7GB+) xml files to get into an SQL database.
>
> The xml-files can contain up to 500 000 subrecords which I want to be able to query in the database.
>
> They are too large to do something like this:
>
>
> insert into rawxml (xml) select XMLPARSE (DOCUMENT CONVERT_FROM(PG_READ_BINARY_FILE('FOO.xml' ), 'UTF8'));
>
> If it were possible, each file would be one huge record in the table which can then be unpacked using XPATH.
>
>
> The options I am considering is :
>
> 1. Unpack the individual records (will be more than 50 million) using something like python with lxml and psycopg2 and insert them after dropping all indexes and triggers on the table
>
> 2. Unpack the individual records and write a (very) large tsv-file and then insert it using 'copy'
>

The fastest way I found is to combine these two. Using iterparse from lxml combined with load_rows and COPY from py-postgresql:

http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements

That way you can stream the data.

> It would be convenient If I could use the present xml files as 'foreign tables'  and parse them using the xpath-capabilities of PostgreSQL.
>
> Is this possible?
>

There is a multicorn fdw for that:

https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py

But I never tried it. It looks like it loads all rows in a python list.

Groeten, Arjen

Re: xml-file as foreign table?

From
Arjen Nienhuis
Date:
On Thu, May 5, 2016 at 2:13 PM, Johann Spies <johann.spies@gmail.com> wrote:
> Dankie Arjen,
>
> On 29 April 2016 at 07:01, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote:
>
>>
>> > The options I am considering is :
>> >
>> > 1. Unpack the individual records (will be more than 50 million) using
>> > something like python with lxml and psycopg2 and insert them after dropping
>> > all indexes and triggers on the table
>> >
>> > 2. Unpack the individual records and write a (very) large tsv-file and
>> > then insert it using 'copy'
>> >
>>
>> The fastest way I found is to combine these two. Using iterparse from lxml
>> combined with load_rows and COPY from py-postgresql:
>>
>> http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements
>>
>> That way you can stream the data.
>
>
> I did not know about py-postgresql as I am a Python-2.7 user.
>
> I am excited with the possiilities python-postgresql is offering. After a
> bit of experimenting to use the streaming-copy option I landed up in the
> same problem that I had while using python2: Postgresql rejecting the
> xml-part of the record.
>
> How do you handle the conversion from string to bytes and back and the
> presence of quotes within the xml?
>
> I have tried this to experiment with just 10 records for a start:
>
> ut = element.xpath('.//t:UID/text()',namespaces=namespaces)[0]
>                 x = etree.tostring(element)
>                 y = x.decode(encoding='UTF-8').replace("'", "''")
>                 s  = '\t'.join([gzipf,filename, ut,y])
>                 t =  s.encode(encoding='UTF-8')
>                 rows.append(t)
>                 count += 1
>                 element.clear()
>                 gc.collect()
>                 if count == 10:
>                     import pdb;pdb.set_trace()
>                     xmlin.load_rows(rows)
>                     rows = []
>                     f.close()
>                     exit(0)
>
> Which ends up with an error:
>
> postgresql.exceptions.XMLContentError: invalid XML content
>   CODE: 2200N
>   LOCATION: File 'xml.c', line 1551, in xml_ereport from SERVER
>   CONTEXT: COPY annual, line 1, column xml: "<REC
> xmlns="http://scientific.thomsonreuters.com/schema/wok5.4/public/FullRecord"
> r_id_disclaimer="R..."
>   DETAIL: line 1: Premature end of data in tag REC line 1

If I need to guess it's because of \n in the xml. I used "WITH (FORMAT
csv)" and quote the fields:

def q(v):
     return b'"' + v.replace(b'"', b'""') + b'"'

return b','.join(q(f) for f in fields) + b'\n'

In the end I also had some other problems with the XML (namespaces), so I used:

etree.tostring(element, method='c14n', exclusive=True)

>
> With python2.7 I tried to use the same technique using subprocess and a call
> to psql to pipe the data to Postgresql -- ending with the same error.
>

Maybe you can show a few lines of the output.

> Dankie nogmaals vir die verwysing na python-postgresql.
>
> Mooi dag.
>
> Johann
>

Groeten, Arjen


Re: xml-file as foreign table?

From
Johann Spies
Date:
Thanks Arjen,



def q(v):
     return b'"' + v.replace(b'"', b'""') + b'"'

return b','.join(q(f) for f in fields) + b'\n'

In the end I also had some other problems with the XML (namespaces), so I used:

etree.tostring(element, method='c14n', exclusive=True)

This helped. My code is now doing it's job.

Regards
Johann