Thread: xml-file as foreign table?
my lips will praise you. (Psalm 63:3)
On Apr 28, 2016 14:33, "Johann Spies" <johann.spies@gmail.com> wrote:
>
> I have several large (7GB+) xml files to get into an SQL database.
>
> The xml-files can contain up to 500 000 subrecords which I want to be able to query in the database.
>
> They are too large to do something like this:
>
>
> insert into rawxml (xml) select XMLPARSE (DOCUMENT CONVERT_FROM(PG_READ_BINARY_FILE('FOO.xml' ), 'UTF8'));
>
> If it were possible, each file would be one huge record in the table which can then be unpacked using XPATH.
>
>
> The options I am considering is :
>
> 1. Unpack the individual records (will be more than 50 million) using something like python with lxml and psycopg2 and insert them after dropping all indexes and triggers on the table
>
> 2. Unpack the individual records and write a (very) large tsv-file and then insert it using 'copy'
>
The fastest way I found is to combine these two. Using iterparse from lxml combined with load_rows and COPY from py-postgresql:
http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements
That way you can stream the data.
> It would be convenient If I could use the present xml files as 'foreign tables' and parse them using the xpath-capabilities of PostgreSQL.
>
> Is this possible?
>
There is a multicorn fdw for that:
https://github.com/Kozea/Multicorn/blob/master/python/multicorn/xmlfdw.py
But I never tried it. It looks like it loads all rows in a python list.
Groeten, Arjen
On Thu, May 5, 2016 at 2:13 PM, Johann Spies <johann.spies@gmail.com> wrote: > Dankie Arjen, > > On 29 April 2016 at 07:01, Arjen Nienhuis <a.g.nienhuis@gmail.com> wrote: > >> >> > The options I am considering is : >> > >> > 1. Unpack the individual records (will be more than 50 million) using >> > something like python with lxml and psycopg2 and insert them after dropping >> > all indexes and triggers on the table >> > >> > 2. Unpack the individual records and write a (very) large tsv-file and >> > then insert it using 'copy' >> > >> >> The fastest way I found is to combine these two. Using iterparse from lxml >> combined with load_rows and COPY from py-postgresql: >> >> http://python.projects.pgfoundry.org/docs/1.1/driver.html#copy-statements >> >> That way you can stream the data. > > > I did not know about py-postgresql as I am a Python-2.7 user. > > I am excited with the possiilities python-postgresql is offering. After a > bit of experimenting to use the streaming-copy option I landed up in the > same problem that I had while using python2: Postgresql rejecting the > xml-part of the record. > > How do you handle the conversion from string to bytes and back and the > presence of quotes within the xml? > > I have tried this to experiment with just 10 records for a start: > > ut = element.xpath('.//t:UID/text()',namespaces=namespaces)[0] > x = etree.tostring(element) > y = x.decode(encoding='UTF-8').replace("'", "''") > s = '\t'.join([gzipf,filename, ut,y]) > t = s.encode(encoding='UTF-8') > rows.append(t) > count += 1 > element.clear() > gc.collect() > if count == 10: > import pdb;pdb.set_trace() > xmlin.load_rows(rows) > rows = [] > f.close() > exit(0) > > Which ends up with an error: > > postgresql.exceptions.XMLContentError: invalid XML content > CODE: 2200N > LOCATION: File 'xml.c', line 1551, in xml_ereport from SERVER > CONTEXT: COPY annual, line 1, column xml: "<REC > xmlns="http://scientific.thomsonreuters.com/schema/wok5.4/public/FullRecord" > r_id_disclaimer="R..." > DETAIL: line 1: Premature end of data in tag REC line 1 If I need to guess it's because of \n in the xml. I used "WITH (FORMAT csv)" and quote the fields: def q(v): return b'"' + v.replace(b'"', b'""') + b'"' return b','.join(q(f) for f in fields) + b'\n' In the end I also had some other problems with the XML (namespaces), so I used: etree.tostring(element, method='c14n', exclusive=True) > > With python2.7 I tried to use the same technique using subprocess and a call > to psql to pipe the data to Postgresql -- ending with the same error. > Maybe you can show a few lines of the output. > Dankie nogmaals vir die verwysing na python-postgresql. > > Mooi dag. > > Johann > Groeten, Arjen
def q(v):
return b'"' + v.replace(b'"', b'""') + b'"'
return b','.join(q(f) for f in fields) + b'\n'
In the end I also had some other problems with the XML (namespaces), so I used:
etree.tostring(element, method='c14n', exclusive=True)