Re: From TODO, XML? - Mailing list pgsql-hackers

From mlw
Subject Re: From TODO, XML?
Date
Msg-id 3B66B342.B1839A1D@mohawksoft.com
Whole thread Raw
In response to Re: Re: From TODO, XML?  (Bruce Momjian <pgman@candle.pha.pa.us>)
List pgsql-hackers
Bruce Momjian wrote:
>
> > > > I have managed to get several XML files into PostgreSQL by writing a parser,
> > > > and it is a huge hassle, the public parsers are too picky. I am thinking that a
> > > > fuzzy parser, combined with some intelligence and an XML DTD reader, could make
> > > > a very cool utility, one which I have not been able to find.
> > > >
> > > > Perhaps it is a two stage process? First pass creates a schema which can be
> > > > modified/corrected, the second pass loads the data.
> > >
> > > Can we accept only relational XML.  Does that buy us anything?  Are the
> > > other database vendors outputting heirchical XML?  Are they using
> > > foreign/primary keys to do it?
> >
> > Then what's the point? Almost no one creates a non-hierarchical XML. For the
> > utility to be usefull, beyond just a different format for pg_dump, it has to
> > deal with these issues and do the right thing.
>
> Oh, seems XML will be much more complicated than I thought.

I think an XML "output" for pg_dump would be a pretty good/easy feature. It is
easy to create XML.

<record>
<field1>bla bla</field1>
<field2>foo bar</field2>
</record>

Is very easy to create. Of course a little work would be needed to take
information of the field (column) types into a DTD, but the actual XML is not
much more complicated than a printf, i.e. printf("<%s>%s</%s>", name, data,
name);

The real issues is reading XML. Postgres can make a DTD and XML file which can
be read by a strict parser, but that does not imply the inverse.

Attached is an XML file from MP3.com. For an XML import to be anything but
market/feature list candy, it should be able to import this sort of data,
because when people say XML, this is the sort of data they are thinking about.

If you take the time to examine the file, you will see it represents four or
five distinct tables in a relational database. These tables are Song, Artist,
[Cdlist,] Cd, and Genre.

Song has a number of fields, plus foreign keys: Artist, Cdlist, and Genre.
Cdlist would have to have a synthetic primary key (OID, sequence?), which
tables Cd and Song would reference. Cdlist would probably never be used in a
query.

I think it is doable as a project (which everyone will be glad to have and
complain about), but I think it is far more complicated than a one or two day
modification of pg_dump.

--
5-4-3-2-1 Thunderbirds are GO!
------------------------
http://www.mohawksoft.com<Song>
  <id>1004001</id>
  <Artist>
    <id>121693</id>
    <xmlUrl>http://contentfeeds.mp3.com/standard_2.0/data_files/artists/121/121693.xml</xmlUrl>
  </Artist>
  <status>AVAILABLE</status>
  <name>conscious</name>
  <url>http://artists.mp3s.com/artist_song/1004/1004001.html</url>
  <description>A hard techno tune in the European style.  Deep sleepers regain consciousness.  Are they awake, are they
aliveor is it all a dream..</description> 
  <creditsCopyrightInfo>e n c r y p t i o n </creditsCopyrightInfo>
  <sourceCd>portal</sourceCd>
  <recordLabel>white label</recordLabel>
  <Cdlist>
    <Cd>
      <id>66944</id>
      <xmlUrl>http://contentfeeds.mp3.com/standard_2.0/data_files/dam_cds/66/66944.xml</xmlUrl>
    </Cd>
  </Cdlist>
  <parentalAdvisory>N</parentalAdvisory>
  <pictureUrl>http://images.mp3.com/mp3s/71/e_n_c_r_y_p_t_i_o_n/artificial1.jpg</pictureUrl>
  <Genre>
    <name>Intelligent Techno</name>
    <url>http://genres.mp3.com/music/electronic/techno/intelligent_techno</url>
  </Genre>
  <filesize>7.8</filesize>

<addToMyMp3>http://my.mp3.com/music?Action=sl-1004001,cm-ad,cs-pc&origin=song&origin_id=1004001&location=generic&wqx=AddToMyMP3</addToMyMp3>

<greetingsUrl>http://musicgreetings.mp3.com/email_song?song_id=1004001&origin=song&origin_id=1004001&location=generic&band_id=121693</greetingsUrl>
</Song>

pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: vacuumlo.
Next
From: Bruce Momjian
Date:
Subject: Re: Performance TODO items