Re: From TODO, XML? - Mailing list pgsql-hackers
From | mlw |
---|---|
Subject | Re: From TODO, XML? |
Date | |
Msg-id | 3B66B342.B1839A1D@mohawksoft.com Whole thread Raw |
In response to | Re: Re: From TODO, XML? (Bruce Momjian <pgman@candle.pha.pa.us>) |
List | pgsql-hackers |
Bruce Momjian wrote: > > > > > I have managed to get several XML files into PostgreSQL by writing a parser, > > > > and it is a huge hassle, the public parsers are too picky. I am thinking that a > > > > fuzzy parser, combined with some intelligence and an XML DTD reader, could make > > > > a very cool utility, one which I have not been able to find. > > > > > > > > Perhaps it is a two stage process? First pass creates a schema which can be > > > > modified/corrected, the second pass loads the data. > > > > > > Can we accept only relational XML. Does that buy us anything? Are the > > > other database vendors outputting heirchical XML? Are they using > > > foreign/primary keys to do it? > > > > Then what's the point? Almost no one creates a non-hierarchical XML. For the > > utility to be usefull, beyond just a different format for pg_dump, it has to > > deal with these issues and do the right thing. > > Oh, seems XML will be much more complicated than I thought. I think an XML "output" for pg_dump would be a pretty good/easy feature. It is easy to create XML. <record> <field1>bla bla</field1> <field2>foo bar</field2> </record> Is very easy to create. Of course a little work would be needed to take information of the field (column) types into a DTD, but the actual XML is not much more complicated than a printf, i.e. printf("<%s>%s</%s>", name, data, name); The real issues is reading XML. Postgres can make a DTD and XML file which can be read by a strict parser, but that does not imply the inverse. Attached is an XML file from MP3.com. For an XML import to be anything but market/feature list candy, it should be able to import this sort of data, because when people say XML, this is the sort of data they are thinking about. If you take the time to examine the file, you will see it represents four or five distinct tables in a relational database. These tables are Song, Artist, [Cdlist,] Cd, and Genre. Song has a number of fields, plus foreign keys: Artist, Cdlist, and Genre. Cdlist would have to have a synthetic primary key (OID, sequence?), which tables Cd and Song would reference. Cdlist would probably never be used in a query. I think it is doable as a project (which everyone will be glad to have and complain about), but I think it is far more complicated than a one or two day modification of pg_dump. -- 5-4-3-2-1 Thunderbirds are GO! ------------------------ http://www.mohawksoft.com<Song> <id>1004001</id> <Artist> <id>121693</id> <xmlUrl>http://contentfeeds.mp3.com/standard_2.0/data_files/artists/121/121693.xml</xmlUrl> </Artist> <status>AVAILABLE</status> <name>conscious</name> <url>http://artists.mp3s.com/artist_song/1004/1004001.html</url> <description>A hard techno tune in the European style. Deep sleepers regain consciousness. Are they awake, are they aliveor is it all a dream..</description> <creditsCopyrightInfo>e n c r y p t i o n </creditsCopyrightInfo> <sourceCd>portal</sourceCd> <recordLabel>white label</recordLabel> <Cdlist> <Cd> <id>66944</id> <xmlUrl>http://contentfeeds.mp3.com/standard_2.0/data_files/dam_cds/66/66944.xml</xmlUrl> </Cd> </Cdlist> <parentalAdvisory>N</parentalAdvisory> <pictureUrl>http://images.mp3.com/mp3s/71/e_n_c_r_y_p_t_i_o_n/artificial1.jpg</pictureUrl> <Genre> <name>Intelligent Techno</name> <url>http://genres.mp3.com/music/electronic/techno/intelligent_techno</url> </Genre> <filesize>7.8</filesize> <addToMyMp3>http://my.mp3.com/music?Action=sl-1004001,cm-ad,cs-pc&origin=song&origin_id=1004001&location=generic&wqx=AddToMyMP3</addToMyMp3> <greetingsUrl>http://musicgreetings.mp3.com/email_song?song_id=1004001&origin=song&origin_id=1004001&location=generic&band_id=121693</greetingsUrl> </Song>
pgsql-hackers by date: