Re: Google SoC--Idea Request - Mailing list pgsql-hackers

From Nikolay Samokhvalov
Subject Re: Google SoC--Idea Request
Date
Msg-id e431ff4c0605020134x303014a7r5cea1ed1951f09f6@mail.gmail.com
Whole thread Raw
In response to Google SoC--Idea Request  ("Jonah H. Harris" <jonah.harris@gmail.com>)
Responses Re: Google SoC--Idea Request  ("Jonah H. Harris" <jonah.harris@gmail.com>)
List pgsql-hackers
Proposal: XMLType for PostgreSQL.

*** Minimum: ***
to have special type support for storing XML data and working with it.
This means following:- ability to define any column of a table as of XMLType; internally,
all data is stored as VARCHAR;- auto validation of documents against XML schema, if it was
specified in column
definition or in XML data sheets themselves (DTD, XSD or at least one
of them) /*contrib/xml2 has such feature, but it uses libxml, what
means DOM interface. Maybe it's better to use some SAX parser to solve
this task*/;- XPath indexes for queries with path expressions in WHERE clause /*I
suppose this kind of indexes would be most frequently used. I propose
using good labeling schema and GIST and/or Gin here*/;- some subset of SQL/XML. Actually, part 14 of SQL:200n (SQL/XML)
has
more than 400 pages now and contains some established constructions,
that are using in other DBMSes. There is the some patch already
written by Pavel Stehule:
http://www.pgsql.ru/db/mw/msg.html?mid=2096818. (BTW, what is with it?
it was kept for 8.2, so what is the result?) I've tested it several
months ago, basic SQL/XML functions worked fine. It changes grammar,
but there is no other way... So, using this patch as a part of this
project means that this project cannot be contrib module,
unfortunately. Nevertheless, current paper of SQL/XML standard seems
to be mature - so, compared with existing implementation it would be a
nice 'landmark';- XML domains support: ability to define domain based on XMLType and
XML schema definition (e.g., external DTD file or smth). I'd consider
XML schema definition as a restriction of entire XML Type (similar to
restrictions for plain types, which are defined as CHECK constraint in
domain definition)

*** Maximum: ***- all things from 'minimum' list :-)- reach index system: * structure index (labeling schema; prefix
schemasseem to be best 
for this and I
suppose GIST would help here). Actually, it would be full shredding,
like primary index for XML in MS SQL Server, but I'm aware of better
labeling algorithms than simple prefix labeling (as in SQL Server).
Surely, GIST/Gin support would be great foundation for these * flexible support of path indexes, value indexes and so
on(smth 
like secondary XML  indexes in SQL Server...) - as a continuation of
work on path indexes from 'minimum' list;- full-text search abilties (tsearch2 / GIST);- different encoding issues
(autoconversion to column's encoding, etc);- ability to choose storage type: VARCHAR or 'native' (trees - like 
in native XML DBMSes and DB2 Viper [if their articles don't lie ;-)])
mode. Actually, this is very-very huge task (almost so as creating
DBMS from scratch) and I inderstand clearly that I won't solve it
using only my own abilities. But the work on 'minimum' list
(especially if it will be a part of SoC) would be a good start point
and may involve some other developers that help to implement it. Maybe
at the initial stage, it's worth to integrate with some other DBMS and
work with it using two-phase commit (surely, this is not a clue to all
problems, as it
means two different execution plans, etc);- XQuery and its integration with SQL (according SQL/XML standard).
In other words,  implementation of XQuery Data Model - this would be
great target point (version 1.0 of entire  project);- XML views / updatable XML views (actually, it's a crazy idea, but
it's my dream ;-) )

As a part of SoC I would concentrate on tasks from 'minimum' list. It
would be a good start point.

Some articles:
Fresh draft of SQL:200n: http://www.wiscorp.com/sql_2003_standard.zip
Other SQL/XML papers: http://www.wiscorp.com/SQLStandards.html#xsqlstandards
XISS system (Li, Moon - advanced interval indexes):
http://www.cs.arizona.edu/xiss/
MASS (prefix indexes):
http://davis.wpi.edu/dsrg/vamana/WebPages/Publication.html
Staircase joins (accelerating XPath Evaluation):
http://www.inf.uni-konstanz.de/dbis/publications/download/injection.pdf
Oleg's TODO list: http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo
XML in DB2 Viper: http://www.vldb2005.org/program/paper/thu/p1164-nicola.pdf
XQuery in SQL Server: http://www.vldb2005.org/program/paper/thu/p1175-pal.pdf
Labeling schema in SQL Server (ORDPATHs):
http://portal.acm.org/ft_gateway.cfm?id=1007686&type=pdf&coll=GUIDE&dl=GUIDE&CFID=74920272&CFTOKEN=73736781

One more comment: I'm a PhD student of MIPT, Russia. I plan to create
an overview of XMLType implementations of last versions of three major
commercial DBMSes (ORA, MS, DB2), comparing them to standard and each
other. First article of this comparison is planned to the end of May.
This work will help to understand, where major commercial DBMS vendors
go and why they go there :-) Moreover, I intend to create a technique
for testing of XMLType support in (O)RDBMSes. In spite of the fact,
that SoC assumes all work be done by only one person, I expect some
upport/help from following people:- Dr. Sergey Kuznetsov (my scientific mentor)- Oleg Bartunov and Teodor Sigaev (as
majordevelopers of PostgreSQL 
and GIST and Gin, they definitely can help me to be successive);- Ivan Zolotukhin (together we plan to create the
overviewmentioned above)- PostgreSQL community (actually, as I've already mentioned, I intend 
using code by Pavel Stehule, and I'm pretty sure that I'll need a lot
of other help from the community)

On 4/15/06, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> Hey everyone,
>
> I know we started a discussion a month or so ago regarding ideas for
> SoC projects.  However, after reading through the thread, I didn't see
> us nail down any actual items.
>
> As such, we need to quickly put together a list of oh, 15-20 midlevel
> project ideas.  I'm sure we can pull some off the TODO list, but we
> should also look at project ideas for porting some of the most used
> third-party OSS software to PostgreSQL too (portals, CMS systems,
> accounting systems, etc.).
>
> All ideas welcome!
>
> --
> Jonah H. Harris, Database Internals Architect
> EnterpriseDB Corporation
> 732.331.1324
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


--
Best regards,
Nikolay

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Constraint Exclusion + Joins?
Next
From: "Mark Cave-Ayland"
Date:
Subject: Re: WITH/WITH RECURSIVE implementation discussion