Thread: Implementing and Experimenting with a Full Disjunctions Operator.

Implementing and Experimenting with a Full Disjunctions Operator.

From
Tzahi Fadida
Date:
Hi All,
As part of my thesis I need to implement a new algorithm for
Full-Disjunctions and 2 older ones. 
A short explanation of what Full-Disjunction is, is that it comes to
solve what A natural outer join 
usually can't do when you have more than 2 relations. 
The main goal is to retrieve maximal answers from a set of relations. 
The natural representation of the operator that is usually depicted in
the literature is FD(r1,...,rN). 
(r=relation/table) 
Ullman's algorithm uses several natural outer joins so there is no
problem there but our 
algorithm must be run internally at the server since it uses no existing
operators but it is 
also not limited to the gamma-acyclic restriction of Ullman's algorithm.

I have already read most of the development documentations, faqs,
presentations, listened on 
this mailing list, etc... I already compiled a dynamically loaded
library with a function and ran it successfully. 
The research part of implementing the algorithm is theoretical and
experimentation. 
After looking around in the code and seeing how the SPI generally works
I have several concerns 
(the first one is the most crucial to me):
1) As part of the experimentation I need to know exactly how many blocks
have been read when 
the algorithm ran. I need complete control over the process to run my
simulations. 
I see that there are functions like heap heap_getnext() heap_fetch()
SearchSysCache(). 
Our algorithm usually read sequentially the relations and I don't see
how to read complete blocks and 
count these blocks. In addition temporary queues that must be held in
memory will be needed to be dumped to 
disk at various times (because of their size) and fetched. Is there a
way to control this process 
with accuracy and calculate the exact disk writes?
2) As part of the theoretical work and experimentation we want to load
blocks of relation rows to the 
main memory and cache them using our techniques. Is there a way to
control the memory blocks so 
they won't be swapped. In addition, is there a way to get a specific
size of memory so we can 
plan our operator running path. I see that palloc return's to me a
chunck of memory but I don't know 
in advance how much is available to me (aside from polling for it).
3) When outputting the results as a set of records, I cannot know in
advance the type of temporary 
table that will come out just like a subquery like (select * from
relationA,relationB); Is there a problem 
outputting this kind of table?
4) When inputting the various tables to the operator I understand that
the function is limited to a fixed 
number of arguments. Is there a way to circumvent that or would I need
to use an ugly ARRAY construct.

Obviously there are better ways than a dynamically loaded library
function, so after we study the algorithm
I don't think there should be any problem integrating it to postgreSQL,
of course if it will be good enough :)

Thank you.

Regards,tzahi.

* - * - *
Itzhak Fadida
MSc Student
Information System Engineering Area
Faculty of Industrial Engineering & Management
Technion - Israel Institute of Technology
Technion City, Haifa, Israel 32000
Technion Email: Tzahi@TX.Technion.ac.il
Alternative Email: TzahiFadida@MyRealBox.com
* - * - * - * - * - * - * - * - * - * - * - * - * - * - *

WARNING TO SPAMMERS:  see at
http://members.lycos.co.uk/my2nis/spamwarning.html