Thread: [HACKERS] [GSoC] Personal presentation and request for clarification

[HACKERS] [GSoC] Personal presentation and request for clarification

From
João Miguel Afonso
Date:

Dear community member(s),


I am João Afonso, a Portuguese MSc student and I'm writing to ask some information about the GSoC projects.

For the reasons explained below, PostgreSQL was the organisation that I most identify with, so I am trying to introduce myself to the community. This way, as I really want to participate, I will describe  my most relevant experiences and knowledge on the field.
Please feel free to pass by the less relevant topics.

The project that most caught my eye was on "Implementing push-based query executor".
Although it completely fits my capabilities and current research, I have some concerns on "The ability to understand and modify PostgresSQL executor code" as I had not enough time to understand the dimension of the referred changes.

My second choice would be the "Sorting algorithms benchmark and implementation", that although is not directly related to my current work, I am more familiarised with and looks quite easier to accomplish.
As I said, I had not enough time to explore the whole project documentation or source code, but I read the code of the sorting algorithm and I realised that it is sequential. Would a parallel implementation take some benefits here?

I will keep working on reading all the documentation and some of the code, but I would appreciate if someone more familiarised with the project could point me the project that best suits my abilities


My motivations: 


A group formed by me and other four MSc students is currently working on a solution for

a linear algebra approach to OLAP. We are at the same time translating the SQL language to linear

algebra operations, developing methods to automate the process, optimising the previously

implemented sparse matrix operations, and benchmarking the resultant work on different

Intel x86 and Nvidia architectures (multi-core, many-core, GPU). Future work may even include query/machine level cost prediction functionalities.


It would be really interesting for me to do a continue analysis on how the replacement of relational algebra would influence the performance and implementation complexity of each independent module and the entire system, so at least I could do benchmark tasks even if I am not accepted on GSoC.


I'm also working on a personal project of making a general benchmark script, capable of test all the combinations of N parameters, both in Serial, Shared and Distributed Memory. It main purpose is to reduce the time spent on the MSc assigned tasks. [GIT HERE]


Not referring the anxiety of joining such important project (what I think is normal), my major concern for both projects is my reduced experience in the referred microoptimizations.


This way, I feel that is important to include my previous work on this topic. It was on the simple dot product algorithm and the main cases of study was cache issues, CPU/Memory bounds,... The work is described [HERE].


Please feel free to contact me for any question.

Best regards,

João Afonso

Re: [HACKERS] [GSoC] Personal presentation and request for clarification

From
Andrew Borodin
Date:
Hi!

It's great that you are interested in PostgreSQL!

I'll answer the question on the matter of GSoC project proposed by me.
I hope someone else will handle questions on your primary objective.


2017-03-03 4:15 GMT+05:00 João Miguel Afonso <joao_miguel_afonso@hotmail.com>:
> My second choice would be the "Sorting algorithms benchmark and
> implementation", that although is not directly related to my current work, I
> am more familiarised with and looks quite easier to accomplish.
> As I said, I had not enough time to explore the whole project documentation
> or source code, but I read the code of the sorting algorithm and I realised
> that it is sequential. Would a parallel implementation take some benefits
> here?

Parallel - means cool, for about 15 years already.
There is related work on parallelization of sorting by Peter Geoghegan
https://commitfest.postgresql.org/13/690/
Sure, it is viable and important work.

At this project, I was aiming to increase the performance of qsort. In
the context of Peter Geoghegan's qsort() is used only in queries
sorting relatively small amount of data, but it is used in other
places too (and, probably, external sort involves it).
Surely, it is related and something on the matter of parallelization
can be done here.

Best regards, Andrey Borodin.



Re: [HACKERS] [GSoC] Personal presentation and request for clarification

From
Robert Haas
Date:
On Thu, Mar 2, 2017 at 6:15 PM, João Miguel Afonso
<joao_miguel_afonso@hotmail.com> wrote:
> The project that most caught my eye was on "Implementing push-based query
> executor".
> Although it completely fits my capabilities and current research, I have
> some concerns on "The ability to understand and modify PostgresSQL executor
> code" as I had not enough time to understand the dimension of the referred
> changes.

They are formidable.

https://www.postgresql.org/message-id/CA%2BTgmoaf_uR_wVMj53MVvyEQ_wRx62MM3QQwR6aPZe0Lbr%2BJew%40mail.gmail.com

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [GSoC] Personal presentation and request for clarification

From
João Miguel Afonso
Date:

> From: Robert Haas <robertmhaas@gmail.com>
> Sent: 09 March 2017 01:09
>
>> The project that most caught my eye was on "Implementing push-based query
>> executor".
>> Although it completely fits my capabilities and current research, I have
>> some concerns on "The ability to understand and modify PostgresSQL executor
>> code" as I had not enough time to understand the dimension of the referred
>> changes.
> They are formidable.

I want to contribute with valuable work, so I will focus on my second
choice: "Sorting algorithms benchmark and implementation". Maybe when
I get more familiarised with the PostgreSQL project I would give it a
try.


> From: pgsql-hackers-owner@Postgresql.org <pgsql-hackers-owner@Postgresql.org> on behalf of Kevin Grittner <kgrittn@gmail.com>
> Sent: 17 March 2017 13:57
>
> Some ideas for desirable content:
>   - A resume or CV of the student, including any prior GSoC work
>   - Their reasons for wanting to participate
>   - What else they have planned for the summer, and what their time
>     commitment to the GSoC work will be
>   - A clear statement that there will be no intellectual property
>     problems with the work they will be doing -- that the PostgreSQL
>     community will be able to use their work without encumbrances
>     (e.g., there should be no agreements related to prior or
>     ongoing work which might assign the rights to the work they do
>     to someone else)
>   - A description of what they will do, and how
>   - Milestones with dates
>   - What they consider to be the test that they have successfully
>     completed the project

Using the information posted HERE and Kevin Grittner's suggestions,
I would like to start writing my proposal as well as begin my work on the
project.

In the last two weeks I have been using some profiling tools like
dstat, top, iostat,... in my university's cluster with the "NAS
Parallel Benchmarks" package from NASA. Now I will start another
academic work using DTrace on a Solaris machine.

I have permanent access to the cluster of SeARCH6, description HERE.
I know it is not that powerful, but it's quite heterogeneous, composed
by many generations of processors, including both Intel many core
solutions (the KNC and the not listed KNL), what I think is good
to test the algorithms in many different scenarios.

I have no permissions to install new software, so I guess I can't use
specific benchmarking software, but it can still be use to test the
algorithm alone, using some selected data sets.

The point here is just to inform about important knowledge and
material that maybe I can use on the project. Other information about
my motivations and competences can be found HERE.

Anyway, I would like to accomplish some small goals before the
23 April's deadline, so I can spot and be prepared for some trickier
parts of the project.

As I will have classes and evaluations in June, and possibly an
internship in the University of Texas in July, I will have to
work in both tasks at the same time, so I made a schedule with
what I think I can do, leaving August almost free to explore the
project (micro optimisations, ...) or compensate in case something
doesn't go as expected.

I would appreciate if you could review it and a advise me if I'm
pointing on the wrong direction.

Schedule:

Before April 3:

project specific work:
- read all the suggested papers
- implement all the sorting algorithms (functional but
 unoptimised versions)
- validate core ideas with the community
integration work:
- read some of the PostgreSQL documentation and source code
- read the HACKERS mailing list

April 3 - May 30:

project specific work:
- discuss possible benchmarks and optimization possibilities
- do a simple benchmark to the current used sort
integration work:
- go further on understanding PostgreSQL project
- keep reading the mailing list and clarify possible doubts

May 30 - June 26 (Coding officially begins!):

- set up the final benchmark environment
- correctly benchmark current sort
- macro optimise all the implemented sorts and define performance
 goals
- test the produced code vs the current one 

June 26 - July 24:

micro optimise all the algorithms:
- study cache/memory issues, vectorisation, ...
- first steps on parallelism
do a full profile of the current work:
- CPU and memory usage
- execution time
- number of operations (per second)

July 24 - August 29:

- optimise parallel solutions
- discuss some possible optimisations and test them
- revise and document all the code
- produce valuable report for future reference

After August 29:

- keep in contact and look for a possible project that fits
 my skills


A small apart:

I read this INFO , but I have been strugling with using the internet
style quoting in outlook's browser client and I end up by doing it by
hand. I have never user a development mailing list before, so any
tip woud be valuable.