Thread: Aggregate C function accumulating a text array

Aggregate C function accumulating a text array

From
Joel Dudley
Date:
Hello,
   I am about to write a set of C functions to be used in an aggregate
function in which the final function performs a calculation on an array
of accumulated text data types stored in a text[] array. I need to use
the text type because this function will be used on DNA sequences which
can be very large. My questions are the following. What is the most
efficient way to accumulate a text array while being efficient with
memory? I see construct_array() used in accumulation functions but I am
worried that I might end up making a copy of a potentially very large
text array each time my accumulation function is called.

The general flow is

User defined aggregate function
    SELECT pb_distance_k2p(sequence) WHERE family_id = 10;

uses accumulation function

distance_accum(PG_FUNCTION_ARGS);

and uses a final function

calculate_distance_k2p(PG_FUNCTION_ARGS)

which needs to deconstruct_array() to get the text array and loop
through the array to do some pairwise comparisons of the text and return
a multidimensional array

Am I thinking about this correctly? Are there any potential pitfalls in
the proposed strategy? I greatly appreciate your feedback.

- Joel

Re: Aggregate C function accumulating a text array

From
Joe Conway
Date:
Joel Dudley wrote:
>   I am about to write a set of C functions to be used in an aggregate
> function in which the final function performs a calculation on an array
> of accumulated text data types stored in a text[] array. I need to use
> the text type because this function will be used on DNA sequences which
> can be very large. My questions are the following. What is the most
> efficient way to accumulate a text array while being efficient with
> memory? I see construct_array() used in accumulation functions but I am
> worried that I might end up making a copy of a potentially very large
> text array each time my accumulation function is called.

True, but the intermediate results should be released after each row, I
think. You might try it with some real data before assuming a
performance problem.

If it is a problem, take a look at how contrib/intagg works. It
basically just passes a pointer from call to call. You could do
something similar for the text data type.

> The general flow is
>
> User defined aggregate function
>     SELECT pb_distance_k2p(sequence) WHERE family_id = 10;
>
> uses accumulation function
>
> distance_accum(PG_FUNCTION_ARGS);
>
> and uses a final function
>
> calculate_distance_k2p(PG_FUNCTION_ARGS)
>
> which needs to deconstruct_array() to get the text array and loop
> through the array to do some pairwise comparisons of the text and return
> a multidimensional array

Makes sense to me. BTW, take a look at PL/R
http://www.joeconway.com/plr/

It would allow you to write your final function in R, which has many
extensions related to bioinformatics -- see:
http://www.bioconductor.org/

HTH,

Joe