Thread: Varlena with recursive data structures?

Varlena with recursive data structures?

From

Sam Patterson

Date:

16 January 2019, 22:08:35

Hi all,

I've recently started developing an extension for Postgres for which I'll need to create a new variable-length base type. The type will require a tree-like structure in order to parse sufficiently, which of course probably means having some sort of recursive data structure, like a struct that has members which are pointers to itself for child nodes. After doing some research, specifically looking at how other variable-length data types store their data, it seems almost all of them store the data in a binary representation, using bit masks and offsets etc in order to store/access the data whilst having an in-memory representation that's used to manipulate the data.

I presume the purpose for using this approach is because all the data in a varlena type has to be contiguous, and the moment you start using pointers this is no longer possible. So my question is, given a structure that looks something like this,

typedef struct Node

{

char *data;

Node *left;

Node *right;

} Node;

am I right in saying that I wouldn't be able to store that representation on-disk, but instead I'd have to transform it into some binary representation and back again when writing/reading respectively, are there any alternatives?

Regards,

Karl

Re: Varlena with recursive data structures?

From

Tom Lane

Date:

16 January 2019, 22:22:33

Sam Patterson <katoriasdev@gmail.com> writes:
> I presume the purpose for using this approach is because all the data in a
> varlena type has to be contiguous, and the moment you start using pointers
> this is no longer possible.

Yup.

> So my question is, given a structure that looks
> something like this,

> typedef struct Node
> {
>     char *data;
>     Node *left;
>     Node *right;
> } Node;

> am I right in saying that I wouldn't be able to store that representation
> on-disk, but instead I'd have to transform it into some binary
> representation and back again when writing/reading respectively, are there
> any alternatives?

Yes, yes, no.  Any elementary datatype has to be able to be flattened into
a "blob of bytes" to be stored on disk.

However, you might be able to avoid working with the flattened
representation all the time.  There's an API for "expanded datums"
whereby functions can pass around an in-memory data structure that
doesn't have to look like a blob of bytes, and only flatten it out
when it's due to be stored somewhere.  I'm not sure how much that'd
help you, but it's possibly worth looking at.  See

src/include/utils/expandeddatum.h
src/backend/utils/adt/expandeddatum.c

for the basic APIs and

src/backend/utils/adt/array_expanded.c
src/backend/utils/adt/expandedrecord.c

for two examples of use.

            regards, tom lane

Re: Varlena with recursive data structures?

From

George Neuner

Date:

17 January 2019, 03:15:25

On Wed, 16 Jan 2019 22:08:35 +0000, Sam Patterson
<katoriasdev@gmail.com> wrote:

>Hi all,
>
>I've recently started developing an extension for Postgres for which I'll
>need to create a new variable-length base type. The type will require a
>tree-like structure in order to parse sufficiently, which of course
>probably means having some sort of recursive data structure, like a struct
>that has members which are pointers to itself for child nodes. After doing
>some research, specifically looking at how other variable-length data types
>store their data, it seems almost all of them store the data in a binary
>representation, using bit masks and offsets etc in order to store/access
>the data whilst having an in-memory representation that's used to
>manipulate the data.
>
>I presume the purpose for using this approach is because all the data in a
>varlena type has to be contiguous, and the moment you start using pointers
>this is no longer possible. So my question is, given a structure that looks
>something like this,
>
>typedef struct Node
>{
>    char *data;
>    Node *left;
>    Node *right;
>} Node;
>
>am I right in saying that I wouldn't be able to store that representation
>on-disk, but instead I'd have to transform it into some binary
>representation and back again when writing/reading respectively, are there
>any alternatives?
>
>Regards,
>
>Karl

You might want to consider using a linear tree representation that can
be searched directly.  See:
https://en.wikipedia.org/w/index.php?title=Binary_tree§ion=9#Arrays

E.g.,

Make the tree nodes:  { unsigned data, left, right }
Store the nodes contiguously as an array, and represent the left/right
branches using simple array indexing.

Write the node data contiguously into a separate array/stream, keeping
an offset [or index] to each data element in its correponding tree
node.

And then write the node array and the data array/stream contiguously
onto the disk.

For reading/searching this structure can be used directly: read in the
homogenous node array, and suck the variable length data stream into a
separate buffer.

If it isn't important that the tree be optimally ordered, then insert
is simple: just append a new node to the array, append its data to the
data stream, and update branch indexes to link the new node at the
correct spot in the tree.

If you want the data to be as small as possible on disk, then update
or delete of a node requires repacking the data and adjusting the data
offsets of the "unchanged" nodes.
https://en.wikipedia.org/wiki/Mark-compact_algorithm

[Obviously, you can delete without repacking by just unlinking the
tree node and rewriting the node array.  The "deleted" node data is
left in place on disk but it becomes unreachable.]

If you can afford to (temporarily) sort the tree nodes by their data
offset, repacking the data can be done in the same buffer by sliding
each element forward to be contiguous with the previous.  If you need
to take the nodes in whatever order they occur, then to pack the data
you need to copy it to a new buffer.

Hope this helps.
George

Re: Varlena with recursive data structures?

From

Michel Pelletier

Date:

17 January 2019, 20:39:09

Hi Karl,

I'm going down this road myself. In addition to the files Tom Lane pointed out there is also some helpful documentation here:

https://www.postgresql.org/docs/current/storage-toast.html#STORAGE-TOAST-INMEMORY

On Wed, Jan 16, 2019 at 2:09 PM Sam Patterson <katoriasdev@gmail.com> wrote:

Hi all,

I've recently started developing an extension for Postgres for which I'll need to create a new variable-length base type. The type will require a tree-like structure in order to parse sufficiently, which of course probably means having some sort of recursive data structure, like a struct that has members which are pointers to itself for child nodes. After doing some research, specifically looking at how other variable-length data types store their data, it seems almost all of them store the data in a binary representation, using bit masks and offsets etc in order to store/access the data whilst having an in-memory representation that's used to manipulate the data.

I presume the purpose for using this approach is because all the data in a varlena type has to be contiguous, and the moment you start using pointers this is no longer possible. So my question is, given a structure that looks something like this,

typedef struct Node
{
char *data;
Node *left;
Node *right;
} Node;

am I right in saying that I wouldn't be able to store that representation on-disk, but instead I'd have to transform it into some binary representation and back again when writing/reading respectively, are there any alternatives?

Regards,

Karl