Re: Improve compression speeds in pg_lzcompress.c - Mailing list pgsql-hackers

From Benedikt Grundmann
Subject Re: Improve compression speeds in pg_lzcompress.c
Date
Msg-id CADbMkNPrKe2P7Oku=2sNGyLrd8+wQad_YBpvJtmJBtV17Tmf4A@mail.gmail.com
Whole thread Raw
In response to Re: Improve compression speeds in pg_lzcompress.c  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers

Personally, my biggest gripe about the way we do compression is that
it's easy to detoast the same object lots of times.  More generally,
our in-memory representation of user data values is pretty much a
mirror of our on-disk representation, even when that leads to excess
conversions.  Beyond what we do for TOAST, there's stuff like numeric
where not only toast but then post-process the results into yet
another internal form before performing any calculations - and then of
course we have to convert back before returning from the calculation
functions.  And for things like XML, JSON, and hstore we have to
repeatedly parse the string, every time someone wants to do anything
to do.  Of course, solving this is a very hard problem, and not
solving it isn't a reason not to have more compression options - but
more compression options will not solve the problems that I personally
have in this area, by and large.

At the risk of saying something totally obvious and stupid as I haven't looked at the actual representation this sounds like a memoisation problem.  In ocaml terms:

type 'a rep =
  | On_disk_rep     of Byte_sequence
  | In_memory_rep of 'a

type 'a t = 'a rep ref

let get_mem_rep t converter =
  match !t with
  | On_disk_rep seq ->
    let res = converter seq in
    t := In_memory_rep res;
    res
  | In_memory_rep x -> x
;;

... (if you need the other direction that it's straightforward too)...

Translating this into c is relatively straightforward if you have the luxury of a fresh start
and don't have to be super efficient:

typedef enum { ON_DISK_REP, IN_MEMORY_REP } rep_kind_t;
 
type t = {
  rep_kind_t rep_kind;
  union {
    char *on_disk;
    void *in_memory;
  } rep;
};

void *get_mem_rep(t *t, void * (*converter)(char *)) {
  void *res;
  switch (t->rep_kind) {
     case ON_DISK_REP:
        res = converter(t->on_disk);
        t->rep.in_memory = res;
        t->rep_kind = IN_MEMORY_REP;
        return res;
     case IN_MEMORY_REP;
        return t->rep.in_memory;
  }
}

Now of course fitting this into the existing types and ensuring that there is neither too early freeing of memory nor memory leaks or other bugs is probably a nightmare and why you said that this is a hard problem.

Cheers,

Bene

pgsql-hackers by date:

Previous
From: Shigeru Hanada
Date:
Subject: Re: PATCH: optimized DROP of multiple tables within a transaction
Next
From: Amit kapila
Date:
Subject: Re: Performance Improvement by reducing WAL for Update Operation