json indexing and data types - Mailing list pgsql-general

From Merlin Moncure
Subject json indexing and data types
Date
Msg-id CAHyXU0zHh8sMCeqDzk30O7BGd1b-hz8frKokfMvv_YFPz8NEPA@mail.gmail.com
Whole thread Raw
In response to json indexing and data types  (Kaare Rasmussen <kaare@jasonic.dk>)
Responses Re: json indexing and data types  (Jim Nasby <Jim.Nasby@BlueTreble.com>)
Re: json indexing and data types  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: json indexing and data types  (Kaare Rasmussen <kaare@jasonic.dk>)
List pgsql-general
On Wed, Dec 2, 2015 at 12:03 AM, Kaare Rasmussen <kaare@jasonic.dk> wrote:
> Hi
>
> As json essentially only has three basic data types, string, int, and
> boolean, I wonder how much of this - to index, search, and sort on
> unstructured data -  is possible. I guess part of the answer would be
> 'jsquery and vodka', but let me describe the problem first.
>
> The basics is, that I have a column with what is essentially json data; a
> number of data structures of different depths. Perhaps 10 - 30 top levels,
> and probably no more than 3, max 4 levels deep. In total there are some
> hundred thousands of rows in each table. It would probably be best stored as
> jsonb. Right now it's text, because it's only used by the application
> itself.
>
> It would be incredibly useful to add an index to this column, and to be able
> to search, using the index, on arbitrary elements. This part seems already
> there, with jsquery.
>
> The hard part is that some of the data items really have another type. There
> are dates and floating points, as the most important ones. And the really
> hard part is that sorting and range searches are important, especially for
> these two types. Having dates is iso-format, and left-padding floats with
> zeros is a low tech solution, and especially the latter is not very
> efficient.
>
> The solution might be to add functional indexes for these data items, but
> it's cumbersome and not easily maintainable. If a one-stop solution is in
> the works, or already there, it could save a lot of time.

I feel your pain.  jsquery is superb for subdocument searching on *specific* subdocuments but range searching is really limited.  Value searching is there for numerics but dates and text range searching are not present.  We also have to understand that you are asking the index to make assumptions about the json that are not clear from the structure itself (such as subfield 'x' is a date).

The only workaround I've been able to come up with is to migrate the json to a specially encoded text field, stored side by side with the source json, that is more amenable to pg_trgm based searching (to give you a taste of that complexity, keys are stored upper case and values are stored lower case).

Some might say that you're better off using a dedicated json searching server like solr but these systems aren't magic; they will quickly boil down to a brute force search in the face of complex queries, and they have lots of other problems in my experience (starting with, lack of proper transactions and painfully slow insertion of large documents).  Other people recommend them; I don't.

One way of looking at this problem is that the "schemaless" check is getting cashed. If you need detailed data driven queries (as opposed to more 'test searchy' type searches) perhaps it's time to start running your data through a normalized structure.

merlin

pgsql-general by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: 9.4 upgrade Help using pg_upgrade
Next
From: Jim Nasby
Date:
Subject: Re: plperlu stored procedure seems to freeze for a minute