Re: JSON data type status? - Mailing list pgsql-hackers

From Joseph Adams
Subject Re: JSON data type status?
Date
Msg-id AANLkTi=ti88X2KR035_G28WJR+-EFTgMMQYpTd5YZw1=@mail.gmail.com
Whole thread Raw
In response to Re: JSON data type status?  (Itagaki Takahiro <itagaki.takahiro@gmail.com>)
List pgsql-hackers
On Thu, Jan 20, 2011 at 8:13 PM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:
> On Fri, Jan 21, 2011 at 09:11, Bruce Momjian <bruce@momjian.us> wrote:
>> What happened to our work to add a JSON data type for PG 9.1?
>
> Nothing will happen in 9.1.
> I assume we are in "competition" status:
>
> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00481.php
>
> Also, if PGXN will work well, we might not have to include JSON
> in the core. We can download any JSON implementations from the
> site after installing the core server.  Of course, if we will
> use JSON types in the core (EXPLAIN JSON output?), we have to
> include one of them.

Thanks for the update.  However, I should probably take some blame as
well for spending basically zero time on the JSON datatype for lengths
at a time.

After someone emailed me asking how to install the JSON datatype, I
fixed up my module version of it:

http://git.postgresql.org/gitweb?p=json-datatype.git

I also have a version where JSON is a core datatype (the patches I
submitted to CommitFest), but it is obsolete now.  The module version
has all the same features, but also fixes a Unicode bug* and adds
PostgreSQL 8.4 compatibility.  For those who want the JSON datatype
right now, I recommend using the module.

I plan to maintain and improve the JSON module (repository linked
above), albeit at a snail's pace.  Patches are certainly welcome.  See
roadmap.markdown in the repository for some minor and not-so-minor
design decisions.


Joey Adams

* P.S.

The "Unicode bug" involved UTF-16 surrogate pair calculation.  JSON
encodes Unicode characters that aren't in the Basic Multilingual Plane
the same way UTF-16 does: using two 4-digit hex codes.  The correct
formula for determining the Unicode codepoint of a surrogate pair is:
   unicode = 0x10000 + (((uc & 0x3FF) << 10) | (lc & 0x3FF));

where:
   0xD800 <= uc <= 0xDBFF   0xDC00 <= lc <= 0xDFFF

For example, consider the JSON string "\uD835\uDD0D":
   uc      = 0xD835   lc      = 0xDD0D   unicode = 0x1D50D

This pair of hex indices collapses into one Unicode codepoint, 0x1D50D
(or "𝔍").

I originally used this:
   /* WRONG */   unicode = 0x10000 | ((uc & 0x3FF) << 10) | (lc & 0x3FF);

The problem is, ((uc & 0x3FF) << 10) | (lc & 0x3FF) has a range of 0
to 0xFFFFF, and in some of those numbers, `0x10000 |` adds zero.

I am sharing all this because I have made this mistake twice, and I
believe it is an easy mistake to make.


pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: pg_basebackup for streaming base backups
Next
From: Simon Riggs
Date:
Subject: Re: SSI and Hot Standby