Thread: Refactoring the Type System

Refactoring the Type System

From
David Fetter
Date:
Folks,

For the past couple of years, I've been hearing from the PostGIS
people among others that our type system just isn't flexible enough
for their needs.  It's really starting to show its age, or possibly
design compromises that seemed reasonable a decade or more ago, but
are less so now.

To that end, I've put up a page on the wiki that includes a list of
issues to be addressed.  It's intended to be changed, possibly
completely.

http://wiki.postgresql.org/wiki/Refactor_Type_System

What might the next version of the type system look like?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


Re: Refactoring the Type System

From
Andrew Dunstan
Date:

On 11/12/2010 11:34 AM, David Fetter wrote:
> Folks,
>
> For the past couple of years, I've been hearing from the PostGIS
> people among others that our type system just isn't flexible enough
> for their needs.  It's really starting to show its age, or possibly
> design compromises that seemed reasonable a decade or more ago, but
> are less so now.

This is so general as to be quite meaningless to me. What is it that is 
wanted that we don't have. (And don't say "flexibility", that's way too 
general - say something much more concrete and specific. If you want 
flexibility we can store everything as text, but I doubt you'll like the 
result.)

cheers

andrew


Re: Refactoring the Type System

From
Dimitri Fontaine
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> This is so general as to be quite meaningless to me. What is it that is
> wanted that we don't have. (And don't say "flexibility", that's way too
> general - say something much more concrete and specific. If you want
> flexibility we can store everything as text, but I doubt you'll like the
> result.)

The way I understand it is (unsurprisingly) related to user data in
extensions. PostGIS maintains a table of user attributes related to
their types, if I've understood correctly. Things that won't fit in the
typmod, that will be different depending on the columns or some other
environment meta-data, and that will have consequences on the meaning
and running of user queries.

Ok, that's not that more precise, but that's a typmod which does not fit
in 32 bits so is saved away in some PostGIS table and referred to from
the main storage.

About all the other problems with the type system flexibility that I've
read on, I think they are in the "type inference" category: we like the
strong typing of the database system but would like it to get forgotten
about. The paramount of this I think was the proposal of the LAMBDA
expressions at the time the DO utility statement appeared.

I don't know how far in the type inference system we want to go (but I
think we already have parts of that in our implicit cast rules). Maybe
we want to think about having (user) functions be types, too.

Also, tables are some types already, and JOINs and resultsets are
relations too, so they are some types too. I don't know how far the
current typing system is considering tables and joins and relation as
the same thing as types, but there's something there with NULL handling
and some ROW and record facilities that we see surprised people about in
-bugs and other places.

Well, just my very unorganised 2¢,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Refactoring the Type System

From
Jeff Davis
Date:
On Fri, 2010-11-12 at 12:03 -0500, Andrew Dunstan wrote:
> 
> On 11/12/2010 11:34 AM, David Fetter wrote:
> > Folks,
> >
> > For the past couple of years, I've been hearing from the PostGIS
> > people among others that our type system just isn't flexible enough
> > for their needs.  It's really starting to show its age, or possibly
> > design compromises that seemed reasonable a decade or more ago, but
> > are less so now.
> 
> This is so general as to be quite meaningless to me. What is it that is 
> wanted that we don't have.

Some kind of generics, type generators, or type interfaces (like an
interface in Java or type class in haskell). A real subtyping system
might also be nice.

That being said, a few details are left to be decided (an
understatement).

Regards,Jeff Davis



Re: Refactoring the Type System

From
Robert Haas
Date:
On Fri, Nov 12, 2010 at 6:12 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> That being said, a few details are left to be decided (an
> understatement).

Best... comment... ever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Refactoring the Type System

From
Jeff Davis
Date:
On Fri, 2010-11-12 at 08:34 -0800, David Fetter wrote:
> Folks,
> 
> For the past couple of years, I've been hearing from the PostGIS
> people among others that our type system just isn't flexible enough
> for their needs.  It's really starting to show its age, or possibly
> design compromises that seemed reasonable a decade or more ago, but
> are less so now.
> 
> To that end, I've put up a page on the wiki that includes a list of
> issues to be addressed.  It's intended to be changed, possibly
> completely.
> 
> http://wiki.postgresql.org/wiki/Refactor_Type_System
> 
> What might the next version of the type system look like?

This problems (as stated) strikes me as pretty overwhelming.

1. As far as I can tell, we have the best type system of any SQL DBMS.
2. Getting a type system right is a hard problem by itself, and there
isn't any obvious consensus (although I think there is some agreement on
some issues).
3. Type systems are more challenging for a DBMS because you need to
account for things like storage, indexing, and optimization in ways that
programming languages don't (consider: when comparing an int4 and an
int8, you may want to coerce based on what indexes you have available).
4. SQL standard issues. In particular, I think that any modern type
system will run into pretty severe problems with NULLs in one way or
another. I think we'd have to pay very close attention to the standard
when designing a new type system, because I suspect that retrofitting
the standard onto a system we invent independently would be a disaster.
5. Backwards compatibility issues.

I think the best we'll do is be able to hack on some of the things that
we actively want and have clear use cases for, such as type interfaces.
We might have to give up on some of the more ambitious ideas that
involve propagating interesting information through the type inference
system; or having any real type that wasn't declared with CREATE TYPE.
Consider that right now we bundle the element type information along
with the array _value_.

Ideas welcome. Particularly if there are a few clear use cases.

Regards,Jeff Davis



Re: Refactoring the Type System

From
Andrew Dunstan
Date:

On 11/12/2010 07:07 PM, Jeff Davis wrote:
> On Fri, 2010-11-12 at 08:34 -0800, David Fetter wrote:
>> Folks,
>>
>> For the past couple of years, I've been hearing from the PostGIS
>> people among others that our type system just isn't flexible enough
>> for their needs.  It's really starting to show its age, or possibly
>> design compromises that seemed reasonable a decade or more ago, but
>> are less so now.
>>
>> To that end, I've put up a page on the wiki that includes a list of
>> issues to be addressed.  It's intended to be changed, possibly
>> completely.
>>
>> http://wiki.postgresql.org/wiki/Refactor_Type_System
>>
>> What might the next version of the type system look like?
> This problems (as stated) strikes me as pretty overwhelming.
>
> 1. As far as I can tell, we have the best type system of any SQL DBMS.
> 2. Getting a type system right is a hard problem by itself, and there
> isn't any obvious consensus (although I think there is some agreement on
> some issues).
> 3. Type systems are more challenging for a DBMS because you need to
> account for things like storage, indexing, and optimization in ways that
> programming languages don't (consider: when comparing an int4 and an
> int8, you may want to coerce based on what indexes you have available).
> 4. SQL standard issues. In particular, I think that any modern type
> system will run into pretty severe problems with NULLs in one way or
> another. I think we'd have to pay very close attention to the standard
> when designing a new type system, because I suspect that retrofitting
> the standard onto a system we invent independently would be a disaster.
> 5. Backwards compatibility issues.
>
> I think the best we'll do is be able to hack on some of the things that
> we actively want and have clear use cases for, such as type interfaces.
> We might have to give up on some of the more ambitious ideas that
> involve propagating interesting information through the type inference
> system; or having any real type that wasn't declared with CREATE TYPE.
> Consider that right now we bundle the element type information along
> with the array _value_.

Yeah, composites too, IIRC. It's a bit sad. But those are really just 
warts show the difficulties we face in implementing types. I'm still 
waiting for some seriously different yet possible thing we could do. 
(And I agree we do have about the best type system around).

cheers

andrew


Re: Refactoring the Type System

From
Darren Duncan
Date:
David Fetter wrote:
> For the past couple of years, I've been hearing from the PostGIS
> people among others that our type system just isn't flexible enough
> for their needs.  It's really starting to show its age, or possibly
> design compromises that seemed reasonable a decade or more ago, but
> are less so now.
> 
> To that end, I've put up a page on the wiki that includes a list of
> issues to be addressed.  It's intended to be changed, possibly
> completely.
> 
> http://wiki.postgresql.org/wiki/Refactor_Type_System
> 
> What might the next version of the type system look like?

Are you talking about changes to the type system as users see it or just changes 
to how the existing behavior is implemented internally?  If you're talking 
about, as users see it, which the other replies to this thread seem to be 
saying, though not necessarily the url you pointed to which looks more internals ...

As a statement which may surprise no one who's heard me talk about it before ...

I've mostly completed a type system specification that would be useable by 
Postgres, as the most fundamental part of my Muldis D language.

The type system is arguably the most central piece of any DBMS, around which 
everything else is defined and built.

You have data, which is structured in some way, and has operators for it.

If you look at a DBMS from the perspective of being a programming language 
implementation, you find that a database is just a variable that holds a value 
of a structured type.  In the case of a relational database, said database is a 
tuple whose attribute values are relations; or in the case of 
namespaces/schemas, the database tuple has tuple attributes having relation 
attributes.

If a database is a variable, then all database constraints are type constraints 
on the declared type of that variable, and you can make said constraints 
arbitrarily complicated.
From basic structures like nestable tuples and relations, plus a complement of 
basic types like numbers and strings, and arbitrary constraints, you can define 
data types of any shape or form.

A key component of a good type system is that users can define data types, and 
moreover where possible, system-defined types are defined in the same ways as 
users define types.  For example, stuff like temporal types or geospatial types 
are prime candidates for being defined like user-defined types.

If you define all structures using tuples and relations, you can easily flatten 
this out on the implementation end and basically do everything as associated 
flat relation variables as you do now.

So what I propose is both very flexible and easy to implement, scale, and 
optimize, relatively speaking.

You don't have to kludge things by implementing arrays as blobs for example; you 
can implement them as relations instead.  Geospatial types can just be tuples. 
Arrays of structured types can just be relations with an attribute per type 
attribute.  Arrays of simple types can just be unary relations.

You can also emulate all of the existing Pg features and syntax that you have 
now over the type system I've defined, maintaining compatibility too.

I also want to emphasize that, while I drew inspiration from many sources when 
defining Muldis D, and there was/is a lot I still didn't/don't know about 
Postgres, I have found that as I use and learn Postgres, I'm finding frequently 
that how Postgres does things is similar and compatible to how I independently 
came up with Muldis D's design; I'm finding more similarities all the time.

-- Darren Duncan


Re: Refactoring the Type System

From
Daniel Farina
Date:
On Sat, Nov 13, 2010 at 7:54 PM, Darren Duncan <darren@darrenduncan.net> wrote:
> A key component of a good type system is that users can define data types,
> and moreover where possible, system-defined types are defined in the same
> ways as users define types.  For example, stuff like temporal types or
> geospatial types are prime candidates for being defined like user-defined
> types.

They are user defined types, and even separately compiled,
distributed, and dynamically linked.

> You don't have to kludge things by implementing arrays as blobs for example;
> you can implement them as relations instead.  Geospatial types can just be
> tuples. Arrays of structured types can just be relations with an attribute
> per type attribute.  Arrays of simple types can just be unary relations.

In practice, my guess is the performance of these approaches would be
terrible for a number of workloads. I don't agree that arrays having
their own storage implementation is a kludge: there are even operators
like unnest that can be used to turn them back into relations.

I have long thought (but never really gave voice to) there being value
having first-class relation values, but I would also say that's
another kettle of fish. I also want closures, and don't think that's
completely nuts.

> I also want to emphasize that, while I drew inspiration from many sources
> when defining Muldis D, and there was/is a lot I still didn't/don't know
> about Postgres, I have found that as I use and learn Postgres, I'm finding
> frequently that how Postgres does things is similar and compatible to how I
> independently came up with Muldis D's design; I'm finding more similarities
> all the time.

You may want to learn more about this, first. Postgres's type system,
while relatively simple, is not as ill-conceived or primitive as you
seem to assume it is. There are also a lot of important system
details, like expanding/replacing the typmod facility that only allows
Postgres 32 bits of type-augmenting data (such as the (1) in the type
numeric(1)).

fdr


Re: Refactoring the Type System

From
Daniel Farina
Date:
On Fri, Nov 12, 2010 at 4:07 PM, Jeff Davis <pgsql@j-davis.com> wrote:
> I think the best we'll do is be able to hack on some of the things that
> we actively want and have clear use cases for, such as type interfaces.
> We might have to give up on some of the more ambitious ideas that
> involve propagating interesting information through the type inference
> system; or having any real type that wasn't declared with CREATE TYPE.
> Consider that right now we bundle the element type information along
> with the array _value_.

Here are some weaknesses in the SUM aggregate that run up against the
type system. Maybe they'll help crystallize some discussion:

SUM(int2) => int4
SUM(int4) => int8
SUM(int8) => numeric

Some weaknesses:

SUM, of any precision, assumes that the precision being accumulated
into (which is also the return-precision) is enough to avoid overflow.
This is generally the case, but there's no reason why it *must* be
true. I'm especially looking at the int2 to int4 conversion. One could
imagine a more interesting scenario where overflow behavior could
occur much more easily.

SUM always promotes types upwards in precision, and is unable to keep
types of the smallest possible precision should SUM expressions be
composed.

SUM is unable to maintain any supplementary information about
precision, i.e. say anything interesting about the typmod, which
defeats or makes impossible many interesting optimizations.

I think that a type-interface system moves towards solving the first
couple of problems, since SUM can return some abstract type such as
"Integer" and use that to promote more aggressively (avoiding
overflow) or keep representations small (avoiding unnecessary
promotion) at run-time. It might require Integer to be an abstract,
non-storable data type, though, so the current CREATE TYPE is not
going to make life easy.

The third problem is slightly different...it might require some
user-pluggable code to be called as part of semantic analysis. I have
felt the idea of making a Postgres Type a more robust first-class data
type and somehow being able to attach a function to another
function/aggregate that is responsible for getting called during
semantic analysis and returning the proper signature roll around in my
head, but it might be the whispers of cthulu, too.

fdr


Re: Refactoring the Type System

From
Tom Lane
Date:
Daniel Farina <drfarina@acm.org> writes:
> Here are some weaknesses in the SUM aggregate that run up against the
> type system. Maybe they'll help crystallize some discussion:

> SUM(int2) => int4
> SUM(int4) => int8
> SUM(int8) => numeric

> Some weaknesses:

> SUM, of any precision, assumes that the precision being accumulated
> into (which is also the return-precision) is enough to avoid overflow.

This is not a flaw of the type system, it's just an implementation
choice in the SUM() aggregates.  We could easily have chosen wider
accumulation and/or result types.
        regards, tom lane


Re: Refactoring the Type System

From
Daniel Farina
Date:
On Sun, Nov 14, 2010 at 7:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Daniel Farina <drfarina@acm.org> writes:
>> Here are some weaknesses in the SUM aggregate that run up against the
>> type system. Maybe they'll help crystallize some discussion:
>
>> SUM(int2) => int4
>> SUM(int4) => int8
>> SUM(int8) => numeric
>
>> Some weaknesses:
>
>> SUM, of any precision, assumes that the precision being accumulated
>> into (which is also the return-precision) is enough to avoid overflow.
>
> This is not a flaw of the type system, it's just an implementation
> choice in the SUM() aggregates.  We could easily have chosen wider
> accumulation and/or result types.

That's true, but there are downsides to escalating the precision so
aggressively.

The case I was thinking about in particular involves composition of
SUM. If one can assume that a relation has int4s and that will never
overflow an int8 (as is done now), I don't see a great way to optimize
the following case without special exceptions in the optimizer for
particular aggregates known a-priori. Here's what would happen now:

SELECT SUM(x::int8)::numeric
FROM (SELECT SUM(x::int4)::int8 AS x     FROM rel     GROUP BY y) some_name;

Could be rendered, by this assumption, as:

SELECT SUM(x::int8)::int8
....(same FROM clause)

(Why would anyone write a query like this? Views. Possibly inlined SQL
UDFs, too.)

This can be measurably faster. It also more properly constrains the
result type, as numeric can also handle non-integer quantities.

I should have underscored that a positive aspect of having a
type-class like facility that allows declaration things like this
hypothetical Integer when backed by concrete types that might support
a superset of functionality.

fdr


Re: Refactoring the Type System

From
Darren Duncan
Date:
Daniel Farina wrote:
> On Sat, Nov 13, 2010 at 7:54 PM, Darren Duncan <darren@darrenduncan.net> wrote:
>> You don't have to kludge things by implementing arrays as blobs for example;
>> you can implement them as relations instead.  Geospatial types can just be
>> tuples. Arrays of structured types can just be relations with an attribute
>> per type attribute.  Arrays of simple types can just be unary relations.
> 
> In practice, my guess is the performance of these approaches would be
> terrible for a number of workloads. I don't agree that arrays having
> their own storage implementation is a kludge: there are even operators
> like unnest that can be used to turn them back into relations.

I'm not discounting this at all.  The data model can formally define types in 
one way and a DBMS can implement various well known cases in specialized ways 
that are more efficient.  Arrays are a good example.  But with other cases they 
still get a default implementation.  If arrays are flexible enough, then 
different arrays could be implemented differently, eg some as blobs and some as 
relations; the former could be better for small or simple arrays and the latter 
for large or complex arrays.

> I have long thought (but never really gave voice to) there being value
> having first-class relation values, but I would also say that's
> another kettle of fish. I also want closures, and don't think that's
> completely nuts.

I think that first-class functions are important at least, if not full-blown 
closures.  You can define generic relational restrictions or summaries or 
whatever with such; eg, the general case of a WHERE is a function that takes a 
relation and a function as input, and results in a relation; more restricted 
cases of WHERE can be defined with simpler functions that take 2 relation inputs 
instead.

> You may want to learn more about this, first. Postgres's type system,
> while relatively simple, is not as ill-conceived or primitive as you
> seem to assume it is. There are also a lot of important system
> details, like expanding/replacing the typmod facility that only allows
> Postgres 32 bits of type-augmenting data (such as the (1) in the type
> numeric(1)).

I'm not saying that Pg's system is primitive et al.  The thread starter said it 
needed an overhaul, so indicating there are deficiencies, and I'm suggesting 
some effective ways to do that.

-- Darren Duncan


Re: Refactoring the Type System

From
Robert Haas
Date:
On Sun, Nov 14, 2010 at 2:27 PM, Daniel Farina <drfarina@acm.org> wrote:
> On Sun, Nov 14, 2010 at 7:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Daniel Farina <drfarina@acm.org> writes:
>>> Here are some weaknesses in the SUM aggregate that run up against the
>>> type system. Maybe they'll help crystallize some discussion:
>>
>>> SUM(int2) => int4
>>> SUM(int4) => int8
>>> SUM(int8) => numeric
>>
>>> Some weaknesses:
>>
>>> SUM, of any precision, assumes that the precision being accumulated
>>> into (which is also the return-precision) is enough to avoid overflow.
>>
>> This is not a flaw of the type system, it's just an implementation
>> choice in the SUM() aggregates.  We could easily have chosen wider
>> accumulation and/or result types.
>
> That's true, but there are downsides to escalating the precision so
> aggressively.
>
> The case I was thinking about in particular involves composition of
> SUM. If one can assume that a relation has int4s and that will never
> overflow an int8 (as is done now), I don't see a great way to optimize
> the following case without special exceptions in the optimizer for
> particular aggregates known a-priori. Here's what would happen now:
>
> SELECT SUM(x::int8)::numeric
> FROM (SELECT SUM(x::int4)::int8 AS x
>      FROM rel
>      GROUP BY y) some_name;
>
> Could be rendered, by this assumption, as:
>
> SELECT SUM(x::int8)::int8
> ....(same FROM clause)
>
> (Why would anyone write a query like this? Views. Possibly inlined SQL
> UDFs, too.)
>
> This can be measurably faster. It also more properly constrains the
> result type, as numeric can also handle non-integer quantities.
>
> I should have underscored that a positive aspect of having a
> type-class like facility that allows declaration things like this
> hypothetical Integer when backed by concrete types that might support
> a superset of functionality.

Like Tom, I'm not sure this is really a type-system problem.  This
sounds like a complaint that operations on "numeric" are much slower
than operations on "int4" and "int8", even for values that could be
represented by either type.  I think that's a valid complaint, but I
don't see how changing the type system would help.  I think what you'd
need to is optimize the existing numeric type, or provide a new
numeric-ish type with optimizations for dealing with
small-to-medium-sized integers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Refactoring the Type System

From
Daniel Farina
Date:
On Sun, Nov 14, 2010 at 1:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Like Tom, I'm not sure this is really a type-system problem.  This
> sounds like a complaint that operations on "numeric" are much slower
> than operations on "int4" and "int8", even for values that could be
> represented by either type.  I think that's a valid complaint, but I
> don't see how changing the type system would help.  I think what you'd
> need to is optimize the existing numeric type, or provide a new
> numeric-ish type with optimizations for dealing with
> small-to-medium-sized integers.

There are other ways one might be able to attack the performance part
of the problem, but consider the loss of information about the type
from int(2|4|8) to numeric when composing a series of sums: we know
the value produced fits the abstract notion of an Integer, but we lose
that information. The same could be said of SUM(x::numeric(1000,0))
yielding an unrestricted numeric, rather than one of scale 0. Not only
is there more information available to the user, but the optimizer
should be able to benefit from that information as well. However, for
an arbitrary user-defined operator to take advantage it would seem to
me that there needs to be a hook where some reasoning can take place
over its input types and subsequently determine the return prototype
at that call site.

I am not trying to suggest that there is no way that one could attack
the details around SUM I have mentioned without changing the type
system, only that they are limitations I encountered and worked around
where a more powerful type system came to mind as an avenue for a
simpler/more proper solution.

fdr


Re: Refactoring the Type System

From
Tom Lane
Date:
Daniel Farina <drfarina@acm.org> writes:
> There are other ways one might be able to attack the performance part
> of the problem, but consider the loss of information about the type
> from int(2|4|8) to numeric when composing a series of sums: we know
> the value produced fits the abstract notion of an Integer, but we lose
> that information. The same could be said of SUM(x::numeric(1000,0))
> yielding an unrestricted numeric, rather than one of scale 0. Not only
> is there more information available to the user, but the optimizer
> should be able to benefit from that information as well. However, for
> an arbitrary user-defined operator to take advantage it would seem to
> me that there needs to be a hook where some reasoning can take place
> over its input types and subsequently determine the return prototype
> at that call site.

Yeah, this has been discussed before.  The problem is that it's too much
work for too little reward.  There are too many different possible
behaviors for functions, and nobody is going to go through and write
a subtype-inference helper function for every function in the system,
or even any usefully large fraction of the functions in the system.
What's more, the subtype definitions have largely been predetermined
for us by SQL, and predetermined in such a way that knowing the subtype
isn't really all that exciting.  Is anybody willing to put in
months of work to teach the system that sum(numeric(7,0)) yields
numeric(something,0) rather than numeric(something,something)?
I doubt it.  The return on that knowledge would be too small, and
there are too many cases where you couldn't deduce anything anyway.
        regards, tom lane


Re: Refactoring the Type System

From
Greg Stark
Date:
On Sun, Nov 14, 2010 at 11:15 AM, Daniel Farina <drfarina@acm.org> wrote:
> SUM(int2) => int4
> SUM(int4) => int8
> SUM(int8) => numeric
>
> Some weaknesses:
>
> SUM, of any precision, assumes that the precision being accumulated
> into (which is also the return-precision) is enough to avoid overflow.
> This is generally the case, but there's no reason why it *must* be
> true. I'm especially looking at the int2 to int4 conversion. One could
> imagine a more interesting scenario where overflow behavior could
> occur much more easily.

Fwiw I think he's right that sum(int2) should perhaps be redefined to
return int8. As it stands all it would take is a 64k rows to
potentially overflow. It's not super likely but it is plausible and
the performance penalty to use int8 wouldn't be super big either.

int4 doesn't seem like as realistic a problem since it would take 4
billion rows and the performance penalty for using numeric would be
much higher.

-- 
greg


Re: Refactoring the Type System

From
Robert Haas
Date:
On Sun, Nov 14, 2010 at 9:16 PM, Greg Stark <gsstark@mit.edu> wrote:
> On Sun, Nov 14, 2010 at 11:15 AM, Daniel Farina <drfarina@acm.org> wrote:
>> SUM(int2) => int4
>> SUM(int4) => int8
>> SUM(int8) => numeric
>>
>> Some weaknesses:
>>
>> SUM, of any precision, assumes that the precision being accumulated
>> into (which is also the return-precision) is enough to avoid overflow.
>> This is generally the case, but there's no reason why it *must* be
>> true. I'm especially looking at the int2 to int4 conversion. One could
>> imagine a more interesting scenario where overflow behavior could
>> occur much more easily.
>
> Fwiw I think he's right that sum(int2) should perhaps be redefined to
> return int8. As it stands all it would take is a 64k rows to
> potentially overflow. It's not super likely but it is plausible and
> the performance penalty to use int8 wouldn't be super big either.
>
> int4 doesn't seem like as realistic a problem since it would take 4
> billion rows and the performance penalty for using numeric would be
> much higher.

I think you could get the same effect by removing sum(int2)
altogether, which I wouldn't lose any sleep over.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Refactoring the Type System

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Fwiw I think he's right that sum(int2) should perhaps be redefined to
> return int8. As it stands all it would take is a 64k rows to
> potentially overflow. It's not super likely but it is plausible and
> the performance penalty to use int8 wouldn't be super big either.

It's not unreasonable.  I think the performance penalty for int8 was
higher when that choice was made than it is now --- and in particular,
on a 64-bit machine it's now pretty much negligible.

On the other hand, you can always execute sum(foo::int4) if you need a
wider sum, just like the escape hatch if any of the other datatype
choices aren't working for you.  It's not clear that we should force a
performance loss on people who don't need it (and I can't offhand recall
*ever* hearing a complaint about sum(int2) overflowing...)

I believe what the OP was arguing for was not so much this sort of
marginal tinkering as inventing a more general notion of "integer type"
that would avoid the whole issue.  The problem with that is that we have
to work within the set of types specified by the SQL standard.
        regards, tom lane