Thread: Refactoring the Type System
Folks, For the past couple of years, I've been hearing from the PostGIS people among others that our type system just isn't flexible enough for their needs. It's really starting to show its age, or possibly design compromises that seemed reasonable a decade or more ago, but are less so now. To that end, I've put up a page on the wiki that includes a list of issues to be addressed. It's intended to be changed, possibly completely. http://wiki.postgresql.org/wiki/Refactor_Type_System What might the next version of the type system look like? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
On 11/12/2010 11:34 AM, David Fetter wrote: > Folks, > > For the past couple of years, I've been hearing from the PostGIS > people among others that our type system just isn't flexible enough > for their needs. It's really starting to show its age, or possibly > design compromises that seemed reasonable a decade or more ago, but > are less so now. This is so general as to be quite meaningless to me. What is it that is wanted that we don't have. (And don't say "flexibility", that's way too general - say something much more concrete and specific. If you want flexibility we can store everything as text, but I doubt you'll like the result.) cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > This is so general as to be quite meaningless to me. What is it that is > wanted that we don't have. (And don't say "flexibility", that's way too > general - say something much more concrete and specific. If you want > flexibility we can store everything as text, but I doubt you'll like the > result.) The way I understand it is (unsurprisingly) related to user data in extensions. PostGIS maintains a table of user attributes related to their types, if I've understood correctly. Things that won't fit in the typmod, that will be different depending on the columns or some other environment meta-data, and that will have consequences on the meaning and running of user queries. Ok, that's not that more precise, but that's a typmod which does not fit in 32 bits so is saved away in some PostGIS table and referred to from the main storage. About all the other problems with the type system flexibility that I've read on, I think they are in the "type inference" category: we like the strong typing of the database system but would like it to get forgotten about. The paramount of this I think was the proposal of the LAMBDA expressions at the time the DO utility statement appeared. I don't know how far in the type inference system we want to go (but I think we already have parts of that in our implicit cast rules). Maybe we want to think about having (user) functions be types, too. Also, tables are some types already, and JOINs and resultsets are relations too, so they are some types too. I don't know how far the current typing system is considering tables and joins and relation as the same thing as types, but there's something there with NULL handling and some ROW and record facilities that we see surprised people about in -bugs and other places. Well, just my very unorganised 2¢, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Fri, 2010-11-12 at 12:03 -0500, Andrew Dunstan wrote: > > On 11/12/2010 11:34 AM, David Fetter wrote: > > Folks, > > > > For the past couple of years, I've been hearing from the PostGIS > > people among others that our type system just isn't flexible enough > > for their needs. It's really starting to show its age, or possibly > > design compromises that seemed reasonable a decade or more ago, but > > are less so now. > > This is so general as to be quite meaningless to me. What is it that is > wanted that we don't have. Some kind of generics, type generators, or type interfaces (like an interface in Java or type class in haskell). A real subtyping system might also be nice. That being said, a few details are left to be decided (an understatement). Regards,Jeff Davis
On Fri, Nov 12, 2010 at 6:12 PM, Jeff Davis <pgsql@j-davis.com> wrote: > That being said, a few details are left to be decided (an > understatement). Best... comment... ever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, 2010-11-12 at 08:34 -0800, David Fetter wrote: > Folks, > > For the past couple of years, I've been hearing from the PostGIS > people among others that our type system just isn't flexible enough > for their needs. It's really starting to show its age, or possibly > design compromises that seemed reasonable a decade or more ago, but > are less so now. > > To that end, I've put up a page on the wiki that includes a list of > issues to be addressed. It's intended to be changed, possibly > completely. > > http://wiki.postgresql.org/wiki/Refactor_Type_System > > What might the next version of the type system look like? This problems (as stated) strikes me as pretty overwhelming. 1. As far as I can tell, we have the best type system of any SQL DBMS. 2. Getting a type system right is a hard problem by itself, and there isn't any obvious consensus (although I think there is some agreement on some issues). 3. Type systems are more challenging for a DBMS because you need to account for things like storage, indexing, and optimization in ways that programming languages don't (consider: when comparing an int4 and an int8, you may want to coerce based on what indexes you have available). 4. SQL standard issues. In particular, I think that any modern type system will run into pretty severe problems with NULLs in one way or another. I think we'd have to pay very close attention to the standard when designing a new type system, because I suspect that retrofitting the standard onto a system we invent independently would be a disaster. 5. Backwards compatibility issues. I think the best we'll do is be able to hack on some of the things that we actively want and have clear use cases for, such as type interfaces. We might have to give up on some of the more ambitious ideas that involve propagating interesting information through the type inference system; or having any real type that wasn't declared with CREATE TYPE. Consider that right now we bundle the element type information along with the array _value_. Ideas welcome. Particularly if there are a few clear use cases. Regards,Jeff Davis
On 11/12/2010 07:07 PM, Jeff Davis wrote: > On Fri, 2010-11-12 at 08:34 -0800, David Fetter wrote: >> Folks, >> >> For the past couple of years, I've been hearing from the PostGIS >> people among others that our type system just isn't flexible enough >> for their needs. It's really starting to show its age, or possibly >> design compromises that seemed reasonable a decade or more ago, but >> are less so now. >> >> To that end, I've put up a page on the wiki that includes a list of >> issues to be addressed. It's intended to be changed, possibly >> completely. >> >> http://wiki.postgresql.org/wiki/Refactor_Type_System >> >> What might the next version of the type system look like? > This problems (as stated) strikes me as pretty overwhelming. > > 1. As far as I can tell, we have the best type system of any SQL DBMS. > 2. Getting a type system right is a hard problem by itself, and there > isn't any obvious consensus (although I think there is some agreement on > some issues). > 3. Type systems are more challenging for a DBMS because you need to > account for things like storage, indexing, and optimization in ways that > programming languages don't (consider: when comparing an int4 and an > int8, you may want to coerce based on what indexes you have available). > 4. SQL standard issues. In particular, I think that any modern type > system will run into pretty severe problems with NULLs in one way or > another. I think we'd have to pay very close attention to the standard > when designing a new type system, because I suspect that retrofitting > the standard onto a system we invent independently would be a disaster. > 5. Backwards compatibility issues. > > I think the best we'll do is be able to hack on some of the things that > we actively want and have clear use cases for, such as type interfaces. > We might have to give up on some of the more ambitious ideas that > involve propagating interesting information through the type inference > system; or having any real type that wasn't declared with CREATE TYPE. > Consider that right now we bundle the element type information along > with the array _value_. Yeah, composites too, IIRC. It's a bit sad. But those are really just warts show the difficulties we face in implementing types. I'm still waiting for some seriously different yet possible thing we could do. (And I agree we do have about the best type system around). cheers andrew
David Fetter wrote: > For the past couple of years, I've been hearing from the PostGIS > people among others that our type system just isn't flexible enough > for their needs. It's really starting to show its age, or possibly > design compromises that seemed reasonable a decade or more ago, but > are less so now. > > To that end, I've put up a page on the wiki that includes a list of > issues to be addressed. It's intended to be changed, possibly > completely. > > http://wiki.postgresql.org/wiki/Refactor_Type_System > > What might the next version of the type system look like? Are you talking about changes to the type system as users see it or just changes to how the existing behavior is implemented internally? If you're talking about, as users see it, which the other replies to this thread seem to be saying, though not necessarily the url you pointed to which looks more internals ... As a statement which may surprise no one who's heard me talk about it before ... I've mostly completed a type system specification that would be useable by Postgres, as the most fundamental part of my Muldis D language. The type system is arguably the most central piece of any DBMS, around which everything else is defined and built. You have data, which is structured in some way, and has operators for it. If you look at a DBMS from the perspective of being a programming language implementation, you find that a database is just a variable that holds a value of a structured type. In the case of a relational database, said database is a tuple whose attribute values are relations; or in the case of namespaces/schemas, the database tuple has tuple attributes having relation attributes. If a database is a variable, then all database constraints are type constraints on the declared type of that variable, and you can make said constraints arbitrarily complicated. From basic structures like nestable tuples and relations, plus a complement of basic types like numbers and strings, and arbitrary constraints, you can define data types of any shape or form. A key component of a good type system is that users can define data types, and moreover where possible, system-defined types are defined in the same ways as users define types. For example, stuff like temporal types or geospatial types are prime candidates for being defined like user-defined types. If you define all structures using tuples and relations, you can easily flatten this out on the implementation end and basically do everything as associated flat relation variables as you do now. So what I propose is both very flexible and easy to implement, scale, and optimize, relatively speaking. You don't have to kludge things by implementing arrays as blobs for example; you can implement them as relations instead. Geospatial types can just be tuples. Arrays of structured types can just be relations with an attribute per type attribute. Arrays of simple types can just be unary relations. You can also emulate all of the existing Pg features and syntax that you have now over the type system I've defined, maintaining compatibility too. I also want to emphasize that, while I drew inspiration from many sources when defining Muldis D, and there was/is a lot I still didn't/don't know about Postgres, I have found that as I use and learn Postgres, I'm finding frequently that how Postgres does things is similar and compatible to how I independently came up with Muldis D's design; I'm finding more similarities all the time. -- Darren Duncan
On Sat, Nov 13, 2010 at 7:54 PM, Darren Duncan <darren@darrenduncan.net> wrote: > A key component of a good type system is that users can define data types, > and moreover where possible, system-defined types are defined in the same > ways as users define types. For example, stuff like temporal types or > geospatial types are prime candidates for being defined like user-defined > types. They are user defined types, and even separately compiled, distributed, and dynamically linked. > You don't have to kludge things by implementing arrays as blobs for example; > you can implement them as relations instead. Geospatial types can just be > tuples. Arrays of structured types can just be relations with an attribute > per type attribute. Arrays of simple types can just be unary relations. In practice, my guess is the performance of these approaches would be terrible for a number of workloads. I don't agree that arrays having their own storage implementation is a kludge: there are even operators like unnest that can be used to turn them back into relations. I have long thought (but never really gave voice to) there being value having first-class relation values, but I would also say that's another kettle of fish. I also want closures, and don't think that's completely nuts. > I also want to emphasize that, while I drew inspiration from many sources > when defining Muldis D, and there was/is a lot I still didn't/don't know > about Postgres, I have found that as I use and learn Postgres, I'm finding > frequently that how Postgres does things is similar and compatible to how I > independently came up with Muldis D's design; I'm finding more similarities > all the time. You may want to learn more about this, first. Postgres's type system, while relatively simple, is not as ill-conceived or primitive as you seem to assume it is. There are also a lot of important system details, like expanding/replacing the typmod facility that only allows Postgres 32 bits of type-augmenting data (such as the (1) in the type numeric(1)). fdr
On Fri, Nov 12, 2010 at 4:07 PM, Jeff Davis <pgsql@j-davis.com> wrote: > I think the best we'll do is be able to hack on some of the things that > we actively want and have clear use cases for, such as type interfaces. > We might have to give up on some of the more ambitious ideas that > involve propagating interesting information through the type inference > system; or having any real type that wasn't declared with CREATE TYPE. > Consider that right now we bundle the element type information along > with the array _value_. Here are some weaknesses in the SUM aggregate that run up against the type system. Maybe they'll help crystallize some discussion: SUM(int2) => int4 SUM(int4) => int8 SUM(int8) => numeric Some weaknesses: SUM, of any precision, assumes that the precision being accumulated into (which is also the return-precision) is enough to avoid overflow. This is generally the case, but there's no reason why it *must* be true. I'm especially looking at the int2 to int4 conversion. One could imagine a more interesting scenario where overflow behavior could occur much more easily. SUM always promotes types upwards in precision, and is unable to keep types of the smallest possible precision should SUM expressions be composed. SUM is unable to maintain any supplementary information about precision, i.e. say anything interesting about the typmod, which defeats or makes impossible many interesting optimizations. I think that a type-interface system moves towards solving the first couple of problems, since SUM can return some abstract type such as "Integer" and use that to promote more aggressively (avoiding overflow) or keep representations small (avoiding unnecessary promotion) at run-time. It might require Integer to be an abstract, non-storable data type, though, so the current CREATE TYPE is not going to make life easy. The third problem is slightly different...it might require some user-pluggable code to be called as part of semantic analysis. I have felt the idea of making a Postgres Type a more robust first-class data type and somehow being able to attach a function to another function/aggregate that is responsible for getting called during semantic analysis and returning the proper signature roll around in my head, but it might be the whispers of cthulu, too. fdr
Daniel Farina <drfarina@acm.org> writes: > Here are some weaknesses in the SUM aggregate that run up against the > type system. Maybe they'll help crystallize some discussion: > SUM(int2) => int4 > SUM(int4) => int8 > SUM(int8) => numeric > Some weaknesses: > SUM, of any precision, assumes that the precision being accumulated > into (which is also the return-precision) is enough to avoid overflow. This is not a flaw of the type system, it's just an implementation choice in the SUM() aggregates. We could easily have chosen wider accumulation and/or result types. regards, tom lane
On Sun, Nov 14, 2010 at 7:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Daniel Farina <drfarina@acm.org> writes: >> Here are some weaknesses in the SUM aggregate that run up against the >> type system. Maybe they'll help crystallize some discussion: > >> SUM(int2) => int4 >> SUM(int4) => int8 >> SUM(int8) => numeric > >> Some weaknesses: > >> SUM, of any precision, assumes that the precision being accumulated >> into (which is also the return-precision) is enough to avoid overflow. > > This is not a flaw of the type system, it's just an implementation > choice in the SUM() aggregates. We could easily have chosen wider > accumulation and/or result types. That's true, but there are downsides to escalating the precision so aggressively. The case I was thinking about in particular involves composition of SUM. If one can assume that a relation has int4s and that will never overflow an int8 (as is done now), I don't see a great way to optimize the following case without special exceptions in the optimizer for particular aggregates known a-priori. Here's what would happen now: SELECT SUM(x::int8)::numeric FROM (SELECT SUM(x::int4)::int8 AS x FROM rel GROUP BY y) some_name; Could be rendered, by this assumption, as: SELECT SUM(x::int8)::int8 ....(same FROM clause) (Why would anyone write a query like this? Views. Possibly inlined SQL UDFs, too.) This can be measurably faster. It also more properly constrains the result type, as numeric can also handle non-integer quantities. I should have underscored that a positive aspect of having a type-class like facility that allows declaration things like this hypothetical Integer when backed by concrete types that might support a superset of functionality. fdr
Daniel Farina wrote: > On Sat, Nov 13, 2010 at 7:54 PM, Darren Duncan <darren@darrenduncan.net> wrote: >> You don't have to kludge things by implementing arrays as blobs for example; >> you can implement them as relations instead. Geospatial types can just be >> tuples. Arrays of structured types can just be relations with an attribute >> per type attribute. Arrays of simple types can just be unary relations. > > In practice, my guess is the performance of these approaches would be > terrible for a number of workloads. I don't agree that arrays having > their own storage implementation is a kludge: there are even operators > like unnest that can be used to turn them back into relations. I'm not discounting this at all. The data model can formally define types in one way and a DBMS can implement various well known cases in specialized ways that are more efficient. Arrays are a good example. But with other cases they still get a default implementation. If arrays are flexible enough, then different arrays could be implemented differently, eg some as blobs and some as relations; the former could be better for small or simple arrays and the latter for large or complex arrays. > I have long thought (but never really gave voice to) there being value > having first-class relation values, but I would also say that's > another kettle of fish. I also want closures, and don't think that's > completely nuts. I think that first-class functions are important at least, if not full-blown closures. You can define generic relational restrictions or summaries or whatever with such; eg, the general case of a WHERE is a function that takes a relation and a function as input, and results in a relation; more restricted cases of WHERE can be defined with simpler functions that take 2 relation inputs instead. > You may want to learn more about this, first. Postgres's type system, > while relatively simple, is not as ill-conceived or primitive as you > seem to assume it is. There are also a lot of important system > details, like expanding/replacing the typmod facility that only allows > Postgres 32 bits of type-augmenting data (such as the (1) in the type > numeric(1)). I'm not saying that Pg's system is primitive et al. The thread starter said it needed an overhaul, so indicating there are deficiencies, and I'm suggesting some effective ways to do that. -- Darren Duncan
On Sun, Nov 14, 2010 at 2:27 PM, Daniel Farina <drfarina@acm.org> wrote: > On Sun, Nov 14, 2010 at 7:47 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Daniel Farina <drfarina@acm.org> writes: >>> Here are some weaknesses in the SUM aggregate that run up against the >>> type system. Maybe they'll help crystallize some discussion: >> >>> SUM(int2) => int4 >>> SUM(int4) => int8 >>> SUM(int8) => numeric >> >>> Some weaknesses: >> >>> SUM, of any precision, assumes that the precision being accumulated >>> into (which is also the return-precision) is enough to avoid overflow. >> >> This is not a flaw of the type system, it's just an implementation >> choice in the SUM() aggregates. We could easily have chosen wider >> accumulation and/or result types. > > That's true, but there are downsides to escalating the precision so > aggressively. > > The case I was thinking about in particular involves composition of > SUM. If one can assume that a relation has int4s and that will never > overflow an int8 (as is done now), I don't see a great way to optimize > the following case without special exceptions in the optimizer for > particular aggregates known a-priori. Here's what would happen now: > > SELECT SUM(x::int8)::numeric > FROM (SELECT SUM(x::int4)::int8 AS x > FROM rel > GROUP BY y) some_name; > > Could be rendered, by this assumption, as: > > SELECT SUM(x::int8)::int8 > ....(same FROM clause) > > (Why would anyone write a query like this? Views. Possibly inlined SQL > UDFs, too.) > > This can be measurably faster. It also more properly constrains the > result type, as numeric can also handle non-integer quantities. > > I should have underscored that a positive aspect of having a > type-class like facility that allows declaration things like this > hypothetical Integer when backed by concrete types that might support > a superset of functionality. Like Tom, I'm not sure this is really a type-system problem. This sounds like a complaint that operations on "numeric" are much slower than operations on "int4" and "int8", even for values that could be represented by either type. I think that's a valid complaint, but I don't see how changing the type system would help. I think what you'd need to is optimize the existing numeric type, or provide a new numeric-ish type with optimizations for dealing with small-to-medium-sized integers. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Nov 14, 2010 at 1:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Like Tom, I'm not sure this is really a type-system problem. This > sounds like a complaint that operations on "numeric" are much slower > than operations on "int4" and "int8", even for values that could be > represented by either type. I think that's a valid complaint, but I > don't see how changing the type system would help. I think what you'd > need to is optimize the existing numeric type, or provide a new > numeric-ish type with optimizations for dealing with > small-to-medium-sized integers. There are other ways one might be able to attack the performance part of the problem, but consider the loss of information about the type from int(2|4|8) to numeric when composing a series of sums: we know the value produced fits the abstract notion of an Integer, but we lose that information. The same could be said of SUM(x::numeric(1000,0)) yielding an unrestricted numeric, rather than one of scale 0. Not only is there more information available to the user, but the optimizer should be able to benefit from that information as well. However, for an arbitrary user-defined operator to take advantage it would seem to me that there needs to be a hook where some reasoning can take place over its input types and subsequently determine the return prototype at that call site. I am not trying to suggest that there is no way that one could attack the details around SUM I have mentioned without changing the type system, only that they are limitations I encountered and worked around where a more powerful type system came to mind as an avenue for a simpler/more proper solution. fdr
Daniel Farina <drfarina@acm.org> writes: > There are other ways one might be able to attack the performance part > of the problem, but consider the loss of information about the type > from int(2|4|8) to numeric when composing a series of sums: we know > the value produced fits the abstract notion of an Integer, but we lose > that information. The same could be said of SUM(x::numeric(1000,0)) > yielding an unrestricted numeric, rather than one of scale 0. Not only > is there more information available to the user, but the optimizer > should be able to benefit from that information as well. However, for > an arbitrary user-defined operator to take advantage it would seem to > me that there needs to be a hook where some reasoning can take place > over its input types and subsequently determine the return prototype > at that call site. Yeah, this has been discussed before. The problem is that it's too much work for too little reward. There are too many different possible behaviors for functions, and nobody is going to go through and write a subtype-inference helper function for every function in the system, or even any usefully large fraction of the functions in the system. What's more, the subtype definitions have largely been predetermined for us by SQL, and predetermined in such a way that knowing the subtype isn't really all that exciting. Is anybody willing to put in months of work to teach the system that sum(numeric(7,0)) yields numeric(something,0) rather than numeric(something,something)? I doubt it. The return on that knowledge would be too small, and there are too many cases where you couldn't deduce anything anyway. regards, tom lane
On Sun, Nov 14, 2010 at 11:15 AM, Daniel Farina <drfarina@acm.org> wrote: > SUM(int2) => int4 > SUM(int4) => int8 > SUM(int8) => numeric > > Some weaknesses: > > SUM, of any precision, assumes that the precision being accumulated > into (which is also the return-precision) is enough to avoid overflow. > This is generally the case, but there's no reason why it *must* be > true. I'm especially looking at the int2 to int4 conversion. One could > imagine a more interesting scenario where overflow behavior could > occur much more easily. Fwiw I think he's right that sum(int2) should perhaps be redefined to return int8. As it stands all it would take is a 64k rows to potentially overflow. It's not super likely but it is plausible and the performance penalty to use int8 wouldn't be super big either. int4 doesn't seem like as realistic a problem since it would take 4 billion rows and the performance penalty for using numeric would be much higher. -- greg
On Sun, Nov 14, 2010 at 9:16 PM, Greg Stark <gsstark@mit.edu> wrote: > On Sun, Nov 14, 2010 at 11:15 AM, Daniel Farina <drfarina@acm.org> wrote: >> SUM(int2) => int4 >> SUM(int4) => int8 >> SUM(int8) => numeric >> >> Some weaknesses: >> >> SUM, of any precision, assumes that the precision being accumulated >> into (which is also the return-precision) is enough to avoid overflow. >> This is generally the case, but there's no reason why it *must* be >> true. I'm especially looking at the int2 to int4 conversion. One could >> imagine a more interesting scenario where overflow behavior could >> occur much more easily. > > Fwiw I think he's right that sum(int2) should perhaps be redefined to > return int8. As it stands all it would take is a 64k rows to > potentially overflow. It's not super likely but it is plausible and > the performance penalty to use int8 wouldn't be super big either. > > int4 doesn't seem like as realistic a problem since it would take 4 > billion rows and the performance penalty for using numeric would be > much higher. I think you could get the same effect by removing sum(int2) altogether, which I wouldn't lose any sleep over. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Greg Stark <gsstark@mit.edu> writes: > Fwiw I think he's right that sum(int2) should perhaps be redefined to > return int8. As it stands all it would take is a 64k rows to > potentially overflow. It's not super likely but it is plausible and > the performance penalty to use int8 wouldn't be super big either. It's not unreasonable. I think the performance penalty for int8 was higher when that choice was made than it is now --- and in particular, on a 64-bit machine it's now pretty much negligible. On the other hand, you can always execute sum(foo::int4) if you need a wider sum, just like the escape hatch if any of the other datatype choices aren't working for you. It's not clear that we should force a performance loss on people who don't need it (and I can't offhand recall *ever* hearing a complaint about sum(int2) overflowing...) I believe what the OP was arguing for was not so much this sort of marginal tinkering as inventing a more general notion of "integer type" that would avoid the whole issue. The problem with that is that we have to work within the set of types specified by the SQL standard. regards, tom lane