Implementing Structural Types on the JVM

The current pain points for the Java backend are: _record_ and _union_ types.  I'm going to discuss the former here, since that is particularly awkward.  The basic question is how to implement efficiently the following on the JVM:

```
type point is {i32 x, i32 y}

function transpose(point p) -> {i32 x, i32 y}:
   return {x: p.y, y: p.x}
```
The challenge, of course, is that Java does not give us any concept similar to a structural type.  An `interface` is perhaps the closest thing we have.  There are two main schools of thought here:

#### Uniform Representation 

In this case, we have a single type (e.g. `Struct`) for representing all record types.  Under the scenes, this could use a `HashMap` for example.  Thus, our above program generates this Java:

```
static public Struct transpose(Struct p) {
   return new Struct(p.get("y"),p.get("x"));
}
```

It should be pretty obvious that this is not particularly efficient.  Some points:

* _Field Accesses_.  They require a method call and `HashMap` look up against a `String` key.
* _Data Representation_.  Simple data types like `int` must be boxed as `Integer` when stored in the `Struct`.
* _Coercions_.  Coercions between equivalent (though non-identical) types don't require actual coercions.  For example, between `Point` and the anonymous record `{int x, int y}`.
* _Open Records_.  Open records require no special treatment.  They "just work".
* _Effective Unions_.  Likewise, unions of records also require no special treatment.  Again, they "just work".
* _Recursion_.  Recursive types are supported out-of-the-box without any additional machinery.

**Improvements**. We can try to improve performance by avoiding an underlying `HashMap`.  For example, by using an `Object[]` where each field position determines the array index.  Thus, in `{int x, int y}`, field `x` has index `0` and field `y` has index `1`.  This means a coercion is necessary when flowing into a type `{int y, int x}`, but no coercion is needed when e.g. `Point` flows into `{int x, int y}` (*note*: sorting helps here).  However, open records are now more complicated as, by definition, these do not correspond to an access with a known offset.  We can mitigate this by including a `String[]` reference which identifies each field name.  Then, accessing an open record is a linear search through this array.  We could try to sort this even, to reduce the overall time.

#### Class Representation

Instead of a `struct` type, Java provides the `class` type.  Therefore, it makes sense to try and use this by generating classes on demand.  Under this scheme, our above program becomes:

```
class Point is { int x, int y, ... }

static public Struct0 transpose(Point p) {
   return new Struct0(p.y,p.x);
}

class Struct0 is { int x, int y, ...}
```

Here, `Point` is generated as expected with appropriate fields.  Furthermore, `Struct0` is generated to represent the anonymous record.  It should be clear that this representation offer potential efficiency gains:

* _Field Accesses_.  These correspond to Java field accesses and, hence, are efficient.
* _Data Representation_.  All datatypes can be represented in their native form, i.e. without boxing.
* _Coercions_.  Coercions between equivalent (though non-identical) types require coercions.  For example, between `Point` and the anonymous record `{int x, int y}` we must turn an instance of `Point` into an instance of `Struct0`.  This is particular ominous when identical records are used across module.  For example, `Struct0` in this module is not the same as `Struct0` in another module.
* _Open Records_.  This representation doesn't handle open records as is, and it's unclear how best to do this.  One option is to have `get` / `set` methods which accept `String` arguments.  Then, lookup is performed on an underlying `String[]` which is not too expensive as this can be shared.  Though it means every record has an extra reference.
* _Effective Unions_.  Likewise, unions of records are not supported without additional machinery.
* _Recursion_.  Recursive types are supported out-of-the-box without any additional machinery.

Whilst it seems this approach is potentially the most efficient, there are some genuine challenges.  For example, how does this translate:

```
type Position is Point
```

Do we create a new class for `Position` or just reuse `Point`?  LIkewise, what does this translate into:

```
type LinkedList is null | { i32 data, LinkedList next}
```

It seems the best translation we could do would be this:

```
class Struct1 {
   int data;
   Union next
}
```

Then, the type `LinkedList` just translates to an anonymous `Union`.  

#### Non-Uniform Representation

Finally, we come to what is potentially the ideal solution.  The basic problem with the class-based representation is simply the non-uniformity between named and unnamed types.  That is, we get `class Point` for `Point` but `Union` for `LinkedList`.  Notice that, for most other data types, we get a "structural" representation.  For example, this:

```
type PointArray is Point[]

function f(PointArray ps) -> (i32 r):
   ...
```

Translates to something like this (depending on how records are handled):

```
static int f(Point[] ps) { ... }
```

Therefore, what we want is a "structural" representation for records.  This then means that all datatypes have a corresponding "structural" representation and this is what we see in the Java source.  Whilst Java doesn't give us such a structural representation, we can attempt to mimick one.  In fact, we already did that above by generating the `Struct0` and `Struct1` types.  There are two main parts:

1. To understand how this works, we initially make a **whole program assumption**.  That is, we assume at compile time we have access to the whole program and, in particular, all datatypes used.  The compiler then emits a "runtime" file (e.g. `wy.java` or similar) which _contains the structural representation of all records used within the program_.  This provides a structural representation for every type.  The user can potentially control the name of the runtime file, etc.  Eitherway, we need to compile the program or against it.

2. _We now relax the whole program assumption_.  We assume that dependencies form a tree.  Thus, the runtime file provided with each dependency gives the representations needed for that dependency.  For all those defined in an earlier dependency we can just reuse their implementations (as they must exist by definition).  This generally works well, though in the case of a dependency dag we can encounter some inefficiencies.

We now consider a concrete example for illustration.  Let us consider how the type `{int x, int y}` is implemented.  Each of our structural representations extend a common base class:
```
abstract class Struct {
   public abstract Object get(String field);
   public abstract void put(String field, Object);
}
```

Then, our implementation looks like this:

```
class Struct0 extends Struct {
  public int x;
  public int y;

  public Struct0(int x, int y) { this.x = x; this.y = y; }

  public Object get(String field) {
     switch(field) {
        case "x":
            return x;
        case "y":
            return y;
        default:
            throw new UnsupportedOperationException();
     }
  }

  public void put(String field, Object value) { ... }

  public boolean equals(Object o) { ... }

  public int hashCode() { ... }
}
```

Whilst this is a fairly heavy weight example, it does seem to work.  We might also include all valid coercions betwee types in the runtime, though this could grow exponentially.

Finally, we might want a suitable naming scheme for our `Structs`.  For `{int x, int y}` we could use `Struct$ix$iy` as a uniquely identifying name.  This actually works quite well, to be fair.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementing Structural Types on the JVM #19

Uniform Representation

Class Representation

Non-Uniform Representation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implementing Structural Types on the JVM #19

Description

Uniform Representation

Class Representation

Non-Uniform Representation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions