Skip to content

Implementing Structural Types on the JVM #19

Open
@DavePearce

Description

@DavePearce

The current pain points for the Java backend are: record and union types. I'm going to discuss the former here, since that is particularly awkward. The basic question is how to implement efficiently the following on the JVM:

type point is {i32 x, i32 y}

function transpose(point p) -> {i32 x, i32 y}:
   return {x: p.y, y: p.x}

The challenge, of course, is that Java does not give us any concept similar to a structural type. An interface is perhaps the closest thing we have. There are two main schools of thought here:

Uniform Representation

In this case, we have a single type (e.g. Struct) for representing all record types. Under the scenes, this could use a HashMap for example. Thus, our above program generates this Java:

static public Struct transpose(Struct p) {
   return new Struct(p.get("y"),p.get("x"));
}

It should be pretty obvious that this is not particularly efficient. Some points:

  • Field Accesses. They require a method call and HashMap look up against a String key.
  • Data Representation. Simple data types like int must be boxed as Integer when stored in the Struct.
  • Coercions. Coercions between equivalent (though non-identical) types don't require actual coercions. For example, between Point and the anonymous record {int x, int y}.
  • Open Records. Open records require no special treatment. They "just work".
  • Effective Unions. Likewise, unions of records also require no special treatment. Again, they "just work".
  • Recursion. Recursive types are supported out-of-the-box without any additional machinery.

Improvements. We can try to improve performance by avoiding an underlying HashMap. For example, by using an Object[] where each field position determines the array index. Thus, in {int x, int y}, field x has index 0 and field y has index 1. This means a coercion is necessary when flowing into a type {int y, int x}, but no coercion is needed when e.g. Point flows into {int x, int y} (note: sorting helps here). However, open records are now more complicated as, by definition, these do not correspond to an access with a known offset. We can mitigate this by including a String[] reference which identifies each field name. Then, accessing an open record is a linear search through this array. We could try to sort this even, to reduce the overall time.

Class Representation

Instead of a struct type, Java provides the class type. Therefore, it makes sense to try and use this by generating classes on demand. Under this scheme, our above program becomes:

class Point is { int x, int y, ... }

static public Struct0 transpose(Point p) {
   return new Struct0(p.y,p.x);
}

class Struct0 is { int x, int y, ...}

Here, Point is generated as expected with appropriate fields. Furthermore, Struct0 is generated to represent the anonymous record. It should be clear that this representation offer potential efficiency gains:

  • Field Accesses. These correspond to Java field accesses and, hence, are efficient.
  • Data Representation. All datatypes can be represented in their native form, i.e. without boxing.
  • Coercions. Coercions between equivalent (though non-identical) types require coercions. For example, between Point and the anonymous record {int x, int y} we must turn an instance of Point into an instance of Struct0. This is particular ominous when identical records are used across module. For example, Struct0 in this module is not the same as Struct0 in another module.
  • Open Records. This representation doesn't handle open records as is, and it's unclear how best to do this. One option is to have get / set methods which accept String arguments. Then, lookup is performed on an underlying String[] which is not too expensive as this can be shared. Though it means every record has an extra reference.
  • Effective Unions. Likewise, unions of records are not supported without additional machinery.
  • Recursion. Recursive types are supported out-of-the-box without any additional machinery.

Whilst it seems this approach is potentially the most efficient, there are some genuine challenges. For example, how does this translate:

type Position is Point

Do we create a new class for Position or just reuse Point? LIkewise, what does this translate into:

type LinkedList is null | { i32 data, LinkedList next}

It seems the best translation we could do would be this:

class Struct1 {
   int data;
   Union next
}

Then, the type LinkedList just translates to an anonymous Union.

Non-Uniform Representation

Finally, we come to what is potentially the ideal solution. The basic problem with the class-based representation is simply the non-uniformity between named and unnamed types. That is, we get class Point for Point but Union for LinkedList. Notice that, for most other data types, we get a "structural" representation. For example, this:

type PointArray is Point[]

function f(PointArray ps) -> (i32 r):
   ...

Translates to something like this (depending on how records are handled):

static int f(Point[] ps) { ... }

Therefore, what we want is a "structural" representation for records. This then means that all datatypes have a corresponding "structural" representation and this is what we see in the Java source. Whilst Java doesn't give us such a structural representation, we can attempt to mimick one. In fact, we already did that above by generating the Struct0 and Struct1 types. There are two main parts:

  1. To understand how this works, we initially make a whole program assumption. That is, we assume at compile time we have access to the whole program and, in particular, all datatypes used. The compiler then emits a "runtime" file (e.g. wy.java or similar) which contains the structural representation of all records used within the program. This provides a structural representation for every type. The user can potentially control the name of the runtime file, etc. Eitherway, we need to compile the program or against it.

  2. We now relax the whole program assumption. We assume that dependencies form a tree. Thus, the runtime file provided with each dependency gives the representations needed for that dependency. For all those defined in an earlier dependency we can just reuse their implementations (as they must exist by definition). This generally works well, though in the case of a dependency dag we can encounter some inefficiencies.

We now consider a concrete example for illustration. Let us consider how the type {int x, int y} is implemented. Each of our structural representations extend a common base class:

abstract class Struct {
   public abstract Object get(String field);
   public abstract void put(String field, Object);
}

Then, our implementation looks like this:

class Struct0 extends Struct {
  public int x;
  public int y;

  public Struct0(int x, int y) { this.x = x; this.y = y; }

  public Object get(String field) {
     switch(field) {
        case "x":
            return x;
        case "y":
            return y;
        default:
            throw new UnsupportedOperationException();
     }
  }

  public void put(String field, Object value) { ... }

  public boolean equals(Object o) { ... }

  public int hashCode() { ... }
}

Whilst this is a fairly heavy weight example, it does seem to work. We might also include all valid coercions betwee types in the runtime, though this could grow exponentially.

Finally, we might want a suitable naming scheme for our Structs. For {int x, int y} we could use Struct$ix$iy as a uniquely identifying name. This actually works quite well, to be fair.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions