Implementing Structural Types on the JVM #19

DavePearce · 2017-09-26T22:26:05Z

The current pain points for the Java backend are: record and union types. I'm going to discuss the former here, since that is particularly awkward. The basic question is how to implement efficiently the following on the JVM:

type point is {i32 x, i32 y}

function transpose(point p) -> {i32 x, i32 y}:
   return {x: p.y, y: p.x}

The challenge, of course, is that Java does not give us any concept similar to a structural type. An interface is perhaps the closest thing we have. There are two main schools of thought here:

Uniform Representation

In this case, we have a single type (e.g. Struct) for representing all record types. Under the scenes, this could use a HashMap for example. Thus, our above program generates this Java:

static public Struct transpose(Struct p) {
   return new Struct(p.get("y"),p.get("x"));
}

It should be pretty obvious that this is not particularly efficient. Some points:

Field Accesses. They require a method call and HashMap look up against a String key.
Data Representation. Simple data types like int must be boxed as Integer when stored in the Struct.
Coercions. Coercions between equivalent (though non-identical) types don't require actual coercions. For example, between Point and the anonymous record {int x, int y}.
Open Records. Open records require no special treatment. They "just work".
Effective Unions. Likewise, unions of records also require no special treatment. Again, they "just work".
Recursion. Recursive types are supported out-of-the-box without any additional machinery.

Improvements. We can try to improve performance by avoiding an underlying HashMap. For example, by using an Object[] where each field position determines the array index. Thus, in {int x, int y}, field x has index 0 and field y has index 1. This means a coercion is necessary when flowing into a type {int y, int x}, but no coercion is needed when e.g. Point flows into {int x, int y} (note: sorting helps here). However, open records are now more complicated as, by definition, these do not correspond to an access with a known offset. We can mitigate this by including a String[] reference which identifies each field name. Then, accessing an open record is a linear search through this array. We could try to sort this even, to reduce the overall time.

Class Representation

Instead of a struct type, Java provides the class type. Therefore, it makes sense to try and use this by generating classes on demand. Under this scheme, our above program becomes:

class Point is { int x, int y, ... }

static public Struct0 transpose(Point p) {
   return new Struct0(p.y,p.x);
}

class Struct0 is { int x, int y, ...}

Here, Point is generated as expected with appropriate fields. Furthermore, Struct0 is generated to represent the anonymous record. It should be clear that this representation offer potential efficiency gains:

Field Accesses. These correspond to Java field accesses and, hence, are efficient.
Data Representation. All datatypes can be represented in their native form, i.e. without boxing.
Coercions. Coercions between equivalent (though non-identical) types require coercions. For example, between Point and the anonymous record {int x, int y} we must turn an instance of Point into an instance of Struct0. This is particular ominous when identical records are used across module. For example, Struct0 in this module is not the same as Struct0 in another module.
Open Records. This representation doesn't handle open records as is, and it's unclear how best to do this. One option is to have get / set methods which accept String arguments. Then, lookup is performed on an underlying String[] which is not too expensive as this can be shared. Though it means every record has an extra reference.
Effective Unions. Likewise, unions of records are not supported without additional machinery.
Recursion. Recursive types are supported out-of-the-box without any additional machinery.

Whilst it seems this approach is potentially the most efficient, there are some genuine challenges. For example, how does this translate:

type Position is Point

Do we create a new class for Position or just reuse Point? LIkewise, what does this translate into:

type LinkedList is null | { i32 data, LinkedList next}

It seems the best translation we could do would be this:

class Struct1 {
   int data;
   Union next
}

Then, the type LinkedList just translates to an anonymous Union.

Non-Uniform Representation

Finally, we come to what is potentially the ideal solution. The basic problem with the class-based representation is simply the non-uniformity between named and unnamed types. That is, we get class Point for Point but Union for LinkedList. Notice that, for most other data types, we get a "structural" representation. For example, this:

type PointArray is Point[]

function f(PointArray ps) -> (i32 r):
   ...

Translates to something like this (depending on how records are handled):

static int f(Point[] ps) { ... }

Therefore, what we want is a "structural" representation for records. This then means that all datatypes have a corresponding "structural" representation and this is what we see in the Java source. Whilst Java doesn't give us such a structural representation, we can attempt to mimick one. In fact, we already did that above by generating the Struct0 and Struct1 types. There are two main parts:

To understand how this works, we initially make a whole program assumption. That is, we assume at compile time we have access to the whole program and, in particular, all datatypes used. The compiler then emits a "runtime" file (e.g. wy.java or similar) which contains the structural representation of all records used within the program. This provides a structural representation for every type. The user can potentially control the name of the runtime file, etc. Eitherway, we need to compile the program or against it.
We now relax the whole program assumption. We assume that dependencies form a tree. Thus, the runtime file provided with each dependency gives the representations needed for that dependency. For all those defined in an earlier dependency we can just reuse their implementations (as they must exist by definition). This generally works well, though in the case of a dependency dag we can encounter some inefficiencies.

We now consider a concrete example for illustration. Let us consider how the type {int x, int y} is implemented. Each of our structural representations extend a common base class:

abstract class Struct {
   public abstract Object get(String field);
   public abstract void put(String field, Object);
}

Then, our implementation looks like this:

class Struct0 extends Struct {
  public int x;
  public int y;

  public Struct0(int x, int y) { this.x = x; this.y = y; }

  public Object get(String field) {
     switch(field) {
        case "x":
            return x;
        case "y":
            return y;
        default:
            throw new UnsupportedOperationException();
     }
  }

  public void put(String field, Object value) { ... }

  public boolean equals(Object o) { ... }

  public int hashCode() { ... }
}

Whilst this is a fairly heavy weight example, it does seem to work. We might also include all valid coercions betwee types in the runtime, though this could grow exponentially.

Finally, we might want a suitable naming scheme for our Structs. For {int x, int y} we could use Struct$ix$iy as a uniquely identifying name. This actually works quite well, to be fair.

The text was updated successfully, but these errors were encountered:

DavePearce · 2017-09-26T23:15:37Z

Byte Representations

For the record, I also wanted to clarify some other ideas I've had. Another option would be to compact records into byte[] arrays. This would also provide a true "structural" representation. However, this is not without its problems. In particular:

Recursive Types. There's no clear way to implement a recursive type. One option is to literally embed each level into the array. This makes adding new links very expensive. Potentially, we could employ a global heap or similar where recursive links become "pointers" into that heap. But, then we need garbage collection!
Open Records. These are problematic as we need some way to identify which field is which. I don't know how you do this.

Right, ok, there is one possibility. We implement e.g. LinkedList like so:

class Struct {
   Field[] schema;
   byte[] bytes;
   Struct[] chunks;
   //
   int getInt(int offset) { ... }
   int getInt(String name) { ... }
   //
   long getLong(int offset) { ... }
   long getLong(String name) { ... }
   ///
   Struct getStruct(int chunk) { ... }
}

class Field {
   int offset; // in bits or bytes? negative for chunks
   String name;
}

There are two main advantages of this: it's more compact than the uniform representation above; and, it's still a generic uniform representation (i.e. we don't have boundless Struct classes). Field accesses can still be made relatively efficiently. For example, reading field y from {i32 x, i32 y} would correspond to a getInt(32) (assuming bit counts). Unclear how to handle unions.

Wow, this actually could work!!!

This old runtime classes are no longer needed. At the moment, we just have the uniform type "Struct".

DavePearce · 2017-09-27T20:01:03Z

Improved Non-Uniform Representation

Another idea occurred. One of the problems with the non-uniform representation is that there are so many potential combinations we cannot pregenerate them all. In particular, the space of field names is gigantic. But, actually, we can attack this problem by employing uniform field names.

For example, a record {i32 x, i32 y} can boil down to class Struct { int f1, int f2 }. In fact, any record involving two i32 fields can. Since the type system knows the underlying types, it can know the mapping. This allows direct field access instead of via an array. As before, we can use a schema to handle the case of open record lookup.

We can further reduce the space of Struct implementations in two ways:

All primitives such as bool, byte, i16, i32, etc can compile to Java int.
For non-primitives we have general Object fields. We can sort so that all primitives come first, and non-primitives after.

So, the collection of Struct implementations would be e.g:

class Struct$i { int f1 }
class Struct$o { Object f1 }
class Struct$ii { int f1, int f2 }
class Struct$io { int f1, Object f2 }
class Struct$iii { int f1, int f2 }
class Struct$iio { int f1, int f2, Object f3 }
class Struct$ioo { int f1, Object f2, Object f3 }

This really reduces the space of possible implementations to the point where we could pre-generate all implementations of interest. Some issues:

Type Testing. It's not completely clear how this would work with type testing. Also, not clear whether it needs to (i.e. since that really applies more to open records).
Schemas. These are now somehow more complicated. Multiple schema's can map to the same Struct implementation. For example {int x, int[] y} and {int[] x, int y} both map to Struct$io. Therefore, the schema lookup needs to account for this.

But, overall, this approach has clear merit.

DavePearce added the Optimisation label Sep 26, 2017

DavePearce added a commit that referenced this issue Sep 27, 2017

Removed old runtime classes #19

a81ff31

This old runtime classes are no longer needed. At the moment, we just have the uniform type "Struct".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Structural Types on the JVM #19

Implementing Structural Types on the JVM #19

DavePearce commented Sep 26, 2017 •

edited

Loading

DavePearce commented Sep 26, 2017 •

edited

Loading

DavePearce commented Sep 27, 2017

Implementing Structural Types on the JVM #19

Implementing Structural Types on the JVM #19

Comments

DavePearce commented Sep 26, 2017 • edited Loading

Uniform Representation

Class Representation

Non-Uniform Representation

DavePearce commented Sep 26, 2017 • edited Loading

Byte Representations

DavePearce commented Sep 27, 2017

Improved Non-Uniform Representation

DavePearce commented Sep 26, 2017 •

edited

Loading

DavePearce commented Sep 26, 2017 •

edited

Loading