Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Matlab-like type system #21

Merged
merged 58 commits into from
Mar 12, 2025
Merged

[Enhancement] Matlab-like type system #21

merged 58 commits into from
Mar 12, 2025

Conversation

balbasty
Copy link
Contributor

@balbasty balbasty commented Feb 3, 2025

Implementation of #20

I am leaving this as a draft PR for now.

Todo:

  • Unit tests for matlab-like indexing.
  • Unit tests for (nested) conversion to/from runtime objects.

The MatlabClassWrapper test fails on my side, but I don't think I've modified that class. (fixed)

@johmedr johmedr marked this pull request as ready for review February 4, 2025 19:50
@johmedr johmedr self-requested a review February 4, 2025 19:50
Copy link
Collaborator

@johmedr johmedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing all these changes @balbasty -- I think this is definitely a big step forward for our type system!

I left a few comments in the review, in particular about attaching helper functions as static class methods to the class they belong to, to keep a clear namespace for people doing from spm import *. I think the main implementations (e.g., for converting a cell to a num array) should be attached to the class it belong (e.g., Array.from_cell(...) or Cell.as_array()), with the possibility of adding helper wrappers for users (e.g., num2cell) that just rely on this implementation and that can be imported selectively (e.g., from spm.helpers import *). This would keep ease of use for Matlab users but allow for more Python-like syntaxes, e.g.

my_cell.as_array().as_type(dtype=np.uint8).reshape((4,3))

The second point is that we need to test that this type systems actually interfaces from and to Matlab without issues. The tests do not test for that yet. A simple check is to construct an identity function and test for equality:

idt = Runtime.call('eval',  '@(x) x')
assert Runtime.call(idt, s) == s

I think we need to get this into the tests before approving the PR.

Other than this, everything looks good, thank you for this :)

# ----------------------------------------------------------------------


def cell(*iterable):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to overload object constructors instead of implementing new types helper. One reason for that is that it links constructors to the objects to which they belong. As most of spm functions have the format spm_xxx, one would do from spm import * for convenience, and should be able to use most of these helpers function without having a cluttered namespace.
Alternatives would be:

  • several possible syntaxes in __init__, e.g., __init__(self, shape_or_iterable, ...)
  • static methods to implement specific constructors, we could have Cell.from_iterable(iterable)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current constructors do indeed follow the __init__(self, shape_or_iterable, ...) syntax. However, it leaves room to ill-definite cases:

  • Cell([1, 2, 3]) generates a cell-array of size 1x2x3 so, how to create a cell that contains 1, 2, 3 -- i.e. {1, 2, 3} in matlab?
  • Cell([[3, 4], [5, 6]]) generates a cell-array of size 2x2 that contains the values 3, 4, 5, 6, so how to create a cell of cells -- i.e. {{3, 4}, {5, 6}} in matlab?

So the cell(...) helper (which resolves these issues) should probably become the Cell.from_iterable(iterable) class method that you propose.

Alternatively, we could have specific methods to construct cell/struct/num arrays of a given shape, like Cell.from_shape(shape) or Cell.empty(shape), and always trigger a copy-construction when using the Cell(array_like) syntax. But it deviates a bit from the matlab syntax, so I am not a fan.

Do we agree with that we should implement the Cell.from_iterable(iterable) syntax?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a really good point !

Do we agree with that we should implement the Cell.from_iterable(iterable) syntax?
Yes, that's my preferred option, if there is no strong arguments against it.

return obj


def num2cell(array):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I think Array.from_cell could be preferred. Maybe we could move these explicit translation of Matlab functions in a different subpackage, e.g., from spm.cheats import *.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

return asmatlab(Struct(other))


def asmatlab(other):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clearing that out :)
I think I would still prefer to link it to some class (e.g., Runtime), to keep things ordered. In particular, this method should be hidden (e.g., _as_matlab), as it could be used by the advanced user to debug their code but should not appear in the scope by default. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that some users may want to convert python/json structures to our types, e.g.

  • asmatlab([{"a": 1}, [0, 1, 2]]) returns a Cell that contains a Struct and an Array.

But I agree that may not be needed (especially as users can directly pass such a structure to an spm_* function without needing the explicit conversion).

)


def _as_matlab_object(obj):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can rename this to _from_matlab to relate to _as_matlab.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this makes me think we could just create an abstract base class with these two abstract methods, in order to reunite MatlabClassWrapper and Matlab types you proposed.

obj = obj.reshape(shape)
return obj.tolist()
return obj
# TODO: what about sparse numpy arrays? Does matlab understand them?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: what about sparse numpy arrays? Does matlab understand them?

No, it does not. Matlab does not understand all types, and the trick I have used is to label data with the Matlab type they should have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so if the user passes a sparse numpy array as input, we should convert it to a dict with type__="sparse"?
Do you have a documentation about the expected layout/content of the data__ field in this case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, sparse arrays are converted to dense arrays and passed in the data__ field, we could find a more efficient way to handle this in the future.

if isinstance(other, Array):
return other

if isinstance(other, Cell):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I am surprised the tests work: Matlab only supports cell vector (M, 1) or (1, N), not cell arrays (M, N). For this, I have used the "type__" field of a struct to specify their type. All of these types of translations are made here. This makes me think tests need to be updated as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a misunderstanding about the use of the asmatlab function. It does not convert objects to representations that we can pass to the runtime -- I use the _as_matlab_object for this purpose.

Instead, asmatlab converts python objects to our type system (recursively).
Hence why this function should be renamed.

Essentially there are two functions

  • One that converts python objects to "Our types"
  • One that converts python objects to "Matlab-compatible types"
Matlab Matlab-compatible types Our types
Nx... {single, double, ...} {ndarray, matlab.single, matlab.double, ...} Array(N, ...)
1x1 {single, double, ...} {double, int, complex, bool} Array()
1xN cell {list, tuple, set} Cell(N)
NxMx... cell dict(type__="cellarray") Cell(N, M, ...)
1x1 struct dict Struct()
Nx... struct dict(type__="structarray") Struct(N, ...)

Is that it?

* `a.as_cell[x,y] = any` indicates that `a` is a cell array;
* `a.as_struct[x,y].f = any` indicates that `a` is a struct array;
* `a.as_cell[x,y].f = any` indicates that `a` is a cell array that contains a struct;
* `a.as_num[x,y] = num` indicates that `a` is a numeric array.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thank you for putting in examples :)

# * we should hide all math functions.
# * `__add__`, `__iadd__`, `__radd__` should fallback to `extend()`,
# as in tuples, rather than `np.add` as in arrays.
# * "scalar" cells should be forbidden (already the case at
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this: in Matlab, both {} and {'a'} would create cells. User should have a way to specify scalar cells, as it might appear in matlabbatch specification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agree. A Cell with a zero or one elements exist, but they must have at least one dimension -- their shape must be (0,) or (1,). What I call a "scalar cell" (and think should be forbidden) is a Cell whose shape is an empty tuple(), which could be created by doing something like Cell(1).reshape([]).

In contrast, "scalar" Array and Struct are allowed:

  • a "scalar array" is the same as e.g. np.asarray(1.0), which is identical to the python scalar 1.0 for all purposes.
  • a "scalar struct" is the same as a dictionary (with dot access to fields)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I had the wrong idea of what a scalar cell is. This sounds good!


# TODO:
# I am thinking that cells should behave more like
# "multidimensional tuples/lists" than "ndarray".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with this, they should behave like multidimensional list (mutability is necessary).

# * "scalar" cells should be forbidden (already the case at
# construction, but we should also check that reshape/view do
# not return scalar cells).
# * maybe we should inherit from `tuple` (or `list`?).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be a nice idea. As you mentioned below, we have to change the insert/pop/remove

@johmedr
Copy link
Collaborator

johmedr commented Feb 5, 2025

Following up on the asmatlab function, doing things like Runtime.call('eval', '@(x) x') does not work as it returns a function handle which is not handled by asmatlab. I would suggest that we handle unknown objects by simply returning them to the user. After all, if Matlab Runtime did not complain about converting through, never should we!

@balbasty
Copy link
Contributor Author

balbasty commented Feb 5, 2025

Thanks for the thorough review @johmedr! I've answered to a few comments. I should have some time next week to implement your suggested changes.

It would be nice to have the identity function in the bindings, indeed.
Also, maybe implement a few functions that either return known objects, or expect known objects and check their value?

Or, can we call pure matlab function through the binding? Like can I call Runtime.instance().mpython_endpoint("cell", (2, 1))?

@johmedr
Copy link
Collaborator

johmedr commented Feb 7, 2025

Thanks for the thorough review @johmedr! I've answered to a few comments. I should have some time next week to implement your suggested changes.

Great, thank you! I'll try to update the tests asap to go pass data to the Matlab Runtime and back.

It would be nice to have the identity function in the bindings, indeed. Also, maybe implement a few functions that either return known objects, or expect known objects and check their value?

Or, can we call pure matlab function through the binding? Like can I call Runtime.instance().mpython_endpoint("cell", (2, 1))?

We can call Runtime.call, which is basically feval with conversion of argument types. Calling Runtime.instance().mpython_endpoint bypasses the type conversion on the Python side but not on the Matlab side (that's what mpython_endpoint is essentially doing).

@balbasty
Copy link
Contributor Author

The type system is now in close-to-final form, but needs many more unit tests.

One final important point to deal: currently, Cell(list[list]) returns a Cell array, but lists are used by matlab to indicate 1D cells, so a list[list] should really be a Cell[Cell]. It means I need to not use asanyarray in the cell constructor. And we also probably need a method to transform a Cell[Cell] into a deep cell array. Maybe deepcat()?

I have renamed quite a few methods. @johmedr can you please go through the code and check that the API suits you?

The final thing that is not renamed is MatlabClassWrapper, which I would vote to rename MatlabClass.

@balbasty
Copy link
Contributor Author

balbasty commented Feb 24, 2025

One of the tests fails (but was not failing locally). It seems the np.ndarray.data field of a struct array is accessed, but it is hidden, so the Struct thinks it is a new key that must be defined and adds it to its dictionary. Not sure where that call to .data happens.

Fixed by 5a5dbaa

@balbasty
Copy link
Contributor Author

balbasty commented Feb 24, 2025

Also, I've added check_finalized in delayed arrays, so that if a delayed array that is already finalized is accessed, an error is raised. This should cover the "bad" use cases you mentioned during the meeting.

@balbasty
Copy link
Contributor Author

It's starting to look good. I've put a couple of questions/comments in the code, but it would be good of someone else than me tried it now :)

@johmedr
Copy link
Collaborator

johmedr commented Mar 7, 2025

Also, it seems that flat iterators on struct are infinite, I don't really now what is happening. Any idea of what's happening?

@balbasty
Copy link
Contributor Author

balbasty commented Mar 7, 2025

I had the same issue with iter which I had to overload. I think it's because numpy relies on IndexError to stop its iteration. But we don't have them anymore thanks to implicit resizing 😅

@johmedr
Copy link
Collaborator

johmedr commented Mar 7, 2025

Here is what I get:

>>> Struct(1).flat # stuck
>>> [*Struct(1).flat] # stuck
>>> s = Struct()
>>> s.foo = 'bar' 
>>> s.flat
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/johan/Documents/Python/spm-python/spm/__wrapper__.py", line 551, in _error_is_not_finalized
    raise IndexOrKeyOrAttributeError(
spm.__wrapper__.IndexOrKeyOrAttributeError: 'This DelayedArray has not been finalized, and you are attempting to use it in a way that may break its finalization cycle.\nIt most likely means that you are indexing out-of-bounds without *setting* the out-of-bound value.\n* Correct usage:   a.b(i).c = x\n* Incorrect usage: x = a.b(i).c\n'
>>> [*s.flat] # stuck

@balbasty
Copy link
Contributor Author

balbasty commented Mar 8, 2025

Ha that's different. I did not include "flat" as one of the authorised methods or properties in a Struct. I tried to keep them at a minimum so that they can be used as keys instead.

If you want to add flat to the list, you can add it to the class variable defined at the beginning of Struct.

@johmedr
Copy link
Collaborator

johmedr commented Mar 10, 2025

@balbasty, here are the remaining failing tests:

Cells have one thing left to fix:

>>> Runtime.call('cell', 1,3)
Cell([Cell([]), Cell([]), Cell([])]) 

should be

>>> Runtime.call('cell', 1,3)
Cell([Array([]), Array([]), Array([])]) 

and same for cell(3,1)

Structs have one thing to fix

Doing

s = Struct()
s.foo = "bar"
s.bar = "baz"
s[1].baz = 42 

does not initialize the missing fields in subslices:

>>> list(s[1].keys())
['baz']

Any fixes in mind?
In addition, sparse arrays are now returned as indices + values !

@balbasty
Copy link
Contributor Author

@johmedr I believe I have fixed everything.

I had to change some tests (mostly replaced 1D indexing of column vectors with 2D indexing). Please check that you're fine with their behaviour.

One discrepency remaining is that Runtime.call("cell", 1, 3) returns a cell of empty arrays, where each empty array has shape [0, 0], whereas I use empty arrays of shape [0] when defining default elements on the python side. Should I use [0, 0] to better mimic matlab?

@johmedr
Copy link
Collaborator

johmedr commented Mar 12, 2025

Amazing, thank you so much ! Happy with the change you made.

One discrepency remaining is that Runtime.call("cell", 1, 3) returns a cell of empty arrays, where each empty array has shape [0, 0], whereas I use empty arrays of shape [0] when defining default elements on the python side. Should I use [0, 0] to better mimic matlab?

Yes, I think that makes sense (for consistency).
Are we ready to merge after this? 😄

@balbasty
Copy link
Contributor Author

balbasty commented Mar 12, 2025

  • Fixed empty array shape (0,) -> (0, 0)
  • Used new way of (de)serializing sparse arrays
  • Changed sparse layout from COO to CRC + fixed (de)serialization

One (last?) inconsistency:

>>> Runtime.call('cell', 1, 3).shape
(3,)

>>> Runtime.call('zeros', 1, 3).shape
(1, 3)

>>> Runtime.call('struct', 'a', [1, 2, 3], 'b', [4, 5, 6]).shape
(1, 3)

So in essence, a row cell is converted to 1D Cell with shape (N,), whereas a row array is converted to a 2D Array with shape (1, N) and a row struct array is converted to a 2D Struct with shape (1, N).

Should we always convert row arrays to 1D things python side?


And let me paste comments from the code so that they are not forgotten:

  • MATLAB does not have 1D arrays (they are always 2D).
    A 1D python array is interpreted as a row arrays, so the round trip
    goes [1, 2, 3] -> [[1, 2, 3]].
    Are we happy with that? I think it works in the sense that
    when a numpy operation takes a matrix and a vector, the vector is
    broadcasted to its left, and is therefore interpreted as a row vector.

    !! We should clearly document this behaviour.

  • The creation of numeric vectors on the python side is currently
    quite verbose (Array.from_any([0, 0]), because Array([0, 0])
    is interpreted as "create an empty array with shape [0, 0]").
    We could either

    • introduce a concise helper (e.g., num) to make this less verbose:
      Array.from_any([0, 0]) -> num([0, 0])
    • Interpret lists of numbers as Arrays rather than Cells. But this is
      problematic when parsing the output of a mpython_endpoint call, since
      lists of numbers do mean "cell" in this context.
  • I've added support for "object arrays" (such as nifti or cfg_dep)
    in DelayedArray:

    a.b[0] = nifti("path") means that a.b contains a 1x1 nifti object.

    However, I only support 1x1 object, and the index must be 0 or -1.
    There might be a way to make this more generic, but it needs more thinking.
    The 1x1 case is all we need for batch jobs (it's required when building
    jobs with dependencies).

    It might be useful to have a ObjectArray type (with MatlabClass
    as a base class?) for such objects -- It'll help with the logic in
    delayed arrays. It should be detectable by looking for class(struct(...))
    in the constructor when parsing the matlab code, although there are
    cases where the struct is created beforehand, e.g.:
    https://github.com/spm/spm/blob/main/%40nifti/nifti.m#L12

    Maybe there's a programmatic way in matlab to detect if a class is
    a pure object or an object array? It seems that old-school classes
    that use the class(struct) constructor are always object arrays.
    With new-style classes, object arrays can be constructed after the
    fact:
    https://uk.mathworks.com/help/matlab/matlab_oop/creating-object-arrays.html

    After more thinking, it also means that we have again a difference in bhv
    between x{1} = object and x(1) = object. In the former case, x is
    a cell that contains an object, whereas in the latter x is a 1x1 object
    array.

  • We should probably implement a helper to convert matlab batches into
    python batches.

@johmedr
Copy link
Collaborator

johmedr commented Mar 12, 2025

  • Fixed empty array shape (0,) -> (0, 0)
  • Used new way of (de)serializing sparse arrays
  • Changed sparse layout from COO to CRC + fixed (de)serialization
    Amazing!

One (last?) inconsistency:

>>> Runtime.call('cell', 1, 3).shape
(3,)

>>> Runtime.call('zeros', 1, 3).shape
(1, 3)

>>> Runtime.call('struct', 'a', [1, 2, 3], 'b', [4, 5, 6]).shape
(1, 3)

So in essence, a row cell is converted to 1D Cell with shape (N,), whereas a row array is converted to a 2D Array with shape (1, N) and a row struct array is converted to a 2D Struct with shape (1, N).

Should we always convert row arrays to 1D things python side?

That could be nice to squeeze the first axis of 2d Arrays and Structs so that indexing is consistent, e.g.,

>>>  s = Runtime.call('struct', 'a', [1, 2, 3], 'b', [4, 5, 6])
>>>  s[1].b
5

instead of

>>>  s[0, 1].b
5

And let me paste comments from the code so that they are not forgotten:

  • MATLAB does not have 1D arrays (they are always 2D).
    A 1D python array is interpreted as a row arrays, so the round trip
    goes [1, 2, 3] -> [[1, 2, 3]].
    Are we happy with that? I think it works in the sense that
    when a numpy operation takes a matrix and a vector, the vector is
    broadcasted to its left, and is therefore interpreted as a row vector.
    !! We should clearly document this behaviour.

  • The creation of numeric vectors on the python side is currently
    quite verbose (Array.from_any([0, 0]), because Array([0, 0])
    is interpreted as "create an empty array with shape [0, 0]").
    We could either

    • introduce a concise helper (e.g., num) to make this less verbose:
      Array.from_any([0, 0]) -> num([0, 0])

Makes sense.

  • Interpret lists of numbers as Arrays rather than Cells. But this is
    problematic when parsing the output of a mpython_endpoint call, since
    lists of numbers do mean "cell" in this context.

The problem is that we don't have much freedom with this, see Matlab doc here and here.

  • I've added support for "object arrays" (such as nifti or cfg_dep)
    in DelayedArray:
    a.b[0] = nifti("path") means that a.b contains a 1x1 nifti object.
    However, I only support 1x1 object, and the index must be 0 or -1.
    There might be a way to make this more generic, but it needs more thinking.
    The 1x1 case is all we need for batch jobs (it's required when building
    jobs with dependencies).
    It might be useful to have a ObjectArray type (with MatlabClass
    as a base class?) for such objects -- It'll help with the logic in
    delayed arrays. It should be detectable by looking for class(struct(...))
    in the constructor when parsing the matlab code, although there are
    cases where the struct is created beforehand, e.g.:
    https://github.com/spm/spm/blob/main/%40nifti/nifti.m#L12
    Maybe there's a programmatic way in matlab to detect if a class is
    a pure object or an object array? It seems that old-school classes
    that use the class(struct) constructor are always object arrays.
    With new-style classes, object arrays can be constructed after the
    fact:
    https://uk.mathworks.com/help/matlab/matlab_oop/creating-object-arrays.html
    After more thinking, it also means that we have again a difference in bhv
    between x{1} = object and x(1) = object. In the former case, x is
    a cell that contains an object, whereas in the latter x is a 1x1 object
    array.

We could use isscalar to detect if an object is an array, and pass its shape in that case. I add this as an issue.

  • We should probably implement a helper to convert matlab batches into
    python batches.

Agreed. In a first time, if you Runtime.call('load', 'batch.mat') you should have a Python version of your Matlab batch.

@balbasty
Copy link
Contributor Author

OK so the only thing that's blocking the merge is squeezing the first axis of 2d arrays and structs.

Do you want me to do it python side (in _from_runtime()) or should it be done matlab-side, in mpython_endpoint().
The former is probably easier.

@balbasty
Copy link
Contributor Author

I've done it python side. It's actually quite nice that call("size", a) now returns [3, 4, 5] instead of [[3, 4, 5]].

@johmedr
Copy link
Collaborator

johmedr commented Mar 12, 2025

Fantastic! I also think it's better and easier to squeeze on the Python side.

@johmedr
Copy link
Collaborator

johmedr commented Mar 12, 2025

All looks good, merging!

Thanks for this huge development @balbasty!

@johmedr johmedr merged commit 77b4c7e into main Mar 12, 2025
32 checks passed
@johmedr johmedr deleted the enh-type-system branch March 12, 2025 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants