Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop Python type system to better emulate MATLAB-like syntax and ease code adaptation #20

Closed
johmedr opened this issue Jan 30, 2025 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@johmedr
Copy link
Collaborator

johmedr commented Jan 30, 2025

Enabling MATLAB-like syntax on Python data structures is important to reduce the complexity of adapting existing MATLAB code (such as examples or snippets generated by SPM matlabbatch system) to Python. The type system should enable syntax as close as possible to MATLAB's, while behaving in a simple and predictible way.

Current type implementation only mimic the basic syntax of Matlab objects (e.g., dot indexing for structures), but does not allow other existing MATLAB syntactic features (such as indexing an undeclared field in a structure)

In #19, @balbasty proposed a new type system enabling several of these features. This issue is about reviewing and integrating this type system in the code base.

@johmedr
Copy link
Collaborator Author

johmedr commented Jan 30, 2025

Originally posted by @balbasty in #19:

@johmedr I've played with alternative Struct/Cell/Array classes that could make it easier to build batches, using a syntax closer to the one we use in matlab.

Here's a notebook with the prototype. There's a batch example at the end of the notebook:

matlabbatch = CellArray()

matlabbatch[0].spm.util["import"].dicom.data = 'dir';
matlabbatch[0].spm.util["import"].dicom.root = 'flat';
matlabbatch[0].spm.util["import"].dicom.outdir = cell('output');
matlabbatch[0].spm.util["import"].dicom.protfilter = '.*';
matlabbatch[0].spm.util["import"].dicom.convopts.format = 'nii';
matlabbatch[0].spm.util["import"].dicom.convopts.meta = 0;
matlabbatch[0].spm.util["import"].dicom.convopts.icedims = 0;

The StructArray/CellArray/NumArray classes are numpy arrays that automatically resize themselves if out-of-bound elements are queried (similar to matlab's behaviour). Uninitialized elements are DelayedArrays that transform themselves into StructArray/CellArray/NumArray based on the type of indexing that's applied. There's then a hacky logic so that delayed arrays warn their parents that they have determined their type and can be "finalised".

The main issues is we don't have two different types of bracket to differentiate "cells of struct" from "struct array". I've tried to hijack __call__ to implement matlab's {} but it's very flimsy. In the meantime I've added as_cell/as_struct/as_array properties to more robustly provide type hints, so we can do something like a[1].b.as_cell[2].c = x for matlab's a(2).b{3}.c = x.

It's a prototype, I am sure they are lots of corner cases I haven't found.

It's somewhat related to this issue, but we can also discuss this in a new issue, if it sounds interesting.

Cheers
Yael

@johmedr
Copy link
Collaborator Author

johmedr commented Jan 30, 2025

Regarding cell indexing:

Curly brackets don't exist in Python, so there will always be changes required when adapting code. MATLAB allows item and slice indexing, with the possibility of unpacking the items using the a{:} notation. Python has very nice and well-established ways of implementing all of the operations you can typically do on a cell, and I would suggest we adopt them.

A possible option is to enable these types of indexing:

1. Single item indexing: a{1} to a[0]

We map the curly bracket, single item indexing to the single item indexing in Python.

2. Slice indexing: a(3:12) to a[2:11]

This is quite straightforward.

3. Multiple item extraction: a{3:12} to *a[2:11]

We use the star operator in Python to extract the items from an array slice. This would translate, e.g., spm_jobman('run', job, inputs{:}) to the natural Python syntax spm_jobman('run', job, *inputs)

4. Single item slice: a(3) to a[(2,)] or a[None,2]

This is based on numpy array indexing which your CellArray is already based on, and should feel natural to people coming from Python. I don't think it is worth the effort and clutter of handling this differently in Python, as this syntax is not particularly used anyway.

This should keep us away from overriding the __call__ method, and I think it is worth sticking to the Pythonic way of indexing a collection type (i.e., through __getitem__).

Let me know what you think :)

@johmedr johmedr added the enhancement New feature or request label Jan 30, 2025
@balbasty
Copy link
Contributor

I agree that bracket indexing does the job in 99% of use cases.

Currently, the behaviour is that of numpy arrays, so your examples all work (the only cavehat is *a[2:11] or *a[2:11, :] will return a list of 1D cells when a is a 2D cell, whereas a{3:12,:} returns a flattened list of values).

The only remaining issue with the absence of curly-like braces occurs when implictly declaring nested structures. Specifically:

clear;
a.b(1).c = 'd';   % b is a struct array

clear; 
a.b{1}.c = 'd';   % b is a cell that contains a struct

In the prototype, I curently do

a = StructArray()
a.b[0].c = 'd'           # b is a struct array

a = StructArray()
a.b.as_cell[0].c = 'd'   # b is a cell that contains a struct

Alternatively, I was hoping to get something like this to work (but I am not there yet)

a = StructArray()
a.b(0).c = 'd'           # b is a cell that contains a struct

@johmedr
Copy link
Collaborator Author

johmedr commented Jan 30, 2025

I see, so for multidimensional cell arrays we then need to unpack as *a[2:11, :].flat to get a similar behavior to a{3:12, :}. Though, I think an additional transpose might be required (as MATLAB arrays are column-major). This might be worth overriding flat, or explicitly constructing all of the arrays in column-major order (e.g. using order='F').

Well done for spotting this confusing case -- we do need a way for the user to tell us what type they want. I do prefer the explicit syntax you introduced (with as_cell) as it seems nice and easy to remember, a bit like the df.at[i] in Panda.
Is there an incentive having normal indexing defaulting to struct array initialization? Is it more frequent than cell array initialization?

@balbasty
Copy link
Contributor

  1. I think that np.ndarray.flat/np.flatten always return elements in "C" order, even if the underlying layout is "F".
In [1]: import numpy as np

In [2]: x = np.arange(16).reshape([4, 4])

In [3]: x
Out[3]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [4]: x.strides
Out[4]: (32, 8)

In [5]: y = np.array(x, order="F")

In [6]: y.strides
Out[6]: (8, 32)

In [7]: list(x.flat)
Out[7]: 
[np.int64(0),
 np.int64(1),
 np.int64(2),
 np.int64(3),
 np.int64(4),
 np.int64(5),
 np.int64(6),
 np.int64(7),
 np.int64(8),
 np.int64(9),
 np.int64(10),
 np.int64(11),
 np.int64(12),
 np.int64(13),
 np.int64(14),
 np.int64(15)]

In [8]: list(y.flat)
Out[8]: 
[np.int64(0),
 np.int64(1),
 np.int64(2),
 np.int64(3),
 np.int64(4),
 np.int64(5),
 np.int64(6),
 np.int64(7),
 np.int64(8),
 np.int64(9),
 np.int64(10),
 np.int64(11),
 np.int64(12),
 np.int64(13),
 np.int64(14),
 np.int64(15)]

We might have to transpose if we want the exact same behaviour as matlab (or just not care -- I don't have use cases in mind where this would be a problem).

  1. I tried to stay consistent with matlab, when it comes to implicit structures, where () is used for structures and numeric arrays, and {} for cells. So I used [] for StructArray and NumArray and the more verbose syntax for cells. In SPM world, though, I think that cells of structures are more common than structure arrays.

@johmedr
Copy link
Collaborator Author

johmedr commented Jan 30, 2025

  1. True, from the docs "Iteration is done in row-major, C-style order (the last index varying the fastest)". flatten allows to specify the order, ravel too but does not systematically copy, np.nditer constructs an iterator with desired order, maybe that's the best option.

  2. One option could be to use only the square bracket indexing for the "natural" indexing associated with an object (e.g., round bracket for Num/StructArrays and curly brackets for CellArray), and ask the user to specify through as_struct and as_cell what they want in that particular ambiguous case. It would make the code more readable for people coming from both Python and Matlab, and when Matlab throws an error, we will be able to check that everything has been converted to the right type (without needing to check what round and square bracket correspond to). On the other hand, it clutters the syntax and requires to change a lot of things when adapting code.

Another option is to override the __call__ method for DelayedArrays and StructArrays (as opposed to CellArrays). One thing against using the __call__ override for cell is that it does not support assignment, i.e., we cannot transform

a{3} = 'c'; 

to

a(3) = 'c'

With struct arrays, we would never really do assignment, but rather assign fields, e.g.

a.b(3).c = 4

But again, that prevents from doing, e.g.,

a.b(:).c = d[:]

to initialize the fields of a StructArray, which we can do if we use square brackets.

Last option is to have __call__ only for transforming a DelayedArray in a cell whose fields are to be indexed, as we'll never need the column/slice indexing on these. We can still have as_struct, as_cell, and as_num for explicit casting. Indexing a CellArray uses square brackets as we discussed before. That would give for the easy cases:

a = StructArray()
a.b[0].c = 'd'     
# OK! b is a struct array

a = StructArray()
a.b.as_struct[0] = StructArray(c='d')   
# OK! b is a struct array

a = StructArray()
a.b(0).c = 'd'     
# OK! b is a cell that contains a struct

a = StructArray()
a.b.as_cell[0] = StructArray(c='d')   
# OK! b is a cell that contains a struct

And two edge cases:

a = StructArray()
a.b[0] = StructArray(c='d') 
# OK, b is a cell that contains a StructArray???

a = StructArray()
a.b(0) = StructArray(c='d') 
# ERROR!

@balbasty
Copy link
Contributor

OK, so I've now made rounded brackets () fallback to square brackets []. (see updates to the notebook)

The only differences are:

  • a(i) = x is invalid python syntax (as you've stated)
  • a(:) is also invalid python syntax (as you've stated), but can be replaced with a(slice(None))
    • we could also parse the type slice as equivalent to the value slice(None), so that the syntax a(slice) is valid.
  • In DelayedArrays, x = a(...) triggers as_cell whereas x = a[...] triggers as_struct

I've implemented () in DelayedArrays and CellArrays, because it allows the following syntax:

a.b(0).c = 'd'  # Instructs that b is a CellArray 
a.b(1).c = 'e'  # At this point b is already a CellArray, but we can still use `()`

If we implement () only in DelayedArrays, we'd have to instead write

a.b(0).c = 'd'  # Instructs that b is a CellArray 
a.b[1].c = 'e'  # At this point b is already a CellArray, so we cannot use `()`

I believe this covers most matlab use cases, except a.b{1} = x which must be written a.b.as_cell[0] = x or a.b = cell(x).


Here are matlab's behaviours for a bunch of cases:

%% x.field = array  -> struct
clear;
a.b = 1; 

%% x(scalar).field = array  -> struct
clear;
a(2).b = 1; 

%% x(slice).field = array  -> ERROR
clear;
a(2:3).b = 1;

%% [x(slice).field] = function() -> struct
clear;
[a(2:3).b] = foo();  

%% x(scalar) = struct -> struct
clear;
a.b(2) = struct('c', 1);

%% x(slice) = struct -> struct
clear;
a.b(2:3) = struct('c', 1);

%% x(slice) = struct array -> struct
clear;
a.b(2:3) = struct('c', {1 2});

%% x{scalar} = array -> cell
clear;
a.b{2} = 1;

%% x{scalar} = struct -> cell of struct
clear;
a.b{2} = struct('c', 1);

%% x{slice} = array -> ERROR
clear;
a.b{2:3} = 1;

%% x{slice} = array -> ERROR
clear;
[a.b{2:3}] = foo();

%% x(scalar) = array -> array
clear;
a.b(2) = 1;

%% x(scalar) = cell -> cell
clear;
a.b(2) = {1};

%% x(slice) = cell -> cell
clear;
a.b(2:3) = {1};

%% x(slice) = cell array -> cell
clear;
a.b(2:3) = {1 2};

%% x(slice) = array -> array
clear;
a.b(2:3) = [1 2];

%% x{scalar}.field = array -> cell of struct
clear
a{2}.b = 1;

%% x{slice}.field = array -> ERROR
clear
a{2:3}.b = 1;

%% [x{slice}.field] = function() -> ERROR
clear
[a{2:3}.b] = foo();

%%
function [x,y] = foo(), x = 1; y = 2; end

For your non-erroring edge case, I think I would interpret it as b is a StructArray

a = StructArray()
a.b[0] = StructArray(c='d') 
# OK, b is a StructArray

so that it matches

%% x(scalar) = struct -> struct
a.b(1) = struct('c', 'd');

but I agree that this is the most confusing case.


Some other issues:

  • Since StructArray inherits np.ndarray, its namespace is quite busy already. Assigning or accessing fields that have the same name as existing methods is rather ill-behaved:

    a = StructArray()
    a.min = 1
    # OK
    
    a = StructArray()
    a.b.min = 1
    # OK
    
    a = StructArray()
    a.b.min.c = 1
    # ERROR

    Should we delete all ndarray methods except a select few from the StructArray class?

  • Should we keep these class names? Or rename them to Struct, Cell, Array or struct, cell, array?

  • Are we happy with the current constructors?

    • NumArray(), NumArray([]) -> 0 (scalar)
    • NumArray(0), NumArray([0]) -> zero-size vector
    • NumArray(1), NumArray([1]) -> [0]
    • CellArray(), CellArray([]), CellArray(0), CellArray([0]) -> zero-size cell
      (this differs from NumArray but is needed otherwise the delayed logic breaks down)
    • CellArray(1), CellArray([1]) -> ([],)
    • StructArray(), StructArray([]) -> {} (scalar dict)
    • StructArray(0), StructArray([0]) -> zero-size array of dict
    • StructArray(1), SturctArray([1]) -> [{}]
    • StructArray({...}) -> {...}(scalar dict)
    • StructArray(a=1) -> {"a": 1} (scalar dict)

Going forward:

  • Do you want me to start a branch?
  • Do you have guidance regarding _to_matlab() conversion?

I think we're converging towards something that makes sense! :)

@balbasty
Copy link
Contributor

Implemented in #21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants