-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create IntanBinaryRawIO
#1318
Create IntanBinaryRawIO
#1318
Conversation
|
Thanks for this. Please let us know when it's ready for review. @samuelgarcia or @JuliaSprenger can you add those test files to Gin, I'm not sure of the procedure? |
Thanks @zm711 Just sleeping this! I can add the files to GIN :) |
Hi Zach. The file you provided are a bit too large. (40Mo each) |
Sweet. I'm back. So running the data for only 1 sec gives me ~10-15 Mo, so if that's still too much I'll turn down the sampling rate. Thanks @samuelgarcia and @alejoe91! I need to push one change since I changed the file names, but I'll do that later this morning. Could you guys ping me if these will work and then I'll push a change (or you can) to point the tests to these folders instead of the old ones? intan_fpc_test_231117_052630.zip |
Hello @zm711! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2023-12-01 20:37:14 UTC |
I think 10-15 might be ok, but the last word is on @samuelgarcia :) |
@alejoe91, cool thanks. We will wait for @samuelgarcia then. I really wish we had black instead of pep8 speaks..... |
Let's open an issue about that! I think it's time for Neo to embrace it too ;) |
Done. |
I will need this for a conversation project so I can help with this. |
Thanks Heberto! It'll be great to have your eyes on this too! If you find something egregious we can back propagate to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good so far. I did a first reading of the code. Next week I will test it with some files that I have. Two general comments:
- Are you aware that Intan provides some code to read the headers:
https://intantech.com/downloads.html?tabSelect=Software&yPos=0
we should double check our parameter extraction with theirs. Last time I checked I remember some small inconsistencies between the code here in neo, theirs and pyintan.
- I am not sure
IntanBinaryRawIO
is very informative about the use case that this class is covering. All the formats are Binary. The documentation of Intan does refer to what is covered byIntanRawIO
as the "Traditional" Intan format in opposition to the "One File Per Signal" and "One File Per Channel" formats. To be honest, my intuition right now is that we probably should fold everything into the old format or otherwise make one class per functionality (e.g. one for the traditional ,one for the signal and one for the channel). The current approach sits in an uncomfortable middle for me. But to keep on this path, what do you think ofIntanFolderRawIO
then we describe that this covers these two formats in the docstring?
Yes, I've used Intan's code. It is is super memory hungry (which is actually why I got involved in Neo in the first place :) ). But I think you make a good point that we can double check this. I will say I have tested files run with Intan's code in Matlab, in python, and with Neo and the values I get for my amplifier channels are all the same within the margin of floating point differences (by my eye, not carefully assessed to a % error). Have you seen huge swings in the numbers when testing?
A couple points here. The IntanRawIO deals with the .rhd format which is not a headerless binary format but their own format which we need to parse whereas the IntanRawBinaryIO is all based on a series of binary (.dat) files with a separate header. So I'm not sure if I agree with that particular part of your point. That being said I agree the name as it relates to Intan terminology is pretty unhelpful. I only chose the name because that is how a bunch of other Neo IOs deal with the distinction between proprietary As far as put together vs middle ground (this PR) vs all separate I lean more toward middle ground or all separate just because the Neo code structure is that something is either one file or one directory. So we could have one per format or keep it such that we have the one file vs one folder (with two formats together). Thanks so much for taking the time to look over this code. I really appreciate the feedback @h-mayorquin! I'm looking forward to (more of) your thoughts. |
No, I haven't paid attention to that degree either. I was thinking on the translation table between bytes and types depending on the version.
This is semantics probably and I am less confident about my previous claim. Specially after your comment that neo tends to make this distinction beteween Binary vs Proprietary. If that is neo style it is better to stay consistent to that. That said, I would like to clarify where I am coming from: binary to me just means non-human readable. It is in bytes. It has nothing to do with it being proprietary. Propietary to me just means that the translation table between bytes and human relevant structures is closed or not available to the general public. From that perspective, all of the intan formats are binary and none of them is propietary so the distinction seems incorrect to me. I can probably just google a couple of results in a browser to see what is common usage. Maybe I am way off. |
Now with your comment I like your semantics better. I agree binary is any format that is machine readable whereas non-binary would be human readable given the correct reader. So maybe to be more semantically precise it would be better to say the distinction is more based on headerless and headered binary files and they are all the same. Typically the header is the proprietary part of the file whether there is an open source reading tool or not but the format is still "binary". So something like a headerless .bin or .dat would be a binaryIO but something like the .rhd which requires the header to be parsed directly is a rawio. Sorry edited to say rawio since in the neo code-base io and rawio are very different things, so we should be really careful about that terminology. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uploading the files now to gin but I am getting errors when I try them with your code.
I added here some more comments.
class IntanBinaryRawIO(BaseRawIO): | ||
""" | ||
Class for processing Intan Data when saved in binary format. Requires an | ||
`info.rhd` as well one file per signal stream or one file per channel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, What about rhs
format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was my understanding that with the newer software it was just doing rhd
. I could be wrong, but when I was reading the documentation this headerless-binary-file only mode seemed to have a bunch of binary files and one info.rhd
header file. If you see something else definitely post it here and we can fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I am actually not familair with rhs at all. I was wondering if you were.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rhs is for use with their older stimulator setup. I have never actually seen someone use it... So I'm not sure. If you look in the IntanRawIO we stopped officially supporting it after v1 so I'm not sure if anything has changed. Sorry...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context.
timestamp = self._raw_data[6]["timestamp"].flatten() | ||
else: | ||
timestamp = self._raw_data[6][0]["timestamp"].flatten() | ||
assert np.all(np.diff(timestamp) == 1), "timestamp have gaps" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we stoping the user from working here?
Is there anything the user can do in these cases? If so, maybe we can include in the assertion message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a holdover from the original RawIO. There are two reasons this can occur:
File corruption (start-stops, run out of storage). In this case I don't know if we can really salvage the files?
Attempting to merge files that shouldn't be merged (ie spread out in time). This is something SI does all the time for recordings, so I don't think this is wrong to do.
Now that I think about this, this may be a better check when all data is in the same .rhd
file rather than these headerless binary files. I'll think about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Sorry, I did not check the original. I should have done that. Let's get this working and then we discuss this type of issues after this PR as they will require modification -if they do require modification- of the two classess. Right now, better to keep them in synch.
signal_channels = [] | ||
for c, chan_info in enumerate(self._ordered_channels): | ||
name = chan_info["native_channel_name"] | ||
chan_id = str(c) # the chan_id have no meaning in intan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would be wrong with using native_channel_name
as a channel_id if the latter has no meaning? or what do you mean by channel id not having a meaning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again this is a holdover from Sam's RawIO implementation. I kept it the same. Neo has three ways to access channels. channel_indexes
which are just the columns of the data matrix, channel_names
which are the names given to the channels (so in this case the 'native_channel_name'), and the channel_ids
which are something I don't fully understand, but they are typically numeric, but encoded as strings. @samuelgarcia would have to explain when we would use the channel_ids
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's get the two in synch here and discuss improvements in another PR.
else: | ||
self._max_sigs_length = max( | ||
[ | ||
len(raw_data) * raw_data[0].size * self._block_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was testing your files on gin and this is failing today. The reason being is that some of your raw_data items are still list at this stage (and therefore don't have the size attribute):
I am still looking at your code so I am not sure about the best suggestion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, when I ran the test on a different dataset this worked. So I'll try to figure that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--Okay, so I tested this locally and it didn't fail? I'm not sure what exactly is failing--. Now it's broken. I'll try to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured it out. So the problem was I originally included a test file with all possible streams. But people do not always provide all possible streams. So my second dataset was missing the supply voltage stream. It turns out that the code required all streams. So in the line of code you flagged the issue was the supply voltage raw data was an empty list causing it to fail. I've fixed that issue by cleaning up the dtype but it wasn't very elegant. I'll flag it for you to look over and hopefully you'll have a better idea how to fix :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good. I feel the code should be robust to the absence of a stream. The files are in gin now. I will make an attempt to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really want to stick to Sam implementation you can just filter out the stuff in the dictionary that are empty lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just skip adding the streams whose ifles are not present. As in, skip blocks like this:
If the corresponding files are not available.
sig_dtype = "int16" | ||
else: | ||
sig_dtype = "uint16" | ||
stream_id = str(chan_info["signal_type"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, signal_type
is just numbers but we should use an enum or a simple mapping to make this meaningful as those numbers do have a meaning (from the documentation of Intan):
I also think you can use those more meaningful string values as keys to your dictionaries as right now you have integers as keys. They could be better and more specific that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but in Neo you have to give stream_index
which is an integer value mapping to signal_type for Intan. We have this in the IntanRawIO:
https://github.com/NeuralEnsemble/python-neo/blob/c96ec936049a125d6bedfa26b67514f927bb1ab3/neo/rawio/intanrawio.py#L436C1-L443C2.
But maybe you have a grander idea? We could do a deeper rewrite, but the user facing api still needs to accept an integer value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python-neo/neo/rawio/intanrawio.py
Lines 110 to 114 in c96ec93
stream_ids = np.unique(signal_channels["stream_id"]) | |
signal_streams = np.zeros(stream_ids.size, dtype=_signal_stream_dtype) | |
signal_streams["id"] = stream_ids | |
for stream_index, stream_id in enumerate(stream_ids): | |
signal_streams["name"][stream_index] = stream_type_to_name.get(int(stream_id), "") |
Each signal_stream has an id
and a name
. But for the most part we use the stream_index
. We could rewrite this, but I was trying to maintain a similar feel to IntanRawIO in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good argument. Lets keep it similar to the other class and change later.
rhd_global_header_final, | ||
rhd_signal_group_header, | ||
rhd_signal_channel_header, | ||
stream_type_to_name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-mayorquin, taken from intanrawio here for the mapping. But I think you suggest be clearer and just retyping this mapping out in the file?
# because data_dtype has all possible dtypes we need to delete the dtypes which aren't contained in this dataset | ||
# so if the data_dtype value is still an empty list we delete to ignore. | ||
dtype_cleanup = [] | ||
for key, value in data_dtype.items(): | ||
if len(value)==0: | ||
dtype_cleanup.append(key) | ||
for key in dtype_cleanup: | ||
_ = data_dtype.pop(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@h-mayorquin, this is required otherwise downstream functions will expect every stream to be present. But this was just a quick fix. If you have a better fix happy to do it.
sl1 = sl0 + (i_stop - i_start) | ||
|
||
sigs_chunk = np.zeros((i_stop - i_start, len(channel_names)), dtype="uint16") | ||
for i, chan_name in enumerate(channel_names): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid this i
, you should use channel_index
. There is already an i
in the i_start
so I think is abit confusing.
@h-mayorquin, after all your wonderful comments and edits you're starting to convince me that we should try to fold this just into the IntanRawIO. Some ios use a multiple IO strategy to deal with header and headerless, but other ios just try to fold everything into one file. Right now we are both trying to stay in synch with IntanRawIO while also improve this IO which is difficult since the IOs are spread out between two files. Give me a couple days to try to draft up fusing the two IOs in a new PR and then I'll ping you and then we will have all the info in front of us and then we can decide whether splitting them because they have different roles is the right way to go or fusing them together would be better. |
@ChauhanT, we think we've actually finished this but in #1402. Instead of doing a separate IO we decided to put it into our old intan. It will now detect automatically which format you are using and correctly map everything. That being said we could use a beta tester before we do final review on it (our plan is to include the updated intanRawIO in the next release at the latest). If you're willing you could install from that PR and test it out. Let me if you want help doing that. |
@zm711 Looks great. In a bit of a deadline crunch atm, but hopefully I can squeeze out an hour or two to test the PR next week. I suppose the longer our recordings get, the more relevant the one-file-per-channel format becomes. Great work, and a big thank you to all involved ! |
@zm711 Hi, I am testing the pull request. Here are some observations. Not sure where to report them so I am writing here:
But other than that, the spikeinterface module seems to be able to read single-file-per-channel format quite fast and very well ! |
@ChauhanT, if you're testing #1402 could you post over there? That way we can keep track in the current PR. We will close this one as soon as we finish that one. If you were testing this PR, it has some mistakes that we have not fixed since we've moved to #1402. But huge thank you for testing. For channels, the issue is that we should return the native channels for the chan_ids which need to be unchangeable. The custom ids can be changed by the user so if we return those rather than a native if someone renames their channels (seems rare but is possible). They wouldn't be able to map their renamed channels. You raise a good point though that this could break old code, but breaks it in the right direction. So I think we should go forward, but we will discuss that as a group. The channel numbers as you list are actually the channel_indexes which should just be numbers.
What are the second to last elements?
There is no convenience function to do that. But you can still get it. You just have to grab it yourself. I can help you over there if you need to do this regularly. It is just stored as a nested attribute (attribute of an attribute).
That is likely a PR we will have to do at the spikeinterface level. |
Just to prevent future confusion, I'll close this one now since I think we are pretty set on #1402. |
I was just drafting a possible binary file reader for Intan. Although most users use the full
*.rhd
file Intan gives the option to save its data into two different binary formats. This issues comes up occasionally, SpikeInterface/spikeinterface#194 where users save in binary format and then have no way to easily load the data. I would still say the vast majority of users do not utilize this optional save mode, but it might be nice to have some underlyingNeo
machinery to deal with it in the rare case the issue comes up.Format
In general Intan offers two possible binary modes.
One File Per Signal Type
is roughly equivalent to saving one binary file/Neo stream. And thenOne File Per Channel
is each channel for each stream is saved as a separate file. Both formats utilize headerless binary files, but instead include aninfo.rhd
header file which provides all information needed for reading and processing the binary files. It is also located all in one directory (the user gets to name the directory but a timestamp is added to the user supplied name).Implementation
I initially thought about folding this into the current
IntanRawIO
, but decided against this for a couple reasons. First code already relies on the current implementation ofIntanRawIO
includingSpikeInterface
andNeuroConv
so changing a parameter to something likefile_or_dirname
would break other code. And although it would be possible to require the user to specify theinfo.rhd
as the root file and then search for the other files, theone-dir
mode forNeo
made more sense to me. This leads to the issue of some code duplication. TheIntanBinaryRawIO
is completely based on Sam'sIntanRawIO
, but requires a few changes to how the complex datatype is formed, requires a different setup for the_raw_data
memmap files, and requires a slightly different way of accessing these values when obtaining ananalogsignal_chunk
. So I created a new class, but since the header structure and some auxiliary functions are the same, I just import those from theintanrawio.py
. This has the benefit that if the header organization changes with a new version it only needs to be updated in one location rather than two. Additionally, since the binary format for Intan is so rarely used, I liked the idea of hiving it off for just those users that absolutely need it, so that if a large change is need in the baseIntanRawIO
it is not slowed down by trying to make all the binary formats work. It could be argued that this is unnecessary code duplication during the header reading, but for the most part Intan hasn't changed that part of their files (gain offset, etc), so upkeep should be minimal.To-do
I haven't figured out how temperature works for the binary file format yet. So I just commented that out for now, but once I figure that out I would add temperature as well. This wouldn't hold up the implementation of this IO.Note below on temp.Testing
simulated_intan_230816_182040.zip
simulated_intan_230816_095232.zip
I don't have a gin account, but these files can be added to the intan folder. They are simulated datasets saved in the two possible binary formats. I haven't quite got the full Neo suite working on my computer locally with these new files, but I've tested parts of the
IntanBinaryRawIO
by running these data sets and trying different channel and stream indices, but I would love to have the option to test these in the CI for this PR, to make sure I'm not making any stupid mistakes.Anyway, would love feedback on this whenever you have time.