Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor serialize_data with better param names and logic #222

Closed
wants to merge 2 commits into from

Conversation

bobleesj
Copy link
Contributor

No description provided.

Copy link

Warning! No news item is found for this PR. If this is a user-facing change/feature/fix,
please add a news item by copying the format from news/TEMPLATE.rst.

for headerfile in tlm_list:
# gather data using loadData
hdata = loadData(headerfile, headers=True)
data_table = loadData(headerfile)

# check path extraction
generated_data = serialize_data(headerfile, hdata, data_table, dt_colnames=["r", "gr"], show_path=True)
generated_data = serialize_data(headerfile, hdata, data_table, dt_col_names=["r", "gr"], show_path=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df_col_names and df_colnames are used across, so sticking with dt_col_names


if dt_col_names is not None:
dt_col_names_count = len(dt_col_names)
if dt_max_col_count < dt_col_names_count: # assume numpy.loadtxt gives non-irregular array
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare the number of col_names values provided as an input for this function against the number of col determined from the data table.

Copy link
Contributor Author

@bobleesj bobleesj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbillinge ready for review. I was looking into this issue: #27

Before addressing this issue, I had to understand the code and thought I could help refactor this.

Copy link

codecov bot commented Dec 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.69%. Comparing base (5f67f6c) to head (d940b64).
Report is 49 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #222      +/-   ##
==========================================
- Coverage   99.69%   99.69%   -0.01%     
==========================================
  Files           8        8              
  Lines         325      324       -1     
==========================================
- Hits          324      323       -1     
  Misses          1        1              
Files with missing lines Coverage Δ
tests/test_serialization.py 100.00% <100.00%> (ø)

@bobleesj bobleesj marked this pull request as ready for review December 13, 2024 05:43
@bobleesj
Copy link
Contributor Author

@sbillinge ready for review. I was looking into this issue: #27

Before addressing this issue, I had to understand the code and thought I could help refactor this.

Ready for review

Copy link
Contributor

@sbillinge sbillinge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to review this, but @bobleesj like you I am a bit confused what is going on here right from the get-go. The function is called serialize_data but it seems to be reading data-file. This seems closer to deserializing data, not serializing it. We may want to get @Sparks29032 in on the convo here, I think he may have written it.

I think we probably want code that will dump to different file formats and read from different formats. These are dumpers and loaders. If we serialize and deserialize things I think of it as a kind of (safe) round trip. the unsafe things is just to pickle and then unpickle but we need to be more careful now. But sinnce we want people using diffraction objects, the serializers that most interesting to me would be ones that serialize and deserialize complete diffraction objects.

hdata: dict
File metadata (generally related to data table).
The file metadata (generally related to data table).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we are not calling this variable metadata? or header_data. calling it hdata I am having to carry around too many pointers in my brain that I have to cognitively resolve as I read.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hdata came from loadData's naming of hdata for header data information. Good to change to metadata.

data_table: list or ndarray
Data table.
dt_colnames: list
The data table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a data table? A table with data on it? I think this is the something like the column data read from the file or something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These serialization functions were coded in tandem with loadData, so the inputs of these match outputs of loadData. This assumes files structured like .chi/.gr/etc. files.

@bobleesj
Copy link
Contributor Author

@Sparks29032 when you have time this Sat, is the serialize_data method used in any of the diffpy packages? Wondering the motivation and how it's used in practice.

@Sparks29032
Copy link
Collaborator

I think this was originally added for the pdfitc rest-api project. I don't believe it is implemented in any packages.

These are dumpers and loaders, but called serialization/deserialization as we are moving from serial files to python-readable information.

This was coded a while back, so I don't remember all the details, but I believe these were coded specifically to dump and load PDF data (hence a data table [list type] + header information [dictionary type]) into serial file formats like .json.

Moving forward, it might be good to standardize so only dictionaries can be passed into these functions, and the dumpers and loaders can use specific file-format libraries given the extension of the file.

@bobleesj
Copy link
Contributor Author

Thanks @Sparks29032

I think this was originally added for the pdfitc rest-api project. I don't believe it is implemented in any packages.

To confirm, this function is used by pdfitc? Wasn't sure if the source code for pdffitc is publicly available.

load PDF data (hence a data table [list type] + header information

Yes, there are test cases for loading .gr files and turning these into a single .json file.

@sbillinge As the next step, since we have an issues to write a DiffractionObject serializer/deserializer (#195) in 3.7.0, should we close the following issue #27 that this PR is trying to slowly address?

@Sparks29032
Copy link
Collaborator

@bobleesj It is not used by pdfitc, don't worry. It was experimental.

@sbillinge
Copy link
Contributor

We definitely want the dumpers and loaders and a pure serializer. We can put it off to 3.7 though to keep momentum. The serializer will change the API though because I would like to give as an option a file path, or probably better a stream so we can have different backends, and have it load the DO from there. So the serializer would be nice now I guess?

@sbillinge
Copy link
Contributor

Dumpers and loaders to group standards file formats as we have are a bit specific, but probably worth having since the large community is using these file formats

@bobleesj
Copy link
Contributor Author

Since this is a PR, will copy and paste the comments by @sbillinge in the existing issue here #195.

@bobleesj bobleesj closed this Dec 15, 2024
@bobleesj
Copy link
Contributor Author

bobleesj commented Dec 15, 2024

Closing - we will probably make a new PR when we are dealing with 3.7.0. For now, this PR isn't needed for 3.6.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants