Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JOSS] Comments on the functionality #175

Closed
3 tasks
rcannood opened this issue Jan 15, 2024 · 2 comments
Closed
3 tasks

[JOSS] Comments on the functionality #175

rcannood opened this issue Jan 15, 2024 · 2 comments
Assignees

Comments

@rcannood
Copy link

rcannood commented Jan 15, 2024

This issue pertains to my review of the simChef's functionality as part of my review of the JOSS submission of this tool at openjournals/joss-reviews#6156.

I apologize for being quite critical about certain statements in the paper (discussed below), but I do think that explicitly showcasing each of the topics below would be of immense value to an inexperienced user wishing to perform the kinds of studies allowed by simChef.

Functionality

Have the functional claims of the software been confirmed?

The paper states:

While simChef’s core functionality focuses on computability (C) – encompassing efficient usage of computational resources, ease of user interaction, reproducibility, and documentation – we emphasize the importance of predictability (P) and stability (S) in data science simulations.

The documentation also states:

Flexible and seamless integration with distributed computation, caching, checkpointing, and debugging tools

There is too little information on how to enable distributed computation, reproducibility, caching, checkpointing and debugging when using simChef.

I found some information in Setting Up Your Simulation Study that suggests you can use hpc = TRUE and init_renv = TRUE to enable distributed computations and reproducibility. However, at this stage I find it hard to argue that the package seamlessly integrates with any of these concepts, since they are not discussed enough in the documentation.

Would it be possible to create separate articles in the documentation to showcase how to set up distributed computation, set up reproducibility, how to use caching & checkpointing, and how to debug runs? Or would you have an alternative solution?

I'm certain that you as developers know exactly how to do this with simChef, but currently the documentation does little for novice simChef users to learn how to do any of these things from scratch.

In the context of reproducibility I would expect the rendered report to contain information on software versions of the used environments.

W.r.t. caching and debugging tools I'd expect to be able to see the execution time, CPU usage, memory usage and error messages when using an HPC as backend. What happens when one of the executions fails? How can I debug what went wrong during a failed run?

Performance

If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

This ties in with the previous comment. The paper mentions efficient usage of computational resources, but some simulation studies will require using an HPC to be able to run the analysis in a decent time frame.

Checklist

  • Functionality - Showcase distributed computation
  • Functionality - Showcase reproducibility with renv
  • Functionality - Showcase debugging tools when something goes wrong
@jpdunc23
Copy link
Contributor

jpdunc23 commented Jan 15, 2024

@rcannood first off, thank you so much for your careful review and very helpful comments. I'll try to address some of your points below.

TL;DR: Please let us know what you think would be best in terms of making the documentation more accessible: (a) split the "comprehensive guide" into separate parts; (b) create new, self-contained vignettes for the following 3 topics: caching, checkpointing, and debugging; (c) something else. I lean toward (a) as part of this JOSS review with (b) as a longer-term goal.

I found some information in Setting Up Your Simulation Study that suggests you can use hpc = TRUE and init_renv = TRUE to enable distributed computations and reproducibility. However, at this stage I find it hard to argue that the package seamlessly integrates with any of these concepts, since they are not discussed enough in the documentation.

I've added a couple of issues related to this part of the API: #176 and #177. I agree this is a bit unclear as it stands but I think that a few updates to the docs and a small bit of API refactoring could go a long way to addressing your concerns.

One point of confusion is the hpc argument which I think should be renamed / removed. Setting hpc = TRUE does nothing more than create directories called scripts and logs. We can definitely make this clearer for users.

Would it be possible to create separate articles in the documentation to showcase how to set up distributed computation, set up reproducibility, how to use caching & checkpointing, and how to debug runs? Or would you have an alternative solution?

We've definitely found it challenging to document simChef in a way that is both approachable and comprehensive, so thank you for taking a close look at the docs and giving us your impressions as a new user. As you've seen, we have a dedicated vignette on using distributed computation. You've probably also seen the "comprehensive guide" vignette which is quite long. That vignette includes sections on caching, checkpointing, and debugging.

To clarify, do you think it would be better to created separate, self-contained vignettes for each concept? Another option is to split the comprehensive guide into smaller chunks (e.g. Part 1, Part 2, etc) focused on specific parts of the API. Would that achieve the same goal?

In the context of reproducibility I would expect the rendered report to contain information on software versions of the used environments.

I think this is a great idea. I've opened #178.

W.r.t. caching and debugging tools I'd expect to be able to see the execution time, CPU usage, memory usage and error messages when using an HPC as backend. What happens when one of the executions fails? How can I debug what went wrong during a failed run?

We print the total execution time at the end of Experiment$fit / Experiment$run() (e.g., see this example) and optionally collect some basic memory usage stats via gc() when the user sets options(simChef.debug = TRUE) (documented here). As for displaying CPU usage, I agree this would be a nice feature but I'm not sure it's within scope right now and users can use external tools like top or htop to check CPU usage as the sim is running. But do you maybe want to expand upon what specifically you'd like to see from simChef along these lines?

As I mentioned above, debugging capabilities are documented here. When an error occurs, we capture the error object itself, process ID, output from gc(), and the specific DGP instance (and Method instance, if the replicate got that far) and parameters that were in play when the error happened. This way, users can directly inspect and interact with those objects after the fact. During batch execution (e.g., when running an experiment as part of a Slurm sbatch job), the pattern is to catch and save the simChef_error, e.g. as in the above vignette:

results <- tryCatch(
  run_experiment(buggy_experiment, n_reps = 2),
  error = identity
)
# check if the results obj has error type and save to disk for later interactive inspection

I don't blame you if you missed that example in the "comprehensive guide" vignette given that vignette is a monolith, or maybe that example just isn't clear enough. If the problem is the former, then my inclination is to split that vignette into separate parts so that users can quickly get to the information they need via the docs site header navigation.

Please let us know if you think a different approach would be better or if the existing example needs to be reworked.

This ties in with the previous comment. The paper mentions efficient usage of computational resources, but some simulation studies will require using an HPC to be able to run the analysis in a decent time frame.

This is possible using the appropriate future backend. For example, using plan(multicore) would enable a small simulation in parallel on a laptop with 8 cores or a modestly-sized simulation using 64 cores on single cluster node. For larger simulations requiring multi-node computations, the backends provided by future.batchtools should work. Your comments make me think that it would probably be worthwhile to document such a workflow, so I've added issue #179.

@jpdunc23 jpdunc23 self-assigned this Jan 15, 2024
@rcannood
Copy link
Author

One point of confusion is the hpc argument which I think should be renamed / removed. Setting hpc = TRUE does nothing more than create directories called scripts and logs. We can definitely make this clearer for users.

I see! 👍 Sounds good!


Would it be possible to create separate articles in the documentation to showcase how to set up distributed computation, set up reproducibility, how to use caching & checkpointing, and how to debug runs? Or would you have an alternative solution?

To clarify, do you think it would be better to created separate, self-contained vignettes for each concept? Another option is to split the comprehensive guide into smaller chunks (e.g. Part 1, Part 2, etc) focused on specific parts of the API. Would that achieve the same goal?

Indeed, I found the simChef: a comphrensive guide quite long, which makes it hard to find information regarding a specific aspect, e.g. caching or debugging. However, please consider this a non blocking issue as the information is definitely there.


For all of the other items I see you've created issues. I will also consider them optional improvements to the software tool, so that they are no longer blocking publication of the manuscript from my part :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants