Skip to content

pymultifit presubmission #221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 of 16 tasks
syedalimohsinbukhari opened this issue Dec 14, 2024 · 14 comments
Closed
3 of 16 tasks

pymultifit presubmission #221

syedalimohsinbukhari opened this issue Dec 14, 2024 · 14 comments

Comments

@syedalimohsinbukhari
Copy link

syedalimohsinbukhari commented Dec 14, 2024

Submitting Author: Syed Ali Mohsin Bukhari (@syedalimohsinbukhari)
Package Name: pymultifit
One-Line Description of Package: A python library for fitting data with multiple models.
Repository Link (if existing): https://github.com/syedalimohsinbukhari/pyMultiFit
EiC: Szymon Moliński (@SimonMolinsky)


Code of Conduct & Commitment to Maintain Package

Description

  • Include a brief paragraph describing what your package does:

pymultifit is built primarily to solve one problem, to fit multiple models (and mixture models) to a given data. Be it multiple Gaussians, multiple Laplacians, or a mixture of such models, this package aims to deal with multi-model data fitting. The package also provides easy-to-use BaseDistribution and BaseFitter classes for respective user-defined functions.

Community Partnerships

We partner with communities to support peer review with an additional layer of
checks that satisfy community requirements. If your package fits into an
existing community please check below:

Scope

  • Please indicate which category or categories this package falls under:

    • Data retrieval
    • Data extraction
    • Data processing/munging
    • Data deposition
    • Data validation and testing
    • Data visualization
    • Workflow automation
    • Citation management and bibliometrics
    • Scientific software wrappers
    • Database interoperability

Domain Specific

  • Geospatial
  • Education

  • Explain how and why the package falls under these categories (briefly, 1-2 sentences). For community partnerships, check also their specific guidelines as documented in the links above. Please note any areas you are unsure of:

This library falls under the "data processing/munging" category as it takes the given data and tries to fit the given model(s) to the data via minimization processes. It also allows the user to extract the parameters for further analysis of the data fitters via helpful functions. Visualization is done internally for the fitted model with options of separable views on total data fitting and individual fits via the fitter module. On the other hand, the distribution module provides pdf, cdf, and stats functionality for any user-defined or pre-built distribution selected.

  • Who is the target audience and what are the scientific applications of this package?

Researchers, data scientists, and statisticians who work with datasets requiring multi-model fitting for robust analysis and modeling.

  • Are there other Python packages that accomplish similar things? If so, how does yours differ?

Apart from scipy, lmfit, and scikit-learn the general purpose scientific packages, there exists PyAutoFit, a Python-based probabilistic programming language built on Bayesian inference. Another notable library is Mixture-Models, which specializes in advanced optimization techniques for fitting various families of mixture models, including Gaussian mixture models and their variants. Both libraries are powerful tools for specific use cases, and I recently came to know about them during my search of existing options.

While these libraries offer robust solutions for hierarchical modeling (PyAutoFit) or a diverse array of pre-defined mixture models (Mixture-Models), pyMultiFit distinguishes itself through its simplicity of use and its focus on simplicity of use. Specifically, it is designed to provide a lightweight and user-friendly framework for fitting multi-model data, including custom mixture models (for example, gaussian + laplace + line). pymultifit also provides easy-to-use base classes that can be modified for any distribution/fitter purposes.

One of the more prominent features of pyMultiFit is the BaseFitter template class that provides custom fitting to any definable function with minimal boilerplate code. All the plotting and boundary functionalities are handled inside the template class so that the user can focus solely on running through multiple models quickly without thinking about how to manage multiple models of the same type or even of different types.

Additionally, the generators template function provides the user with an N-model data generator function with added noise capability to mimic real-life scenarios of whatever distribution the user might want.

  • Any other questions or issues we should be aware of:

P.S. Have feedback/comments about our review process? Leave a comment here

@SimonMolinsky
Copy link
Collaborator

Hi @syedalimohsinbukhari

Thanks for submitting your package! It's a great tool, but at this point, I wonder if it is within the scope of pyOpenSci. You have written that pymultifit overlaps with other packages in the ecosystem, but the differences are unclear. The package must fulfill one of those conditions:

- More open in licensing or development practices
- Broader in functionality (e.g., providing access to more data sets, providing a greater suite of functions), but not only by duplicating additional packages
- Better in usability and performance
- Actively maintained while alternatives are poorly or no longer actively maintained

You have stated that your package has better usability and performance; I've checked your package's documentation (https://pymultifit.readthedocs.io/index.html#), and you don't provide examples of how to use your package - thus, I wasn't able to compare functionalities. Could you prepare some comparisons of overlapping functionalities as the code examples? It could be a part of your README in the future. Then we can decide if pymultifit can be accepted to pyOpenSci.

@syedalimohsinbukhari
Copy link
Author

Hi @SimonMolinsky

Thank you for reaching out. I apologize for my oversight of incomplete documentation before presumission. Currently the documentation of the package is in development in the docs branch, which will be up ASAP, including proper usage examples and comparisons as well.

@SimonMolinsky
Copy link
Collaborator

@syedalimohsinbukhari

In this situation, please let me know when the docs build is ready!

@syedalimohsinbukhari
Copy link
Author

I will, and once again, thank you for your time and consideration @SimonMolinsky

@syedalimohsinbukhari
Copy link
Author

Hi @SimonMolinsky

Thank you for the wait; the documentation is now up with API references, tutorials, and benchmarks for speed and accuracy with scipy as well.

@SimonMolinsky
Copy link
Collaborator

@syedalimohsinbukhari

Thanks for the updates. Give me a few days, and I will come back with the final decision!

@syedalimohsinbukhari
Copy link
Author

@SimonMolinsky

That'd be great.

@SimonMolinsky
Copy link
Collaborator

@syedalimohsinbukhari could you check your documentation on Read the Docs? I got the 404 error when using this link: https://pymultifit.readthedocs.com/latest/

@syedalimohsinbukhari
Copy link
Author

syedalimohsinbukhari commented Jan 19, 2025

Hi @SimonMolinsky

The documentation is up and working on https://pymultifit.readthedocs.io/latest, not on the .com version. I don't know how I messed up io with com, but I'll correct it rightaway

EDIT The link is now corrected in the README.md file

@SimonMolinsky
Copy link
Collaborator

Hi!

I've checked the benchmarks and tutorials, and a few comments from me:

Change the placeholder named your_library to the name of your package.

  • benchmarking - generally it's ok, but it would be nice to introduce some summary at the top of the page (not for every plot, but for accuracy and speed pages)

  • The tutorials are great!

I have the last two questions:

  1. How have you achieved a computing time decrease over Scipy? You use mostly numpy; haven't you skipped some checks performed by scipy functions to achieve those results? Or is there another reason for the faster computations?
  2. Do you plan to maintain and develop this package in the future?

@syedalimohsinbukhari
Copy link
Author

syedalimohsinbukhari commented Jan 19, 2025

Hi,

Thanks for taking the time to check it out. I will surely include the summary on the benchmarking pages and remove that placeholder mark.

I have the last two questions:

  1. How have you achieved a computing time decrease over Scipy? You use mostly numpy; haven't you skipped some checks performed by scipy functions to achieve those results? Or is there another reason for the faster computations?

From my own understanding, the reason for the decrease in computational time from Scipy mostly stems from the fact that my implementations are singular in purpose and do not require as many checks as the scipy requires. This did require a lot of tweaking in some distributions to match the Scipy's values, the results of which you can see in the accuracy benchmark file.

My reasoning for this answer comes from slower/comparable computation times with Beta/ArcSine/SkewNormal distributions that required extensive checks/calling other distributions in order to match the values by Scipy that incurred heavy overheads.

For example, BetaDistribution -> beta_pdf_ uses various masks for different values of alpha and beta parameters

https://github.com/syedalimohsinbukhari/pyMultiFit/blob/e1777ea6a5391839e14b9cefd43c78ebff0d73d6/src/pymultifit/distributions/utilities_d.py#L117-L143

Also, SkewNormalDistribution -> skew_normal_pdf_ internally calls gaussian_pdf_ and gaussian_cdf_ functions

https://github.com/syedalimohsinbukhari/pyMultiFit/blob/e1777ea6a5391839e14b9cefd43c78ebff0d73d6/src/pymultifit/distributions/utilities_d.py#L1131-L1143

  1. Do you plan to maintain and develop this package in the future?

Yes, I do.

@SimonMolinsky
Copy link
Collaborator

SimonMolinsky commented Jan 19, 2025

@syedalimohsinbukhari

Considering your input, your package has the potential to be a great tool in the pyOpenSci!

You might submit pymultifit to the pyOpenSci ecosystem.

I'm still slightly unsure about the overlap between your package and scipy, but you have provided extensive tutorials and benchmarks for the community, and this is a positive signal! Moreover, you intend to maintain pymultifit in the future, so I think that the overlap between scipy and pymultifit will be less noticeable.

Again, feel free to submit your package! I will close this thread when you submit the package.

Have a nice day!

@syedalimohsinbukhari
Copy link
Author

Hi @SimonMolinsky

Thank you for getting back, and thank you for your assessment. I will submit the package, but I just want to be clear as to what you mean by the overlap between pyMultiFit and scipy (just for my own clearance).

@SimonMolinsky
Copy link
Collaborator

I meant the capabilities (not functionalities and their API implementation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: pre-submission
Development

No branches or pull requests

2 participants