Aggregate models from Hugging Face Hub based on a search scenarios.
- Work on a generic important_models.yaml search scenario for which we want to have CSP support / great doc
- Work on a generic recommended_models.yaml to provide recommendations to CSP on their curated catalogs.
- Add a component that can pull best models from leaderboards (doesn't seem to be programmatic access)
- Add a component that can list Merve's collections using HfApi (complicated)
-
Provider-specific Model Selection
- Support for multiple cloud providers compatibility checks (GCP, AWS, Azure)
-
Flexible Search Scenarios
- YAML-based configuration for search scenario
- configs/important_models.yaml to list models for which we want to have great doc for all our CSP.
- configs/recommended_models.yaml to list models which we think should be added to our CSP catalogs.
- Create your own.
- Clone the repository:
git clone [repository-url]
- Install dependencies:
pip install -r requirements.txt
Required dependencies:
- huggingface-hub: For accessing the Hugging Face model hub
- pandas: For data processing and CSV output
The tool uses YAML configuration files located in the configs/
directory:
search_scenarios.yaml
: example search scenariosproviders/
: Provider-specific compatibility rulesgcp.yaml
: Google Cloud Platform configuration (Deploy to Google Cloud rules)aws.yaml
: Amazon Web Services configuration (Deploy to Sagemaker rules)azure.yaml
: Microsoft Azure configuration (Azure HF Collection limitations)
src:
config.py
: Define the classes used to load the providers and search scenario config files.providers.py
: Define the classes corresponding to each provider. It is used to define model compatibility rules.searcher.py
: Define the class used to query the hub for each search query, define model compatibility, and save results.
main.py
take as input a list of providers, a config file of search scenarios and output the results.
Each scenario in your scenarios yaml file requires:
sort
: Field to sort results by (e.g., "downloads", "trendingScore")direction
: Sort direction (-1 for descending, 1 for ascending)
Optional parameters:
tasks
: List of Hugging Face tasks to search fortags
: List of tags to filter models
Example scenario configuration:
finance:
sort: "downloads"
direction: -1
tasks:
- "text-classification"
- "text-generation"
tags:
- "finance"
- "fintech"
When both tasks and tags are specified, the tool performs searches for each combination of task and tag.
Basic usage:
python main.py --provider gcp,aws,azure --search_scenario_file configs/search_scenario.yaml
--provider
: Comma-separated list of providers (gcp,aws,azure)--search_scenario_file
: yaml_file
- Search trending models for GCP:
python main.py --provider gcp --search_scenario_file configs/trending.yaml
- Get finance-specific models for AWS:
python main.py --provider aws --search_scenario_file configs/finance.yaml
- Run a search scenario across providers:
python main.py --provider gcp,aws --search_scenario_file configs/search_scenario.yaml
The tool generates a consolidated CSV file in the output/
directory with:
- Model ID
- Provider Compatibility (for each provider)
- Downloads
- Likes
- Tags
- Task
- Search Parameters Used (task and tag that found the model)
- Pipeline Compatibility
- Library Name
- Search Scenario
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request