Skip to content

pagezyhf/hub-model-search

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hub Model Search

Aggregate models from Hugging Face Hub based on a search scenarios.

TODO

  1. Work on a generic important_models.yaml search scenario for which we want to have CSP support / great doc
  2. Work on a generic recommended_models.yaml to provide recommendations to CSP on their curated catalogs.
  3. Add a component that can pull best models from leaderboards (doesn't seem to be programmatic access)
  4. Add a component that can list Merve's collections using HfApi (complicated)

Features

  1. Provider-specific Model Selection

    • Support for multiple cloud providers compatibility checks (GCP, AWS, Azure)
  2. Flexible Search Scenarios

    • YAML-based configuration for search scenario
    • configs/important_models.yaml to list models for which we want to have great doc for all our CSP.
    • configs/recommended_models.yaml to list models which we think should be added to our CSP catalogs.
    • Create your own.

Installation

  1. Clone the repository:
git clone [repository-url]
  1. Install dependencies:
pip install -r requirements.txt

Required dependencies:

  • huggingface-hub: For accessing the Hugging Face model hub
  • pandas: For data processing and CSV output

Configuration

The tool uses YAML configuration files located in the configs/ directory:

  • search_scenarios.yaml: example search scenarios
  • providers/: Provider-specific compatibility rules
    • gcp.yaml: Google Cloud Platform configuration (Deploy to Google Cloud rules)
    • aws.yaml: Amazon Web Services configuration (Deploy to Sagemaker rules)
    • azure.yaml: Microsoft Azure configuration (Azure HF Collection limitations)

Logic

src:

  • config.py: Define the classes used to load the providers and search scenario config files.
  • providers.py: Define the classes corresponding to each provider. It is used to define model compatibility rules.
  • searcher.py: Define the class used to query the hub for each search query, define model compatibility, and save results.

main.py take as input a list of providers, a config file of search scenarios and output the results.

Search Scenarios Configuration

Each scenario in your scenarios yaml file requires:

  • sort: Field to sort results by (e.g., "downloads", "trendingScore")
  • direction: Sort direction (-1 for descending, 1 for ascending)

Optional parameters:

  • tasks: List of Hugging Face tasks to search for
  • tags: List of tags to filter models

Example scenario configuration:

finance:
  sort: "downloads"
  direction: -1
  tasks:
    - "text-classification"
    - "text-generation"
  tags:
    - "finance"
    - "fintech"

When both tasks and tags are specified, the tool performs searches for each combination of task and tag.

Usage

Basic usage:

python main.py --provider gcp,aws,azure --search_scenario_file configs/search_scenario.yaml

Command Line Arguments

  • --provider: Comma-separated list of providers (gcp,aws,azure)
  • --search_scenario_file: yaml_file

Examples

  1. Search trending models for GCP:
python main.py --provider gcp --search_scenario_file configs/trending.yaml
  1. Get finance-specific models for AWS:
python main.py --provider aws --search_scenario_file configs/finance.yaml
  1. Run a search scenario across providers:
python main.py --provider gcp,aws --search_scenario_file configs/search_scenario.yaml

Output

The tool generates a consolidated CSV file in the output/ directory with:

  • Model ID
  • Provider Compatibility (for each provider)
  • Downloads
  • Likes
  • Tags
  • Task
  • Search Parameters Used (task and tag that found the model)
  • Pipeline Compatibility
  • Library Name
  • Search Scenario

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages