Skip to content

Update self instruct#24

Open
WenlongHuangResearch wants to merge 1 commit intoyizhongw:mainfrom
WenlongHuangResearch:wenlonghuangresearch
Open

Update self instruct#24
WenlongHuangResearch wants to merge 1 commit intoyizhongw:mainfrom
WenlongHuangResearch:wenlonghuangresearch

Conversation

@WenlongHuangResearch
Copy link

@WenlongHuangResearch WenlongHuangResearch commented May 5, 2025

Summary by Sourcery

Update the Self-Instruct project to be compatible with the latest OpenAI API, including modifications to handle API changes, update dependencies, and improve the task generation and fine-tuning data preparation workflow.

Enhancements:

  • Updated code to work with the new OpenAI API client
  • Modified API request handling to support the latest OpenAI library
  • Improved error handling and serialization of API responses

Documentation:

  • Completely rewrote the README with a detailed explanation of the task generation and fine-tuning process
  • Added comprehensive documentation for each step of the workflow

Chores:

  • Updated default model from 'davinci' to 'davinci-002'
  • Modified function signatures to match new OpenAI API client
  • Updated file paths and default parameters

@sourcery-ai
Copy link

sourcery-ai bot commented May 5, 2025

Reviewer's Guide

This pull request updates the codebase to be compatible with recent OpenAI API changes, modifies default parameters and paths, and adds a detailed README explaining the updated self-instruct workflow.

Sequence Diagram for Updated OpenAI API Call

sequenceDiagram
    participant Script as Calling Script
    participant API as gpt3_api.make_requests
    participant Client as OpenAI Client

    Script->>API: Call make_requests(...)
    API->>Client: Initialize OpenAI(api_key, organization)
    API->>Client: client.completions.create(model="davinci-002", ...)
    Client-->>API: Return Response object (response)
    API->>API: Process response.choices
    API-->>Script: Return formatted results
Loading

Class Diagram for Module Updates

classDiagram
    namespace gpt3_api {
        class make_requests {
            +make_requests(..., model, temperature, top_p, ...)
            # Uses OpenAI Client
        }
    }
    namespace bootstrap_instructions {
        class bootstrap_instructions.py {
            +convert_to_serializable(obj)
            +main()
        }
    }
    namespace generate_instances {
        class generate_instances.py {
            +convert_to_serializable(obj)
            +main()
        }
    }
    class OpenAIClient {
        <<External>>
        +completions.create(...)
        +choices
        +text
        +finish_reason
    }
    make_requests ..> OpenAIClient : uses
    note for make_requests "Handles API calls using the updated OpenAI client library.\nParameter 'engine' changed to 'model'.\nResponse accessed via response.choices[...].text etc."
    note for bootstrap_instructions.py "Added helper to serialize API response metadata."
    note for generate_instances.py "Added helper to serialize API response metadata."
Loading

File-Level Changes

Change Details Files
Updated OpenAI API interactions.
  • Replaced openai.Completion.create with client.completions.create.
  • Updated response object parsing (e.g., response["choices"][0]["text"] to response[0].text).
  • Changed default engine/model from "davinci" to "davinci-002".
  • Handled potential JSON serialization issues in API responses using a new helper function.
self_instruct/gpt3_api.py
self_instruct/bootstrap_instructions.py
self_instruct/generate_instances.py
self_instruct/prepare_for_finetuning.py
self_instruct/identify_clf_or_not.py
Updated documentation and configuration.
  • Added a detailed README explaining the 4-step workflow (task generation, classification, instance generation, fine-tuning preparation).
  • Updated default file paths and directory names (e.g., using data/gpt3_generations/).
  • Made some script arguments optional by removing required=True.
  • Adjusted default number of instructions to generate in bootstrap_instructions.py.
README.md
self_instruct/bootstrap_instructions.py
self_instruct/generate_instances.py
self_instruct/prepare_for_finetuning.py
self_instruct/identify_clf_or_not.py
Added new files for testing and output.
  • Added two Python scripts (test.py, test2.py), likely for experimentation or testing.
  • Added empty .jsonl files corresponding to the new default output paths.
self_instruct/test.py
self_instruct/test2.py
data/gpt3_generations/all_generated_instances.jsonl
data/gpt3_generations/gpt3_finetuning_data_243.jsonl
data/gpt3_generations/is_clf_or_not_davinci-002_template_1.jsonl
data/gpt3_generations/machine_generated_instances.jsonl
data/gpt3_generations/machine_generated_instructions.jsonl

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @WenlongHuangResearch - I've reviewed your changes - here's some feedback:

  • Consider moving the duplicated convert_to_serializable function to a shared utility module.
  • The new files test.py and test2.py seem like temporary development scripts that should be removed.
  • Ensure code comments align with the project's primary language; some new comments are in Chinese.
Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 3 issues found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

return parser.parse_args()


# 在文件顶部添加一个辅助函数来处理不可序列化的对象
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting the convert_to_serializable helper function into a shared utility module to reduce code duplication.

Extract the helper into a shared utility module to avoid duplication and reduce the overall complexity. For example, create a file like utils/serialization.py:

# utils/serialization.py
import json

def convert_to_serializable(obj):
    if isinstance(obj, dict):
        return {k: convert_to_serializable(v) for k, v in obj.items()}
    if isinstance(obj, list):
        return [convert_to_serializable(item) for item in obj]
    if hasattr(obj, 'to_dict') and callable(getattr(obj, 'to_dict')):
        return convert_to_serializable(obj.to_dict())
    if hasattr(obj, '__dict__'):
        return convert_to_serializable(vars(obj))
    try:
        json.dumps(obj)
        return obj
    except (TypeError, OverflowError):
        return str(obj)

Then in both files (including the current one and generate_instances.py), import and use it:

from utils.serialization import convert_to_serializable

# Use convert_to_serializable as before

This consolidates its definition, reduces code duplication, and makes the overall codebase easier to maintain.

return parser.parse_args()


def convert_to_serializable(obj):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider extracting the convert_to_serializable function into a utility module and removing commented-out code to reduce complexity.

Consider extracting the duplicated convert_to_serializable function into a centralized utility module and removing legacy commented-out code. This will reduce maintenance overhead and simplify the file.

For example, create a new module (e.g., utils/serialization.py):

# utils/serialization.py
import json

def convert_to_serializable(obj):
    """Convert an object to a JSON-serializable format."""
    if isinstance(obj, dict):
        return {k: convert_to_serializable(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_to_serializable(item) for item in obj]
    elif hasattr(obj, 'to_dict') and callable(getattr(obj, 'to_dict')):
        return convert_to_serializable(obj.to_dict())
    elif hasattr(obj, '__dict__'):
        return convert_to_serializable(obj.__dict__)
    else:
        try:
            json.dumps(obj)
            return obj
        except (TypeError, OverflowError):
            return str(obj)

Then update the main file to import and use it:

# In your main file
from utils.serialization import convert_to_serializable

# Remove the duplicated definition and any commented-out legacy code.

This refactoring centralizes the logic and removes redundant inline comments, reducing complexity while preserving functionality.

@@ -10,24 +10,27 @@

random.seed(123)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider removing the unnecessary inline comments and commented-out default values.

Consider cleaning out the extra inline comments (# finish_reading) and commented‐out defaults that don't contribute to functionality. This will reduce visual clutter without affecting behavior. For example, change this:

# finish_reading
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--instance_files",
        nargs="+",
        # default=["data/batch_221203/machine_generated_instances.jsonl"],
        default=["data/gpt3_generations/machine_generated_instances.jsonl"],
        type=str,
        help="The input files that contain the machine generated instances."
    )
    ...

to a cleaner version:

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--instance_files",
        nargs="+",
        default=["data/gpt3_generations/machine_generated_instances.jsonl"],
        type=str,
        help="The input files that contain the machine generated instances."
    )
    ...

Actionable steps:

  1. Remove redundant inline markers: Delete all occurrences of # finish_reading as they do not add to code functionality.
  2. Clean out commented-out code: Remove commented-out default values that are no longer in use. If you might need older defaults, rely on version control history.

This clean-up maintains all existing functionality while reducing cognitive overhead for future maintenance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant