Update self instruct by WenlongHuangResearch · Pull Request #24 · yizhongw/self-instruct

WenlongHuangResearch · 2025-05-05T07:12:45Z

Summary by Sourcery

Update the Self-Instruct project to be compatible with the latest OpenAI API, including modifications to handle API changes, update dependencies, and improve the task generation and fine-tuning data preparation workflow.

Enhancements:

Updated code to work with the new OpenAI API client
Modified API request handling to support the latest OpenAI library
Improved error handling and serialization of API responses

Documentation:

Completely rewrote the README with a detailed explanation of the task generation and fine-tuning process
Added comprehensive documentation for each step of the workflow

Chores:

Updated default model from 'davinci' to 'davinci-002'
Modified function signatures to match new OpenAI API client
Updated file paths and default parameters

sourcery-ai · 2025-05-05T07:12:49Z

Reviewer's Guide

This pull request updates the codebase to be compatible with recent OpenAI API changes, modifies default parameters and paths, and adds a detailed README explaining the updated self-instruct workflow.

Sequence Diagram for Updated OpenAI API Call

sequenceDiagram
    participant Script as Calling Script
    participant API as gpt3_api.make_requests
    participant Client as OpenAI Client

    Script->>API: Call make_requests(...)
    API->>Client: Initialize OpenAI(api_key, organization)
    API->>Client: client.completions.create(model="davinci-002", ...)
    Client-->>API: Return Response object (response)
    API->>API: Process response.choices
    API-->>Script: Return formatted results

Class Diagram for Module Updates

classDiagram
    namespace gpt3_api {
        class make_requests {
            +make_requests(..., model, temperature, top_p, ...)
            # Uses OpenAI Client
        }
    }
    namespace bootstrap_instructions {
        class bootstrap_instructions.py {
            +convert_to_serializable(obj)
            +main()
        }
    }
    namespace generate_instances {
        class generate_instances.py {
            +convert_to_serializable(obj)
            +main()
        }
    }
    class OpenAIClient {
        <<External>>
        +completions.create(...)
        +choices
        +text
        +finish_reason
    }
    make_requests ..> OpenAIClient : uses
    note for make_requests "Handles API calls using the updated OpenAI client library.\nParameter 'engine' changed to 'model'.\nResponse accessed via response.choices[...].text etc."
    note for bootstrap_instructions.py "Added helper to serialize API response metadata."
    note for generate_instances.py "Added helper to serialize API response metadata."

File-Level Changes

Change	Details	Files
Updated OpenAI API interactions.	Replaced `openai.Completion.create` with `client.completions.create`. Updated response object parsing (e.g., `response["choices"][0]["text"]` to `response[0].text`). Changed default engine/model from "davinci" to "davinci-002". Handled potential JSON serialization issues in API responses using a new helper function.	`self_instruct/gpt3_api.py` `self_instruct/bootstrap_instructions.py` `self_instruct/generate_instances.py` `self_instruct/prepare_for_finetuning.py` `self_instruct/identify_clf_or_not.py`
Updated documentation and configuration.	Added a detailed README explaining the 4-step workflow (task generation, classification, instance generation, fine-tuning preparation). Updated default file paths and directory names (e.g., using `data/gpt3_generations/`). Made some script arguments optional by removing `required=True`. Adjusted default number of instructions to generate in `bootstrap_instructions.py`.	`README.md` `self_instruct/bootstrap_instructions.py` `self_instruct/generate_instances.py` `self_instruct/prepare_for_finetuning.py` `self_instruct/identify_clf_or_not.py`
Added new files for testing and output.	Added two Python scripts (`test.py`, `test2.py`), likely for experimentation or testing. Added empty `.jsonl` files corresponding to the new default output paths.	`self_instruct/test.py` `self_instruct/test2.py` `data/gpt3_generations/all_generated_instances.jsonl` `data/gpt3_generations/gpt3_finetuning_data_243.jsonl` `data/gpt3_generations/is_clf_or_not_davinci-002_template_1.jsonl` `data/gpt3_generations/machine_generated_instances.jsonl` `data/gpt3_generations/machine_generated_instructions.jsonl`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @WenlongHuangResearch - I've reviewed your changes - here's some feedback:

Consider moving the duplicated convert_to_serializable function to a shared utility module.
The new files test.py and test2.py seem like temporary development scripts that should be removed.
Ensure code comments align with the project's primary language; some new comments are in Chinese.

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟡 Complexity: 3 issues found
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-05-05T07:14:42Z

self_instruct/bootstrap_instructions.py

    return parser.parse_args()

-
+# 在文件顶部添加一个辅助函数来处理不可序列化的对象


issue (complexity): Consider extracting the convert_to_serializable helper function into a shared utility module to reduce code duplication.

Extract the helper into a shared utility module to avoid duplication and reduce the overall complexity. For example, create a file like utils/serialization.py:

# utils/serialization.py import json def convert_to_serializable(obj): if isinstance(obj, dict): return {k: convert_to_serializable(v) for k, v in obj.items()} if isinstance(obj, list): return [convert_to_serializable(item) for item in obj] if hasattr(obj, 'to_dict') and callable(getattr(obj, 'to_dict')): return convert_to_serializable(obj.to_dict()) if hasattr(obj, '__dict__'): return convert_to_serializable(vars(obj)) try: json.dumps(obj) return obj except (TypeError, OverflowError): return str(obj)

Then in both files (including the current one and generate_instances.py), import and use it:

from utils.serialization import convert_to_serializable # Use convert_to_serializable as before

This consolidates its definition, reduces code duplication, and makes the overall codebase easier to maintain.

sourcery-ai · 2025-05-05T07:14:42Z

self_instruct/generate_instances.py

    return parser.parse_args()

-
+def convert_to_serializable(obj):


issue (complexity): Consider extracting the convert_to_serializable function into a utility module and removing commented-out code to reduce complexity.

Consider extracting the duplicated convert_to_serializable function into a centralized utility module and removing legacy commented-out code. This will reduce maintenance overhead and simplify the file.

For example, create a new module (e.g., utils/serialization.py):

# utils/serialization.py import json def convert_to_serializable(obj): """Convert an object to a JSON-serializable format.""" if isinstance(obj, dict): return {k: convert_to_serializable(v) for k, v in obj.items()} elif isinstance(obj, list): return [convert_to_serializable(item) for item in obj] elif hasattr(obj, 'to_dict') and callable(getattr(obj, 'to_dict')): return convert_to_serializable(obj.to_dict()) elif hasattr(obj, '__dict__'): return convert_to_serializable(obj.__dict__) else: try: json.dumps(obj) return obj except (TypeError, OverflowError): return str(obj)

Then update the main file to import and use it:

# In your main file from utils.serialization import convert_to_serializable # Remove the duplicated definition and any commented-out legacy code.

This refactoring centralizes the logic and removes redundant inline comments, reducing complexity while preserving functionality.

sourcery-ai · 2025-05-05T07:14:42Z

self_instruct/prepare_for_finetuning.py

@@ -10,24 +10,27 @@

 random.seed(123)


issue (complexity): Consider removing the unnecessary inline comments and commented-out default values.

Consider cleaning out the extra inline comments (# finish_reading) and commented‐out defaults that don't contribute to functionality. This will reduce visual clutter without affecting behavior. For example, change this:

# finish_reading def parse_args(): parser = argparse.ArgumentParser() parser.add_argument( "--instance_files", nargs="+", # default=["data/batch_221203/machine_generated_instances.jsonl"], default=["data/gpt3_generations/machine_generated_instances.jsonl"], type=str, help="The input files that contain the machine generated instances." ) ...

to a cleaner version:

def parse_args(): parser = argparse.ArgumentParser() parser.add_argument( "--instance_files", nargs="+", default=["data/gpt3_generations/machine_generated_instances.jsonl"], type=str, help="The input files that contain the machine generated instances." ) ...

Actionable steps:

Remove redundant inline markers: Delete all occurrences of # finish_reading as they do not add to code functionality.

Clean out commented-out code: Remove commented-out default values that are no longer in use. If you might need older defaults, rely on version control history.

This clean-up maintains all existing functionality while reducing cognitive overhead for future maintenance.

Update self instruct

aa8f9bc

sourcery-ai bot reviewed May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update self instruct#24

Update self instruct#24
WenlongHuangResearch wants to merge 1 commit intoyizhongw:mainfrom
WenlongHuangResearch:wenlonghuangresearch

WenlongHuangResearch commented May 5, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented May 5, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot May 5, 2025

Uh oh!

sourcery-ai bot May 5, 2025

Uh oh!

sourcery-ai bot May 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		return parser.parse_args()


		# 在文件顶部添加一个辅助函数来处理不可序列化的对象

Conversation

WenlongHuangResearch commented May 5, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence Diagram for Updated OpenAI API Call

Class Diagram for Module Updates

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot May 5, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot May 5, 2025

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot May 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WenlongHuangResearch commented May 5, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented May 5, 2025 •

edited

Loading