From 61e02fc8a92de110e59a82ae873c4c0e69cda5db Mon Sep 17 00:00:00 2001 From: Rob Chartier Date: Fri, 20 Sep 2024 16:04:39 -0700 Subject: [PATCH] pdf, text working --- Dockerfile | 6 +- echonotes-prompt.md | 71 ----------- main.py | 299 ++++++++++++++++++++++++++++++++++---------- requirements.txt | 9 +- run.sh | 7 +- summarize-notes.md | 42 ++++--- 6 files changed, 271 insertions(+), 163 deletions(-) delete mode 100644 echonotes-prompt.md diff --git a/Dockerfile b/Dockerfile index d5f5e7b..a7de725 100644 --- a/Dockerfile +++ b/Dockerfile @@ -2,14 +2,12 @@ FROM python:3.10-slim # Install dependencies RUN apt-get update && apt-get install -y \ - tesseract-ocr \ - libtesseract-dev \ - poppler-utils && \ + tesseract-ocr libtesseract-dev poppler-utils && \ rm -rf /var/lib/apt/lists/* # Install Python packages COPY requirements.txt /app/requirements.txt -RUN pip install --no-cache-dir -r /app/requirements.txt +RUN pip install --no-cache-dir -r /app/requirements.txt # Copy app files COPY main.py /app/main.py diff --git a/echonotes-prompt.md b/echonotes-prompt.md deleted file mode 100644 index b096137..0000000 --- a/echonotes-prompt.md +++ /dev/null @@ -1,71 +0,0 @@ -Lets create an app together. Here are the requirements: - - -## Requirements - -### Application Overview -Our application, called "echonotes", will be used to monitor a folder for PDF files, extract the contents, and then send those contents along with a summarize prompt to a local ollama instance. Be sure to use our cool name "echonotes" in the source code, dockerfile, etc. - -1. **Python Application**: - - A Python application that runs in a Docker container. - - The application monitors a specific folder (mounted as a Docker volume) for new PDF files. - - When a new PDF is detected, the app extracts handwritten notes from the PDF using OCR (Tesseract). - - The extracted text is written back to the same folder with a filename derived from the original PDF. - - The contents of the markdown prompt file will be **prepended** (not appended) to the extracted text from the PDF before sending it in the API request. - - HTTP requests to the API will be made directly without using third-party libraries. - - The API response will be saved to disk with a filename derived from the PDF. - - The application will be fully functional offline (including using Tesseract). - - The path to monitor for new PDFs, this will need to be a hard coded path to a folder within the container at /app/incoming, but mounted as a volume from the caller. - - The path to the markdown prompt file, this will need to be a hard coded path to a file within the container at /app/summarize-notes.md, but mounted as a volume from the caller. - - -2. **Configuration**: - - A `config.yml` file will be used for configuration, passed as a volume to the Docker container. - - This configuration file will include: - - The API URL. - - A bearer token for authentication. - - The model to be used in the API call. - - The configuration variables can be overwritten by command-line arguments. - -3. **Exception Handling**: - - Extensive exception handling and management will be expected. - - Be sure to intelligently catch all common exceptions and deal with them accordingly, incluiding instructing the user on how to deal with the issue. - - Never let the application crash, ever. It should just log exceptions, errors, fatals, etc.. and keep running. Never crash. - -4. **Logging**: - - Extensive logging will be implemented in the Python script to track operations and errors. - -5. **Docker Setup**: - - The application, including its dependencies (Tesseract OCR), will be built and packaged into a single Docker image. - - The app will be fully deployable offline. - -6. **GitHub Workflow**: - - Create a GitHub Actions workflow to automate the building and pushing of the Docker image to DockerHub. - - The workflow should: - - Trigger on new commits to the main branch. - - Build the Docker image. - - Push the Docker image to DockerHub using the appropriate credentials (supplied via GitHub secrets). - -7. **`run.sh` Bash Script**: - - Develop a separate `run.sh` script to automate the building and execution of the Docker container. - - Accept the docker image name as an optional argument, but by default to the latest for the project. - - Use named arguments to avoid ambiguity - - The script should: - - Validate that the required arguments are passed (e.g., config path, prompt file, incoming folder). - - Provide usage information and fail with a helpful message if invalid or missing arguments are provided. - - Build the Docker image locally. - - Run the Docker container, mounting the appropriate volumes (e.g., PDF monitoring folder, config file). - -8. **Project README.md** - - Write a README file, in markdown suitable for github - - It will provide an overview of the project, in a moderate level of detail - - It must have a professional and excited tone - - It will include instructions as to how to use the project via docker (include sample code) - - It will also include instructions as to how to use docker compose (include sample code) - - For the docker compose sample, assume relative paths to the files and folders - - -9. **Ollama System Prompt** - - Write a file, "prompt.md", which is the default value for the project's "markdown prompt file" - - In this file, create a LLM prompt appropriate for summarizing hand written notes. - - The structure should follow the "Cornell Method". Research this method to find an optimal structure to follow. diff --git a/main.py b/main.py index d310cba..23a3c04 100644 --- a/main.py +++ b/main.py @@ -1,4 +1,3 @@ - import os import time import pytesseract @@ -6,10 +5,19 @@ from watchdog.events import FileSystemEventHandler from PyPDF2 import PdfReader from pdf2image import convert_from_path +from docx import Document import requests import logging import yaml import json +import shutil +import ffmpeg +import whisper + + +# Setup logging +logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') + # Load config def load_config(config_path="/app/config.yml"): @@ -23,8 +31,52 @@ def load_config(config_path="/app/config.yml"): logging.error(f"Error reading configuration file {config_path}: {e}") raise -# Setup logging -logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s') + +# Send extracted text to local API for summarization +def send_to_api(api_url, bearer_token, model, content): + try: + headers = { + "Authorization": f"Bearer {bearer_token}", + "Content-Type": "application/json", + } + payload = { + "model": model, + "prompt": content, + "stream": False + } + + # Log the details of the request + logging.info(f"Sending request to API: {api_url}") + logging.info(f"Request headers: {headers}") + logging.info(f"Request payload: {payload}") + + # Make the POST request + response = requests.post(api_url, json=payload, headers=headers) + + # Ensure the status code is successful; raises error for 4xx or 5xx + response.raise_for_status() + + # Attempt to parse the response as JSON + try: + parsed_response = response.json() # Should return a dict + logging.info(f"Parsed Response content: {parsed_response}") + return parsed_response.get('response', 'No text found in response') + except ValueError: + logging.error(f"Failed to parse response as JSON: {response.text}") + return 'No valid JSON response' + + except requests.exceptions.HTTPError as http_err: + logging.error(f"HTTP error occurred: {http_err}") + raise + except requests.exceptions.ConnectionError: + logging.error("Failed to connect to the API. Please ensure the API server is running and accessible.") + raise + except requests.exceptions.Timeout: + logging.error("Request to the API timed out. Consider increasing the timeout duration.") + raise + except Exception as e: + logging.error(f"An error occurred while sending a request to the API: {e}") + raise # Helper function to extract text from PDF using OCR and write it back to the same folder def extract_text_from_pdf(pdf_path): @@ -61,7 +113,7 @@ def extract_text_from_pdf(pdf_path): logging.info(f"Extracted text written to {output_filename}") - return text + return text, output_filename except FileNotFoundError: logging.error(f"The file {pdf_path} does not exist. Please ensure the file is available.") raise @@ -70,7 +122,110 @@ def extract_text_from_pdf(pdf_path): raise +# Ensure "completed" and "working" directories exist +def ensure_folders(path_to_watch): + completed_folder = os.path.join(path_to_watch, "completed") + working_folder = os.path.join(path_to_watch, "working") + for folder in [completed_folder, working_folder]: + if not os.path.exists(folder): + os.makedirs(folder) + logging.info(f"Created folder at: {folder}") + return working_folder, completed_folder + + +# Move files to the "working" folder +def move_to_working(file_path, working_folder): + try: + file_dest = os.path.join(working_folder, os.path.basename(file_path)) + shutil.move(file_path, file_dest) + logging.info(f"Moved {file_path} to {file_dest}") + return file_dest + except Exception as e: + logging.error(f"Error moving file to working folder: {e}") + raise + + +# Move processed files to the "completed" folder +def move_to_completed(file_path, output_files, completed_folder): + try: + # Move output files first + for output_file in output_files: + if os.path.exists(output_file): # Ensure the file exists before moving + output_dest = os.path.join(completed_folder, os.path.basename(output_file)) + shutil.move(output_file, output_dest) + logging.info(f"Moved {output_file} to {output_dest}") + + # Now move the original file + if os.path.exists(file_path): # Ensure the file exists before moving + file_dest = os.path.join(completed_folder, os.path.basename(file_path)) + shutil.move(file_path, file_dest) + logging.info(f"Moved {file_path} to {file_dest}") + except Exception as e: + logging.error(f"Error moving files to completed folder: {e}") + raise + + +# Extract audio from video and save as MP3 +def extract_audio_from_video(video_path): + try: + logging.info(f"Extracting audio from video: {video_path}") + base_filename = os.path.splitext(os.path.basename(video_path))[0] + mp3_output = os.path.join(os.path.dirname(video_path), f"{base_filename}.mp3") + # Use ffmpeg to extract the audio and save it as an MP3 file + ffmpeg.input(video_path).output(mp3_output).run(overwrite_output=True) + logging.info(f"Audio extracted and saved to {mp3_output}") + return mp3_output + except Exception as e: + logging.error(f"Error extracting audio from video {video_path}: {e}") + raise + +# Function to read the entire content of a text file +def extract_text_from_txt(file_name): + try: + with open(file_name, 'r') as file: + # Read the entire content of the file + contents = file.read() + return contents, file_name + except FileNotFoundError: + print(f"Error: The file {file_name} was not found.") + except Exception as e: + print(f"Error: An error occurred while reading the file: {e}") + + +# Convert MP3 to text using Whisper +def convert_audio_to_text(audio_path): + try: + logging.info(f"Converting audio to text using Whisper: {audio_path}") + model = whisper.load_model("base") + result = model.transcribe(audio_path) + + # Save the transcribed text to a markdown file + base_filename = os.path.splitext(os.path.basename(audio_path))[0] + output_filename = os.path.join(os.path.dirname(audio_path), f"{base_filename}_transcribed.md") + with open(output_filename, 'w') as output_file: + output_file.write(f"# Transcribed Audio\n\n{result['text']}") + + logging.info(f"Transcribed text saved to {output_filename}") + return result['text'], output_filename + except Exception as e: + logging.error(f"Error transcribing audio from {audio_path}: {e}") + raise + +# Properly format the API response to Markdown +def format_markdown(api_response): + try: + response_text = api_response + + # Replace placeholder characters to better fit markdown format + if response_text: + formatted_markdown += response_text.replace('\n', '\n\n') # Double line break for markdown paragraphs + + return formatted_markdown + except Exception as e: + logging.error(f"Error formatting API response to Markdown: {e}") + return "" + # Prepend the markdown prompt file content def prepend_markdown_prompt(pdf_text, prompt_path): try: @@ -84,79 +239,84 @@ def prepend_markdown_prompt(pdf_text, prompt_path): logging.error(f"Error reading markdown prompt file {prompt_path}: {e}") raise -# Send extracted text to local API for summarization -def send_to_api(api_url, bearer_token, model, content): - try: - headers = { - "Authorization": f"Bearer {bearer_token}", - "Content-Type": "application/json", - } - payload = { - "model": model, - "prompt": content, - "stream": False - } +# Event handler for newly created files +class FileHandler(FileSystemEventHandler): + def __init__(self, config, working_folder, completed_folder): + self.config = config + self.working_folder = working_folder + self.completed_folder = completed_folder - # Log the details of the request - logging.info(f"Sending request to API: {api_url}") - logging.info(f"Request headers: {json.dumps(headers, indent=2)}") - logging.info(f"Request payload: {json.dumps(payload, indent=2)}") + def on_created(self, event): + try: + # Move the file to the working folder before processing + working_file_path = move_to_working(event.src_path, self.working_folder) + + if working_file_path.endswith(".pdf"): + logging.info(f"Processing PDF: {working_file_path}") + text, extracted_text_file = extract_text_from_pdf(working_file_path) + full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md") + api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text) + output_filename = f"{working_file_path}.summary.md" + with open(output_filename, 'w') as f: + f.write(format_markdown(api_response)) + move_to_completed(working_file_path, [extracted_text_file, output_filename], self.completed_folder) + elif working_file_path.endswith(".docx"): + logging.info(f"Processing Word document: {working_file_path}") + text, extracted_text_file = extract_text_from_word(working_file_path) + full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md") + api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text) + output_filename = f"{working_file_path}.summary.md" + with open(output_filename, 'w') as f: + f.write(format_markdown(api_response)) + move_to_completed(working_file_path, [extracted_text_file, output_filename], self.completed_folder) - response = requests.post(api_url, headers=headers, data=json.dumps(payload)) + elif working_file_path.endswith(".txt"): + logging.info(f"Processing text file: {working_file_path}") + text, extracted_text_file = extract_text_from_txt(working_file_path) + full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md") + api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text) + output_filename = f"{working_file_path}.summary.md" + with open(output_filename, 'w') as f: + f.write(format_markdown(api_response)) + move_to_completed(working_file_path, [extracted_text_file, output_filename], self.completed_folder) + elif working_file_path.endswith((".mp4", ".avi", ".mov", ".mkv")): + logging.info(f"Processing video file: {working_file_path}") + mp3_file = extract_audio_from_video(working_file_path) + text, extracted_text_file = convert_audio_to_text(mp3_file) + full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md") + api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text) + output_filename = f"{working_file_path}.summary.md" + with open(output_filename, 'w') as f: + f.write(format_markdown(api_response)) + move_to_completed(working_file_path, [mp3_file, extracted_text_file, output_filename], self.completed_folder) - # Log the details of the response - logging.info(f"Response status code: {response.status_code}") - logging.info(f"Response headers: {json.dumps(dict(response.headers), indent=2)}") - logging.info(f"Response content: {response.text}") + elif working_file_path.endswith(".mp3"): + logging.info(f"Processing MP3 file: {working_file_path}") + text, extracted_text_file = convert_audio_to_text(working_file_path) + full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md") + api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text) + output_filename = f"{working_file_path}.summary.md" + with open(output_filename, 'w') as f: + f.write(format_markdown(api_response)) + move_to_completed(working_file_path, [extracted_text_file, output_filename], self.completed_folder) - response.raise_for_status() # Will raise an error for HTTP codes 4xx or 5xx - parsed_response = response.json() - logging.info(f"Parsed Response content: {parsed_response}") - return parsed_response.get('response', 'No text found in response') - except requests.exceptions.HTTPError as http_err: - logging.error(f"HTTP error occurred: {http_err}") - logging.error("Please check the API URL, bearer token, and model in the configuration.") - raise - except requests.exceptions.ConnectionError: - logging.error("Failed to connect to the API. Please ensure the API server is running and accessible.") - raise - except requests.exceptions.Timeout: - logging.error("Request to the API timed out. Consider increasing the timeout duration.") - raise - except Exception as e: - logging.error(f"An error occurred while sending a request to the API: {e}") - raise + except Exception as e: + logging.error(f"Error processing {event.src_path}: {e}") -# Event handler for new PDFs -class PDFHandler(FileSystemEventHandler): - def __init__(self, config): - self.config = config - def on_created(self, event): - if event.src_path.endswith(".pdf"): - logging.info(f"New PDF detected: {event.src_path}") - try: - extracted_text = extract_text_from_pdf(event.src_path) - full_text = prepend_markdown_prompt(extracted_text, "/app/summarize-notes.md") - - logging.info(f"Full Text to send to our API:{full_text}") - api_response = send_to_api( - self.config['api_url'], - self.config['bearer_token'], - self.config['model'], - full_text - ) - output_filename = f"{event.src_path}.summary.txt" - with open(output_filename, 'w') as f: - f.write(api_response.get("summary", "No summary provided")) - logging.info(f"Summary written to {output_filename}") - except Exception as e: - logging.error(f"Error processing {event.src_path}: {e}") +def show_ascii_art(): + ascii_art = """ + _ _ |_ _ __ _ _|_ _ _ +(/_(_ | |(_)| |(_) |_(/__> + """ + logging.info(ascii_art) if __name__ == "__main__": try: + show_ascii_art() + # Load configuration config = load_config() @@ -166,7 +326,10 @@ def on_created(self, event): logging.error(f"Directory {path_to_watch} does not exist. Please ensure the folder is mounted.") raise FileNotFoundError(f"Directory {path_to_watch} not found") - event_handler = PDFHandler(config) + # Ensure the "working" and "completed" folders exist + working_folder, completed_folder = ensure_folders(path_to_watch) + + event_handler = FileHandler(config, working_folder, completed_folder) observer = Observer() observer.schedule(event_handler, path=path_to_watch, recursive=False) observer.start() @@ -179,4 +342,4 @@ def on_created(self, event): observer.join() except Exception as e: - logging.critical(f"Application failed to start: {e}") + logging.critical(f"Application failed to start: {e}") \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 2442e77..b81a7a2 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,6 +1,9 @@ +pytesseract==0.3.10 PyPDF2==3.0.0 pdf2image==1.16.3 -pytesseract==0.3.10 -watchdog==2.1.9 requests==2.28.1 -pyyaml==6.0 +watchdog==2.1.9 +PyYAML==6.0 +python-docx==0.8.11 +ffmpeg-python==0.2.0 +whisper==1.0 diff --git a/run.sh b/run.sh index efc8ada..500273e 100755 --- a/run.sh +++ b/run.sh @@ -65,13 +65,16 @@ done # Validate required arguments validate_args +# clean up past runs + sudo rm -rf incoming/completed incoming/working incoming/* -R + # Build the Docker image echo "Building Docker image $IMAGE_NAME..." -docker build -t "$IMAGE_NAME" . +docker build --no-cache -t "$IMAGE_NAME" . # Run the Docker container echo "Running Docker container..." -docker run -v "$INCOMING_FOLDER:/app/incoming" \ +docker run --rm -v "$INCOMING_FOLDER:/app/incoming" \ -v "$CONFIG_FILE:/app/config.yml" \ -v "$PROMPT_FILE:/app/summarize-notes.md" \ "$IMAGE_NAME" diff --git a/summarize-notes.md b/summarize-notes.md index 38f43e8..975968e 100644 --- a/summarize-notes.md +++ b/summarize-notes.md @@ -1,23 +1,35 @@ -## Summarize the following handwritten notes using the Cornell Note-Taking Method: +Summarize the following meeting transcript, ensuring the output is structured in Markdown. Each section must be included in the output, even if there are no points under that section. -The Cornell Method structures notes into three sections: +## STEPS +- Fully digest the content provided. +- Extract all action items agreed within the meeting and owners. +- Extract any interesting ideas brought up in the meeting. -1. **Notes Section**: Summarize the main content and details, which can include keywords, concepts, explanations, and diagrams from the notes. -2. **Cue Section**: Generate concise questions or keywords that correspond to the main ideas, concepts, or key terms from the notes. -3. **Summary Section**: Write a brief summary of the entire set of notes, capturing the overall message and key takeaways. -Please follow this format when summarizing the handwritten notes below: +## The summary should include: ---- +1. **Context:** Briefly describe the meeting's purpose and key participants. +2. **Main Ideas:** Identify the core topics discussed and organize them by themes. Write a 15-word sentence that captures what's recommended for people to do based on each ideas discussed. +3. **Decisions:** List any significant decisions made during the meeting. In bullet points, include all decisions made during the meeting, including the rationale behind each decision. +4. **Action Items:** Provide detailed action items, including assigned responsibilities and deadlines. Write bullet points for ALL agreed actionable details. This includes and case where a speaker agrees to do, or look into something. If there is a deadline mentioned, include it here. +5. **Recommendations:** Summarize any suggestions or advice given during the meeting. +6. **Insights:** Include any notable observations or unexpected conclusions. +7. **Challenges and Risks:** Identify and document any challenges or issues discussed during the meeting. Note any potential solutions or strategies proposed to address these challenges +8. *Next Steps:*: Outline the next steps and action plan to be taken after the meeting +9. **Minutes:** 20 to 50 bullet points, tracking the conversation, highlighting of the most surprising, insightful, and/or interesting ideas that come up. If there are less than 50 then collect all of them. Make sure you extract at least 20. -### Notes Section: -Provide a detailed but organized list of all the key information and supporting details found in the handwritten notes. -### Cue Section: -List out key questions or prompts based on the main concepts and ideas from the handwritten notes. These should help the reader recall the critical points covered. -### Summary Section: -Summarize the key ideas in a concise paragraph that captures the essence of the notes, including major points and overarching themes. +## Output Instructions: +- The summary must be in Markdown. +- Each section header (e.g., Context, Main Ideas) must be present even if the section is empty. +- Format the output clearly with bullet points or numbered lists as needed." +- Do not give warnings or notes; only output the requested sections. +- Do not repeat ideas, quotes, facts, or resources. +- Do not start items with the same opening words. +- Ensure you follow ALL these instructions when creating your output. ---- -INPUT: + +# INPUT: + +