From 61e02fc8a92de110e59a82ae873c4c0e69cda5db Mon Sep 17 00:00:00 2001
From: Rob Chartier <rob.chartier@gmail.com>
Date: Fri, 20 Sep 2024 16:04:39 -0700
Subject: [PATCH] pdf, text working

---
 Dockerfile          |   6 +-
 echonotes-prompt.md |  71 -----------
 main.py             | 299 ++++++++++++++++++++++++++++++++++----------
 requirements.txt    |   9 +-
 run.sh              |   7 +-
 summarize-notes.md  |  42 ++++---
 6 files changed, 271 insertions(+), 163 deletions(-)
 delete mode 100644 echonotes-prompt.md

diff --git a/Dockerfile b/Dockerfile
index d5f5e7b..a7de725 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -2,14 +2,12 @@ FROM python:3.10-slim
 
 # Install dependencies
 RUN apt-get update && apt-get install -y \
-    tesseract-ocr \
-    libtesseract-dev \
-    poppler-utils && \
+    tesseract-ocr libtesseract-dev poppler-utils && \
     rm -rf /var/lib/apt/lists/*
 
 # Install Python packages
 COPY requirements.txt /app/requirements.txt
-RUN pip install --no-cache-dir -r /app/requirements.txt
+RUN pip install --no-cache-dir  -r /app/requirements.txt
 
 # Copy app files
 COPY main.py /app/main.py
diff --git a/echonotes-prompt.md b/echonotes-prompt.md
deleted file mode 100644
index b096137..0000000
--- a/echonotes-prompt.md
+++ /dev/null
@@ -1,71 +0,0 @@
-Lets create an app together.  Here are the requirements:
-
-
-## Requirements
-
-### Application Overview
-Our application, called "echonotes", will be used to monitor a folder for PDF files, extract the contents, and then send those contents along with a summarize prompt to a local ollama instance.  Be sure to use our cool name "echonotes" in the source code, dockerfile, etc.
-
-1. **Python Application**: 
-   - A Python application that runs in a Docker container.
-   - The application monitors a specific folder (mounted as a Docker volume) for new PDF files.
-   - When a new PDF is detected, the app extracts handwritten notes from the PDF using OCR (Tesseract).
-   - The extracted text is written back to the same folder with a filename derived from the original PDF.
-   - The contents of the markdown prompt file will be **prepended** (not appended) to the extracted text from the PDF before sending it in the API request.
-   - HTTP requests to the API will be made directly without using third-party libraries.
-   - The API response will be saved to disk with a filename derived from the PDF.
-   - The application will be fully functional offline (including using Tesseract).
-   - The path to monitor for new PDFs, this will need to be a hard coded path to a folder within the container at /app/incoming, but mounted as a volume from the caller.
-   - The path to the markdown prompt file, this will need to be a hard coded path to a file within the container at /app/summarize-notes.md, but mounted as a volume from the caller.
-
-
-2. **Configuration**:
-   - A `config.yml` file will be used for configuration, passed as a volume to the Docker container.
-   - This configuration file will include:
-     - The API URL.
-     - A bearer token for authentication.
-     - The model to be used in the API call.
-   - The configuration variables can be overwritten by command-line arguments.
-
-3. **Exception Handling**:
-   - Extensive exception handling and management will be expected. 
-   - Be sure to intelligently catch all common exceptions and deal with them accordingly, incluiding instructing the user on how to deal with the issue.
-   - Never let the application crash, ever.  It should just log exceptions, errors, fatals, etc.. and keep running. Never crash.
-
-4. **Logging**:
-   - Extensive logging will be implemented in the Python script to track operations and errors.
-
-5. **Docker Setup**:
-   - The application, including its dependencies (Tesseract OCR), will be built and packaged into a single Docker image.
-   - The app will be fully deployable offline.
-
-6. **GitHub Workflow**:
-   - Create a GitHub Actions workflow to automate the building and pushing of the Docker image to DockerHub.
-   - The workflow should:
-     - Trigger on new commits to the main branch.
-     - Build the Docker image.
-     - Push the Docker image to DockerHub using the appropriate credentials (supplied via GitHub secrets).
-
-7. **`run.sh` Bash Script**:
-   - Develop a separate `run.sh` script to automate the building and execution of the Docker container.
-   - Accept the docker image name as an optional argument, but by default to the latest for the project.
-   - Use named arguments to avoid ambiguity
-   - The script should:
-     - Validate that the required arguments are passed (e.g., config path, prompt file, incoming folder).
-     - Provide usage information and fail with a helpful message if invalid or missing arguments are provided.
-     - Build the Docker image locally.
-     - Run the Docker container, mounting the appropriate volumes (e.g., PDF monitoring folder, config file).
-
-8. **Project README.md**
-   - Write a README file, in markdown suitable for github
-   - It will provide an overview of the project, in a moderate level of detail
-   - It must have a professional and excited tone
-   - It will include instructions as to how to use the project via docker (include sample code)
-   - It will also include instructions as to how to use docker compose (include sample code)
-   - For the docker compose sample, assume relative paths to the files and folders
-
-
-9. **Ollama System Prompt**
-   - Write a file, "prompt.md", which is the default value for the project's "markdown prompt file"
-   - In this file, create a LLM prompt appropriate for summarizing hand written notes.  
-   - The structure should follow the "Cornell Method".  Research this method to find an optimal structure to follow.
diff --git a/main.py b/main.py
index d310cba..23a3c04 100644
--- a/main.py
+++ b/main.py
@@ -1,4 +1,3 @@
-
 import os
 import time
 import pytesseract
@@ -6,10 +5,19 @@
 from watchdog.events import FileSystemEventHandler
 from PyPDF2 import PdfReader
 from pdf2image import convert_from_path
+from docx import Document
 import requests
 import logging
 import yaml
 import json
+import shutil
+import ffmpeg
+import whisper
+
+
+# Setup logging
+logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+
 
 # Load config
 def load_config(config_path="/app/config.yml"):
@@ -23,8 +31,52 @@ def load_config(config_path="/app/config.yml"):
         logging.error(f"Error reading configuration file {config_path}: {e}")
         raise
 
-# Setup logging
-logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
+
+# Send extracted text to local API for summarization
+def send_to_api(api_url, bearer_token, model, content):
+    try:
+        headers = {
+            "Authorization": f"Bearer {bearer_token}",
+            "Content-Type": "application/json",
+        }
+        payload = {
+            "model": model,
+            "prompt": content,
+            "stream": False
+        }
+
+        # Log the details of the request
+        logging.info(f"Sending request to API: {api_url}")
+        logging.info(f"Request headers: {headers}")
+        logging.info(f"Request payload: {payload}")
+
+        # Make the POST request
+        response = requests.post(api_url, json=payload, headers=headers)
+
+        # Ensure the status code is successful; raises error for 4xx or 5xx
+        response.raise_for_status()
+
+        # Attempt to parse the response as JSON
+        try:
+            parsed_response = response.json()  # Should return a dict
+            logging.info(f"Parsed Response content: {parsed_response}")
+            return parsed_response.get('response', 'No text found in response')
+        except ValueError:
+            logging.error(f"Failed to parse response as JSON: {response.text}")
+            return 'No valid JSON response'
+
+    except requests.exceptions.HTTPError as http_err:
+        logging.error(f"HTTP error occurred: {http_err}")
+        raise
+    except requests.exceptions.ConnectionError:
+        logging.error("Failed to connect to the API. Please ensure the API server is running and accessible.")
+        raise
+    except requests.exceptions.Timeout:
+        logging.error("Request to the API timed out. Consider increasing the timeout duration.")
+        raise
+    except Exception as e:
+        logging.error(f"An error occurred while sending a request to the API: {e}")
+        raise
 
 # Helper function to extract text from PDF using OCR and write it back to the same folder
 def extract_text_from_pdf(pdf_path):
@@ -61,7 +113,7 @@ def extract_text_from_pdf(pdf_path):
         
         logging.info(f"Extracted text written to {output_filename}")
         
-        return text
+        return text, output_filename
     except FileNotFoundError:
         logging.error(f"The file {pdf_path} does not exist. Please ensure the file is available.")
         raise
@@ -70,7 +122,110 @@ def extract_text_from_pdf(pdf_path):
         raise
 
 
+# Ensure "completed" and "working" directories exist
+def ensure_folders(path_to_watch):
+    completed_folder = os.path.join(path_to_watch, "completed")
+    working_folder = os.path.join(path_to_watch, "working")
+    for folder in [completed_folder, working_folder]:
+        if not os.path.exists(folder):
+            os.makedirs(folder)
+            logging.info(f"Created folder at: {folder}")
+    return working_folder, completed_folder
+
+
+# Move files to the "working" folder
+def move_to_working(file_path, working_folder):
+    try:
+        file_dest = os.path.join(working_folder, os.path.basename(file_path))
+        shutil.move(file_path, file_dest)
+        logging.info(f"Moved {file_path} to {file_dest}")
+        return file_dest
+    except Exception as e:
+        logging.error(f"Error moving file to working folder: {e}")
+        raise
+
+
+# Move processed files to the "completed" folder
+def move_to_completed(file_path, output_files, completed_folder):
+    try:
+        # Move output files first
+        for output_file in output_files:
+            if os.path.exists(output_file):  # Ensure the file exists before moving
+                output_dest = os.path.join(completed_folder, os.path.basename(output_file))
+                shutil.move(output_file, output_dest)
+                logging.info(f"Moved {output_file} to {output_dest}")
+        
+        # Now move the original file
+        if os.path.exists(file_path):  # Ensure the file exists before moving
+            file_dest = os.path.join(completed_folder, os.path.basename(file_path))
+            shutil.move(file_path, file_dest)
+            logging.info(f"Moved {file_path} to {file_dest}")
+    except Exception as e:
+        logging.error(f"Error moving files to completed folder: {e}")
+        raise
+
+
+# Extract audio from video and save as MP3
+def extract_audio_from_video(video_path):
+    try:
+        logging.info(f"Extracting audio from video: {video_path}")
+        base_filename = os.path.splitext(os.path.basename(video_path))[0]
+        mp3_output = os.path.join(os.path.dirname(video_path), f"{base_filename}.mp3")
 
+        # Use ffmpeg to extract the audio and save it as an MP3 file
+        ffmpeg.input(video_path).output(mp3_output).run(overwrite_output=True)
+        logging.info(f"Audio extracted and saved to {mp3_output}")
+        return mp3_output
+    except Exception as e:
+        logging.error(f"Error extracting audio from video {video_path}: {e}")
+        raise
+
+# Function to read the entire content of a text file
+def extract_text_from_txt(file_name):
+    try:
+        with open(file_name, 'r') as file:
+            # Read the entire content of the file
+            contents = file.read()
+        return contents, file_name
+    except FileNotFoundError:
+        print(f"Error: The file {file_name} was not found.")
+    except Exception as e:
+        print(f"Error: An error occurred while reading the file: {e}")
+        
+
+# Convert MP3 to text using Whisper
+def convert_audio_to_text(audio_path):
+    try:
+        logging.info(f"Converting audio to text using Whisper: {audio_path}")
+        model = whisper.load_model("base")
+        result = model.transcribe(audio_path)
+
+        # Save the transcribed text to a markdown file
+        base_filename = os.path.splitext(os.path.basename(audio_path))[0]
+        output_filename = os.path.join(os.path.dirname(audio_path), f"{base_filename}_transcribed.md")
+        with open(output_filename, 'w') as output_file:
+            output_file.write(f"# Transcribed Audio\n\n{result['text']}")
+        
+        logging.info(f"Transcribed text saved to {output_filename}")
+        return result['text'], output_filename
+    except Exception as e:
+        logging.error(f"Error transcribing audio from {audio_path}: {e}")
+        raise
+
+# Properly format the API response to Markdown
+def format_markdown(api_response):
+    try:
+        response_text = api_response
+
+        # Replace placeholder characters to better fit markdown format
+        if response_text:
+            formatted_markdown += response_text.replace('\n', '\n\n')  # Double line break for markdown paragraphs
+        
+        return formatted_markdown
+    except Exception as e:
+        logging.error(f"Error formatting API response to Markdown: {e}")
+        return ""
+    
 # Prepend the markdown prompt file content
 def prepend_markdown_prompt(pdf_text, prompt_path):
     try:
@@ -84,79 +239,84 @@ def prepend_markdown_prompt(pdf_text, prompt_path):
         logging.error(f"Error reading markdown prompt file {prompt_path}: {e}")
         raise
 
-# Send extracted text to local API for summarization
-def send_to_api(api_url, bearer_token, model, content):
-    try:
-        headers = {
-            "Authorization": f"Bearer {bearer_token}",
-            "Content-Type": "application/json",
-        }
-        payload = {
-            "model": model,
-            "prompt": content,
-            "stream": False
-        }
+# Event handler for newly created files
+class FileHandler(FileSystemEventHandler):
+    def __init__(self, config, working_folder, completed_folder):
+        self.config = config
+        self.working_folder = working_folder
+        self.completed_folder = completed_folder
 
-        # Log the details of the request
-        logging.info(f"Sending request to API: {api_url}")
-        logging.info(f"Request headers: {json.dumps(headers, indent=2)}")
-        logging.info(f"Request payload: {json.dumps(payload, indent=2)}")
+    def on_created(self, event):
+        try:
+            # Move the file to the working folder before processing
+            working_file_path = move_to_working(event.src_path, self.working_folder)
+            
+            if working_file_path.endswith(".pdf"):
+                logging.info(f"Processing PDF: {working_file_path}")
+                text, extracted_text_file = extract_text_from_pdf(working_file_path)
+                full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md")
+                api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text)
+                output_filename = f"{working_file_path}.summary.md"
+                with open(output_filename, 'w') as f:
+                    f.write(format_markdown(api_response))
+                move_to_completed(working_file_path, [extracted_text_file, output_filename], self.completed_folder)
 
+            elif working_file_path.endswith(".docx"):
+                logging.info(f"Processing Word document: {working_file_path}")
+                text, extracted_text_file = extract_text_from_word(working_file_path)
+                full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md")
+                api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text)
+                output_filename = f"{working_file_path}.summary.md"
+                with open(output_filename, 'w') as f:
+                    f.write(format_markdown(api_response))
+                move_to_completed(working_file_path, [extracted_text_file, output_filename], self.completed_folder)
 
-        response = requests.post(api_url, headers=headers, data=json.dumps(payload))
+            elif working_file_path.endswith(".txt"):
+                logging.info(f"Processing text file: {working_file_path}")
+                text, extracted_text_file = extract_text_from_txt(working_file_path)
+                full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md")
+                api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text)
+                output_filename = f"{working_file_path}.summary.md"
+                with open(output_filename, 'w') as f:
+                    f.write(format_markdown(api_response))
+                move_to_completed(working_file_path, [extracted_text_file, output_filename], self.completed_folder)
 
+            elif working_file_path.endswith((".mp4", ".avi", ".mov", ".mkv")):
+                logging.info(f"Processing video file: {working_file_path}")
+                mp3_file = extract_audio_from_video(working_file_path)
+                text, extracted_text_file = convert_audio_to_text(mp3_file)
+                full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md")
+                api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text)
+                output_filename = f"{working_file_path}.summary.md"
+                with open(output_filename, 'w') as f:
+                    f.write(format_markdown(api_response))
+                move_to_completed(working_file_path, [mp3_file, extracted_text_file, output_filename], self.completed_folder)
 
-        # Log the details of the response
-        logging.info(f"Response status code: {response.status_code}")
-        logging.info(f"Response headers: {json.dumps(dict(response.headers), indent=2)}")
-        logging.info(f"Response content: {response.text}")
+            elif working_file_path.endswith(".mp3"):
+                logging.info(f"Processing MP3 file: {working_file_path}")
+                text, extracted_text_file = convert_audio_to_text(working_file_path)
+                full_text = prepend_markdown_prompt(text, "/app/summarize-notes.md")
+                api_response = send_to_api(self.config['api_url'], self.config['bearer_token'], self.config['model'], full_text)
+                output_filename = f"{working_file_path}.summary.md"
+                with open(output_filename, 'w') as f:
+                    f.write(format_markdown(api_response))
+                move_to_completed(working_file_path, [extracted_text_file, output_filename], self.completed_folder)
 
-        response.raise_for_status()  # Will raise an error for HTTP codes 4xx or 5xx
-        parsed_response = response.json()
-        logging.info(f"Parsed Response content: {parsed_response}")
-        return parsed_response.get('response', 'No text found in response')
-    except requests.exceptions.HTTPError as http_err:
-        logging.error(f"HTTP error occurred: {http_err}")
-        logging.error("Please check the API URL, bearer token, and model in the configuration.")
-        raise
-    except requests.exceptions.ConnectionError:
-        logging.error("Failed to connect to the API. Please ensure the API server is running and accessible.")
-        raise
-    except requests.exceptions.Timeout:
-        logging.error("Request to the API timed out. Consider increasing the timeout duration.")
-        raise
-    except Exception as e:
-        logging.error(f"An error occurred while sending a request to the API: {e}")
-        raise
+        except Exception as e:
+            logging.error(f"Error processing {event.src_path}: {e}")
 
-# Event handler for new PDFs
-class PDFHandler(FileSystemEventHandler):
-    def __init__(self, config):
-        self.config = config
 
-    def on_created(self, event):
-        if event.src_path.endswith(".pdf"):
-            logging.info(f"New PDF detected: {event.src_path}")
-            try:
-                extracted_text = extract_text_from_pdf(event.src_path)
-                full_text = prepend_markdown_prompt(extracted_text, "/app/summarize-notes.md")
-                
-                logging.info(f"Full Text to send to our API:{full_text}")
-                api_response = send_to_api(
-                    self.config['api_url'],
-                    self.config['bearer_token'],
-                    self.config['model'],
-                    full_text
-                )
-                output_filename = f"{event.src_path}.summary.txt"
-                with open(output_filename, 'w') as f:
-                    f.write(api_response.get("summary", "No summary provided"))
-                logging.info(f"Summary written to {output_filename}")
-            except Exception as e:
-                logging.error(f"Error processing {event.src_path}: {e}")
+def show_ascii_art():
+    ascii_art = """
+ _  _ |_  _ __  _ _|_ _  _ 
+(/_(_ | |(_)| |(_) |_(/__> 
+    """
+    logging.info(ascii_art)
 
 if __name__ == "__main__":
     try:
+        show_ascii_art()
+
         # Load configuration
         config = load_config()
 
@@ -166,7 +326,10 @@ def on_created(self, event):
             logging.error(f"Directory {path_to_watch} does not exist. Please ensure the folder is mounted.")
             raise FileNotFoundError(f"Directory {path_to_watch} not found")
 
-        event_handler = PDFHandler(config)
+        # Ensure the "working" and "completed" folders exist
+        working_folder, completed_folder = ensure_folders(path_to_watch)
+
+        event_handler = FileHandler(config, working_folder, completed_folder)
         observer = Observer()
         observer.schedule(event_handler, path=path_to_watch, recursive=False)
         observer.start()
@@ -179,4 +342,4 @@ def on_created(self, event):
         observer.join()
 
     except Exception as e:
-        logging.critical(f"Application failed to start: {e}")
+        logging.critical(f"Application failed to start: {e}")
\ No newline at end of file
diff --git a/requirements.txt b/requirements.txt
index 2442e77..b81a7a2 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,6 +1,9 @@
+pytesseract==0.3.10
 PyPDF2==3.0.0
 pdf2image==1.16.3
-pytesseract==0.3.10
-watchdog==2.1.9
 requests==2.28.1
-pyyaml==6.0
+watchdog==2.1.9
+PyYAML==6.0
+python-docx==0.8.11
+ffmpeg-python==0.2.0
+whisper==1.0
diff --git a/run.sh b/run.sh
index efc8ada..500273e 100755
--- a/run.sh
+++ b/run.sh
@@ -65,13 +65,16 @@ done
 # Validate required arguments
 validate_args
 
+# clean up past runs
+ sudo rm -rf incoming/completed incoming/working incoming/* -R
+
 # Build the Docker image
 echo "Building Docker image $IMAGE_NAME..."
-docker build -t "$IMAGE_NAME" .
+docker build --no-cache -t "$IMAGE_NAME" .
 
 # Run the Docker container
 echo "Running Docker container..."
-docker run -v "$INCOMING_FOLDER:/app/incoming" \
+docker run --rm -v "$INCOMING_FOLDER:/app/incoming" \
            -v "$CONFIG_FILE:/app/config.yml" \
            -v "$PROMPT_FILE:/app/summarize-notes.md" \
            "$IMAGE_NAME"
diff --git a/summarize-notes.md b/summarize-notes.md
index 38f43e8..975968e 100644
--- a/summarize-notes.md
+++ b/summarize-notes.md
@@ -1,23 +1,35 @@
-## Summarize the following handwritten notes using the Cornell Note-Taking Method:
+Summarize the following meeting transcript, ensuring the output is structured in Markdown. Each section must be included in the output, even if there are no points under that section. 
 
-The Cornell Method structures notes into three sections:
+## STEPS
+- Fully digest the content provided.
+- Extract all action items agreed within the meeting and owners.
+- Extract any interesting ideas brought up in the meeting.
 
-1. **Notes Section**: Summarize the main content and details, which can include keywords, concepts, explanations, and diagrams from the notes.
-2. **Cue Section**: Generate concise questions or keywords that correspond to the main ideas, concepts, or key terms from the notes.
-3. **Summary Section**: Write a brief summary of the entire set of notes, capturing the overall message and key takeaways.
 
-Please follow this format when summarizing the handwritten notes below:
+## The summary should include:
 
----
+1. **Context:** Briefly describe the meeting's purpose and key participants.
+2. **Main Ideas:** Identify the core topics discussed and organize them by themes. Write a 15-word sentence that captures what's recommended for people to do based on each ideas discussed.
+3. **Decisions:** List any significant decisions made during the meeting. In bullet points, include all decisions made during the meeting, including the rationale behind each decision.
+4. **Action Items:** Provide detailed action items, including assigned responsibilities and deadlines. Write bullet points for ALL agreed actionable details. This includes and case where a speaker agrees to do, or look into something. If there is a deadline mentioned, include it here.
+5. **Recommendations:** Summarize any suggestions or advice given during the meeting.
+6. **Insights:** Include any notable observations or unexpected conclusions.
+7. **Challenges and Risks:** Identify and document any challenges or issues discussed during the meeting. Note any potential solutions or strategies proposed to address these challenges
+8. *Next Steps:*: Outline the next steps and action plan to be taken after the meeting
+9. **Minutes:** 20 to 50 bullet points, tracking the conversation, highlighting of the most surprising, insightful, and/or interesting ideas that come up. If there are less than 50 then collect all of them. Make sure you extract at least 20.
 
-### Notes Section:
-Provide a detailed but organized list of all the key information and supporting details found in the handwritten notes.
 
-### Cue Section:
-List out key questions or prompts based on the main concepts and ideas from the handwritten notes. These should help the reader recall the critical points covered.
 
-### Summary Section:
-Summarize the key ideas in a concise paragraph that captures the essence of the notes, including major points and overarching themes.
+## Output Instructions:
+- The summary must be in Markdown.
+- Each section header (e.g., Context, Main Ideas) must be present even if the section is empty.
+- Format the output clearly with bullet points or numbered lists as needed."
+- Do not give warnings or notes; only output the requested sections.
+- Do not repeat ideas, quotes, facts, or resources.
+- Do not start items with the same opening words.
+- Ensure you follow ALL these instructions when creating your output.
 
----
-INPUT:
+
+# INPUT:
+
+