ScrapeGraphAI
diff --git a/‎.gitignore
Lines changed: 5 additions & 0 deletions b/‎.gitignore
Lines changed: 5 additions & 0 deletions
diff --git a/‎graphs/abstract_graph.json
Lines changed: 32 additions & 0 deletions b/‎graphs/abstract_graph.json
Lines changed: 32 additions & 0 deletions
diff --git a/‎graphs/base_graph.json
Lines changed: 27 additions & 0 deletions b/‎graphs/base_graph.json
Lines changed: 27 additions & 0 deletions
diff --git a/‎graphs/csv_scraper_graph.json
Lines changed: 27 additions & 0 deletions b/‎graphs/csv_scraper_graph.json
Lines changed: 27 additions & 0 deletions
@@ -0,0 +1,5 @@
+**/.DS_Store
+autogenerator.py
+code.txt
+stringify.py
+reply.txt
@@ -0,0 +1,27 @@
+[
+    {
+        "prompt": "What is the purpose of the BaseGraph class in scrapegraph ai?",
+        "answer": "The `BaseGraph` class in scrapegraph ai manages the execution flow of a graph composed of interconnected nodes. It takes care of traversing the nodes and executing them in the correct order, starting from the entry point node."
+    },
+    {
+        "prompt": "What are the main attributes of the BaseGraph class in scrapegraph ai?",
+        "answer": "The main attributes of the `BaseGraph` class in scrapegraph ai are `nodes`, which is a list of node instances; `edges`, a dictionary representing the directed edges of the graph; and `entry_point`, the name of the entry point node from which the graph execution begins."
+    },
+    {
+        "prompt": "What is the role of the _create_edges method in the BaseGraph class of scrapegraph ai?",
+        "answer": "The `_create_edges` method in the `BaseGraph` class of scrapegraph ai is a helper method that creates a dictionary of edges from the given iterable of tuples. The keys in the dictionary are the from-nodes, and the values are the to-nodes."
+    },
+    {
+        "prompt": "What does the _execute_standard method in the BaseGraph class of scrapegraph ai do?",
+        "answer": "The `_execute_standard` method in the `BaseGraph` class of scrapegraph ai executes the graph by traversing nodes starting from the entry point using the standard method. It keeps track of the execution time and other information, such as token usage and cost, and returns the final state and a list of execution info."
+    },
+    {
+        "prompt": "What is the difference between the execute and _execute_standard methods in the BaseGraph class of scrapegraph ai?",
+        "answer": "The execute method in the `BaseGraph` class of scrapegraph ai is responsible for executing the graph, either using the BurrBridge or the standard method, depending on the `use_burr` attribute. The _execute_standard method, on the other hand, is specifically designed to execute the graph using the standard method and is called by the execute method when `use_burr` is set to False."
+    },
+    {
+        "prompt": "Show me the implementation of Scrapegraph's BaseGraph.",
+        "answer": "Sure, here's the implementation for the `BaseGraph` class in ScrapeGraphAI:\n\n```python\n\"\"\"\nBaseGraph Module\n\"\"\"\n\nimport time\nimport warnings\nfrom langchain_community.callbacks import get_openai_callback\nfrom typing import Tuple\n\n\nclass BaseGraph:\n    \"\"\"\n    BaseGraph manages the execution flow of a graph composed of interconnected nodes.\n\n    Attributes:\n        nodes (list): A dictionary mapping each node's name to its corresponding node instance.\n        edges (list): A dictionary representing the directed edges of the graph where each\n                      key-value pair corresponds to the from-node and to-node relationship.\n        entry_point (str): The name of the entry point node from which the graph execution begins.\n\n    Args:\n        nodes (iterable): An iterable of node instances that will be part of the graph.\n        edges (iterable): An iterable of tuples where each tuple represents a directed edge\n                          in the graph, defined by a pair of nodes (from_node, to_node).\n        entry_point (BaseNode): The node instance that represents the entry point of the graph.\n\n    Raises:\n        Warning: If the entry point node is not the first node in the list.\n\n    Example:\n        >>> BaseGraph(\n        ...    nodes=[\n        ...        fetch_node,\n        ...        parse_node,\n        ...        rag_node,\n        ...        generate_answer_node,\n        ...    ],\n        ...    edges=[\n        ...        (fetch_node, parse_node),\n        ...        (parse_node, rag_node),\n        ...        (rag_node, generate_answer_node)\n        ...    ],\n        ...    entry_point=fetch_node,\n        ...    use_burr=True,\n        ...    burr_config={\"app_instance_id\": \"example-instance\"}\n        ... )\n    \"\"\"\n\n    def __init__(self, nodes: list, edges: list, entry_point: str, use_burr: bool = False, burr_config: dict = None):\n\n        self.nodes = nodes\n        self.edges = self._create_edges({e for e in edges})\n        self.entry_point = entry_point.node_name\n        self.initial_state = {}\n\n        if nodes[0].node_name != entry_point.node_name:\n            # raise a warning if the entry point is not the first node in the list\n            warnings.warn(\n                \"Careful! The entry point node is different from the first node if the graph.\")\n        \n        # Burr configuration\n        self.use_burr = use_burr\n        self.burr_config = burr_config or {}\n\n    def _create_edges(self, edges: list) -> dict:\n        \"\"\"\n        Helper method to create a dictionary of edges from the given iterable of tuples.\n\n        Args:\n            edges (iterable): An iterable of tuples representing the directed edges.\n\n        Returns:\n            dict: A dictionary of edges with the from-node as keys and to-node as values.\n        \"\"\"\n\n        edge_dict = {}\n        for from_node, to_node in edges:\n            edge_dict[from_node.node_name] = to_node.node_name\n        return edge_dict\n\n    def _execute_standard(self, initial_state: dict) -> Tuple[dict, list]:\n        \"\"\"\n        Executes the graph by traversing nodes starting from the entry point using the standard method.\n\n        Args:\n            initial_state (dict): The initial state to pass to the entry point node.\n\n        Returns:\n            Tuple[dict, list]: A tuple containing the final state and a list of execution info.\n        \"\"\"\n        current_node_name = self.entry_point\n        state = initial_state\n\n        # variables for tracking execution info\n        total_exec_time = 0.0\n        exec_info = []\n        cb_total = {\n            \"total_tokens\": 0,\n            \"prompt_tokens\": 0,\n            \"completion_tokens\": 0,\n            \"successful_requests\": 0,\n            \"total_cost_USD\": 0.0,\n        }\n\n        while current_node_name:\n            curr_time = time.time()\n            current_node = next(node for node in self.nodes if node.node_name == current_node_name)\n\n            with get_openai_callback() as cb:\n                result = current_node.execute(state)\n                node_exec_time = time.time() - curr_time\n                total_exec_time += node_exec_time\n\n                cb_data = {\n                    \"node_name\": current_node.node_name,\n                    \"total_tokens\": cb.total_tokens,\n                    \"prompt_tokens\": cb.prompt_tokens,\n                    \"completion_tokens\": cb.completion_tokens,\n                    \"successful_requests\": cb.successful_requests,\n                    \"total_cost_USD\": cb.total_cost,\n                    \"exec_time\": node_exec_time,\n                }\n\n                exec_info.append(cb_data)\n\n                cb_total[\"total_tokens\"] += cb_data[\"total_tokens\"]\n                cb_total[\"prompt_tokens\"] += cb_data[\"prompt_tokens\"]\n                cb_total[\"completion_tokens\"] += cb_data[\"completion_tokens\"]\n                cb_total[\"successful_requests\"] += cb_data[\"successful_requests\"]\n                cb_total[\"total_cost_USD\"] += cb_data[\"total_cost_USD\"]\n\n            if current_node.node_type == \"conditional_node\":\n                current_node_name = result\n            elif current_node_name in self.edges:\n                current_node_name = self.edges[current_node_name]\n            else:\n                current_node_name = None\n\n        exec_info.append({\n            \"node_name\": \"TOTAL RESULT\",\n            \"total_tokens\": cb_total[\"total_tokens\"],\n            \"prompt_tokens\": cb_total[\"prompt_tokens\"],\n            \"completion_tokens\": cb_total[\"completion_tokens\"],\n            \"successful_requests\": cb_total[\"successful_requests\"],\n            \"total_cost_USD\": cb_total[\"total_cost_USD\"],\n            \"exec_time\": total_exec_time,\n        })\n\n        return state, exec_info\n\n    def execute(self, initial_state: dict) -> Tuple[dict, list]:\n        \"\"\"\n        Executes the graph by either using BurrBridge or the standard method.\n\n        Args:\n            initial_state (dict): The initial state to pass to the entry point node.\n\n        Returns:\n            Tuple[dict, list]: A tuple containing the final state and a list of execution info.\n        \"\"\"\n\n        self.initial_state = initial_state\n        if self.use_burr:\n\n            from ..integrations import BurrBridge\n            \n            bridge = BurrBridge(self, self.burr_config)\n            result = bridge.execute(initial_state)\n            return (result[\"_state\"], [])\n        else:\n            return self._execute_standard(initial_state)\n```"
+    }
+]
+  
@@ -0,0 +1,27 @@
+[
+    {
+        "prompt": "What is the purpose of the CSVScraperGraph class in scrapegraph ai?",
+        "answer": "The `CSVScraperGraph` class in scrapegraph ai is a smart scraper tool that automates the process of extracting information from CSV files or directories using a natural language model to interpret and answer prompts."
+    },
+    {
+        "prompt": "What are the main attributes of the CSVScraperGraph class in scrapegraph ai?",
+        "answer": "The main attributes of the `CSVScraperGraph` class in scrapegraph ai are inherited from the `AbstractGraph` class, such as `prompt`, `config`, `source`, and `schema`. Additionally, it has the `input_key` attribute, which is either 'csv' or 'csv_dir' based on the source."
+    },
+    {
+        "prompt": "What is the role of the _create_graph method in the CSVScraperGraph class of scrapegraph ai?",
+        "answer": "The _create_graph method in the `CSVScraperGraph` class of scrapegraph ai is responsible for creating the graph of nodes representing the workflow for CSV scraping. It includes nodes for fetching the CSV data, processing it with a RAG (Retrieval-Augmented Generation) model, and generating the final answer."
+    },
+    {
+        "prompt": "What does the run method in the CSVScraperGraph class of scrapegraph ai do?",
+        "answer": "The run method in the `CSVScraperGraph` class of scrapegraph ai executes the CSV scraping process by providing the initial inputs to the graph and calling the graph's execute method. It then stores the final state and execution info and returns the answer to the prompt."
+    },
+    {
+        "prompt": "How does the CSVScraperGraph class in scrapegraph ai handle CSV files and directories?",
+        "answer": "The `CSVScraperGraph` class in scrapegraph ai determines the `input_key` attribute based on the source. If the source ends with 'csv', the `input_key` is set to 'csv', and if not, it is set to 'csv_dir'. This `input_key` is then used to provide the correct input to the graph's fetch node."
+    },
+    {
+        "prompt": "Show me the implementation of the CSVScraperGraph class in scrapegraph.",
+        "answer": "Sure, here's the implementation for the `CSVScraperGraph` class in ScrapeGraphAI:\n\n```python\n\"\"\"\nModule for creating the smart scraper\n\"\"\"\n\nfrom typing import Optional\n\nfrom .base_graph import BaseGraph\nfrom .abstract_graph import AbstractGraph\n\nfrom ..nodes import (\n    FetchNode,\n    RAGNode,\n    GenerateAnswerCSVNode\n)\n\n\nclass CSVScraperGraph(AbstractGraph):\n    \"\"\"\n    SmartScraper is a comprehensive web scraping tool that automates the process of extracting\n    information from web pages using a natural language model to interpret and answer prompts.\n    \"\"\"\n\n    def __init__(self, prompt: str, source: str, config: dict, schema: Optional[str] = None):\n        \"\"\"\n        Initializes the CSVScraperGraph with a prompt, source, and configuration.\n        \"\"\"\n        super().__init__(prompt, config, source, schema)\n\n        self.input_key = \"csv\" if source.endswith(\"csv\") else \"csv_dir\"\n\n    def _create_graph(self):\n        \"\"\"\n        Creates the graph of nodes representing the workflow for web scraping.\n        \"\"\"\n        fetch_node = FetchNode(\n            input=\"csv | csv_dir\",\n            output=[\"doc\"],\n        )\n        rag_node = RAGNode(\n            input=\"user_prompt & doc\",\n            output=[\"relevant_chunks\"],\n            node_config={\n                \"llm_model\": self.llm_model,\n                \"embedder_model\": self.embedder_model,\n            }\n        )\n        generate_answer_node = GenerateAnswerCSVNode(\n            input=\"user_prompt & (relevant_chunks | doc)\",\n            output=[\"answer\"],\n            node_config={\n                \"llm_model\": self.llm_model,\n                \"schema\": self.schema,\n            }\n        )\n\n        return BaseGraph(\n            nodes=[\n                fetch_node,\n                rag_node,\n                generate_answer_node,\n            ],\n            edges=[\n                (fetch_node, rag_node),\n                (rag_node, generate_answer_node)\n            ],\n            entry_point=fetch_node\n        )\n\n    def run(self) -> str:\n        \"\"\"\n        Executes the web scraping process and returns the answer to the prompt.\n        \"\"\"\n        inputs = {\"user_prompt\": self.prompt, self.input_key: self.source}\n        self.final_state, self.execution_info = self.graph.execute(inputs)\n\n        return self.final_state.get(\"answer\", \"No answer found.\")\n```"
+    }
+]
+