Welcome! This tutorial provides a comprehensive, step-by-step guide to implementing the "Automated Systematic Retrieval and Review" (ASRR) project in three phases.
The goal of this phase is to build the ASRR's "mind" by creating a sophisticated, private search engine over a curated set of documents. By the end of this guide, you will have a functional retrieval backend on Google Cloud, meeting the requirements of Phase 2 of the ASRR project. For a copmlete overview fo the ASRR project see: SOMMAS Wiki
Before you begin, ensure you have the following setup:
- Windows Subsystem for Linux (WSL) with Ubuntu installed and integrated into VS Code. This tutorial assumes all commands are run from the VS Code terminal connected to your WSL Ubuntu environment. Install WSL
- Visual Studio Code (VS Code) installed. Visual Studio Code
- A Google Cloud Platform (GCP) Account with a project created and billing enabled. Google Cloud
- A GitHub Account. GitHub Account | See this Guide for setting up Authentication with GitHub.
- (Optional) Docker Desktop installed with WSL. Docker | WSL 2
- (Optional) Windows Terminal installed. Windows Terminal
We will begin by creating a copy of the project template, which contains the necessary scripts and sample documents.
-
Create Your Repository:
- Navigate to the template repository in your browser: https://github.com/edsponsler/sommas
- Click the green "Use this template" button and select "Create a new repository".
- Give your new repository a name (e.g.,
my-asrr-project) and click "Create repository".
-
Clone Your Repository:
- Open VS Code with a terminal connected to WSL Ubuntu.
- First, ensure Git is installed in your Ubuntu environment:
sudo apt update && sudo apt install git -y - Clone the repository you just created. Replace
<your-github-username>and<your-repo-name>with your details.git clone [https://github.com/](https://github.com/)<your-github-username>/<your-repo-name>.git
- Navigate into your new project directory:
cd <your-repo-name>
-
Create a Python Virtual Environment:
- It is a best practice to keep project dependencies isolated. We will create a virtual environment inside your project folder.
python3 -m venv .venv
- Activate the environment. You must do this every time you open a new terminal to work on this project.
source .venv/bin/activate - Your terminal prompt should now be prefixed with
(.venv), indicating the environment is active.
- It is a best practice to keep project dependencies isolated. We will create a virtual environment inside your project folder.
The gcloud command-line interface (CLI) is essential for interacting with your GCP resources. For this WSL-based setup, a dual installation is required.
-
Install the
gcloudCLI:- On Windows: Follow the official instructions to install the gcloud CLI on Windows. This is required because it can seamlessly open a browser for authentication.
- In WSL Ubuntu: Follow the official instructions to install the gcloud CLI on Debian/Ubuntu.
-
Authenticate from WSL:
- You need to perform two separate authentications: one for the
gcloudCLI tool itself, and one for your application code. - For the
gcloudCLI: This authorizes you to rungcloudcommands in the terminal. The--no-browserflag is crucial as it prevents WSL from trying (and failing) to open a browser window.gcloud auth login --no-browser
- This command will output a long
gcloud auth ...command. Copy this entire command. - Open a standard Windows Command Prompt (CMD), paste the copied command, and press Enter.
- This will launch a browser window on your Windows desktop. Complete the login and grant the necessary permissions.
- For your Application Code (ADC): This creates a credential file that your Python script will automatically find and use.
gcloud auth application-default login --no-browser
- You will need to repeat the same copy/paste process into a Windows CMD to complete the browser-based login.
- You need to perform two separate authentications: one for the
-
Set Your Project:
- Once authentication is complete, return to your WSL terminal. Tell
gcloudwhich project to work on. Replaceyour-project-idwith your actual GCP Project ID.gcloud config set project your-project-id
- Once authentication is complete, return to your WSL terminal. Tell
We need two secure locations in Google Cloud Storage to hold our documents: one for the original raw files and another for the processed, machine-readable data.
-
Choose Unique Bucket Names: Bucket names must be globally unique. A good practice is to append your unique project ID. For example:
asrr-raw-data-your-project-id. -
Create the Buckets: Run these commands from your WSL terminal, replacing the placeholder names with your unique names.
# Create the raw bucket gcloud storage buckets create gs://your-unique-raw-bucket-name --project=your-project-id --uniform-bucket-level-access # Create the processed bucket gcloud storage buckets create gs://your-unique-processed-bucket-name --project=your-project-id --uniform-bucket-level-access
This step uses the provided Python script to transform our raw documents into a structured format suitable for Vertex AI. We will use pip-tools to ensure a consistent and reproducible Python environment.
-
Install Python Dependencies:
- Ensure your virtual environment is active (
source .venv/bin/activate). - First, install the
pip-toolspackage itself:pip install pip-tools
- Next, use
pip-compileto generate arequirements.txtfile from yourrequirements.in. This command resolves and pins all dependencies, creating a complete "lock file" for your environment.pip-compile requirements.in
- Finally, use
pip-syncto install all the packages listed in the newly generatedrequirements.txt.pip-syncis powerful because it ensures your environment exactly matches the requirements file, adding missing packages and removing any that don't belong.pip-sync
- Ensure your virtual environment is active (
-
Upload Raw Corpus to Cloud Storage:
- The
proto-corpusfolder in your repository contains the sample documents. Upload them to therawbucket you created.gcloud storage cp proto-corpus/* gs://your-unique-raw-bucket-name/
- The
-
Configure and Run the Script:
- The script reads its configuration from a local
.envfile. To set this up, create a copy of the example file:cp .env-example .env
- Open the new
.envfile in VS Code and replace the placeholder values with your actual GCP Project ID and the unique bucket names you created in Step 3. - Save the
.envfile. The script will automatically load these settings when you execute it:python preprocess.py
- The script will download the raw files, perform intelligent chunking, add source metadata, and upload a single
processed_corpus.jsonlfile to your processed bucket.
- The script reads its configuration from a local
-
Verify the Output:
- In the Google Cloud Console, navigate to your processed bucket and confirm that the
processed_corpus.jsonlfile exists.
- In the Google Cloud Console, navigate to your processed bucket and confirm that the
With our data prepared, we can now build the search engine itself. We will first create the backend datastore and then create an "App" to act as the frontend interface.
-
Create the Datastore:
- In the GCP Console, search or navigate to AI Applications.
- Select the Data Stores tab from the left menu.
- Click + CREATE NEW DATASTORE.
- Configure it with the following settings:
- Data Source: Select Cloud Storage.
- Import data from Cloud Storage: Select Structured data (JSONL).
- Synchronization frequency Select One time.
- Select a folder or a file you want to import Choose Folder tab. Click Browse. Select the row for your processed bucket (
gs://your-unique-processed-bucket-name). Do not click into the bucket. The path should be to the bucket itself. - Select Continue
- Review schema and assign key properties The defaults are fine. The Key property fields will be blank.
- Select Continue
- Configure your data store For Multi-region Select global (Global).
- Datastore name: Give it a name like
asrr-corpus-datastore.
- Click Create and wait 15-30 minutes for the indexing process to complete.
-
Create the App and Link the Datastore:
- Select the Apps tab from the left menu.
- Click + CREATE APP and select the Custom search (general) type.
- Ensure both Enterprise edition features and Advanced LLM features are checked.
- App name: Give it a name like
asrr-search-engine. - Fill in your company name.
- Location of your app For Multi-region Select global (Global).
- Select Continue
- Check the box next to the
asrr-corpus-datastoreyou just created and click Create.
This final step validates all our work. You will now be able to query your knowledge base.
-
Navigate to the Preview Page:
- From AI Application Select the Apps tab, you will now see your
asrr-search-engineapp. Click on it. - Select the Preview tab.
- From AI Application Select the Apps tab, you will now see your
-
Run Test Queries:
- Use the search bar to ask questions based on the documents you indexed:
What is a K-line?What is the concept of "mindless" agents?how is cloud run serverless?
- Observe the results. Each result should show a relevant text snippet and the original source file, confirming your entire pipeline is working correctly.
- Use the search bar to ask questions based on the documents you indexed:
To make your ASRR more powerful, you can add more documents to its knowledge base.
- Find Documents: You can find scientific and philosophical texts on sites like Google Scholar or public repositories like arXiv.org and PubMed Central. For technical documentation, you can save web pages from official sites as text files.
- Update Your Corpus:
- Add the new files (PDF, TXT, etc.) to your local
proto-corpusfolder. - Re-run the upload command:
gcloud storage cp proto-corpus/* gs://your-unique-raw-bucket-name/ - Re-run the processing script:
python preprocess.py - Vertex AI Search will automatically detect the changes in the
processed_corpus.jsonlfile and re-index your datastore.
- Add the new files (PDF, TXT, etc.) to your local
You have now successfully completed Phase 1 of the ASRR Project. Congratulations!
You've built a powerful, private search engine. In Phase 2, we will transform this search engine into a conversational partner. The objective is to create a true subject matter expert—an AI you can dialogue with to clarify concepts, ask follow-up questions, and receive answers that are synthesized and grounded only in your curated documents.
We will use the Vertex AI Agent Builder framework to wrap our search engine in a conversational layer, and Streamlit to create a simple but effective web interface for interaction.
The app.py script, which runs the conversational agent, needs to know the ID of your search app and which Cloud Storage bucket to use for temporary files.
-
Find Your Data Store ID:
- In the Google Cloud Console, navigate to AI Applications.
- Select the Apps tab and click on the
asrr-search-engineapp you created in Phase 1. - In the app's menu, select the Data tab.
- You will see your datastore listed. Copy its ID (it will be a long alphanumeric string).
-
Update Your
.envFile:- Open the
.envfile in your project. - You will see placeholders for
DATA_STORE_IDandSTAGING_BUCKET. - Paste the Data Store ID you just copied as the value for
DATA_STORE_ID. - For
STAGING_BUCKET, provide the full path to the staging bucket you created in Step 3 (e.g.,gs://your-unique-staging-bucket-name).
- Open the
The app.py script orchestrates three main components to bring your conversational analyst to life:
-
The Search Tool (
search_knowledge_base): This Python function is the bridge to the search engine you built in Phase 1. It takes a user's query string as input. It initializes thediscoveryengineclient and points it to your specific data store using theGCP_PROJECT_IDandDATA_STORE_IDfrom your.envfile. It sends the query and formats the raw search results into a clean string, including the content and the source file for each result. This formatted string is what the agent will "read" to find an answer. -
The Agent (
Agent): The agent is the "brain" of the operation, created usingfrom google.adk.agents import Agent. It's initialized with a specific generative model (like Gemini 2.5 Flash) and, most importantly, thesearch_knowledge_basefunction is passed into itstoolslist. This tells the agent: "When you need to answer a question, you have one tool you can use: this search function." The agent learns to automatically call this tool with a relevant search query whenever it needs information. -
The Web Interface (
Streamlit): Streamlit is a framework that turns Python scripts into interactive web apps.app.pyuses it to create the chat interface.st.titlesets the page title.st.session_stateis used to remember the chat history, so the conversation persists as you interact with the app.- The main loop waits for user input with
st.chat_input. When you send a message, it's added to the history and displayed on the screen. - The agent's response is streamed back to the UI using
st.write_stream, providing a real-time, "typing" effect for a better user experience.
With the configuration complete, you can now launch the web application.
-
Launch the App:
- In your WSL terminal (with the
.venvvirtual environment active), run the following command:python -m streamlit run app.py
- In your WSL terminal (with the
-
Interact with the Analyst:
- This command will start a local web server and should automatically open a new tab in your browser. You'll be greeted by the "ASRR: The Conversational Analyst" interface.
- Try asking the same questions from the end of Phase 1. Notice the difference: instead of just a list of search results, the agent now provides a synthesized, conversational answer with citations pointing back to the source documents.
Congratulations! You have successfully completed Phase 2 of the ASRR project. You now have a functional conversational agent that can reason over your private document set, providing grounded, synthesized answers through an interactive web interface. This lays the critical foundation for Phase 3, where we will explore deploying, evaluating, and extending the agent's capabilities.
You've built a conversational analyst. In this final phase, we elevate the agent from a mere subject matter expert to a true Implementation Strategist. The objective is to empower the agent to deconstruct high-level, complex challenges and synthesize concrete, actionable implementation plans on Google Cloud.
This new capability is powered by LangGraph, a library for building stateful, multi-actor applications with LLMs. We use it to define a more sophisticated, multi-step tool that can first perform research and then synthesize a detailed proposal based on its findings.
The core of Phase 3 is the new propose_gcp_architecture tool, which is defined in the new tools.py file. This isn't just a single function call; it's a stateful graph that executes a sequence of steps.
-
The State (
GraphState): LangGraph works by passing a "state" object between nodes. Our state is a simple dictionary that tracks the progress of the task:user_request: The original, complex query from the user.research_results: The context gathered from our Vertex AI Search knowledge base.final_proposal: The final, synthesized architectural plan.
-
The Nodes (Functions): Each step in our workflow is a "node," which is just a Python function that modifies the state.
survey_technologies_node: This is the first step. It takes theuser_requestfrom the state, performs a comprehensive search against our knowledge base, and populates theresearch_resultsfield in the state.synthesize_proposal_node: This node runs after the research is complete. It takes both the originaluser_requestand the newresearch_resultsfrom the state, feeds them into an expert-level prompt, and uses a generative model to write a detailed architectural proposal. The output is saved to thefinal_proposalfield.
-
The Graph (
workflow): We define the order of operations by adding nodes and connecting them with "edges". Our graph is a straightforward sequence:- The entry point is
survey_technologies. - After
survey_technologiescompletes, the graph transitions tosynthesize_proposal. - After
synthesize_proposalcompletes, the graph finishes, and the final state (containing the proposal) is returned.
- The entry point is
The main app.py script is updated to integrate this powerful new tool:
- Dual Tools: The agent is now initialized with a list of two tools: the simple
search_knowledge_basefunction and the new, advancedpropose_gcp_architecturegraph-based tool. - Intelligent Routing: The system prompt has been enhanced to act as a router. It instructs the agent on how to choose the correct tool based on the user's query.
This routing is the key to the agent's new strategic capability. For example:
- Simple Question: If you ask,
"Who is Minsky?", the agent, guided by the prompt, recognizes this as a factual lookup and calls the simple, efficientsearch_knowledge_basetool. - Complex Question: If you ask,
"Propose a scalable architecture for deploying thousands of simple, specialized agents.", the agent identifies this as a high-level design task. It then invokes thepropose_gcp_architecturetool, triggering the entire LangGraph workflow of research followed by synthesis.
The method for launching the app remains the same. The new, more powerful agent is now the default.
- Ensure your virtual environment is active:
source .venv/bin/activate - Launch the Streamlit application from the project root folder:
python -m streamlit run app.py
This concludes the three-phase implementation of the ASRR project. You have successfully built:
- Phase 1: A robust, private knowledge base using Google Cloud Storage and Vertex AI Search.
- Phase 2: A conversational analyst capable of answering questions grounded in that knowledge base.
- Phase 3: An implementation strategist that uses a multi-step LangGraph agent to research and formulate detailed architectural proposals.
This project serves as a powerful foundation. Here are some ideas for extending its capabilities:
- Extend the Corpus: The single most impactful improvement is to expand the knowledge base. You can add extensive documentation for other cloud providers like AWS and Azure for comparative analysis, ingest more advanced research papers on topics like Multi-Agent Reinforcement Learning (MARL), or add internal best-practice documents.
- Advanced Tool Development: You can build more specialized LangGraph tools as described in the project plan, such as a tool to explicitly compare and contrast frameworks (e.g., Mesa vs. Ray) or a tool to design the "K-line" analogue using a graph database.
- Team Deployment: Package the application using Docker and deploy it as a service on Google Cloud Run, making it accessible to your entire team.