[Feature Request] Deep Research #2115

fire17 · 2025-02-13T22:49:36Z

Hi there
Excited to try alto out :)

Was wondering if you're planning on make a deep research pipeline

Plenty of opensource implementions are starting to pop out. I would love for several pipelines
For example how deepseek does it 30-50 search results + thinking
And openai style where does deep research until it is satisfied it compiled everything it needs
And maybe other styles

Thanks a lot and all the best!

pritipsingh · 2025-02-14T07:18:30Z

Hey @fire17 thats a great request! Would love for you to elaborate more on it. This is something easily achievable via our workflows concept. Is this something you’re looking for in playground front end or something we can improve on the sdk?

samyogdhital · 2025-02-14T14:22:57Z

@pritipsingh My recommendation here is we go completely open-source route.

Precise User query understanding

Master agent is given detailed prompt by the user.
Master agent first analyzes the query itself, asks user 3-5 follow-up questions to precisely understand how broad and depth user wants the research to go.

Multiple small precise queries

After the master agent precisely knows what user want, master agent will try to divide the whole user prompt and it understood from questionnaire to multiple small queries.

Use SearXNG to discover the urls with high probability of having the information master agent needs.

Master agent has the list of small queries. Now, it now needs to discover websites where there is high probably to get the answer to these queries.

For this we will use SearXNG. Its api feature can give you the list of websites given the queries (just like google). Infact it combines multiple search engines to give you the list of websites.

Now master agent does tool call for each of these small queries to SearXNG api parallelly. (Obviously we will make our own tool to handle their api in agno and have it as tool for other to use.)

Master Agent will get response for each of these queries from SearXNG in below format.

      [{
      "url": "https://openai.com/",
      "title": "OpenAI",
      "content": "Our work to create safe and beneficial AI requires a deep understanding of the potential risks and benefits, as well as careful consideration of the impact · We research generative models and how to align them with human values",
      "publishedDate": null,
      "thumbnail": "",
      "engine": "brave",
      "template": "default.html",
      "parsed_url": [
        "https",
        "openai.com",
        "/",
        "",
        "",
        ""
      ],
      "engines": [
        "brave"
      ],
      "positions": [
        1
      ],
      "score": 1.0,
      "category": "general"
    },]

See we have url and content key inside that array of object?

Now from the above SearXNG's response for each of these queries, master agent analyzes and chooses the website with even more highly relevant content field got from the response (to save the context window) for web scraping these individual websites.

Web scraping individual websites with open source self hostable tools

Crawl4AI, Firecrawl, Scrapegraph-ai can be used here. (of course only one tool 😄 )
(I would prefer Firecrawl self host version.)

Master agent will have access to tool that will help it scrape all the websites it has got from SearXNG's api for individual small query.
As the information from each of these websites scraped through scraper tool, master agent analyzes the information it needs and there are 2 scenarios:

Information is incomplete

If the information master agent gets ultimately by:
- Dividing the user prompt into multiple queries
- Getting the list of websites for each of these queries from SearXNG
- Scrape all the websites for each of these queries got from SearXNG using firecrawl or similar webscraping tool
Master agent will try to form new queries where it's knowledge is shortcoming with some twist and the exact information it needs. Then it will repeat the process of getting websites from SearXNG, scrapping the websites it get and analyze.

If it stucks in loop and not able to get quality information or we see the context window is filling up quickly, we will have a timeout variable for say certain minutes and the agent will move to the output phase with its limited information.

Information is complete

If the master agents thinks information is complete and will 100% satisfies the user query, it will move to the output phase.

OUTPUT

Finally master agent is in output phase.

It will analyze all the information it was able to get from this whole process.
Then, it will write detailed research like response to the user. It will not consider any bullshit information that is not needed and will cite every important information from the accurate source.
It will pass the response to the user.
Also, since this process is long, so saving the output in output.md file will also make sense.

Important consideration

We can do this with single agent as well as multi agent. But we have to make sure that multi agent flow is actually controllable.
We will have to make a tool for SearXNG since that is also not implemented in agno.

I hope this is clear.
Feel free to comment or ask for further question if you will @pritipsingh.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Deep Research #2115

[Feature Request] Deep Research #2115

fire17 commented Feb 13, 2025

pritipsingh commented Feb 14, 2025

samyogdhital commented Feb 14, 2025

[Feature Request] Deep Research #2115

[Feature Request] Deep Research #2115

Comments

fire17 commented Feb 13, 2025

pritipsingh commented Feb 14, 2025

samyogdhital commented Feb 14, 2025

Precise User query understanding

Multiple small precise queries

Use SearXNG to discover the urls with high probability of having the information master agent needs.

Web scraping individual websites with open source self hostable tools

Information is incomplete

Information is complete

OUTPUT

Important consideration