Skip to content

Feature: Research & Learning AI Agent with Educational Source Integrations #102

@Deodat-Lawson

Description

@Deodat-Lawson

🧠 Feature: Research & Learning AI Agent with Educational Source Integrations

Summary
Build a Research & Learning AI Agent that retrieves, ranks, and synthesizes answers from academic APIs (Semantic Scholar, arXiv, CrossRef, PubMed, OpenAlex) and from trusted educational sources (Khan Academy, Wikipedia/Wikimedia, OpenStax, MIT OCW, PhET, NASA, etc.). The agent will return citation-grounded answers with links/DOIs and optional “learning mode” explanations, quizzes, and follow-ups.


Motivation
Users need authoritative yet accessible explanations. Academic papers provide rigor; educational sources provide pedagogy. Combining both yields reliable answers that are also teachable, making PDR AI v2 more valuable for onboarding, training, and education in enterprises and classrooms.


Scope of Sources (Phase 1 → Phase 2)

Academic (Phase 1):

  • Semantic Scholar, arXiv, CrossRef, PubMed, OpenAlex

Educational (Phase 1):

  • Wikipedia REST API (summary + page HTML), Wikimedia Commons (media)
  • OpenStax (open textbooks; OER, API/endpoints where available)
  • NASA (factsheets, articles, imagery: open data)
  • PhET Interactive Simulations (concept explanations, educator pages where allowed)

Educational (Phase 2 / backlog):

  • Khan Academy content: use official public endpoints where permitted (e.g., topic trees, exercise metadata) and respect TOS. For videos, rely on YouTube Data API metadata + captions for official Khan Academy channel.
  • MIT OpenCourseWare (open license; no official API—HTML fetch + cache with license compliance)
  • CK-12, OpenLearn, Saylor, OpenIntro (OER—evaluate per-site API/TOS)
  • Stanford Encyclopedia of Philosophy (open access; no API—HTML fetch with attribution)

Note: Only integrate sources with explicitly allowed API use or OER licenses. Add per-source adapters with license notes and rate-limit guards.


Core Capabilities

  1. Multi-source Retrieval & Ranking

    • Generate queries → call source adapters → normalize results → score by authority, recency, pedagogical clarity, and topical match.
  2. Grounded Answers with Citations

    • Inline numbered citations [1] linking to DOI/URL; add a References section with title, year, authors, and license/attribution when required (e.g., Wikipedia/CC BY-SA).
  3. Learning Mode

    • Simplified explanation, key takeaways, quick quiz (2–3 questions), and suggested next steps/readings from OER sources.
  4. Research Mode

    • Concise synthesis with limitations/uncertainties and direct links to papers/sections.
  5. Caching & Dedup

    • Cache normalized records (Redis/Supabase) by query hash + source; deduplicate by DOI/URL.

Proposed Implementation

  • Agent: research_learning_agent within LangGraph (planner → retriever → ranker → synthesizer → (optional) quizzer).

  • Adapters (/server/research/adapters/*):

    • semantic_scholar.ts, arxiv.ts, crossref.ts, pubmed.ts, openalex.ts
    • wikipedia.ts (REST + page summary), wikimedia.ts (media), openstax.ts, nasa.ts, phet.ts
    • (Phase 2) khan_academy.ts, mit_ocw.ts, ck12.ts, etc.
  • Normalizer: Common schema { title, authors, year, url, doi?, snippet, license?, source, weight }.

  • Ranker: Heuristic + embedding re-rank (optionally via Qdrant Cloud if enabled).

  • Synthesizer: Model composes answer; enforces citation injection and license-aware attribution.

  • UI/API Toggle:

    { "agent": "research_learning", "mode": "learning", "max_sources": 8 }

API & Config (examples)

RESEARCH_AGENT_ENABLED=true
RESEARCH_AGENT_MODE_DEFAULT=learning

# Academic
SEMANTIC_SCHOLAR_API_KEY=...
OPENALEX_API_KEY=...
ARXIV_BASE_URL=https://export.arxiv.org/api/query

# Educational
WIKIPEDIA_BASE_URL=https://en.wikipedia.org/api/rest_v1
OPENSTAX_BASE_URL=https://openstax.org/api
NASA_API_KEY=...

# Optional: YouTube for official edu channels (e.g., Khan Academy)
YOUTUBE_API_KEY=...

# Vector re-ranking / grounding
VECTOR_DB=QDRANT_CLOUD
QDRANT_URL=https://<cluster>.cloud.qdrant.io
QDRANT_API_KEY=...

Licensing & Compliance

  • Wikipedia/Wikimedia: Provide attribution; note CC BY-SA 3.0 / GFDL obligations in References.
  • OpenStax: Attribute per OER license (usually CC BY).
  • Khan Academy / MIT OCW / others: Only fetch content allowed by TOS/API; attribute per license; avoid scraping protected endpoints.
  • Create /docs/data-sources.md listing each source, license, and usage limits.

Rate Limits & Reliability

  • Per-adapter retry with backoff; respect Retry-After.
  • Fallback sequence (if a source fails): Academic → Educational → cached.
  • Circuit breaker to prevent UI latency spikes.

Acceptance Criteria

  • End-to-end: question → multi-source retrieval → ranked synthesis → grounded citations.
  • Two modes supported: research and learning (pedagogical tone + quiz).
  • At least 5 sources integrated (≥3 academic, ≥2 educational) in Phase 1.
  • Citations include DOI/URL and, where applicable, license/attribution.
  • Caching implemented; duplicate suppression by DOI/URL.
  • Configuration & docs added: /docs/agents/research_learning_agent.md and /docs/data-sources.md.
  • Unit tests for adapters; integration tests for ranking + synthesis pipeline.

Nice-to-Have (Backlog)

  • Per-source trust scores surfaced in UI (hover for source details).
  • Section-level grounding (quote + link to exact section/anchor).
  • Learning paths built from OpenStax chapters / Khan topic trees.
  • Instructor dashboard with anonymized analytics.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions