Add openreview helper to fetch papers from conferences (#879)

Co-authored-by: pre-commit-ci-lite[bot] <117423508+pre-commit-ci-lite[bot]@users.noreply.github.com> Co-authored-by: James Braza <[email protected]>
Future-House · Feb 24, 2025 · 495e9e0 · 495e9e0
1 parent 81d7b15
commit 495e9e0
Show file tree

Hide file tree

Showing 8 changed files with 402 additions and 101 deletions.
diff --git a/.mailmap b/.mailmap
@@ -1,5 +1,6 @@
 Andrew White <[email protected]> <[email protected]>
 Anush008 <[email protected]> Anush <[email protected]>
+Dmitrii Magas <[email protected]> eamag
 Geemi Wellawatte <[email protected]> <[email protected]>
 Geemi Wellawatte <[email protected]> <[email protected]>
 Harry Vu <[email protected]> <[email protected]>

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -73,8 +73,4 @@ from responses to ensure sensitive information is excluded from the cassettes.
 Please ensure cassettes are less than 1 MB
 to keep tests loading quickly.
 
-## Additional resources
-
-For more information on contributing, please refer to the [CONTRIBUTING.md](CONTRIBUTING.md) file in the repository.
-
 Happy coding!
diff --git a/README.md b/README.md
@@ -39,8 +39,6 @@ question answering, summarization, and contradiction detection.
   - [Using Clients Directly](#using-clients-directly)
 - [Settings Cheatsheet](#settings-cheatsheet)
 - [Where do I get papers?](#where-do-i-get-papers)
-  - [Zotero](#zotero)
-  - [Paper Scraper](#paper-scraper)
 - [Callbacks](#callbacks)
   - [Caching Embeddings](#caching-embeddings)
 - [Customizing Prompts](#customizing-prompts)
@@ -836,99 +834,7 @@ will return much faster than the first query and we'll be certain the authors ma
 
 Well that's a really good question! It's probably best to just download PDFs of papers you think will help answer your question and start from there.
 
-### Zotero
-
-_It's been a while since we've tested this - so let us know if it runs into issues!_
-
-If you use [Zotero](https://www.zotero.org/) to organize your personal bibliography,
-you can use the `paperqa.contrib.ZoteroDB` to query papers from your library,
-which relies on [pyzotero](https://github.com/urschrei/pyzotero).
-
-Install `pyzotero` via the `zotero` extra for this feature:
-
-```bash
-pip install paper-qa[zotero]
-```
-
-First, note that PaperQA2 parses the PDFs of papers to store in the database,
-so all relevant papers should have PDFs stored inside your database.
-You can get Zotero to automatically do this by highlighting the references
-you wish to retrieve, right clicking, and selecting _"Find Available PDFs"_.
-You can also manually drag-and-drop PDFs onto each reference.
-
-To download papers, you need to get an API key for your account.
-
-1. Get your library ID, and set it as the environment variable `ZOTERO_USER_ID`.
-   - For personal libraries, this ID is given [here](https://www.zotero.org/settings/keys) at the part "_Your userID for use in API calls is XXXXXX_".
-   - For group libraries, go to your group page `https://www.zotero.org/groups/groupname`, and hover over the settings link. The ID is the integer after /groups/. (_h/t pyzotero!_)
-2. Create a new API key [here](https://www.zotero.org/settings/keys/new) and set it as the environment variable `ZOTERO_API_KEY`.
-   - The key will need read access to the library.
-
-With this, we can download papers from our library and add them to PaperQA2:
-
-```python
-from paperqa import Docs
-from paperqa.contrib import ZoteroDB
-
-docs = Docs()
-zotero = ZoteroDB(library_type="user")  # "group" if group library
-
-for item in zotero.iterate(limit=20):
-    if item.num_pages > 30:
-        continue  # skip long papers
-    docs.add(item.pdf, docname=item.key)
-```
-
-which will download the first 20 papers in your Zotero database and add
-them to the `Docs` object.
-
-We can also do specific queries of our Zotero library and iterate over the results:
-
-```python
-for item in zotero.iterate(
-    q="large language models",
-    qmode="everything",
-    sort="date",
-    direction="desc",
-    limit=100,
-):
-    print("Adding", item.title)
-    docs.add(item.pdf, docname=item.key)
-```
-
-You can read more about the search syntax by typing `zotero.iterate?` in IPython.
-
-### Paper Scraper
-
-If you want to search for papers outside of your own collection, I've found an unrelated project called [`paper-scraper`](https://github.com/blackadad/paper-scraper) that looks
-like it might help. But beware, this project looks like it uses some scraping tools that may violate publisher's rights or be in a gray area of legality.
-
-First, install `paper-scraper`:
-
-```bash
-pip install git+https://github.com/blackadad/paper-scraper.git
-```
-
-Then run with it:
-
-```python
-import paperscraper
-from paperqa import Docs
-
-keyword_search = "bispecific antibody manufacture"
-papers = paperscraper.search_papers(keyword_search)
-docs = Docs()
-for path, data in papers.items():
-    try:
-        docs.add(path)
-    except ValueError as e:
-        # sometimes this happens if PDFs aren't downloaded or readable
-        print("Could not read", path, e)
-session = docs.query(
-    "What manufacturing challenges are unique to bispecific antibodies?"
-)
-print(session)
-```
+See detailed docs [about zotero, openreview and parsing](docs/tutorials/where_do_I_get_papers.md)
 
 ## Callbacks
 

diff --git a/docs/tutorials/where_do_I_get_papers.md b/docs/tutorials/where_do_I_get_papers.md
@@ -0,0 +1,122 @@
+# Where to get papers
+
+## OpenReview
+
+You can use papers from [https://openreview.net/](https://openreview.net/) as your database!
+Here's a helper that fetches a list of all papers from a selected conference (like ICLR, ICML, NeurIPS), queries this list to find relevant papers using LLM, and downloads those relevant papers to a local directory which can be used with paper-qa on the next step. Install `openreview-py` with
+
+```bash
+pip install paper-qa[openreview]
+```
+
+and get your username and password from the website. You can put them into `.env` file under `OPENREVIEW_USERNAME` and `OPENREVIEW_PASSWORD` variables, or pass them in the code directly.
+
+```python
+from paperqa import Settings
+from paperqa.contrib.openreview_paper_helper import OpenReviewPaperHelper
+
+# these settings require gemini api key you can get from https://aistudio.google.com/
+# import os; os.environ["GEMINI_API_KEY"] = os.getenv("GEMINI_API_KEY")
+# 1Mil context window helps to suggest papers. These settings are not required, but useful for an initial setup.
+settings = Settings.from_name("openreview")
+helper = OpenReviewPaperHelper(settings, venue_id="ICLR.cc/2025/Conference")
+# if you don't know venue_id you can find it via
+# helper.get_venues()
+
+# Now we can query LLM to select relevant papers and download PDFs
+question = "What is the progress on brain activity research?"
+
+submissions = helper.fetch_relevant_papers(question)
+
+# There's also a function that saves tokens by using openreview metadata for citations
+docs = await helper.aadd_docs(submissions)
+
+# Now you can continue asking like in the [main tutorial](../../README.md)
+session = docs.query(question, settings=settings)
+print(session.answer)
+```
+
+## Zotero
+
+_It's been a while since we've tested this - so let us know if it runs into issues!_
+
+If you use [Zotero](https://www.zotero.org/) to organize your personal bibliography,
+you can use the `paperqa.contrib.ZoteroDB` to query papers from your library,
+which relies on [pyzotero](https://github.com/urschrei/pyzotero).
+
+Install `pyzotero` via the `zotero` extra for this feature:
+
+```bash
+pip install paper-qa[zotero]
+```
+
+First, note that PaperQA2 parses the PDFs of papers to store in the database,
+so all relevant papers should have PDFs stored inside your database.
+You can get Zotero to automatically do this by highlighting the references
+you wish to retrieve, right clicking, and selecting _"Find Available PDFs"_.
+You can also manually drag-and-drop PDFs onto each reference.
+
+To download papers, you need to get an API key for your account.
+
+1. Get your library ID, and set it as the environment variable `ZOTERO_USER_ID`.
+   - For personal libraries, this ID is given [here](https://www.zotero.org/settings/keys) at the part "_Your userID for use in API calls is XXXXXX_".
+   - For group libraries, go to your group page `https://www.zotero.org/groups/groupname`, and hover over the settings link. The ID is the integer after /groups/. (_h/t pyzotero!_)
+2. Create a new API key [here](https://www.zotero.org/settings/keys/new) and set it as the environment variable `ZOTERO_API_KEY`.
+   - The key will need read access to the library.
+
+With this, we can download papers from our library and add them to PaperQA2:
+
+```python
+from paperqa import Docs
+from paperqa.contrib import ZoteroDB
+
+docs = Docs()
+zotero = ZoteroDB(library_type="user")  # "group" if group library
+
+for item in zotero.iterate(limit=20):
+    if item.num_pages > 30:
+        continue  # skip long papers
+    docs.add(item.pdf, docname=item.key)
+```
+
+which will download the first 20 papers in your Zotero database and add
+them to the `Docs` object.
+
+We can also do specific queries of our Zotero library and iterate over the results:
+
+```python
+for item in zotero.iterate(
+    q="large language models",
+    qmode="everything",
+    sort="date",
+    direction="desc",
+    limit=100,
+):
+    print("Adding", item.title)
+    docs.add(item.pdf, docname=item.key)
+```
+
+You can read more about the search syntax by typing `zotero.iterate?` in IPython.
+
+## Paper Scraper
+
+If you want to search for papers outside of your own collection, I've found an unrelated project called [paper-scraper](https://github.com/blackadad/paper-scraper) that looks
+like it might help. But beware, this project looks like it uses some scraping tools that may violate publisher's rights or be in a gray area of legality.
+
+```python
+from paperqa import Docs
+
+keyword_search = "bispecific antibody manufacture"
+papers = paperscraper.search_papers(keyword_search)
+docs = Docs()
+for path, data in papers.items():
+    try:
+        docs.add(path)
+    except ValueError as e:
+        # sometimes this happens if PDFs aren't downloaded or readable
+        print("Could not read", path, e)
+session = docs.query(
+    "What manufacturing challenges are unique to bispecific antibodies?"
+)
+print(session)
+```
diff --git a/paperqa/configs/openreview.json b/paperqa/configs/openreview.json
@@ -0,0 +1,36 @@
+{
+  "llm": "gemini/gemini-2.0-flash-exp",
+  "llm_config": {
+    "model_name": "gemini/gemini-2.0-flash-exp",
+    "litellm_params": {
+      "model": "gemini/gemini-2.0-flash-exp",
+      "api_key": null
+    }
+  },
+  "summary_llm": "gemini/gemini-2.0-flash-exp",
+  "summary_llm_config": {
+    "model_name": "gemini/gemini-2.0-flash-exp",
+    "litellm_params": {
+      "model": "gemini/gemini-2.0-flash-exp",
+      "api_key": null
+    }
+  },
+  "embedding": "ollama/granite3-dense",
+  "paper_directory": "my_papers",
+  "verbosity": 3,
+  "agent": {
+    "agent_llm": "gemini/gemini-2.0-flash-exp",
+    "agent_llm_config": {
+      "model_name": "gemini/gemini-2.0-flash-exp",
+      "litellm_params": {
+        "model": "gemini/gemini-2.0-flash-exp",
+        "api_key": null
+      }
+    },
+    "return_paper_metadata": false
+  },
+  "parsing": {
+    "chunk_size": 3000000,
+    "use_doc_details": false
+  }
+}