Skip to content

Commit

Permalink
Add openreview helper to fetch papers from conferences (#879)
Browse files Browse the repository at this point in the history
Co-authored-by: pre-commit-ci-lite[bot] <117423508+pre-commit-ci-lite[bot]@users.noreply.github.com>
Co-authored-by: James Braza <[email protected]>
  • Loading branch information
3 people authored Feb 24, 2025
1 parent 81d7b15 commit 495e9e0
Show file tree
Hide file tree
Showing 8 changed files with 402 additions and 101 deletions.
1 change: 1 addition & 0 deletions .mailmap
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
Andrew White <[email protected]> <[email protected]>
Anush008 <[email protected]> Anush <[email protected]>
Dmitrii Magas <[email protected]> eamag
Geemi Wellawatte <[email protected]> <[email protected]>
Geemi Wellawatte <[email protected]> <[email protected]>
Harry Vu <[email protected]> <[email protected]>
Expand Down
4 changes: 0 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,4 @@ from responses to ensure sensitive information is excluded from the cassettes.
Please ensure cassettes are less than 1 MB
to keep tests loading quickly.

## Additional resources

For more information on contributing, please refer to the [CONTRIBUTING.md](CONTRIBUTING.md) file in the repository.

Happy coding!
96 changes: 1 addition & 95 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,6 @@ question answering, summarization, and contradiction detection.
- [Using Clients Directly](#using-clients-directly)
- [Settings Cheatsheet](#settings-cheatsheet)
- [Where do I get papers?](#where-do-i-get-papers)
- [Zotero](#zotero)
- [Paper Scraper](#paper-scraper)
- [Callbacks](#callbacks)
- [Caching Embeddings](#caching-embeddings)
- [Customizing Prompts](#customizing-prompts)
Expand Down Expand Up @@ -836,99 +834,7 @@ will return much faster than the first query and we'll be certain the authors ma

Well that's a really good question! It's probably best to just download PDFs of papers you think will help answer your question and start from there.

### Zotero

_It's been a while since we've tested this - so let us know if it runs into issues!_

If you use [Zotero](https://www.zotero.org/) to organize your personal bibliography,
you can use the `paperqa.contrib.ZoteroDB` to query papers from your library,
which relies on [pyzotero](https://github.com/urschrei/pyzotero).

Install `pyzotero` via the `zotero` extra for this feature:

```bash
pip install paper-qa[zotero]
```

First, note that PaperQA2 parses the PDFs of papers to store in the database,
so all relevant papers should have PDFs stored inside your database.
You can get Zotero to automatically do this by highlighting the references
you wish to retrieve, right clicking, and selecting _"Find Available PDFs"_.
You can also manually drag-and-drop PDFs onto each reference.

To download papers, you need to get an API key for your account.

1. Get your library ID, and set it as the environment variable `ZOTERO_USER_ID`.
- For personal libraries, this ID is given [here](https://www.zotero.org/settings/keys) at the part "_Your userID for use in API calls is XXXXXX_".
- For group libraries, go to your group page `https://www.zotero.org/groups/groupname`, and hover over the settings link. The ID is the integer after /groups/. (_h/t pyzotero!_)
2. Create a new API key [here](https://www.zotero.org/settings/keys/new) and set it as the environment variable `ZOTERO_API_KEY`.
- The key will need read access to the library.

With this, we can download papers from our library and add them to PaperQA2:

```python
from paperqa import Docs
from paperqa.contrib import ZoteroDB

docs = Docs()
zotero = ZoteroDB(library_type="user") # "group" if group library

for item in zotero.iterate(limit=20):
if item.num_pages > 30:
continue # skip long papers
docs.add(item.pdf, docname=item.key)
```

which will download the first 20 papers in your Zotero database and add
them to the `Docs` object.

We can also do specific queries of our Zotero library and iterate over the results:

```python
for item in zotero.iterate(
q="large language models",
qmode="everything",
sort="date",
direction="desc",
limit=100,
):
print("Adding", item.title)
docs.add(item.pdf, docname=item.key)
```

You can read more about the search syntax by typing `zotero.iterate?` in IPython.

### Paper Scraper

If you want to search for papers outside of your own collection, I've found an unrelated project called [`paper-scraper`](https://github.com/blackadad/paper-scraper) that looks
like it might help. But beware, this project looks like it uses some scraping tools that may violate publisher's rights or be in a gray area of legality.

First, install `paper-scraper`:

```bash
pip install git+https://github.com/blackadad/paper-scraper.git
```

Then run with it:

```python
import paperscraper
from paperqa import Docs

keyword_search = "bispecific antibody manufacture"
papers = paperscraper.search_papers(keyword_search)
docs = Docs()
for path, data in papers.items():
try:
docs.add(path)
except ValueError as e:
# sometimes this happens if PDFs aren't downloaded or readable
print("Could not read", path, e)
session = docs.query(
"What manufacturing challenges are unique to bispecific antibodies?"
)
print(session)
```
See detailed docs [about zotero, openreview and parsing](docs/tutorials/where_do_I_get_papers.md)

## Callbacks

Expand Down
122 changes: 122 additions & 0 deletions docs/tutorials/where_do_I_get_papers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Where to get papers

## OpenReview

You can use papers from [https://openreview.net/](https://openreview.net/) as your database!
Here's a helper that fetches a list of all papers from a selected conference (like ICLR, ICML, NeurIPS), queries this list to find relevant papers using LLM, and downloads those relevant papers to a local directory which can be used with paper-qa on the next step. Install `openreview-py` with

```bash
pip install paper-qa[openreview]
```

and get your username and password from the website. You can put them into `.env` file under `OPENREVIEW_USERNAME` and `OPENREVIEW_PASSWORD` variables, or pass them in the code directly.

```python
from paperqa import Settings
from paperqa.contrib.openreview_paper_helper import OpenReviewPaperHelper

# these settings require gemini api key you can get from https://aistudio.google.com/
# import os; os.environ["GEMINI_API_KEY"] = os.getenv("GEMINI_API_KEY")
# 1Mil context window helps to suggest papers. These settings are not required, but useful for an initial setup.
settings = Settings.from_name("openreview")
helper = OpenReviewPaperHelper(settings, venue_id="ICLR.cc/2025/Conference")
# if you don't know venue_id you can find it via
# helper.get_venues()

# Now we can query LLM to select relevant papers and download PDFs
question = "What is the progress on brain activity research?"

submissions = helper.fetch_relevant_papers(question)

# There's also a function that saves tokens by using openreview metadata for citations
docs = await helper.aadd_docs(submissions)

# Now you can continue asking like in the [main tutorial](../../README.md)
session = docs.query(question, settings=settings)
print(session.answer)
```

## Zotero

_It's been a while since we've tested this - so let us know if it runs into issues!_

If you use [Zotero](https://www.zotero.org/) to organize your personal bibliography,
you can use the `paperqa.contrib.ZoteroDB` to query papers from your library,
which relies on [pyzotero](https://github.com/urschrei/pyzotero).

Install `pyzotero` via the `zotero` extra for this feature:

```bash
pip install paper-qa[zotero]
```

First, note that PaperQA2 parses the PDFs of papers to store in the database,
so all relevant papers should have PDFs stored inside your database.
You can get Zotero to automatically do this by highlighting the references
you wish to retrieve, right clicking, and selecting _"Find Available PDFs"_.
You can also manually drag-and-drop PDFs onto each reference.

To download papers, you need to get an API key for your account.

1. Get your library ID, and set it as the environment variable `ZOTERO_USER_ID`.
- For personal libraries, this ID is given [here](https://www.zotero.org/settings/keys) at the part "_Your userID for use in API calls is XXXXXX_".
- For group libraries, go to your group page `https://www.zotero.org/groups/groupname`, and hover over the settings link. The ID is the integer after /groups/. (_h/t pyzotero!_)
2. Create a new API key [here](https://www.zotero.org/settings/keys/new) and set it as the environment variable `ZOTERO_API_KEY`.
- The key will need read access to the library.

With this, we can download papers from our library and add them to PaperQA2:

```python
from paperqa import Docs
from paperqa.contrib import ZoteroDB

docs = Docs()
zotero = ZoteroDB(library_type="user") # "group" if group library

for item in zotero.iterate(limit=20):
if item.num_pages > 30:
continue # skip long papers
docs.add(item.pdf, docname=item.key)
```

which will download the first 20 papers in your Zotero database and add
them to the `Docs` object.

We can also do specific queries of our Zotero library and iterate over the results:

```python
for item in zotero.iterate(
q="large language models",
qmode="everything",
sort="date",
direction="desc",
limit=100,
):
print("Adding", item.title)
docs.add(item.pdf, docname=item.key)
```

You can read more about the search syntax by typing `zotero.iterate?` in IPython.

## Paper Scraper

If you want to search for papers outside of your own collection, I've found an unrelated project called [paper-scraper](https://github.com/blackadad/paper-scraper) that looks
like it might help. But beware, this project looks like it uses some scraping tools that may violate publisher's rights or be in a gray area of legality.

```python
from paperqa import Docs

keyword_search = "bispecific antibody manufacture"
papers = paperscraper.search_papers(keyword_search)
docs = Docs()
for path, data in papers.items():
try:
docs.add(path)
except ValueError as e:
# sometimes this happens if PDFs aren't downloaded or readable
print("Could not read", path, e)
session = docs.query(
"What manufacturing challenges are unique to bispecific antibodies?"
)
print(session)
```
36 changes: 36 additions & 0 deletions paperqa/configs/openreview.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"llm": "gemini/gemini-2.0-flash-exp",
"llm_config": {
"model_name": "gemini/gemini-2.0-flash-exp",
"litellm_params": {
"model": "gemini/gemini-2.0-flash-exp",
"api_key": null
}
},
"summary_llm": "gemini/gemini-2.0-flash-exp",
"summary_llm_config": {
"model_name": "gemini/gemini-2.0-flash-exp",
"litellm_params": {
"model": "gemini/gemini-2.0-flash-exp",
"api_key": null
}
},
"embedding": "ollama/granite3-dense",
"paper_directory": "my_papers",
"verbosity": 3,
"agent": {
"agent_llm": "gemini/gemini-2.0-flash-exp",
"agent_llm_config": {
"model_name": "gemini/gemini-2.0-flash-exp",
"litellm_params": {
"model": "gemini/gemini-2.0-flash-exp",
"api_key": null
}
},
"return_paper_metadata": false
},
"parsing": {
"chunk_size": 3000000,
"use_doc_details": false
}
}
Loading

0 comments on commit 495e9e0

Please sign in to comment.