This project is a community effort to build a Neo4j Knowledge Graph (KG) that links heterogenous data about COVID-19 to help fight this outbreak! It serves as a sandbox and incubator project and the best ideas will be incorporated into the Covid-19-Net KG.
Join "GraphHackers, Let’s Unite to Help Save the World — Graphs4Good 2020".
What kind of data can you contribute? Here are some of our ideas.
-
File an issue to discuss your idea so we can coordinate efforts
-
Help with specific issues
-
Suggest publically accessible data sets
-
Suggest graph queries to gain new insights from the KG
-
Add Jupyter Notebooks with data analyses
-
Add data and map visualizations
-
Help improve the data model
-
Report bugs or issues
The left side of the schema shows the geolocation hierarchy from the world to the city level (> 1000 citizens). Geolocations are linked by COVID-19 case counts to information about host organisms, virus strains, genomes, genes, and proteins, and publications that mention the virus strains.
View of Neo4j Browser showing the result of a query about publications on the origin of the SARS-CoV-2 virus.
You can browse the KG with the Neo4j Browser here:
- Launch Browser
- Enter username: reader, password: demo
- Click on the database icon on the top left, then click on any node label to start exploring the KG
- Run a Cypher query
MATCH (s:Strain)-[:FOUND_IN]->(l:Location{name: 'Los Angeles'}) RETURN s, l
This subgraph shows two viral strains (green) of the SARS-CoV-2 virus carried by a human host in Los Angeles (organisms in yellow). The strains have several variants (e.g., mutations)(red) in common. Details of the high-lighted variant is shown at the bottom. This variant is a missense mutation: the base "G" (Guanine) found in the Wuhan-HU-1 reference genome was mutated to a "C" (Cytosine) at position 28007 in this strain (ORF8:c.184Gtg>Ctg), resulting in the encoded ORF8 protein (QHD43422.1) to be changed from a "V" (Valine) to an "L" (Leucine) amino acid at position 62 (QHD43422.1:p.62V>L). Two publications: PMC7166309 and PMC7106203 (blue) mention this strain.
MATCH (o:Outbreak{id: "COVID-19"})<-[:RELATED_TO]-(c:Cases{date: date("2020-05-04")})-[:REPORTED_IN]->(a:Admin2)-[:IN]->(a1:Admin1)
RETURN a1.name as state, sum(c.cummulativeConfirmed) as confirmed, sum(c.cummulativeDeaths) as deaths
ORDER BY deaths;
Note, some cases in the COVID-19 Data Repository by Johns Hopkins University cannot be mapped to a county or state location (e.g., cruise ships, correctional facilities, missing location data). Therefore, the results of this query will underreport the actual number of cases.
[more documentations will come soon]
This project uses Jupyter Notebooks to download and curate the latest data files, create a Neo4j graph database, and run Cypher queries on the graph database. The results of the queries can then be used in the Jupyter Notebooks for further analysis and visualizations.
You can run the Jupyter Notebooks in this repo in your web browser:
Once Jupyter Lab launches, navigate to the notebooks folder and run the following notebooks:
| Notebook | Description |
|---|---|
| 00e-GeoNamesCountry | Downloads country information from GeoNames.org |
| 00f-GeoNamesAdmin1 | Downloads first administrative divisions (State, Province, Municipality) information from GeoNames.org |
| 00g-GeoNamesAdmin2 | Downloads second administrative divisions (Counties in the US) information from GeoNames.org |
| 00h-GeoNamesCity | Downloads city information (cities > 1000 citizens) from GeoNames.org |
| 00i-USCensusRegionDivisionState2017 | Downloads US regions, divisions, and assigns state FIPS codes from the US Census Bureau |
| 00j-USCensusCountyCity2017 | Downloads US County FIPS codes from the US Census Bureau |
| 00k-UNRegion | Downloads UN geographic regions, subregions, and intermediate region information from United Nations |
| 01a-NCBIStrain | Downloads the latest SARS-CoV-2 strain data from NCBI [currently not used, replaced by 01d-CNCBStrain] |
| 01b-Nextstrain | Downloads the SARS-CoV-2 strain metadata from Nextstrain |
| 01c-NCBIRefSeq | Downloads the SARS-CoV-2 reference genome, genes, and protein products from NCBI |
| 01d-CNCBStrain | Downloads SARS-CoV-2 viral strains and variation data from CNCB (China National Center for Bioinformation) [takes about 12 hours to run the first time, results are cached] |
| 01h-PMC-Accession | Downloads PubMed Central articles that mention NCBI and GISAID strains |
| 02a-JHUCases | Downloads cummulative confimed cases and deaths from the COVID-19 Data Repository by Johns Hopkins University |
| ... | Future notebooks that add new data to the knowledge graph |
| 2-CreateKnowledgeGraph | Creates a Neo4j Knowledge Graph by running the Cypher scripts in the cypher directory [does not work on Binder!] |
| 3-ExampleQueriesRemote | Runs Cypher queries on the Knowledge Graph server |
1. Fork this project
A fork is a copy of a repository in your GitHub account. Forking a repository allows you to freely experiment with changes without affecting the original project.
In the top-right corner of this GitHub page, click Fork.
Then, download all materials to your laptop by cloning your copy of the repository, where your-user-name is your GitHub user name. To clone the repository from a Terminal window or the Anaconda prompt (Windows), run:
git clone https://github.com/your-user-name/covid-19-community.git
cd covid-19-community
2. Create a conda environment
The file environment.yml specifies the Python version and all packages required by the tutorial.
conda env create -f environment.yml
Activate the conda environment
conda activate covid-19-community
3. Install Neo4j Desktop
Then, launch the Neo4j Browser, create an empty database, and set the password to "neo4jbinder"
4. Set Environment Variable
TODO add more documentation here ...
Set a NEO4J_HOME environment variable with the path to the database installation.
(Example path from Mac OS: /Users/username/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-993db298-6374-4f0a-9a9a-d0783480877a/installation-3.5.14)
5. Launch Jupyter Lab Run the Jupyter Notebooks in order to download the latest data, create a new graph database, and then query then query the graph database.
jupyter lab
6. Browse KG in Neo4j Browser
After you create the graph database by running the Jupyter Notebooks, start the database in Neo4j Browser to interactively explore the KG.



