Skip to content

Community effort to build a Neo4j Knowledge Graph (KG) that links heterogeneous data about COVID-19

License

Notifications You must be signed in to change notification settings

Harikantipudi/covid-19-community

 
 

Repository files navigation

Covid-19-Community

This project is a community effort to build a Neo4j Knowledge Graph (KG) that links heterogenous data about COVID-19 to help fight this outbreak! It serves as a sandbox and incubator project and the best ideas will be incorporated into the Covid-19-Net KG.

Join "GraphHackers, Let’s Unite to Help Save the World — Graphs4Good 2020".

What kind of data can you contribute? Here are some of our ideas.

How can you contribute?

  • File an issue to discuss your idea so we can coordinate efforts

  • Help with specific issues

  • Suggest publically accessible data sets

  • Suggest graph queries to gain new insights from the KG

  • Add Jupyter Notebooks with data analyses

  • Add data and map visualizations

  • Help improve the data model

  • Report bugs or issues

Preliminary Knowledge Graph Schema

The left side of the schema shows the geolocation hierarchy from the world to the city level (> 1000 citizens). Geolocations are linked by COVID-19 case counts to information about host organisms, virus strains, genomes, genes, and proteins, and publications that mention the virus strains.

Browsing the Knowledge Graph with the Neo4j Browser

View of Neo4j Browser showing the result of a query about publications on the origin of the SARS-CoV-2 virus.

You can browse the KG with the Neo4j Browser here:

  1. Launch Browser
  2. Enter username: reader, password: demo
  3. Click on the database icon on the top left, then click on any node label to start exploring the KG
  4. Run a Cypher query

Example Cypher query: find viral strains collected in Los Angeles

MATCH (s:Strain)-[:FOUND_IN]->(l:Location{name: 'Los Angeles'}) RETURN s, l

This subgraph shows two viral strains (green) of the SARS-CoV-2 virus carried by a human host in Los Angeles (organisms in yellow). The strains have several variants (e.g., mutations)(red) in common. Details of the high-lighted variant is shown at the bottom. This variant is a missense mutation: the base "G" (Guanine) found in the Wuhan-HU-1 reference genome was mutated to a "C" (Cytosine) at position 28007 in this strain (ORF8:c.184Gtg>Ctg), resulting in the encoded ORF8 protein (QHD43422.1) to be changed from a "V" (Valine) to an "L" (Leucine) amino acid at position 62 (QHD43422.1:p.62V>L). Two publications: PMC7166309 and PMC7106203 (blue) mention this strain.

Example Cypher query: aggregate cummulative COVID-19 case numbers at the US state (Admin1) level

MATCH (o:Outbreak{id: "COVID-19"})<-[:RELATED_TO]-(c:Cases{date: date("2020-05-04")})-[:REPORTED_IN]->(a:Admin2)-[:IN]->(a1:Admin1)
RETURN a1.name as state, sum(c.cummulativeConfirmed) as confirmed, sum(c.cummulativeDeaths) as deaths
ORDER BY deaths;

Note, some cases in the COVID-19 Data Repository by Johns Hopkins University cannot be mapped to a county or state location (e.g., cruise ships, correctional facilities, missing location data). Therefore, the results of this query will underreport the actual number of cases.

[more documentations will come soon]

How to use this project?

This project uses Jupyter Notebooks to download and curate the latest data files, create a Neo4j graph database, and run Cypher queries on the graph database. The results of the queries can then be used in the Jupyter Notebooks for further analysis and visualizations.

You can run the Jupyter Notebooks in this repo in your web browser:

Binder

Once Jupyter Lab launches, navigate to the notebooks folder and run the following notebooks:

Notebook Description
00e-GeoNamesCountry Downloads country information from GeoNames.org
00f-GeoNamesAdmin1 Downloads first administrative divisions (State, Province, Municipality) information from GeoNames.org
00g-GeoNamesAdmin2 Downloads second administrative divisions (Counties in the US) information from GeoNames.org
00h-GeoNamesCity Downloads city information (cities > 1000 citizens) from GeoNames.org
00i-USCensusRegionDivisionState2017 Downloads US regions, divisions, and assigns state FIPS codes from the US Census Bureau
00j-USCensusCountyCity2017 Downloads US County FIPS codes from the US Census Bureau
00k-UNRegion Downloads UN geographic regions, subregions, and intermediate region information from United Nations
01a-NCBIStrain Downloads the latest SARS-CoV-2 strain data from NCBI [currently not used, replaced by 01d-CNCBStrain]
01b-Nextstrain Downloads the SARS-CoV-2 strain metadata from Nextstrain
01c-NCBIRefSeq Downloads the SARS-CoV-2 reference genome, genes, and protein products from NCBI
01d-CNCBStrain Downloads SARS-CoV-2 viral strains and variation data from CNCB (China National Center for Bioinformation) [takes about 12 hours to run the first time, results are cached]
01h-PMC-Accession Downloads PubMed Central articles that mention NCBI and GISAID strains
02a-JHUCases Downloads cummulative confimed cases and deaths from the COVID-19 Data Repository by Johns Hopkins University
... Future notebooks that add new data to the knowledge graph
2-CreateKnowledgeGraph Creates a Neo4j Knowledge Graph by running the Cypher scripts in the cypher directory [does not work on Binder!]
3-ExampleQueriesRemote Runs Cypher queries on the Knowledge Graph server

How to run this project locally

1. Fork this project

A fork is a copy of a repository in your GitHub account. Forking a repository allows you to freely experiment with changes without affecting the original project.

In the top-right corner of this GitHub page, click Fork.

Then, download all materials to your laptop by cloning your copy of the repository, where your-user-name is your GitHub user name. To clone the repository from a Terminal window or the Anaconda prompt (Windows), run:

git clone https://github.com/your-user-name/covid-19-community.git
cd covid-19-community

2. Create a conda environment

The file environment.yml specifies the Python version and all packages required by the tutorial.

conda env create -f environment.yml

Activate the conda environment

conda activate covid-19-community

3. Install Neo4j Desktop

Download Neo4j

Then, launch the Neo4j Browser, create an empty database, and set the password to "neo4jbinder"

4. Set Environment Variable

TODO add more documentation here ...

Set a NEO4J_HOME environment variable with the path to the database installation.

(Example path from Mac OS: /Users/username/Library/Application Support/Neo4j Desktop/Application/neo4jDatabases/database-993db298-6374-4f0a-9a9a-d0783480877a/installation-3.5.14)

5. Launch Jupyter Lab Run the Jupyter Notebooks in order to download the latest data, create a new graph database, and then query then query the graph database.

jupyter lab

6. Browse KG in Neo4j Browser

After you create the graph database by running the Jupyter Notebooks, start the database in Neo4j Browser to interactively explore the KG.

About

Community effort to build a Neo4j Knowledge Graph (KG) that links heterogeneous data about COVID-19

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.7%
  • Shell 0.3%