Alliance of Genome Resources on AWS

Welcome to the Alliance of Genome Resources (Alliance) open data repository on AWS. This documentation helps you access and use comprehensive genomic, genetic, and molecular data from multiple model organisms.

What is the Alliance of Genome Resources?

The Alliance of Genome Resources is a consortium integrating data from leading model organism databases:

Drosophila melanogaster and other Drosophila species
Caenorhabditis elegans
Danio rerio (zebrafish)
Mus musculus (mouse)
Rattus norvegicus (rat)
Saccharomyces cerevisiae (yeast)
Xenopus laevis and Xenopus tropicalis (frogs)
Homo sapiens (human reference data)

Mission: Provide unified, high-quality genomic data to accelerate biological research and human disease understanding.

Quick Start

1. Browse Available Data

Visit the Alliance downloads page or explore the S3 buckets:

FlyBase Data (Public HTTPS):

# Browse via wget
wget -qO- https://s3ftp.flybase.org/releases/current/precomputed_files/ | grep '.tsv.gz'

Alliance Data (S3):

# List Alliance releases
aws s3 ls s3://mod-datadumps/ --no-sign-request

# List disease data in latest release
aws s3 ls s3://mod-datadumps/8.3.0/DISEASE-ALLIANCE/ --no-sign-request

2. Download Your First File

# Download gene annotation IDs
wget https://s3ftp.flybase.org/releases/current/precomputed_files/genes/fbgn_annotation_ID_current.tsv.gz

# Decompress
gunzip fbgn_annotation_ID_current.tsv.gz

# View
head fbgn_annotation_ID_current.tsv

3. Explore Tutorials

Check out TUTORIAL.md for step-by-step guides on:

Downloading and accessing data
Gene annotation lookups
Disease gene discovery
Expression analysis
Building interaction networks
Cross-species orthology mapping

Documentation

Data Organization Documentation

Complete reference guide covering:

Dataset Overview - Scale, update frequency, data categories
S3 Bucket Structure - Directory organization and file naming
Data Categories - Detailed descriptions of all data types:
- Gene annotations
- Expression data (bulk RNA-Seq, single-cell RNA-Seq)
- Disease associations
- Molecular and genetic interactions
- Orthology relationships
- Variants and alleles
- Genome sequences and annotations
Access Methods - S3, HTTPS, Python boto3, PostgreSQL, API
File Formats - TSV, JSON, FASTA, GFF3, GTF, VCF, MITAB
Common Use Cases - Real-world examples

Tutorial Guide

Hands-on tutorials with working code examples:

Getting Started - Download your first dataset (10 min)
Gene Annotation Lookups - ID conversion and batch processing (15 min)
Disease Gene Discovery - Find disease-associated genes (20 min)
RNA-Seq Expression Analysis - Analyze developmental expression (25 min)
Protein Interaction Networks - Build and visualize PPI networks (20 min)
Cross-Species Orthology - Map genes between species (15 min)

Data Categories

Core Annotation Data

Gene Annotations - IDs, symbols, descriptions, mappings
Gene Ontology - Molecular function, biological process, cellular component
Gene Groups - Pathway and functional groupings

Functional Genomics

Bulk RNA-Seq - RPKM/FPKM expression matrices across tissues and stages
Single-Cell RNA-Seq - Cell cluster expression from multiple datasets
Curated Expression - Spatiotemporal expression patterns with ontology terms

Disease and Phenotypes

Disease Associations - Links to human diseases (Disease Ontology)
Phenotypes - Mutant and variant phenotype annotations
Human Disease Models - Model organism connections to human disease

Interactions

Physical Interactions - Protein-protein, protein-RNA, RNA-RNA (PSI-MI TAB format)
Genetic Interactions - Suppression, enhancement, synthetic lethality

Comparative Genomics

Orthology - Cross-species gene relationships with DIOPT scores
Paralogy - Within-species gene duplications

Variants and Alleles

VCF Files - Genomic variants in standard format
Allele Annotations - Detailed allele and variant descriptions
Genotype-Phenotype - Links between genetic changes and phenotypes

Genome Sequences

FASTA - Chromosomes, genes, transcripts, proteins
GFF3/GTF - Genome annotations for analysis pipelines
Transposable Elements - TE sequences and insertion sites

Access Methods

Web Browser

Alliance Portal: https://www.alliancegenome.org
Downloads Page: https://www.alliancegenome.org/downloads
FTP Browser: https://s3ftp.flybase.org/releases/current/

AWS S3 (Recommended for Large Downloads)

# Anonymous access - no AWS account needed
# FlyBase data
aws s3 ls s3://s3ftp.flybase.org/releases/current/precomputed_files/genes/ --no-sign-request
aws s3 cp s3://s3ftp.flybase.org/releases/current/precomputed_files/genes/fbgn_annotation_ID_current.tsv.gz . --no-sign-request

# Alliance data
aws s3 ls s3://mod-datadumps/8.3.0/DISEASE-ALLIANCE/COMBINED/ --no-sign-request
aws s3 cp s3://mod-datadumps/8.3.0/DISEASE-ALLIANCE/COMBINED/DISEASE-ALLIANCE_COMBINED_2.tsv.gz . --no-sign-request

Direct Download (wget/curl)

# wget
wget https://s3ftp.flybase.org/releases/current/precomputed_files/genes/fbgn_annotation_ID_current.tsv.gz

# curl
curl -O https://s3ftp.flybase.org/releases/current/precomputed_files/genes/fbgn_annotation_ID_current.tsv.gz

Python (boto3)

import boto3
from botocore import UNSIGNED
from botocore.client import Config

# Create S3 client with anonymous access
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Download FlyBase data
s3.download_file('s3ftp.flybase.org',
                'releases/current/precomputed_files/genes/fbgn_annotation_ID_current.tsv.gz',
                'fbgn_annotation_ID_current.tsv.gz')

# Download Alliance data
s3.download_file('mod-datadumps',
                '8.3.0/DISEASE-ALLIANCE/COMBINED/DISEASE-ALLIANCE_COMBINED_2.tsv.gz',
                'DISEASE-ALLIANCE_COMBINED_2.tsv.gz')

PostgreSQL Database

# Public read-only access
psql -h chado.flybase.org -U flybase flybase

REST API

# Get gene information
curl https://www.alliancegenome.org/api/gene/FBgn0000001

# Search genes
curl https://www.alliancegenome.org/api/search?category=gene&q=white

File Naming Conventions

Alliance files follow predictable patterns:

Current Release (Always Latest)

<data_type>_<details>_current.tsv.gz
dmel-all-<feature>-current.fasta.gz

Examples:

fbgn_annotation_ID_current.tsv.gz
gene_rpkm_matrix_current.tsv.gz

Versioned Releases

<data_type>_<details>_fb_YYYY_MM.tsv.gz
dmel-all-<feature>-rX.YY.fasta.gz

Examples:

fbgn_annotation_ID_fb_2023_06.tsv.gz (FB2023_06 release)
dmel-all-chromosome-r6.55.fasta.gz (genome release 6.55)

Data Updates

Release Schedule:

Major Releases: Quarterly (every ~3 months)
Hot Fixes: As needed for critical corrections
Continuous Updates: Daily for time-sensitive annotations

Versioning:

Alliance releases: FB[YEAR]_[MONTH] (e.g., FB2023_06)
Genome releases: r[MAJOR].[MINOR] (e.g., r6.55)

Current vs. Archived:

/releases/current/ - Always points to latest release
/releases/FB2023_06/ - Specific archived release

Common Use Cases

Gene Research

Convert gene symbols to database IDs
Retrieve gene descriptions and annotations
Find genes in specific pathways or GO terms

Disease Studies

Identify genes associated with human diseases
Find model organism disease models
Map disease genes to orthologs

Expression Analysis

Compare gene expression across developmental stages
Analyze tissue-specific expression
Explore single-cell expression patterns

Interaction Networks

Build protein-protein interaction networks
Analyze genetic interactions
Find interaction partners for proteins of interest

Comparative Genomics

Map orthologs between species
Find conserved genes and pathways
Compare genomic features across model organisms

Variant Analysis

Access genomic variants in VCF format
Link variants to phenotypes
Study allele effects

Data Formats

Format	Description	Use Case
TSV	Tab-separated values	General data tables
JSON	JavaScript Object Notation	Structured data, API responses
FASTA	Sequence data	Genomic/protein sequences
GFF3	Genome annotations	Genome browsers, analysis
GTF	Gene Transfer Format	RNA-Seq pipelines
VCF	Variant Call Format	Variant analysis
MITAB	PSI-MI TAB	Protein interactions
XML	Chado XML	Complete database dumps

All compressed files use gzip (.gz extension).

System Requirements

Minimal Setup (Browser Only)

Web browser
No special software needed

Command Line Access

Linux, Mac, or Windows (WSL)
wget or curl
gunzip (usually pre-installed)

Data Analysis

Python:

Python 3.7+
pandas, boto3, matplotlib, biopython

R:

R 4.0+
tidyverse, data.table

Tools:

AWS CLI (optional, for S3 access)
IGV, JBrowse (genome visualization)
Cytoscape (network analysis)

Citation and License

Citing Alliance Data

Primary Citation:

Alliance of Genome Resources Consortium. Alliance of Genome Resources Portal: unified model organism research platform. Nucleic Acids Research (2023). https://doi.org/10.1093/nar/gkac1003

Data License

Most Alliance data is available under CC0 1.0 Universal (Public Domain Dedication). Some datasets may use CC-BY 4.0 (attribution required).

License Details: https://www.alliancegenome.org/terms-of-use

Attribution Requirements

When publishing research using Alliance data:

Cite the Alliance consortium paper (above)
Include release version numbers for reproducibility
Acknowledge specific data sources when applicable
Link to https://www.alliancegenome.org in web applications

Support and Help

AWS Open Data Sponsorship Program

This dataset is part of the AWS Open Data Sponsorship Program, which provides free hosting for publicly available high-value datasets.

Benefits:

✓ No AWS account required for downloads
✓ No data egress fees
✓ High-speed S3 access
✓ Global availability
✓ Automatic backups and archiving

Registry Entry: https://registry.opendata.aws/alliance-genome-resources/

Quick Reference

Essential Files

Data Type	File	Location
Gene IDs	`fbgn_annotation_ID_*.tsv.gz`	`precomputed_files/genes/`
Expression Matrix	`gene_rpkm_matrix_*.tsv.gz`	`precomputed_files/genes/`
Disease Data	`disease_model_annotations_*.tsv.gz`	`precomputed_files/disease/`
Interactions	`physical_interactions_mitab_*.tsv.gz`	`precomputed_files/interactions/`
Orthologs	`dmel_human_orthologs_disease_*.tsv.gz`	`precomputed_files/orthologs/`
Genome Sequence	`dmel-all-chromosome-*.fasta.gz`	`genomes/.../fasta/`
Genome Annotation	`dmel-all-*.gff.gz`	`genomes/.../gff/`

Example Commands

# Download gene annotations via wget
wget https://s3ftp.flybase.org/releases/current/precomputed_files/genes/fbgn_annotation_ID_current.tsv.gz

# Download FlyBase data using AWS CLI
aws s3 cp s3://s3ftp.flybase.org/releases/current/precomputed_files/genes/fbgn_annotation_ID_current.tsv.gz . --no-sign-request

# Download Alliance data using AWS CLI
aws s3 cp s3://mod-datadumps/8.3.0/DISEASE-ALLIANCE/COMBINED/DISEASE-ALLIANCE_COMBINED_2.tsv.gz . --no-sign-request

# List files in FlyBase
aws s3 ls s3://s3ftp.flybase.org/releases/current/precomputed_files/genes/ --no-sign-request

# List files in Alliance
aws s3 ls s3://mod-datadumps/8.3.0/DISEASE-ALLIANCE/ --no-sign-request

# Decompress
gunzip fbgn_annotation_ID_current.tsv.gz

# View first 10 data rows (skip comment lines)
grep -v '^#' fbgn_annotation_ID_current.tsv | head -10

Changelog

Version 1.0 (2025-10-17)

Initial documentation release
Comprehensive data organization guide
Six hands-on tutorials
AWS S3 access instructions
Registry of Open Data submission

Contributing

We welcome feedback and contributions:

Report Issues: Use GitHub issues for bugs or documentation improvements
Suggest Tutorials: Email [email protected] with tutorial ideas
Share Use Cases: Tell us how you're using Alliance data
Contribute Code: Submit pull requests for example scripts

Additional Resources

Alliance Components

Related Projects

Gene Ontology: http://geneontology.org
Disease Ontology: https://disease-ontology.org
UniProt: https://www.uniprot.org
NCBI: https://www.ncbi.nlm.nih.gov

Tools and Browsers

JBrowse: https://jbrowse.org
IGV: https://software.broadinstitute.org/software/igv/
Cytoscape: https://cytoscape.org

Contact

Alliance of Genome Resources

Website: https://www.alliancegenome.org
Email: [email protected]
Community Forum: https://community.alliancegenome.org/categories
Facebook: https://www.facebook.com/AllianceOfGenomeResources
Mastodon: https://genomic.social/@AllianceGenome
Bluesky: https://bsky.app/profile/alliancegenome.bsky.social
GitHub: https://github.com/alliance-genome

Documentation Version: 1.0 Last Updated: 2025-10-17 Maintained by: Alliance of Genome Resources Consortium

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
AWS_S3_STRUCTURE_SUMMARY.md		AWS_S3_STRUCTURE_SUMMARY.md
DATA_DOCUMENTATION.md		DATA_DOCUMENTATION.md
DOCUMENTATION_REVISION_SUMMARY.md		DOCUMENTATION_REVISION_SUMMARY.md
README.md		README.md
TUTORIAL.md		TUTORIAL.md

alliance-genome/agr_open_data

Folders and files

Latest commit

History

Repository files navigation

Alliance of Genome Resources on AWS

What is the Alliance of Genome Resources?

Quick Start

1. Browse Available Data

2. Download Your First File

3. Explore Tutorials

Documentation

Data Organization Documentation

Tutorial Guide

Data Categories

Core Annotation Data

Functional Genomics

Disease and Phenotypes

Interactions

Comparative Genomics

Variants and Alleles

Genome Sequences

Access Methods

Web Browser

AWS S3 (Recommended for Large Downloads)

Direct Download (wget/curl)

Python (boto3)

PostgreSQL Database

REST API

File Naming Conventions

Current Release (Always Latest)

Versioned Releases

Data Updates

Common Use Cases

Gene Research

Disease Studies

Expression Analysis

Interaction Networks

Comparative Genomics

Variant Analysis

Data Formats

System Requirements

Minimal Setup (Browser Only)

Command Line Access

Data Analysis

Citation and License

Citing Alliance Data

Data License

Attribution Requirements

Support and Help

Documentation

Get Help

Community

AWS Open Data Sponsorship Program

Quick Reference

Essential Files

Example Commands

Changelog

Version 1.0 (2025-10-17)

Contributing

Additional Resources

Alliance Components

Related Projects

Tools and Browsers

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages