Skip to content

3. Functions and Use Cases

Huu Minh Liem Le edited this page Jan 10, 2025 · 4 revisions

Here, we outline the component functions of blessy along with use-cases and arguments for each, to assist users in customizing the module to fit their specific needs. The functions are listed based on their order in the blessy pipeline:

Function Description
blessy.getTranscriptTrack Fetch transcript-type annotation tracks from the UCSC database.
blessy.getDomainTrack Fetch domain-type annotation tracks UCSC database.
blessy.dfToGRangesList Convert BED-like R data frame to GRangesList.
blessy.mapDomainToTranscript Create a data frame of domains mapped to transcripts.
blessy.addStartsEnds Add important coordinates columns for domain deduplication.
blessy.domainDeduplication Remove non-exact domain mappings.
blessy.domainPhasing Create a 'DoCo' column based on domain phasing.
blessy.createPhasingDictionary Summarize gene, DoCo and transcript levels from given annotations.
blessy.createDoCoCount Aggregate given transcript count to DoCo level.


3.1. Fetch Annotation Tracks

blessy.getTranscriptTrack(genomeAssembly, transcriptAnnotation) and blessy.getDomainTrack(genomeAssembly, domainAnnotation)

The purpose of these two functions is for retrieving UCSC annotations. As transcript and domain tracks often have different syntax on the UCSC Database, each function is designed to a different track type. Both functions require two arguments: an assembly identifier and an annotation identifier, which correspond to the 'Assembly' and 'Table' options in the UCSC Table Browser respectively. Again, we highly recommend using the GENCODE or NCBI RefSeq tracks for transcript annotation and UniProt or Pfam tracks for domain annotation. The output of blessy.getTranscriptTrack() and blessy.getDomainTrack() are annotations stored in BED-like R data frames.

# Fetch annotation tracks
transcript_annotation <- blessy.getTranscriptTrack("hg38", "wgEncodeGencodeBasicV44")
domain_annotation <- blessy.getDomainTrack("hg38", "unipDomain")

# Output visualization
head(transcript_annotation)
  chrom chromStart  chromEnd              name score strand thickStart  thickEnd itemRgb blockCount
1  chr1   67092164  67134970 ENST00000684719.1     0      -   67093004  67127240   0,0,0          8
2  chr1   67092164  67231852 ENST00000371007.6     0      -   67093004  67127240   0,0,0          8
3  chr1   67092175  67127261 ENST00000371006.5     0      -   67093004  67127240   0,0,0          6
4  chr1   67092175  67127261 ENST00000475209.6     0      -   67093579  67127240   0,0,0          7
5  chr1   67092396  67127261 ENST00000621590.4     0      -   67096311  67127240   0,0,0          3
6  chr1  201283451 201332993 ENST00000263946.7     0      +  201283702 201328836   0,0,0         15

There can be several scenarios where users would prefer NOT to use annotations directly fetched from the UCSC Genome Browser:

a. Using user-provided annotations

To run blessy using annotations NOT from UCSC, user must ensure that their provided annotations are first read into BED12-like R data frames. The data frame must include the following columns:

chrom- Chromosome or scaffold name of the feature.

chromStart - Start coordinate on the chromosome or scaffold for the feature considered.

chromEnd - End coordinate on the chromosome or scaffold for the feature considered.

blockCount - Number of blocks (exon for transcripts, domain-block for domains) on the line of the BED-like data frame.

blockSizes - List of values separated by commas corresponding to the size of the blocks (the number of values must correspond to that of the 'blockCount').

blockStarts - List of values separated by commas corresponding to the starting coordinates of the blocks, coordinates calculated relative to those present in the chromStart column (the number of values must correspond to that of the 'blockCount').

thickStart - (for transcript only) Start coordinate of the coding sequence of the transcript.

thickEnd - (for transcript only) End coordinate of the coding sequence of the transcript.

geneName - (for transcript only) Identifier of the gene the transcript belongs to.

For GTF annotations, blessy offers the function blessy.GTFtoBEDdf() to convert a GTF file into the required BED-like format. For example, user can choose a given annotation track from GENCODE and load it as below:

customTranscriptAnnotation <- blessy.GTFtoBEDdf("path/to/gencode.v47.basic.annotation.gtf", geneID = FALSE)

blessy can run on this annotation via the blessy.usingCustomAnnotation() custom wrapper to define the DoCo class and aggregate the provided transcript count:

blessy_outs_custom <- blessy.usingCustomAnnotation(customTranscriptAnnotation, DomainAnnotation, transcriptCount, unique_domain = FALSE, coordinates = TRUE)

customTranscriptAnnotation - A BED-like data frame or a GRangesList representing transcript annotation with required columns/information.

customDomainAnnotation - A BED-like data frame or a GRangesList representing domain annotation with required columns/information.

transcriptCount - An R data frame containing RNA-Seq transcript count. The first column must be named 'TranscriptID' storing string values of transcript identifiers. Other columns are considered numeric count across different biological samples.

unique_domain - A logical value indicating whether to further deduplicate domains by keeping only unique domains per transcript. Default is set to FALSE.

coordinates - A boolean value to decide if the coordinates of protein domains would be kept in the final DoCo string or not. Generally, setting this value to FALSE further deduplicates and reduces the number of DoCo values in a given dataset. Default is set to TRUE.

In this example, we worked with the Basic gene annotation GTF of the GENCODE human annotation Release 47. To acess the GTF, click here.

# Load library
library(longhaul)

# Load count
tx_count <- read.table("example_transcript_count.txt")

# Get domain annotation via blessy
domain_df <- blessy.getDomainTrack("hg38", "unipDomain")

# Load and convert the GTF transcript annotation
tx_df <- blessy.GTFtoDataFrame("gencode.v47.basic.annotation.gtf", geneID = T)

# Run blessy using custom annotation 
gtf_run <- blessy.usingCustomAnnotation(tx_df, domain_df, tx_count)
doco_count <- gtf_run$doco_count
doco_dict <- gtf_run$doco_dict

b. Using multiple tracks for transcript or domain annotations

Users can choose to retrieve multiple domain annotation tracks via blessy.getDomainTrack() (see Functions and Use Cases), and integrate these domain tracks into one comprehensive domain annotation using simple R commands. Once the final annotation data frame is created, it can be use as an input to blessy.usingCustomAnnotation().

# Load libraries
library(longhaul)
library(dplyr)

# Fetch domain tracks from the UCSC Genome Browser
unipDomain <- blessy.getDomainTrack("hg38", "unipDomain")
unipInterest <- blessy.getDomainTrack("hg38", "unipInterest")

# Combine the two data frames
combined_df <- bind_rows(unipDomain, unipInterest)

# Filter out rows containing "Disordered"
filtered_df <- combined_df %>% 
  filter(!grepl("Disordered", name)) 

# Sort 
domain_df <- filtered_df %>% 
  arrange(chrom, chromStart) 

# Load transcript count
tx_count <- read.table("example_transcript_count.txt")

# Fetch transcript annotation from the UCSC browser
tx_df <- blessy.getTranscriptTrack("hg38", "wgEncodeGencodeCompV44")

# Perform blessy using custom annotations
blessy_outs_custom <- blessy.usingCustomAnnotation(tx_df, domain_df, tx_count)


3.2. Convert BED-like R Data Frame to GRangesList

blessy.dfToGRangesList(annotation_df)

blessy.dfToGRangesList takes in a BED-like annotation data frame and convert it into a big GRanges object, based on the columns 'chrom', 'chromStart', 'chromEnd', and 'strand'. Next, this single GRanges objects will be splitted into a GRangesList where each GRanges object correspond to a transcript or a domain in the initial annotation using this function.

# Example BED-like annotation data frame
transcript_annotation <- data.frame(
  chrom = c("chr1", "chr1", "chr2"),
  chromStart = c(1000, 2000, 3000),
  chromEnd = c(1500, 2500, 3500),
  name = c("tx1", "tx2", "tx3"),
  strand = c("+", "-", "+"),
  thickStart = c(1000, 2000, 3000),
  thickEnd = c(1500, 2500, 3500),
  blockCount = c(2, 2, 2),
  blockSizes = c("100,200", "150,250", "200,300"),
  blockStarts = c("0,400", "0,500", "0,600"),
  geneName = c("geneA", "geneB", "geneC"),
  stringsAsFactors = FALSE
)

# Convert the data frame to a GRangesList
transcript_GRL <- blessy.dfToGRangesList(transcript_annotation)

# Output visualization
head(transcript_GRL)

GRangesList object of length 3:
$`1`
GRanges object with 1 range and 7 metadata columns:
      seqnames    ranges strand |        name thickStart  thickEnd blockCount  blockSizes blockStarts    geneName
         <Rle> <IRanges>  <Rle> | <character>  <numeric> <numeric>  <numeric> <character> <character> <character>
  [1]     chr1 1000-1500      + |         tx1       1000      1500          2     100,200       0,400       geneA
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

$`2`
GRanges object with 1 range and 7 metadata columns:
      seqnames    ranges strand |        name thickStart  thickEnd blockCount  blockSizes blockStarts    geneName
         <Rle> <IRanges>  <Rle> | <character>  <numeric> <numeric>  <numeric> <character> <character> <character>
  [1]     chr1 2000-2500      - |         tx2       2000      2500          2     150,250       0,500       geneB
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

$`3`
GRanges object with 1 range and 7 metadata columns:
      seqnames    ranges strand |        name thickStart  thickEnd blockCount  blockSizes blockStarts    geneName
         <Rle> <IRanges>  <Rle> | <character>  <numeric> <numeric>  <numeric> <character> <character> <character>
  [1]     chr2 3000-3500      + |         tx3       3000      3500          2     200,300       0,600       geneC
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths


3.3. Map Domain to Transcript

blessy.mapDomainToTranscript(transcript_GRL, domain_GRL, transcript_annotation, domain_annotation)

This function serves to map the domains that intersect each transcript based on genomic coordinates. The function takes in the two GRangesLists of domain and transcript, and find their coordinate overlaps using the findOverlaps() function of GenomicRanges. findOverlaps() returns index pairs of matching domains and transcripts, which is used to join columns of the initial transcript and domain annotations based on matching features. The return object of this function is a mapping data frame with information on transcript and its matched domains.

# Example BED-like annotation data frames
transcript_annotation <- data.frame(
  chrom = c("chr1", "chr1", "chr2"),
  chromStart = c(1000, 2000, 3000),
  chromEnd = c(1500, 2500, 3500),
  name = c("tx1", "tx2", "tx3"),
  strand = c("+", "-", "+"),
  thickStart = c(1000, 2000, 3000),
  thickEnd = c(1500, 2500, 3500),
  blockCount = c(2, 2, 2),
  blockSizes = c("100,200", "150,250", "200,300"),
  blockStarts = c("0,400", "0,500", "0,600"),
  geneName = c("geneA", "geneB", "geneC"),
  stringsAsFactors = FALSE
)


domain_annotation <- data.frame(
  chrom = c("chr1", "chr1", "chr2"),
  chromStart = c(1200, 2100, 3100),
  chromEnd = c(1300, 2200, 3200),
  name = c("domainA", "domainB", "domainC"),
  strand = c("+", "-", "+"),
  blockStarts = c("0", "0", "0"),
  blockSizes = c("100", "100", "100"),  # Assuming block size is 100 for each domain
  stringsAsFactors = FALSE
)

# Convert to GRangesList
tx_grangesList <- blessy.dfToGRangesList(transcript_annotation)
domain_grangesList <- blessy.dfToGRangesList(domain_annotation)

# Create domain mapping 
mapping_df <- blessy.mapDomainToTranscript(tx_grangesList, domain_grangesList, transcript_annotation, domain_annotation)

# Output visualization
head(mapping_df)
  chrom txStart txEnd Transcript strand cdsStart cdsEnd exonCount exonSizes exonRelativeStarts  Gene chromStart chromEnd  Domain
1  chr1    1000  1500        tx1      +     1000   1500         2   100,200              0,400 geneA       1200     1300 domainA
2  chr1    2000  2500        tx2      -     2000   2500         2   150,250              0,500 geneB       2100     2200 domainB
3  chr2    3000  3500        tx3      +     3000   3500         2   200,300              0,600 geneC       3100     3200 domainC
  chromStarts blockSizes
1           0        100
2           0        100
3           0        100


3.4. Add Block Coordinates to Mapping Data Frame

blessy.addStartsEnds(mapping_df)

This function adds several columns needed for the domain deduplication process to the input mapping data frame, based on existing columns. More specifically, this functions creates exonStarts and exonEnds columns corresponding to the start and end genomic coordinates of exons within a transcript, and likewise for its domains with blockStarts and blockEnds.

# Example mapping data frame
mapping_df <- data.frame(
  txStart = c(1000, 2000),
  exonRelativeStarts = c("0,100,200", "0,50,100"),
  exonSizes = c("100,50,25", "80,40,30"),
  chromStart = c(1000, 2000),
  chromStarts = c("0,150,300", "0,100,200"),
  blockSizes = c("100,50,25", "80,40,30")
)

# Add block coordinates
mapping_df <- blessy.addStartsEnds(mapping_df)

# Output visualization
head(mapping_df)
  txStart exonRelativeStarts exonSizes chromStart chromStarts blockSizes     exonStarts       exonEnds
1    1000          0,100,200 100,50,25       1000   0,150,300  100,50,25 1000,1100,1200 1100,1150,1225
2    2000           0,50,100  80,40,30       2000   0,100,200   80,40,30 2000,2050,2100 2080,2090,2130
     blockStarts      blockEnds
1 1000,1150,1300 1100,1200,1325
2 2000,2100,2200 2080,2140,2230


3.5. Domain Mapping Deduplication

blessy.domainDeduplication(tx_domain_df, unique_domain = FALSE)

While the blessy.mapDomainToTranscript function allows for finding coordinate overlaps between domains and transcripts, non-exact matches that incorrectly assigns a domain to a transcript can happen. In the figure below, given a hypothetical transcript A, only domain 1 to domain 4 (D1-D4) should be considered correct matches for transcript A. For D5, we observe a non-consecutive exon match. The first D6 block does not align to an exon and while both blocks aligns with consecutive exons, the second block of D7 exhibited a gap to the left-hand side. Similarly, first block alignment of D8 showed overflown. The blessy.domainDeduplication function thus aims to remove D5 to D8 from this hypothetical domain-transcript mapping, using the 'exonStarts', 'exonEnds', 'blockStarts', 'blockEnds' columns created by blessy.addStartsEnds.

domainDedup_concept

These hypothetical transcripts and domains have been converted to BED files, which can be found as data/example_domains.bed or data/example_transcript.bed. By default, we expect D1, D2, D3 and D4 to remain after deduplication. The use case for this process is as follows:

# Read and rename hypothetical transcript A BED into an R BED-like data frame
tx_df <- read.table("example_transcript.bed")
colnames(tx_df) <- c("chrom", "chromStart", "chromEnd", "name", "score", "strand",
                        "thickStart", "thickEnd", "itemRgb", "blockCount", "blockSizes", "blockStarts", "geneName")

# Read and rename hypothetical domains BED into an R BED-like data frame
domain_df <- read.table("example_domains.bed")
colnames(domain_df) <- c("chrom", "chromStart", "chromEnd", "name", "score", "strand",
                     "thickStart", "thickEnd", "itemRgb", "blockCount", "blockSizes", "blockStarts")

# Convert dataframes to GRanges object
tx_grangesList <- blessy.dfToGRangesList(tx_df)
domain_grangesList <- blessy.dfToGRangesList(domain_df)
  
# Map domains to transcripts
mapped_df <- blessy.mapDomainToTranscript(tx_grangesList, domain_grangesList, tx_df, domain_df)
  
# Add exon and block starts/ends
starts_ends_df <- blessy.addStartsEnds(mapped_df)
  
# Deduplicate domain mappings
deduplicated_df <- blessy.domainDeduplication(starts_ends_df, unique_domain = FALSE)

> unique(deduplicated_df$Domain)
[1] "D1" "D2" "D3" "D4"

Additionally, the function provides the parameter 'unique_domain' which is set to FALSE by default. By switching unique_domain to TRUE, the function removes repetitive domain names in each transcript so that the domain values are now unique for a transcript. For example, if transcript X is mapped to domain A twice, only one domain A is kept in this mapping. Because of this, unique_domain = TRUE is preferably used in tandem with coordinates = FALSE (see below)



3.6. Domain Phasing

blessy.domainPhasing(mapping_df, coordinates = TRUE)

The blessy.domainPhasing() generates the DoCo string of transcripts with matched domains and includes them to the input mapping data frame. The function iterates through the domains in each transcript based on their coordinates and strand direction, and output the DoCo string in a new 'DoCo' column. The coordinates boolean parameter decides if the coordinates are kept in the DoCo string or not. Switching coordinates to FALSE will remove the genomic coordinates from a DoCo string, grouping transcripts with domains in different locations but similar names. For example, transcript X with DoCo "D1::chr1:1500-2000;;geneZ" and transcript Y with DoCo "D1::chr1:2000-2500;;geneZ" will both belong to DoCo "D1;;geneZ" when coordinates is set to FALSE.

# Example domain mapping data frame
mapping_df <- data.frame(
  Transcript = c("tx1", "tx1", "tx2"),
  Domain = c("domainA", "domainB", NA),
  chrom = c("chr1", "chr1", "chr2"),
  chromStart = c(1000, 2000, 3000),
  chromEnd = c(1100, 2100, 3100),
  strand = c("+", "-", "+"),
  Gene = c("geneA", "geneA", "geneB")
)

# Apply domain phasing and create a new 'DoCo' column
mapping_df <- blessy.domainPhasing(mapping_df)

# Output visualization
head(mapping_df)
# A tibble: 3 × 8
  Transcript Domain  chrom chromStart chromEnd strand Gene  DoCo                                                
  <chr>      <chr>   <chr>      <dbl>    <dbl> <chr>  <chr> <chr>                                               
1 tx1        domainA chr1        1000     1100 +      geneA domainA::chr1:1000-1100(+),domainB::chr1:2000-2100(…
2 tx1        domainB chr1        2000     2100 -      geneA domainA::chr1:1000-1100(+),domainB::chr1:2000-2100(…
3 tx2        NA      chr2        3000     3100 +      geneB ;;; geneB                                           


3.7. Create Phasing Dictionary

blessy.createPhasingDictionary(mapping_df, transcript_annotation)

This function summarizes the hierarchical relationship between Gene, DoCo and transcript for transcripts with and without matched domains, in the chosen annotation. The function requires the data frame consisting mapped transcripts, and the initial transcript annotation for handling unmapped transcripts. The output of the function is a data frame with 'Gene', 'DoCo' and 'Transcript' columns, which will be used to aggregate transcripts belonging to the same DoCo in a given RNA-Seq transcript count.

# Example domain mapping data frame containing a 'DoCo' column
mapping_df <- data.frame(
  Transcript = c("tx1", "tx2"),
  DoCo = c("domainA::chr1:1000-1100(+);;; geneA", ";;; geneB"),
  Gene = c("geneA", "geneB")
)
#'
# Example transcript annotation
transcript_annotation <- data.frame(
  name = c("tx1", "tx2", "tx3"),
  geneName = c("geneA", "geneB", "geneC")
)

# Create the phasing dictionary
phasing_dict <- blessy.createPhasingDictionary(mapping_df, transcript_annotation)

# Output visualization
head(phasing_dict)
  Transcript                                DoCo              Gene
1        tx1             domainA::chr1:1000-1100(+);;; geneA geneA
2        tx2                           ;;; geneB             geneB
3        tx3                           ;;; geneC             geneC


3.8. Create DoCo Count from Transcript Count

blessy.createDoCoCount(phasing_dict, transcriptCount)

This function aggregates an input transcript count to acquire DoCo-level count. Transcripts are grouped based on their DoCo assignment from an input dictionary data frame, and counts are summed across biological samples. Transcripts not found in the dictionary are grouped into the DoCo class ";;;". For the transcript count input, the first column must be named 'TranscriptID' storing string values of transcript identifiers. Other columns are considered numeric count across different biological samples.

# Example dictionary data frame
dict <- data.frame(
  Transcript = c("Tx1", "Tx2", "Tx3"),
  DoCo = c("D1,D2;;; GeneA", "D3;;; GeneB", ";;; GeneC"),
  stringsAsFactors = FALSE
)

# Example count data frame
count_df <- data.frame(
  TranscriptID = c("Tx1", "Tx2", "Tx4", "Tx5"),
  Sample1 = c(10, 20, 5, 7),
  Sample2 = c(15, 25, 8, 10),
  stringsAsFactors = FALSE
)

# Create DoCo-level counts
doco_count <- blessy.createDoCoCount(dict, count_df)
Warning messages:
1: In blessy.createDoCoCount(dict, count_df) :
  Uncommon transcripts found, grouping them into the empty ';;;' DoCo group.
2: In blessy.createDoCoCount(dict, count_df) :
  Number of uncommon transcripts grouped: 2

# Output visualization
head(doco_count)
            DoCo       Sample1     Sample2
1           ;;;            12          18
2     D1,D2;;; GeneA       10          15
3      D3;;; GeneB         20          25