Skip to content

Commit 9a96f06

Browse files
authored
Merge pull request #16 from OpenKBC/engineering_dev
New updates for engineering, confirmed
2 parents 2e484ee + c9f35d7 commit 9a96f06

16 files changed

+522
-161
lines changed

R/IDconverter.R

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# ID converter from Ensembl ID to Entrez ID, please cleanup data by using python code before using this code
2+
# Only working with csv file
3+
# Usage Rscript
4+
# Rscript IDconverter.R path/counts_norm_CD8.csv path/outputfile.csv
5+
6+
library("AnnotationDbi")
7+
library("org.Hs.eg.db")
8+
9+
args = commandArgs(trailingOnly=TRUE)
10+
inputFile = args[1] # with path
11+
outputFile = args[2] # with path
12+
13+
#str_split
14+
data<-read.table(inputFile, row.names=1, sep=',', header=TRUE) # Read data
15+
names(data) <- sub("^X", "", names(data)) # drop "X" string in columns name
16+
17+
### Warning ###
18+
# Entrez ID might duplicate for Ensemble ID
19+
data$entrez = mapIds(org.Hs.eg.db, keys=row.names(data), column="ENTREZID", keytype="ENSEMBL", multiVals="first")
20+
row.names(data)<-make.names(data$entrez, unique=TRUE)
21+
row.names(data) <- sub("^X", "", row.names(data)) # drop "X" string in index name
22+
write.table(data, outputFile, sep=',', row.names = TRUE, col.names = TRUE) # Write result

R/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# R Utils for the project
2+
3+
### Requirements
4+
```shell
5+
Rscript notebook/installers/installer_Rpackage.R
6+
```
7+
8+
#### 1. deseq2_normalizaiton.R
9+
This code is an example to get normalized matrix from raw files, it does not have instruction
10+
11+
#### 2. IDconverter.R
12+
This code is converter for Ensembl ID to Entrez ID, and input should be cleaned up. Input file should be CSV format.
13+
14+
**Example:**
15+
```shell
16+
Rscript IDconverter.R inputpath/input.csv outputpath/output.csv
17+
```

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,10 @@
1717
* Slides (Ask to members)
1818
* S3 Bucket (Ask to members)
1919
* https://openkbc.github.io/multiple_sclerosis_proj/
20+
21+
### Usage of docker container
22+
* Use docker-compose for using jupyter notebook
23+
```
24+
docker-compose up
25+
```
26+
* Access http://localhost:8888/token

notebook/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
## Guide for docker volumes
66
* Please mount or bind with this information
7+
* For getting data, please ask members to have s3 access
78
```yaml
89
## Local path:container path
910
- notebook/notebook_lib:/home/jovyan/work/notebook_lib
@@ -15,9 +16,12 @@
1516
## Library List
1617
| Name | Description | Reference or link |
1718
|---------|---------|---------|
18-
| NWPV2 | DEG function with pvalue integration | [github](https://github.com/swiri021/NWPV2/blob/master/README.md), [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3135688/) |
19+
| NWPV2 | DEG function with pvalue integration | [github](https://github.com/swiri021/NWPV2), [paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3135688/) |
20+
| gene_zscore | Getting Gene-set Zscore(Activation Score) for data | [github](https://github.com/swiri021/Threaded_gsZscore), [paper](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2006-7-10-r93) |
21+
1922
2023
## Utils List
2124
| Name | Description | Reference or link |
2225
|---------|---------|---------|
2326
| OpenKbcMSToolkit | Handy toolkit for data extraction | No reference |
27+
| OpenKbcMSCalculator | Advanced calculators for getting result | No reference |

notebook/RFECV_with_allgenes.ipynb

Lines changed: 233 additions & 0 deletions
Large diffs are not rendered by default.

notebook/getDEG_with_nwpv.ipynb

Lines changed: 0 additions & 149 deletions
This file was deleted.

notebook/installers/installer_Rpackage.R

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,6 @@
22
if (!requireNamespace("BiocManager", quietly = TRUE))
33
install.packages("BiocManager", repos='http://cran.us.r-project.org')
44
BiocManager::install("DESeq2")
5-
BiocManager::install("tximport")
5+
BiocManager::install("tximport")
6+
BiocManager::install("AnnotationDbi")
7+
BiocManager::install("org.Hs.eg.db")

notebook/installers/requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
feather-format==0.4.1
1+
feather-format==0.4.1
2+
scikit-learn==0.24.2
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
__author__ = "Junhee Yoon"
2+
__version__ = "1.0.0"
3+
__maintainer__ = "Junhee Yoon"
4+
__email__ = "[email protected]"
5+
6+
"""
7+
Manual: https://github.com/swiri021/Threaded_gsZscore
8+
Reference: https://genomebiology.biomedcentral.com/articles/10.1186/gb-2006-7-10-r93
9+
Description: calculating activation score by using threaded z score
10+
"""
11+
import pandas as pd
12+
import numpy as np
13+
import threading
14+
import functools
15+
import itertools
16+
17+
class funcThread(object):
18+
def __init__(self):
19+
print ("Loaded Threads")
20+
21+
def __call__(self, func):
22+
@functools.wraps(func)
23+
def run(*args, **kwargs):
24+
print ("Number of Threads : %d"%(kwargs['nthread']))
25+
26+
threads = [None]*kwargs['nthread']
27+
container = [None]*kwargs['nthread']
28+
29+
####Divide Samples by number of threads
30+
i_col = len(args[1].columns.tolist())
31+
contents_numb = i_col/kwargs['nthread']
32+
split_columns = [args[1].columns.tolist()[i:i+contents_numb] for i in range(0, len(args[1].columns.tolist()), contents_numb)]
33+
if len(split_columns)>kwargs['nthread']:
34+
split_columns = split_columns[:kwargs['nthread']-1] + [list(itertools.chain(*split_columns[kwargs['nthread']-1:]))]
35+
#split_columns[len(split_columns)-2] = split_columns[len(split_columns)-2]+split_columns[len(split_columns)-1]
36+
#split_columns = split_columns[:len(split_columns)-1]
37+
38+
####Running threads
39+
for i, item in enumerate(split_columns):
40+
threads[i] = threading.Thread(target = func, args=(args[0], args[1].ix[:,item], container, i), kwargs=kwargs)
41+
threads[i].start()
42+
for i in range(len(threads)):
43+
threads[i].join()
44+
45+
return pd.concat(container, axis=0)
46+
47+
return run
48+
49+
50+
class calculator(object):
51+
52+
def __init__(self, df):
53+
if df.empty:
54+
raise ValueError("Input Dataframe is empty, please try with different one.")
55+
else:
56+
self.df = df
57+
58+
# Wrapper for controlling Threads
59+
def gs_zscore(self, nthread=5, gene_set=[]):
60+
arr1 = self.df
61+
container = None
62+
i = None
63+
64+
return self._calculating(arr1, container, i, nthread=nthread, gene_set=gene_set)
65+
66+
# function structure
67+
# args(input, container, thread_index , **kwargs)
68+
@funcThread()
69+
def _calculating(self, arr1, container, i, nthread=5, gene_set=[]):
70+
zscore=[]
71+
arr1_index = arr1.index.tolist()
72+
inter = list(set(arr1_index).intersection(gene_set))
73+
74+
diff_mean = arr1.loc[inter].mean(axis=0).subtract(arr1.mean(axis=0))
75+
len_norm = arr1.std(ddof=1, axis=0).apply(lambda x: np.sqrt(len(inter))/x)
76+
zscore = diff_mean*len_norm
77+
zscore = zscore.to_frame()
78+
zscore.columns = ['Zscore']
79+
container[i] = zscore
80+
##No Return

notebook/notebook_lib/nwpv/nwpv.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,17 @@
1-
from .statistics import STAT
2-
from scipy import stats
3-
import numpy as np
1+
__author__ = "Junhee Yoon"
2+
__version__ = "1.0.0"
3+
__maintainer__ = "Junhee Yoon"
4+
__email__ = "[email protected]"
5+
46
"""
57
Manual: https://github.com/swiri021/NWPV2
68
Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3135688/
79
Description: Method of combined p-values for getting DEG in dataset
810
"""
911

10-
12+
from .statistics import STAT
13+
from scipy import stats
14+
import numpy as np
1115

1216
class nwpv_calculation(object):
1317
def _preprocessing(self, min_adj=1e-16, max_adj=0.9999999999999999):
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
from notebook_lib.nwpv.nwpv import nwpv_calculation
2+
from notebook_lib.gene_zscore.threaded_gzscore import calculator as gzscore_class
3+
4+
class AdvancedCalculators(object):
5+
def nwpv_calculator(self, test : list, contol : list, data, save : bool = True):
6+
"""
7+
NWPV calculator
8+
Input
9+
test : test sample list
10+
control : control sample list
11+
data : actual input data
12+
save : saving output or not ?
13+
14+
"""
15+
#NWPV calculation
16+
nwpv_class = nwpv_calculation(data, test, contol)
17+
result = nwpv_class.get_result()
18+
19+
if save==True:
20+
result.to_csv("resultFiles/nwpv_result.csv")
21+
22+
return result
23+
24+
def activation_score(self, data, gene_set : list):
25+
26+
"""
27+
gene zscore calculator
28+
Input
29+
data : actual input data
30+
gene set : gene set input for calculating activate score
31+
save : saving output or not ?
32+
33+
"""
34+
35+
#### Init Class and check input file
36+
zscore_calculator = gzscore_class(data)
37+
38+
#### Input list should be EntrezIDs(Pathways)
39+
result = zscore_calculator.gs_zscore(nthread=4, gene_set=gene_set)
40+
return result

0 commit comments

Comments
 (0)