From ead71d1b9880e3319d815a72208c092fb53e570a Mon Sep 17 00:00:00 2001 From: ducku Date: Tue, 5 Mar 2024 13:57:05 -0800 Subject: [PATCH 1/4] Update README to include gfa node renaming and using gaf files in prepare_local_chunk.sh --- README.md | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index a9fea47a..b0ef0d18 100644 --- a/README.md +++ b/README.md @@ -188,11 +188,13 @@ It assumes that the graph represents some region along some reference path that It assumes that path names in the subgraph *don't* use subregion suffixes (bracket-enclosed numbers). The path name used in the region should *exactly* match the name of one of the paths in the graph. +`prepare_local_chunk.sh` also accepts `.gaf` files, which will automatically converted into a gam file using `vg convert`. + For example, you can run it like: ``` cd exampleData/ -../scripts/prepare_local_chunk.sh -x subgraph.gbz -r chr5:1023911-1025911 -g subgraph_reads.gam -g other_sample_reads.gam -o subgraph1 >> subgraphs.bed +../scripts/prepare_local_chunk.sh -x subgraph.gbz -r chr5:1023911-1025911 -g subgraph_reads.gam -g other_sample_reads.gam -g another_sample_reads.gaf -o subgraph1 >> subgraphs.bed ``` Your graph can be a `.vg`, `.xg`, `.gfa`, or any other graph format understood by vg, but it *must* be in the same node ID space as your reads, and the script does *not* check this for you! In particular, indexing a GFA graph and mapping to it with `vg giraffe` can result in the original GFA nodes being cut into manageable pieces and assigned new numbers in the graph that the reads actually are aligned to, meaning the original GFA won't work here. You can check your reads against your graph with `vg validate subgraph.gfa --gam subgraph_reads.gam`. If your read alignments look completely absurd and jump all over the place, this is likely the problem. @@ -201,6 +203,19 @@ If the original subgraph file does not remain in place under the configured `dat The net result will be that you can select the BED file, select the region it specifies, and view a precomputed view of the subgraph, with coordinates computed assuming it covers the region provided to `prepare_local_chunk.sh`. +A note on naming node IDs when using `.gfa` files: +VG keeps node IDs the same when all node names are integers. However, node IDs are renamed upon encountering string-named nodes. Renaming begins at the first encounter of a string-named node, using the highest integer encountered so far (+1), or 1 if the first node is string-named in the GFA file. Future nodes are renamed in a +1 manner regardless of their datatype. + +Here's an example of a rename + +``` +Original -> Renamed +3 -> 3 +1 -> 1 +five -> 4 +7 -> 5 +four -> 6 +``` #### Development Mode From c2b5d1488b612b9d51784d0eee369c06d7b58ff4 Mon Sep 17 00:00:00 2001 From: ducku Date: Tue, 5 Mar 2024 13:57:41 -0800 Subject: [PATCH 2/4] Allow prepare_local_chunk.sh to convert gaf files into gam files --- scripts/prepare_local_chunk.sh | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/scripts/prepare_local_chunk.sh b/scripts/prepare_local_chunk.sh index a6463eff..ee3c3fdf 100755 --- a/scripts/prepare_local_chunk.sh +++ b/scripts/prepare_local_chunk.sh @@ -65,6 +65,24 @@ echo >&2 "Node colors: " ${NODE_COLORS[@]} rm -fr $OUTDIR mkdir -p $OUTDIR +TEMP="script_temp" + +rm -fr $TEMP +mkdir $TEMP + + +# Covert GAF files to GAM +for i in "${!GAM_FILES[@]}"; do + if [[ ${GAM_FILES[$i]} == *.gaf ]]; then + # Filename without path + filename=$(basename -- ${GAM_FILES[$i]}) + # Remove file extension + filename=${filename%.*} + vg convert --gaf-to-gam ${GAM_FILES[$i]} ${GRAPH_FILE} > $TEMP/${filename}.gam + GAM_FILES[$i]="$TEMP/${filename}.gam" + fi +done + # Parse the region REGION_END="$(echo ${REGION} | rev | cut -f1 -d'-' | rev)" REGION_START="$(echo ${REGION} | rev | cut -f2 -d'-' | cut -f1 -d':' | rev)" @@ -135,3 +153,5 @@ done cat $OUTDIR/regions.tsv | cut -f1-3 | tr -d "\n" printf "\t${DESC}\t${OUTDIR}\n" +rm -fr $TEMP + From b94e8d185f9af7a5cd41afaf8ab050c0ef42ad88 Mon Sep 17 00:00:00 2001 From: ducku Date: Tue, 5 Mar 2024 14:01:52 -0800 Subject: [PATCH 3/4] Fix typo --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index b0ef0d18..778f8b2f 100644 --- a/README.md +++ b/README.md @@ -188,7 +188,7 @@ It assumes that the graph represents some region along some reference path that It assumes that path names in the subgraph *don't* use subregion suffixes (bracket-enclosed numbers). The path name used in the region should *exactly* match the name of one of the paths in the graph. -`prepare_local_chunk.sh` also accepts `.gaf` files, which will automatically converted into a gam file using `vg convert`. +`prepare_local_chunk.sh` also accepts `.gaf` files, which will automatically be converted into a gam file using `vg convert`. For example, you can run it like: @@ -206,7 +206,7 @@ The net result will be that you can select the BED file, select the region it sp A note on naming node IDs when using `.gfa` files: VG keeps node IDs the same when all node names are integers. However, node IDs are renamed upon encountering string-named nodes. Renaming begins at the first encounter of a string-named node, using the highest integer encountered so far (+1), or 1 if the first node is string-named in the GFA file. Future nodes are renamed in a +1 manner regardless of their datatype. -Here's an example of a rename +Here's an example of a rename: ``` Original -> Renamed From 2037f61ca060e9d0705c55e01850617e88b866be Mon Sep 17 00:00:00 2001 From: Adam Novak Date: Wed, 6 Mar 2024 15:20:06 -0500 Subject: [PATCH 4/4] Apply suggestions from code review --- README.md | 2 +- scripts/prepare_local_chunk.sh | 5 +---- 2 files changed, 2 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 778f8b2f..3cf13b80 100644 --- a/README.md +++ b/README.md @@ -204,7 +204,7 @@ If the original subgraph file does not remain in place under the configured `dat The net result will be that you can select the BED file, select the region it specifies, and view a precomputed view of the subgraph, with coordinates computed assuming it covers the region provided to `prepare_local_chunk.sh`. A note on naming node IDs when using `.gfa` files: -VG keeps node IDs the same when all node names are integers. However, node IDs are renamed upon encountering string-named nodes. Renaming begins at the first encounter of a string-named node, using the highest integer encountered so far (+1), or 1 if the first node is string-named in the GFA file. Future nodes are renamed in a +1 manner regardless of their datatype. +VG keeps node IDs the same when all node names are strictly positive integers. However, node IDs are renamed upon encountering string-named nodes. Renaming begins at the first encounter of a string-named node, using the highest integer encountered so far (+1), or 1 if the first node is string-named in the GFA file. Future nodes are renamed in a +1 manner regardless of their datatype. Here's an example of a rename: diff --git a/scripts/prepare_local_chunk.sh b/scripts/prepare_local_chunk.sh index ee3c3fdf..7734f0f2 100755 --- a/scripts/prepare_local_chunk.sh +++ b/scripts/prepare_local_chunk.sh @@ -65,10 +65,7 @@ echo >&2 "Node colors: " ${NODE_COLORS[@]} rm -fr $OUTDIR mkdir -p $OUTDIR -TEMP="script_temp" - -rm -fr $TEMP -mkdir $TEMP +TEMP="$(mktemp -d)" # Covert GAF files to GAM