Serveral Perlscript are used in this paper: Li X*, Yang Z, Wang Z, Li W, Zhang G, Yan H. Comparative Genomics of Pseudomonas stutzeri Complex: Taxonomic Assignments and Genetic Diversity. Frontiers in Microbiology 2022, DOI: 10.3389/fmicb.2021.755874.
- Run this command to enble users to extract protein sequences for genomes using multiple threads.
Usage: -dir genbank_file_directory [options]
Firstly, put all genomes used for pan genome analysis under the same directory, then run using the following command: "perl stutzeri_123 -m 3". The "stutzeri_123" is the directory containing annotated genomes as Genbank format file, and the parameter "-m " is used to set thread number. The detailed usage instructions is included in
After all predicted protein-coding sequences (CDSs) were extracted from each of 123 genomes separately using, the output directory is used as input for Orthofinder to infer orthogroups. When orthofinder is done, Orthogroups.tsv and Orthogroups_UnassignedGenes.tsv are obtained. The former is a tab separated text file, in which each row contains the genes belonging to a single orthogroup. The latter is a tab separated text file that is identical in format to Orthogroups.csv but contains all of the genes that were not assigned to any orthogroup. By runing the command "cat" in Unix/Linux system, the users can merge these two files into a single file (refer to merged_Orthogroups.txt), which contains all gene families for all analyzed genomes. This merged file is modified using a Perlscript "", then PGAP can use the modified file and "genomic_name.txt" as inputs to perform Pan-genome analysis.
- This Perlscript is used to modify the format of merged file, and allows PGAP to use this file as input file.
Usage: genomic_name.txt merged_Orthogroups.txt OUT.txt "genomic_name.txt" is a tab separated text containing only one row, which contains all analyzed genomic names.
"merged_Orthogroups.txt" is genarated by merging the Orthogroups.tsv and Orthogroups_UnassignedGenes.tsv.
"OUT.txt" is the output file, which is used as input file for PGAP to perfomr Pan-genome analysis.
Please refer to PGAP manual ( for Pan-genome analysis.
- Run this command to enble users to concatenate alignment files into a pseudo-DNA fasta file.
FOR EXAMPLE: perl -dir alignment_dir
"alignment_dir" is a directory containing alignment files as FASTA format file.
After run, a concatenation sequence file called "concatenation.fasta" is produced in directory where is located. The detailed usage instructions is included in
- A Perlscript is used to complete the whole process of cog annotation using COG database and protein sequences as FASTA format file as inputs.
Usage: cog-20.fa cog-20.cog.csv merged_cogs.fa Pan_repersnet_seq.fasta COG_out 16
The latest version of COG database can be download in website, which contains cog-20.fa, cog-20.cog.csv,, and merged_cogs.fa files.
"Pan_repersnet_seq.fasta" is the protein sequences as FASTA format file.
"16" refers to 16 thread number used when runing
"COG_out" is the COG annotation result, which a tab separated text file, in which each row contains the protein ID, COG functional category (could include multiple letters in the order of importance), and COG functional category as single letter. If a gene was assigned to more than one COG category, each COG category is shown as separate row.
For example, the content of "COG_out" file is as follows:
Shell|6(323_genes|109_taxa) LX L
Shell|6(323_genes|109_taxa) LX X
Softcore|7(320_genes|119_taxa) G G
The step for obtaining the 3 housekeeping genes from assemblies is as follows. Firstly, All genomes were placed in the same folder, which is used as input for gene sequences extraction. 16S rRNA genes were extracted using included in Gcluster tool (Li et al., 2020). The nucleotide sequence of all genes for each genome was obtained using; then the gene IDs for gyrB and rpoD genes were extracted using included in Gcluster using the gyrB and rpoD genes of P. stutzeri A1501 as reference sequences; finally, we obtained the nucleotide sequences of gyrB by using the nucleotide sequence of all genes for each genome and the gene IDs of the gyrB as input, and we obtained the nucleotide sequences of rpoD by using the nucleotide sequence of all genes for each genome and the gene IDs of the rpoD as input.
- Run this command to extract the nucleotide sequence of all genes for each genome.
Usage: -dir genbank_file_directory [options]
- extract gene sequences according to the gene IDs from a fold containing the nucleotide sequence of all genes for each genome.
Usage: fold (containg all gene sequneces) gene_id_list (each row contain a gene id)