I am by no way an expert in R or in bioinformatics in general but my last job in which I was providing various bioinformatics services I picked up a few skills, mostly code and result troubleshooting-related and I think its time I contributed to the masses of help I’ve received from other online blogs, forums and scientists via e-mail and GitHub.
In this blog I am going to describe the steps you can take to install R, genoPlotR, a few plugins in a linux machine (I have Ubuntu 16 which I upgraded in my biolinux 8 operating system) and other software and make a publication quality ACT-like image of a comparison of genomic DNA sequences or genomic regions.
1) Installing or upgrading R
My machine had R-3.2.3. This may be good enough but if you’re going to start installing things and creating files you want your work to give you “rewards” in the form of a platform you can use for as long as possible before you start finding that the new packages you want to use are incompatible with your old R version and you start having to search in archives for compatible packages and help. There are various help pages on the web on how to migrate your work if you do need to move from an older version of R to a newer one, I’m not going to go into that here but you can find most of the information you need here:
The guide directs you to https://CRAN.R-project.org/mirrors.html to obtain R. This link gives you a list of mirrors from which you have to chose one. In my case I chose the first UK link what is for the University of Bristol. If you select that mirror you get some information on R such as the latest release (Monday 2016-10-31, Sincere Pumpkin Patch) R-3.3.2.tar.gz) and options to download R for Linux, Mac, or Windows. Once you click on the option to download you then need to select your linux version from the next page (I picked ubuntu).
The lazy way in which you can then install R is using typing the following text into a terminal:
sudo apt-get install r-base r-base-dev
In my case I got the following messages:
r-base is already the newest version (3.2.3-4).
r-base-dev is already the newest version (3.2.3-4).
Be aware that because of how rapidly things change and how databases and links are not always updated or maintained you have to expect that you are being lied to and double-check everything.
If you have an eye for detail and followed the guide above you’ll be blinking at the 3.2.3 and wondering how that is the new version when you’ve just been told the new version is 3.3.2.
Unfortunately that’s life. Ignore the message and download the newest version of R . There’s a link to “R Sources” in the left sidebar under Software which gives you a link to R-3.3.2.tar.gz.
Download the file to a folder of your choice and unpack it with
tar -xf R-3.3.2.tar.gz
The R manual gives a list of ways to install the new R version on your system.
I just used
You should get a list of messages on the lines of:
R is now configured for x86_64-pc-linux-gnu
Source directory: .
Installation directory: /usr/local
Congratulations, you now have a up-to-date version or R.
In my case I already had R-3.2.3 and because its not very big, just in case I happen to need to use a package which is incompatible with newer versions I decided to keep it. In order to have the newer version open when I type R in my command line I renamed R in the folder /usr/bin to R-3.2.3 and copied my new R file to /usr/local/bin
R 3.3.2 now opens by simply typing R in a terminal from any folder location.
2) Installing some R packages
To use R as a tool to aid in the analysis of biological data you probably don’t want to have to write your own scripts. The next thing to do if this is the case is install packages which will do the jobs you need to do for you.
A nice and easy to use package for creating genomic or cluster comparison files is genoPlotR, made and maintained by Lionel Guy and I’m going to continue talking about the 0.8.4 version (2015-07-02).
I chose this for several reasons. One is that it includes very convenient methods for importing your data from genbank files, blast files and even alignment files. This issue is not trivial because if your data is in various formats of gff, gff3, gtf annotation files and the tools you want to use tell you that you have to supply files with three columns with specific information then you either have to rewrite the information you need based on the annotation files you have or spend a few days tidying them up with various scripts. Also in my case I am a fan of ACT (Sanger) and find the similarity in its images and the convenience of using the same files to manipulate and annotate your files in ACT and Artemis then feed the same files into genoPlotR very useful.
To install packages start up R in your working folder by typing:
From there if you have a newly installed R equip yourself with Bioconductor:
If you’re new to R check out http://bioconductor.org/packages/release/BiocViews.html#___Software to get an idea as to what other packages you can use from bioconductor.
Next install the genoPlotR, its dependencies and any of the other packages you may need.
You can exit R by typing q()
3) Using R from command line
Now you have everything installed you need to start creating your R analysis pipeline.
You can just write commands in your terminal but I’m going to recommend creating a file in “getedit” or “notepad++” with your pipeline. I’m going to call mine “R_cluster_graphics.analysis”
You can add comments of what you are doing by beginning lines with #.
At any point you can run the script to see what comes out with the command:
R CMD BATCH R_cluster_graphics.analysis
This will create file named R_cluster_graphics.analysis.Rout in the folder you ran it from which you can open and check for error messages and for messages which answer questions in your analysis pipeline for example is.dna_seg which returns TRUE or FALSE.
First add commands to load the packages you need to use:
## State where you would like your images to be save to:
imgPath <- “../img”
pdfPath <- “../pdfs”
imgPath <- “/media/…/genoPlotR”
pdfPath <- “/media/…/genoPlotR”
##Import your genbank data (referred to as dna_seg):
#check dna_seq identity
#Create your Comparison files.
You can make these with blastall or blast+ in a terminal by formatting one sequence as a database and running a blastn or tblastx using the first dna_seg as input and the second one as the database. Here is an example of how to do this for two fasta files that match your genbank files and are in the same folder:
1) Navigate to your folder.
2) Format your database with:
formatdb -i SampleB.fasta -p F
3) Run a blastall search:
blastall -p tblastx -d /media/…/ SampleB.fasta -i /media/…/ SampleA.fasta -o SampleA_SampleB.tblastx -m 9
Alternatively you can also use the NCBI online blast tool at https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome and download your comparison files from there.
#Import your Comparison files.
SampleA_SampleB.comparison <- read_comparison_from_blast(“/media/…/SampleA_SampleB.tblastx”, sort_by = “per_id”,filt_high_evalue = NULL,filt_low_per_id =NULL,filt_length = NULL,color_scheme = NULL)
#check that your comparison file is acceptable:
If you would like to have a distance tree in your image make an alignment and a corresponding dendrogram. There are a number of command line tools and GUI software packages which can do this.
In my case I used MEGA7 to make an alignment with the muscle algorithm, you can try other algorithms and especially if your sequences are large or dissimilar I would recommend trying a few algorithms and picking the most appropriate alignment. You can get various version of MEGA from http://megasoftware.net/ I used the command line version of MEGA7 by opening megaproto (go to the installation folder and type megaproto), selected “non coding nucleotide alignment” and “muscle nuc alignment” from the Align option in the toolbar and changed the cluster method to Neighbor Joining for all iterations then saved these settings as a .mao file.
To make life easier I joined all my fasta sequences with the command
cat sequence1.fasta sequence2.fasta etc > allsequences.fasta
I then ran the alignment with the command:
/media/…/MEGA7/megacc -a /media/…/nucleotide_align.mao -d /media/…/allsequences.fasta -o MEGA7_allsequences.alignment
I then used the alignment to make a tree with RAxML version 8.2.4. which you can get from GitHub at https://github.com/stamatak/standard-RAxML
If you would are also going to use use RAxML check which version you need. If you are able to use multiple threads I would definitely pick the Pthreads option. Especially if you are planning on making trees out of huge alignments in the future I would definitely pick the best version you can. In my case I used the AVX version which is compatible with Intel i7.
You can download the files with the command
git clone https://github.com/stamatak/standard-RAxML.git
and install the appropriate version in the manner described in the Readme file in the software folder. In my case it was:
make -f Makefile.AVX.PTHREADS.gcc
Run RAxML to make your tree. I used the BS and ML recommended settings from the RAxML manual. It didn’t like my MEGA7 alignment so I converted it to a phylip alignment. There are many tools that can handle this but since I don’t have much on my new computer yet I used this link http://sequenceconversion.bugaco.com/converter/biology/sequences/
RAxMLHPC-PTHREADS-AVX -f a -x 12345 -p 12345 -# 100 -m GTRGAMMA -s /media/…/MEGA7_alignment.phylip -n MEGA7_alignment.RAxML -T 6
Note: If you can’t use threads remove the -T 6 bit
This creates a RAxML_bestTree.MEGA7_alignment.RAxML file
You can view your tree using various software packages. One example is dendroscope which comes with biolinux and by typing dendroscope into a terminal it opens a graphical user interphase (GUI) from which you can navigate to your tree file and quickly visualize it.
Give R your tree information by pasting the data from the RAxML_bestTree.MEGA7_alignment.RAxML file into a tree<- newick2phylog(“”) command :
#Pick the shapes you would like your CDS to appear as. I used arrows. You can see the other options with the command: gene_types()
gene_types(auto = FALSE)
SampleA$gene_type <- “arrows”
SampleB$gene_type <- “arrows”
#Give you dna_seg sequences CDS annotation
# Make your DNA-comparison image
pdf(file.path(pdfPath, “My_SampleABetc_comparison.pdf”), h=7, w=10)
plot_gene_map(dna_segs=list(SampleA, SampleB, next samples), comparisons=list(SampleA_SampleB.comparison, next comparisons),annotations=list(annotA,annotB,next annotations),override_color_schemes=TRUE, global_color_scheme=c(“e_value”, “auto”, “red_blue”, “0.5”),dna_seg_labels=c(“SampleA”,”SampleB”,”next labels”),tree=tree)
This should give you a nice comparison pdf which looks like what you always wanted ACT images to look like, in the folder you specified for pdfPath above.
The manual gives lots of detailed information on how you can change the parameters above to alter your image size, colours, annotation etc but I would like to mention that the power of using a file with the commands above to make your image is that once you have it and see what the output looks like you can go back to your genbank files and add any new CDS you discover or change their names then run the script again with “R CMD BATCH R_cluster_graphics.analysis” to get the new image. You can also make several different trees, call them tree1, tree2, tree3 then make several comparison images just by rerunning the script once with more copies of the “# Make your DNA-comparison image” commands with the appropriate tree name specified and the other lists and labels in the correct order. Here is an example:
## Make your comparison tree1
pdf(file.path(pdfPath, “My_SampleABetc_tree1.pdf”), h=7, w=10)
plot_gene_map(dna_segs=list(SampleA, SampleB, next samples), comparisons=list(SampleA_SampleB.comparison, next comparisons),annotations=list(annotA,annotB,next annotations),override_color_schemes=TRUE, global_color_scheme=c(“e_value”, “auto”, “red_blue”, “0.5”),dna_seg_labels=c(“SampleA”,”SampleB”,”next labels”),tree=tree1)
## Make comparison image for tree2 (you may need to make and specify more blastall comparison files)
pdf(file.path(pdfPath, “My_SampleABetc_tree2.pdf”), h=7, w=10)
plot_gene_map(dna_segs=list(SampleB, SampleA, next samples), comparisons=list(SampleB_SampleA.comparison, next comparisons),annotations=list(annotB,annotA,next annotations),override_color_schemes=TRUE, global_color_scheme=c(“e_value”, “auto”, “red_blue”, “0.5”),dna_seg_labels=c(“SampleB”,”SampleA”,”next labels”),tree=tree2)
#You can also add different types of annotation and gene_types by giving them different names then making even more images each with a different type of annotation or graphics for your CDS.
4) Check your images
Hopefully if you followed this guide you should have some very pretty pdf files of your genomic DNA comparisons now. It is a good idea to scrutinize your images and make sure they look realistic. If it looks like the comparison lines don’t seem quite right between two or more samples, for example you have 3 conserved geneA sets but the comparison lines give good identity to irrelevant regions, check that you used the correct reference and input sequences in your blast and try supplying a comparison file with the sequences give in the reverse order. Unfortunately genoPlotR does not automatically get the sample information from the blast files so this is something you have to be careful about.
For examples of the output images please see the genoPlotR page and I’ll add my images here once my paper is published.
5) Final words
I would like to give thanks to Lionel Guy himself in this post as besides creating this very useful and easy-to-use R tool (compared to many other R tools) in two instances where I could not tell from the manual how to deal with my annotations and tree, he responded in less than 24 hours to my e-mailed questions with precise and very helpful answers. This blog is by no means a substitute for the software manuals or a claim of expertise but rather what I hope is a simple guide for biologists which are unfamiliar with software manuals, R code and the sort of mistakes they can make if you are not aware of issues such as order of commands and recognition of data structures and need a simple guide on how to go about starting to use R to get a publication-quality image that looks like your lovely ACT comparison.
- ACT:Carver, Tim J., et al. “ACT: the Artemis comparison tool.” Bioinformatics 21.16 (2005): 3422-3423 http://www.sanger.ac.uk/science/tools/artemis-comparison-tool-act
- Artemis: Rutherford, Kim, et al. “Artemis: sequence visualization and annotation.” Bioinformatics 16.10 (2000): 944-945http://www.sanger.ac.uk/science/tools/artemis
- genoPlotR: Guy, Lionel, Jens Roat Kultima, and Siv GE Andersson. “genoPlotR: comparative gene and genome visualization in R.” Bioinformatics 26.18 (2010): 2334-2335 https://cran.r-project.org/web/packages/genoPlotR/index.html
- BLAST: http://www.ncbi.nlm.nih.gov/blast
- Bioconductor: http://bioconductor.org/
- MEGA7: http://megasoftware.net/
- RAxML: A. Stamatakis: “RAxML Version 8: A tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies”. In Bioinformatics, 2014, open access.
- GitHub: https://github.com/
- Biolinux: http://environmentalomics.org/bio-linux/
Difficult long Greek surnames.
I still remember the look of utter incomprehension and disbelief when, after being asked my surname in the UK I used to just tell people.
Me> My surname is Hatziioanou.
This was usually followed by a blank stare or silence over the phone (depending on the circumstance) during which I could actually picture strong “brain machines” grinding to an abrupt halt for a few seconds and then… ever so slowly and irregularly… grinding to a forced start as they battled to retain the echoes of this cryptic word,… and process it in order to find some familiar sounds which they can manage to repeat or scribble down.
This was always said with either annoyance, certainty or fear.
Person > Can you say that again please?
Me > Sure, it’s Hhhaaaaattzziiiiiiiioooaaannnnnouuuu
I would repeat slowly to make it easier for them.
After a few weeks in the UK I realized it was completely futile to tell people my surname, it was simply impossible for anyone to make head or tail from it. It was only when I had to face Chinese names while living in Taiwan years later that I realized what my surname must sound like to foreigners. For me it is a normal everyday easy to remember Greek surname which has a common structure composing of the words Hatzi and the name of one of my ancestors, Ioannis (or John), with the name’s ending changed to Ioannou to give it the meaning “of Ioannis”. I miss one of the “n”s out in English to make it slightly easier for people. That’s as easy as a name can get right? In Greece we often don’t even write the full name but use a shortened version such as H”Ioannou instead because “Hatzi+name” surnames are so common everyone immediately understands the name and how it is written in full.
I quickly learned to stop telling people my surname if it was to be recorded. Instead, when asked I would answer:
Me > Sure, would you like me to spell it out for you?
Person > *Suspicious confusion* um ok…
Me > That’s H, A, T, Z, double I, O, A, N, O, U. Would you like me to repeat that for you?
Person > *Look of disbelief* No, that’s alright……..
Person > …..And how do you pronounce that?
Me > Hatziioanou *smile*, thats Hhhaaaaattzziiiiiiiioooaaannnnnouuuu.
Person > Hjnoyidj?… znoyaaaanov?? right! *look of slightly guilty uncertainty and incomprehension*
It might not be the name they expected or the way they preferred to have surnames delivered but I quickly came to find my surname was wonderfully useful for starting a conversation or breaking the ice with people who had written surnames down as a matter of routine and had hardly looked at me before! Frequently I would be asked:
Person > So what does your surname mean?
Just so you know that I am not making up this story about my surname I would like to direct you to the following text copied from the link: http://genealogy.familyeducation.com/surname-origin/hatzis
“ Last name origin & meaning of Hatzis
Greek: from the vocabulary word khatzis ‘pilgrim (to Jerusalem)’, from the Arabic hajji ‘pilgrim (to Mecca)’, borne originally by Greek Muslims. Having completed a pilgrimage to the Holy Land was a mark of high social distinction. Often, this surname is a reduced form of a surname with Hatzi- as a prefix to a patronymic, naming the ancestor who performed the pilgrimage; e.g. Hatzimarkou ‘son of Mark the Pilgrim’, Hatzioannou ‘son of John the Pilgrim’.”
So there you are! I had an ancestor called John who went on a pilgrimage to Jerusalem and since then his descendents have had the surname: of John the pilgrim!
As with any translation of a name you will find it spelled in many different ways in English, the main one being Chatziioannou (which sounds completely different from the name in Greek) but they are all the same name, written the same way in Greek which people have one way or another hammered into Latin letters.
My conclusion from the whole story: I’ve come across so many people with difficult surnames that seem embarrassed to say their name and stutter it in low, barely audible voices or repeatedly shout it at you defensively to your surprise and confusion. If you have a difficult surname, make the most out of it and spell it out to people, smile at their efforts and explain it to them. If you’re the shy sort you have a powerful tool to start conversations and get to know people. If you’re faced with an impossible surname, don’t panic, ask for it to be spelled out and test your ability to say it back to them, you can both get a laugh out of it!