diff --git a/_posts/labnews/2011-02-21-fire-ant-genome-out.markdown b/_posts/labnews/2011-02-21-fire-ant-genome-out.markdown index f7dc050c..94d35294 100644 --- a/_posts/labnews/2011-02-21-fire-ant-genome-out.markdown +++ b/_posts/labnews/2011-02-21-fire-ant-genome-out.markdown @@ -13,7 +13,7 @@ tags: - work --- -Two papers just out! Our [Solenopsis invicta fire ant genome ](http://www.pnas.org/cgi/doi/10.1073/pnas.1009690108) paper is out in PNAS. Win! And a study [on fire ant Odorant Binding Proteins](http://www.plosone.org/article/info:doi/10.1371/journal.pone.0016289) in PLoS ONE. [Anurag Priyam](http://yeban.in) and are developing a [generic BLAST web interface](http://www.sequenceserver.com) in ruby. It's already super useful for our [fourmidable ant genome database](http://www.antgenomes.org), and I'm sure will be for others working with non-model organisms. (easy to use; less of a hassle to set up than gmod...). Using the server, you can [blast ant genome sequences](http://www.antgenomes.org/blast) (and predicted genes). +Two papers just out! Our [Solenopsis invicta fire ant genome ](http://www.pnas.org/cgi/doi/10.1073/pnas.1009690108) paper is out in PNAS. Win! And a study [on fire ant Odorant Binding Proteins](http://www.plosone.org/article/info:doi/10.1371/journal.pone.0016289) in PLoS ONE. Anurag Priyam and are developing a [generic BLAST web interface](http://www.sequenceserver.com) in ruby. It's already super useful for our [fourmidable ant genome database](http://www.antgenomes.org), and I'm sure will be for others working with non-model organisms. (easy to use; less of a hassle to set up than gmod...). Using the server, you can [blast ant genome sequences](http://www.antgenomes.org/blast) (and predicted genes). @@ -23,4 +23,4 @@ Two papers just out! Our [Solenopsis invicta fire ant genome ](http://www.pnas.o -Photo of fire ants on their genome (C) [Romain Libbrecht](http://www.unil.ch/dee/page50472_en.html) & [Yannick Wurm](http://www.sbcs.qmul.ac.uk/staff/yannickwurm.html) +Photo of fire ants on their genome (C) Romain Libbrecht & [Yannick Wurm](https://www.qmul.ac.uk/sbbs/staff/yannickwurm.html) diff --git a/_posts/labnews/2011-06-09-june-update.markdown b/_posts/labnews/2011-06-09-june-update.markdown index 8d3cde65..0f811493 100644 --- a/_posts/labnews/2011-06-09-june-update.markdown +++ b/_posts/labnews/2011-06-09-june-update.markdown @@ -6,12 +6,12 @@ layout: post slug: june-update title: May Taiwan Conf & June genome updates wordpress_id: 29 -categories: +categories: - labnews --- -Had a great two weeks visiting [John Wang's lab at Academia Sinica, Taiwan](http://biodiv.sinica.edu.tw/en2007/index.php?pi=157), and join National Taiwan University's [International Symposium on Social Insects](http://twentomolsoc.blogspot.com/2011/03/international-symposium-on-social.html) for wonderfully stimulating talks by [Jo Billen](http://bio.kuleuven.be/ento/), [Lars Chittka](http://chittkalab.sbcs.qmul.ac.uk/), [James Nieh,](http://www-biology.ucsd.edu/labs/nieh/) [Kenji Matsuura](http://www.agr.okayama-u.ac.jp/LIECO/englishpage.html) & [Bob Vander Meer](http://ars.usda.gov/pandp/people/people.htm?personid=5796). The symposium gave me the opportunity to share some thoughts about [sequencing genomes with high throughput technologies](http://yannick.poulet.org/publications/wurm2011antGenomeBehindTheScenes.pdf) in the journal of the Taiwan Entomological Society, [Formosan Entomologist](http://140.112.100.38/english.htm). +Had a great two weeks visiting John Wang's lab at Academia Sinica, Taiwan, and join National Taiwan University's [International Symposium on Social Insects](http://twentomolsoc.blogspot.com/2011/03/international-symposium-on-social.html) for wonderfully stimulating talks by [Jo Billen](http://bio.kuleuven.be/ento/), [Lars Chittka](http://chittkalab.sbcs.qmul.ac.uk/), [James Nieh,](http://www-biology.ucsd.edu/labs/nieh/) Kenji Matsuura & [Bob Vander Meer](http://ars.usda.gov/pandp/people/people.htm?personid=5796). The symposium gave me the opportunity to share some thoughts about [sequencing genomes with high throughput technologies](http://yannick.poulet.org/publications/wurm2011antGenomeBehindTheScenes.pdf) in the journal of the Taiwan Entomological Society, Formosan Entomologist. @@ -21,9 +21,9 @@ Had a great two weeks visiting [John Wang's lab at Academia Sinica, Taiwan](http - -In genomic news, the _Acromyrmex echinatior _leafcutter ant genome, led by [Sanne Nygaard](http://www1.bio.ku.dk/english/research/oe/cse/personer/sanne/) & [Koos Boosma](http://www1.bio.ku.dk/english/research/oe/cse/personer/koos/) is _in press_! The data are already on [Fourmidable](http://www.antgenomes.org); and Fourmdiable's [ant genome BLAST interface](http://www.antgenomes.org) was updated to the latest [SequenceServer](http://www.sequenceserver.com). + +In genomic news, the _Acromyrmex echinatior _leafcutter ant genome, led by Sanne Nygaard & Koos Boosma is _in press_! The data are already on [Fourmidable](http://www.antgenomes.org); and Fourmdiable's [ant genome BLAST interface](http://www.antgenomes.org) was updated to the latest [SequenceServer](http://www.sequenceserver.com). diff --git a/_posts/labnews/2011-07-03-shenzhen-social-insect-conference.markdown b/_posts/labnews/2011-07-03-shenzhen-social-insect-conference.markdown index cfd70f73..c8439634 100644 --- a/_posts/labnews/2011-07-03-shenzhen-social-insect-conference.markdown +++ b/_posts/labnews/2011-07-03-shenzhen-social-insect-conference.markdown @@ -6,12 +6,12 @@ layout: post slug: shenzhen-social-insect-conference title: Social insect genomics conference 2011 wordpress_id: 41 -categories: +categories: - labnews - genomics --- -Many interesting talks and stimulating discussions during [Shenzhen's Social Insect Genomics Conference](http://ldl.genomics.org.cn/event/conference.jsp?conId=31) which coincided with the release of Sanne Nygaard's [_Acromyrmex echinatior_ leaf-cutter ant genome paper](http://www.genome.org/cgi/doi/10.1101/gr.121392.111) showing adaptations linked to fungal farming. More excitement is on its way with next generation sociogenetics projects bubbling up around the world & across the phylogeny! +Many interesting talks and stimulating discussions during Shenzhen's Social Insect Genomics Conference which coincided with the release of Sanne Nygaard's [_Acromyrmex echinatior_ leaf-cutter ant genome paper](http://www.genome.org/cgi/doi/10.1101/gr.121392.111) showing adaptations linked to fungal farming. More excitement is on its way with next generation sociogenetics projects bubbling up around the world & across the phylogeny! diff --git a/_posts/labnews/2011-12-01-Solenopsis-invicta-fire-ant-genome-paper.markdown b/_posts/labnews/2011-12-01-Solenopsis-invicta-fire-ant-genome-paper.markdown index 6ca45cfd..9991a5c5 100644 --- a/_posts/labnews/2011-12-01-Solenopsis-invicta-fire-ant-genome-paper.markdown +++ b/_posts/labnews/2011-12-01-Solenopsis-invicta-fire-ant-genome-paper.markdown @@ -24,7 +24,7 @@ categories:
31.01.2011
Arizona State University: New quartet of ant genomes advanced by experts [pdf]
02.02.2011
-
Swiss Institute of Bioinformatics: Fire ant: The biggest genome ever sequenced in Switzerland [pdf]
+
Swiss Institute of Bioinformatics: Fire ant: The biggest genome ever sequenced in Switzerland [pdf]

Scientific Coverage

@@ -42,11 +42,11 @@ categories:

Traditional News Media

04.02.2011
-
Tribune de Geneve: Le genome, nouvelle arme contre les vilaines fourmis [pdf]
+
Tribune de Geneve: Le genome, nouvelle arme contre les vilaines fourmis [pdf]
01.02.2011
myScience: Fourmi de feu: le plus grand genome jamais sequence en Suisse [pdf]
01.02.2011
-
Tribune de Geneve: Universite? de Lausanne: le genome de la fourmi de feu sequence [pdf]
+
Tribune de Geneve: Universite? de Lausanne: le genome de la fourmi de feu sequence [pdf]
01.02.2011
Le Matin: Universite de Lausanne: le genome de la fourmi de feu sequence [pdf]
01.02.2011
@@ -54,13 +54,13 @@ categories:
01.02.2011
Basler Zeitung: Forscher entschlusseln das Erbgut von drei Ameisenarten [pdf]
01.02.2011
-
SwissInfo: Genome research could combat ant pest [pdf]
+
SwissInfo: Genome research could combat ant pest [pdf]
01.02.2011
Der Bund : Forscher entschlusseln das Erbgut von drei Ameisenarten [pdf]
01.02.2011
-
Thuner Tagblatt: Forscher entschlusseln das Erbgut von drei Ameisenarten [pdf]
+
Thuner Tagblatt: Forscher entschlusseln das Erbgut von drei Ameisenarten [pdf]
01.02.2011
-
24heures: Universite de Lausanne: le genome de la fourmi de feu sequence [pdf]
+
24heures: Universite de Lausanne: le genome de la fourmi de feu sequence [pdf]
02.02.2011
Radio Suisse Romande: informations matinales
@@ -71,9 +71,9 @@ categories:
01.02.2011
GenomeWeb: International Teams Publish Three New Ant Genome Studies [pdf]
01.02.2011
-
SF Chronicle: Ants' genome project might unlock mysteries [pdf]
+
SF Chronicle: Ants' genome project might unlock mysteries [pdf]
01.02.2011
-
Futurity: Genomes of menacing ants sequenced [pdf]
+
Futurity: Genomes of menacing ants sequenced [pdf]
diff --git a/_posts/labnews/2012-02-07-new-publications-new-job.markdown b/_posts/labnews/2012-02-07-new-publications-new-job.markdown index 7d50a4e9..a80a38b2 100644 --- a/_posts/labnews/2012-02-07-new-publications-new-job.markdown +++ b/_posts/labnews/2012-02-07-new-publications-new-job.markdown @@ -6,27 +6,27 @@ layout: post slug: new-publications-new-job title: New publications & New job wordpress_id: 67 -categories: +categories: - labnews - genomics --- New year, new country, new job: I am now a [Lecturer](http://en.wikipedia.org/wiki/Lecturer#United_Kingdom) at [Queen Mary University of London](http://www.qmul.ac.uk/). I will continue to use genomics and bioinformatics approaches to examine the interplay between social evolution and genome evolution. Get in touch if you're interested in working with me in a great place. -[![Queen mary qmul logo blue]({{ site.url }}/img/news/queen_mary_qmul_logo_blue.gif)](http://www.sbcs.qmul.ac.uk/staff/yannickwurm.html) +[![Queen mary qmul logo blue]({{ site.url }}/img/news/queen_mary_qmul_logo_blue.gif)](https://www.qmul.ac.uk/sbbs/staff/yannickwurm.html) And a few nice papers on which I am coauthor are now out. - * [PNAS: Relaxed selection is a precursor to the evolution of phenotypic plasticity](http://yannick.poulet.org/publications/hunt2011phenotypicPlasticity.pdf). Led by [Brendan Hunt](http://www.goodismanlab.biology.gatech.edu/hunt/) at Georgia Tech + * [PNAS: Relaxed selection is a precursor to the evolution of phenotypic plasticity](http://yannick.poulet.org/publications/hunt2011phenotypicPlasticity.pdf). Led by Brendan Hunt at Georgia Tech * [Bioinformatics: Visualization and quality assessment of de novo genome assemblies](http://yannick.poulet.org/publications/Bioinformatics-2011-Riba-Grognuz-3425-6.pdf) - led by [Oksana Riba-Grognuz](http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=http%3A%2F%2Fwww.unil.ch%2Fdee%2Fpage81073_en.html&ei=fRcxT9HTMMi_0QWrhNSzBw&usg=AFQjCNE8Ei6RfZ8mIpMA03zncpmkodTzbg&sig2=7AxdNSLp5YWPl93zXvcf5Q) - [Source code available](https://github.com/ksanao/TGNet). - * [Trends in Genetics: The genomic impact of 100 million years of social evolution in seven ant species](http://yannick.poulet.org/publications/TiG2011.pdf). Reviews some of the findings from the [ant genome projects](http://www.antgenomes.org) & lays the foundation for coming analyses. [Romain Libbrecht](http://www.unil.ch/dee/page50472_en.html) & I made the photo that was selected as cover image! + * [Trends in Genetics: The genomic impact of 100 million years of social evolution in seven ant species](http://yannick.poulet.org/publications/TiG2011.pdf). Reviews some of the findings from the [ant genome projects](http://www.antgenomes.org) & lays the foundation for coming analyses. Romain Libbrecht & I made the photo that was selected as cover image! [![TIGs ant genomes]({{ site.url }}/img/news/TIGs_ant_genomes.png)](http://www.antgenomes.org) diff --git a/_posts/labnews/2012-02-18-oxford-nanopore-sequencing-a-revolution-for-non-model-organisms.markdown b/_posts/labnews/2012-02-18-oxford-nanopore-sequencing-a-revolution-for-non-model-organisms.markdown index 4df92df0..b1693c27 100644 --- a/_posts/labnews/2012-02-18-oxford-nanopore-sequencing-a-revolution-for-non-model-organisms.markdown +++ b/_posts/labnews/2012-02-18-oxford-nanopore-sequencing-a-revolution-for-non-model-organisms.markdown @@ -21,7 +21,7 @@ Exciting announcement of a new dirt-cheap machine-less DNA sequencing technology * it supposedly provide 100,000bp long reads. This will eliminate [most scaffolding issues]({{ site.url }}/news/2011-09-21-genome-analyses-for-non-model-organisms/) we have with assembling _de novo _genome sequence. - * using the [USB thumb-chip version](http://www.nanoporetech.com/technology/minion-a-miniaturised-sensing-instrument), no machine is required. Thus when you are [out in the field](http://vimeo.com/21287431), you can sequence right then and there - a potential workaround for worrying about tissue sample export permits... at least until new regulations appear! + * using the USB thumb-chip version, no machine is required. Thus when you are [out in the field](http://vimeo.com/21287431), you can sequence right then and there - a potential workaround for worrying about tissue sample export permits... at least until new regulations appear! @@ -31,7 +31,7 @@ Exciting announcement of a new dirt-cheap machine-less DNA sequencing technology - + diff --git a/_posts/labnews/2013-12-22-news.markdown b/_posts/labnews/2013-12-22-news.markdown index f4d9abb6..38eae552 100644 --- a/_posts/labnews/2013-12-22-news.markdown +++ b/_posts/labnews/2013-12-22-news.markdown @@ -5,18 +5,17 @@ date: 2013-12-22 17:07:12 layout: post slug: 2013-nonews title: Long time no update -categories: +categories: - labnews - teaching --- -No real update here for a while! Major events include: +No real update here for a while! Major events include: * Publication of our [social chromosome paper in Nature](http://www.nature.com/nature/journal/vaop/ncurrent/full/nature11832.html) - * BBSRC & NESCent funding to [crowdsource gene curation](http://afra.sbcs.qmul.ac.uk) + * BBSRC & NESCent funding to crowdsource gene curation * Google Summer of Code funding to help [identify problematic gene predictions](https://github.com/monicadragan/GeneValidator) * Support form the [Software Sustainability Institute](http://software.ac.uk) * [Lots of teaching](/teaching/) - * Some wonderful new [team members](/team), visitors, [colleagues](http://www.sbcs.qmul.ac.uk/people/index.html) & collaborators. - -Perhaps hope for more regular updates in 2014? We'll see. + * Some wonderful new [team members](/team), visitors, [colleagues](https://www.qmul.ac.uk/sbbs/people/index.html) & collaborators. +Perhaps hope for more regular updates in 2014? We'll see. diff --git a/_posts/labnews/2013-12-23-vacancies.markdown b/_posts/labnews/2013-12-23-vacancies.markdown index 20c0bc73..d2d49257 100644 --- a/_posts/labnews/2013-12-23-vacancies.markdown +++ b/_posts/labnews/2013-12-23-vacancies.markdown @@ -14,11 +14,11 @@ categories: Positions to be filled: * [Postdoc (renewable for up to 3 years) of genomics, transcriptomics, bioinformatics & population genomics work involving pollinators](/news/2014-10-31-pollinator-population-genomicist). This will be in collaboration with [Richard Gill](http://www3.imperial.ac.uk/people/r.gill) at Imperial College Silwood Park. - * Two four-year PhD positions available as part of [NERC's London Doctoral Training Program](http://www.sbcs.qmul.ac.uk/prospectivestudents/research/nercdtpstudentships/118400.html): - * [Social Chromosome Evolution](http://london-nerc-dtp.org/2013/11/27/social-chromosome-evolution/) co-supervised with [Judith Mank](http://www.ucl.ac.uk/mank-group/people.htm) - * [Evolutionary Genomics in Ants](http://london-nerc-dtp.org/2013/11/27/evolutionary-genomics-in-ants/) co-supervised by [Steve Rossiter](http://www.sbcs.qmul.ac.uk/staff/stephenrossiter.html) + * Two four-year PhD positions available as part of NERC's London Doctoral Training Program: + * Social Chromosome Evolution co-supervised with Judith Mank + * Evolutionary Genomics in Ants co-supervised by [Steve Rossiter](https://www.qmul.ac.uk/sbbs/staff/stephenrossiter.html) Please get in touch by email with a CV for more details. NERC PhD application deadline is mid-February. Cheers, Yannick. -NERC logo +NERC logo diff --git a/_posts/labnews/2014-01-27-reference.markdown b/_posts/labnews/2014-01-27-reference.markdown index e67c7881..01ffe1d3 100644 --- a/_posts/labnews/2014-01-27-reference.markdown +++ b/_posts/labnews/2014-01-27-reference.markdown @@ -3,49 +3,49 @@ layout: post title: Reference Letters date: 2014-01-27 comments: true -categories: +categories: - labnews - teaching - writing --- -Current or former students *very regularly* ask me for a reference to help them apply for a job or a new study program. The process is facilitated & the letter is improved by the following advice. +Current or former students *very regularly* ask me for a reference to help them apply for a job or a new study program. The process is facilitated & the letter is improved by the following advice. -If you need a reference letter from me, I need you to write a first draft. First, you are best positioned to know what makes you great for what you're applying for. Second, you'll end up with a better letter if my time is spent revising something than if I try to create something from scratch. +If you need a reference letter from me, I need you to write a first draft. First, you are best positioned to know what makes you great for what you're applying for. Second, you'll end up with a better letter if my time is spent revising something than if I try to create something from scratch. -Your draft should in the form of a letter from me about you (yes, it can feel awkward to write like this). You basically need to say that you are a great and justify why). Some general tips: +Your draft should in the form of a letter from me about you (yes, it can feel awkward to write like this). You basically need to say that you are a great and justify why). Some general tips: * Please respect the style guidelines given by Strunk & White's "The Elements of Style". -* Keep things concise. +* Keep things concise. * Use a spell-checker and a grammar-checker (on strict mode!). * It's better if the examples you use are relevant to the degree you're applying to. -* Don't highlight weaknesses. E.g. if you have a "C" in something don't mention it. -* Whatever you do, don't lie. Any lies will come back to hurt you 1000-fold (karma). +* Don't highlight weaknesses. E.g. if you have a "C" in something don't mention it. +* Whatever you do, don't lie. Any lies will come back to hurt you 1000-fold (karma). * Send it as a document I can edit (not a PDF). ** 2015 update: a much more [exhaustive list of writing tips here]({% post_url /labnews/2015-02-05-scientific-writing %}).** ** 2020 update: ** please use an automatic grammar and style checker such as [Grammarly](https://grammarly.go2cloud.org/SH2na) or Microsoft Word's grammar checker. They aren't magical solutions, but can help you a lot! -### Structure +### Structure -Introductory paragraph. This should include: +Introductory paragraph. This should include: * Why I am writing - * Why I know you well (e.g., I am your academic advisor/tutor/supervisor/lecturer since at Queen Mary since xxx when you started your degree in XX). - * Which degree you are doing and when you are expected to graduate. - * The last sentence should be a small list of ideas (see below), summarizing why you are great for the opportunity you're applying to. This also announces the structure of the subsequent pre-conclusion paragraphs (i.e., it should end with a list of 2 or 3 or 4 items as below). + * Why I know you well (e.g., I am your academic advisor/tutor/supervisor/lecturer since at Queen Mary since xxx when you started your degree in XX). + * Which degree you are doing and when you are expected to graduate. + * The last sentence should be a small list of ideas (see below), summarizing why you are great for the opportunity you're applying to. This also announces the structure of the subsequent pre-conclusion paragraphs (i.e., it should end with a list of 2 or 3 or 4 items as below). -One paragraph per idea (no ping-ponging back and forth!). Some examples of ideas: +One paragraph per idea (no ping-ponging back and forth!). Some examples of ideas: * academic achievements (e.g., coursework or overall grades, predicted final grade ("first?")) - * evidence that you are dedicated/serious/hardworking/intelligent/creative (e.g., based on your project, punctuality, behavior in tutorials). - * evidence that you have a good personality (e.g., social intelligence, teamwork, helping others). + * evidence that you are dedicated/serious/hardworking/intelligent/creative (e.g., based on your project, punctuality, behavior in tutorials). + * evidence that you have a good personality (e.g., social intelligence, teamwork, helping others). * extra-curricular activities (jobs, volunteering) -Conclusion: a quick summary stating that you're great for the degree/program/job because of the 3 or 4 ideas. +Conclusion: a quick summary stating that you're great for the degree/program/job because of the 3 or 4 ideas. -Overall, the reference should not take more than 1 page - people are unlikely to read anything that is longer. +Overall, the reference should not take more than 1 page - people are unlikely to read anything that is longer. --- -Thanks to Rob Hammond for telling me about The Elements of Style years ago. +Thanks to Rob Hammond for telling me about The Elements of Style years ago. diff --git a/_posts/labnews/2015-06-02-avoidgenomicsretractions.md b/_posts/labnews/2015-06-02-avoidgenomicsretractions.md index 883d6baa..7955c1cb 100644 --- a/_posts/labnews/2015-06-02-avoidgenomicsretractions.md +++ b/_posts/labnews/2015-06-02-avoidgenomicsretractions.md @@ -23,7 +23,7 @@ categories: ## Biology is a data-science -The dramatic [plunge in DNA sequencing costs](http://www.genome.gov/images/content/cost_megabase_.jpg) means that a single MSc or PhD student can now generate data that would have cost $15,000,000 only ten years ago. We are thus leaping from lab-notebook-scale science to research that requires extensive programming, statistics and high performance computing. +The dramatic plunge in DNA sequencing costs means that a single MSc or PhD student can now generate data that would have cost $15,000,000 only ten years ago. We are thus leaping from lab-notebook-scale science to research that requires extensive programming, statistics and high performance computing. This is exciting & empowering – in particular for small teams working on emerging model organisms that lacked genomic resources. But with great powers come great responsibilities... and risks of doing things wrong. These risks are far greater for genome biologists than, say physicists or astronomers who have strong traditions of working with large datasets. In particular: @@ -83,13 +83,13 @@ Additionally, the essentials of experimental design are long established: ensuri There is no way around it: analysing large datasets is hard. -When genomics projects involved tens of millions of $, much of this went to teams of dedicated data scientists, statisticians and bioinformaticians who could ensure data quality and analysis rigor. As sequencing has gotten cheaper the challenges [and costs](http://genomebiology.com/2011/12/8/125/figure/F1?highres=y) have shifted even further towards data analysis. For large scale human resequencing projects this is well understood. Despite the challenges being even greater for organisms with only few genomic resources, surprisingly many PIs, researchers and funders focusing on such organisms suppose that individual researchers with little formal training will be able to perform all necessary analysis. This is worrying and suggests that important stakeholders who still have limited experience of large datasets underestimate how easily mistakes with major negative consequences occur and go undetected. We may have to see additional publication retractions for awareness of the risks to fully take hold. +When genomics projects involved tens of millions of $, much of this went to teams of dedicated data scientists, statisticians and bioinformaticians who could ensure data quality and analysis rigor. As sequencing has gotten cheaper the challenges and costs have shifted even further towards data analysis. For large scale human resequencing projects this is well understood. Despite the challenges being even greater for organisms with only few genomic resources, surprisingly many PIs, researchers and funders focusing on such organisms suppose that individual researchers with little formal training will be able to perform all necessary analysis. This is worrying and suggests that important stakeholders who still have limited experience of large datasets underestimate how easily mistakes with major negative consequences occur and go undetected. We may have to see additional publication retractions for awareness of the risks to fully take hold. Thankfully, multiple initiatives are improving visibility of the data challenges we face (e.g., [1](http://www.nature.com/news/core-services-reward-bioinformaticians-1.17251), [2](https://www.epsrc.ac.uk/funding/calls/rsefellowships/), [3](http://www.nature.com/nature/journal/v498/n7453/full/498255a.html), [4](http://www.nytimes.com/2011/12/01/business/dna-sequencing-caught-in-deluge-of-data.html?_r=0), [5](http://ivory.idyll.org/blog/2015-docker-and-replicating-papers.html), [6](http://www.software.ac.uk)). Such visibility of the risks – and of how easy it is to implement practices that will improve research robustness – needs to grow among funders, researchers, PIs, journal editors and reviewers. This will ultimately bring more people to do better, more trustworthy science that will never need to be retracted. ## Acknowledgements -*This post came together thanks to the [SSI Collaborations workshop](http://software.ac.uk), [Bosco K Ho's post on Geoffrey Chang](http://boscoh.com/protein/a-sign-a-flipped-structure-and-a-scientific-flameout-of-epic-proportions.html), discussions in [my lab](http://wurmlab.github.io) and through interactions with colleagues at the [social insect genomics conference](https://meetings.cshl.edu/meetings/2015/insect15.shtml) and the [NESCent Genome Curation group](http://genomecuration.github.io). YW is funded by the Biotechnology and Biological Sciences Research Council [BB/K004204/1], the Natural Environment Research Council [NE/L00626X/1, [EOS Cloud](http://environmentalomics.org/portfolio/big-data-infrastructure/)] and is a fellow of the [Software Sustainablity Institute](http://software.ac.uk).* +*This post came together thanks to the [SSI Collaborations workshop](http://software.ac.uk), [Bosco K Ho's post on Geoffrey Chang](http://boscoh.com/protein/a-sign-a-flipped-structure-and-a-scientific-flameout-of-epic-proportions.html), discussions in [my lab](http://wurmlab.github.io) and through interactions with colleagues at the social insect genomics conference and the [NESCent Genome Curation group](http://genomecuration.github.io). YW is funded by the Biotechnology and Biological Sciences Research Council [BB/K004204/1], the Natural Environment Research Council [NE/L00626X/1, EOS Cloud] and is a fellow of the [Software Sustainablity Institute](http://software.ac.uk).* [Please cite The Winnower version of this article](https://thewinnower.com/papers/avoid-having-to-retract-your-genomics-analysis) diff --git a/_posts/labnews/2016-02-01-sequenceserverpaper.markdown b/_posts/labnews/2016-02-01-sequenceserverpaper.markdown index 950c54f7..c6565955 100644 --- a/_posts/labnews/2016-02-01-sequenceserverpaper.markdown +++ b/_posts/labnews/2016-02-01-sequenceserverpaper.markdown @@ -11,12 +11,12 @@ categories: --- - +Interactive Figure Happy to announce that we now have a manuscript describing the rationale and current features of SequenceServer - our easy to setup BLAST frontend. Importantly, the manuscript also provides extensive detail about the sustainable software development and user-centric design approaches we used to build this software. The full bioRxiv reference is:

Sequenceserver: a modern graphical user interface for custom BLAST databases 2015. Priyam, Woodcroft, Rai, Munagala, Moghul, Ter, Gibbins, Moon, Leonard, Rumpf and Wurm. bioRxiv doi: 10.1101/033142 [PDF].

-Be sure to check out the interactive figure giving a guided tour of Sequenceserver's BLAST results. +Be sure to check out the interactive figure giving a guided tour of Sequenceserver's BLAST results. Finally, I'll note that Sequenceserver arose from our own needs; these are clearly shared by many as Sequenceserver has already been cited in ≥20 publications and has been downloaded ≥30,000 times! Thanks to all community members who have made this tool successful. diff --git a/_posts/labnews/2016-04-25-GoogleSummerOfBioinformaticsCode.markdown b/_posts/labnews/2016-04-25-GoogleSummerOfBioinformaticsCode.markdown index 882f9d80..db229020 100644 --- a/_posts/labnews/2016-04-25-GoogleSummerOfBioinformaticsCode.markdown +++ b/_posts/labnews/2016-04-25-GoogleSummerOfBioinformaticsCode.markdown @@ -13,5 +13,5 @@ categories: Congratulations to our 2016 [Google Summer of Code](https://en.wikipedia.org/wiki/Google_Summer_of_Code) students! We are pround & excited to host them: * Hiten Chowdhary (Indian Institue of Technology, Karaghpur) will create a **BLAST result visualization methods** for [BioRuby](http://bioruby.org) and [SequenceServer](http://www.sequenceserver.com). This work should significantly facilitate the interpretation of results produced with our Sequenceserver custom BLAST-ing tool (see Sequenceserver: a modern graphical user interface for custom BLAST databases; Priyam et al. 2015 BioRxiv). This project is part of the [Open Genome Informatics](https://summerofcode.withgoogle.com/organizations/6212058194378752/) organization; supervision by [Priyam](/team/priyam/) & Yannick. - - * Julian Mazzitelli (U Toronto) will improve Bionode's capabilities for performing **analyses of streams of biological data** in real-time as they are downloaded, computed, or generated. This project is part of the [Open Bioinformatics Foundation](https://summerofcode.withgoogle.com/organizations/5693436329984000/); supervision by [Bruno Vieira](/team/bmpvieira.html), [Max Ogden](http://maxogden.com/), [Mathias Buus](https://github.com/mafintosh) & Yannick. + + * Julian Mazzitelli (U Toronto) will improve Bionode's capabilities for performing **analyses of streams of biological data** in real-time as they are downloaded, computed, or generated. This project is part of the [Open Bioinformatics Foundation](https://summerofcode.withgoogle.com/organizations/5693436329984000/); supervision by [Bruno Vieira](/team/bmpvieira.html), [Max Ogden](http://maxogden.com/), [Mathias Buus](https://github.com/mafintosh) & Yannick. diff --git a/_posts/labnews/2016-08-23-hiten-blast-visualization-gsoc.markdown b/_posts/labnews/2016-08-23-hiten-blast-visualization-gsoc.markdown index f6202100..328ff2d2 100644 --- a/_posts/labnews/2016-08-23-hiten-blast-visualization-gsoc.markdown +++ b/_posts/labnews/2016-08-23-hiten-blast-visualization-gsoc.markdown @@ -3,7 +3,7 @@ layout: post author: Hiten Chowdhary title: Blast Visualization Google Summer of Code modified: 2016-08-23 -categories: +categories: - labnews image: feature: gsoc.png @@ -13,7 +13,7 @@ comments: true share: true --- -*Written by [Hiten Chowdhary](http://www.hiten.io/), cross-posted from [http://www.hiten.io/blog/articles/gsoc-16/](http://www.hiten.io/blog/articles/gsoc-16/)* +*Written by [Hiten Chowdhary](http://www.hiten.io/), cross-posted from www.hiten.io/blog/articles/gsoc-16/* This post is going to be about my GSoC 2016 project under Open Genome Informatics organisation along with Anurag Priyam and Yannick Wurm as my mentors. diff --git a/_posts/labnews/2016-09-01-GreatGoogleSummerOfBioinformaticsCode.markdown b/_posts/labnews/2016-09-01-GreatGoogleSummerOfBioinformaticsCode.markdown index f9f5cff4..a392f57a 100644 --- a/_posts/labnews/2016-09-01-GreatGoogleSummerOfBioinformaticsCode.markdown +++ b/_posts/labnews/2016-09-01-GreatGoogleSummerOfBioinformaticsCode.markdown @@ -16,11 +16,10 @@ Google summer of code 2016 has just came to an end. Thanks to our host organisat - - * After reviewing in detail the [strengths and weaknesses of bash, make, snakemake and nextflow as biological analysis pipelines](//github.com/thejmazz/jmazz.me/blob/master/content/post/ngs-workflow.md), [Julian Mazzitelli](//www.jmazz.me) created [Bionode waterwheel](//github.com/bionode/bionode-watermill), a tool demonstrating the capabilities of javascript streams for real-time analysis of biological data. [Read more about how it works.](//github.com/bionode/bionode-watermill/blob/master/README.md) + + * After reviewing in detail the [strengths and weaknesses of bash, make, snakemake and nextflow as biological analysis pipelines](https://github.com/thejmazz/jmazz.me/blob/master/_posts/NGS-Workflows.md), Julian Mazzitelli created [Bionode waterwheel](//github.com/bionode/bionode-watermill), a tool demonstrating the capabilities of javascript streams for real-time analysis of biological data. [Read more about how it works.](//github.com/bionode/bionode-watermill/blob/master/README.md) As the finishing touches are implemented, we look forward to being able to deploy the work of these students into production releases of [SequenceServer](//www.sequenceserver.com) and [Bionode](//bionode.io). - diff --git a/_posts/labnews/2017-01-30-BriefNewYearsUpdate.markdown b/_posts/labnews/2017-01-30-BriefNewYearsUpdate.markdown index 4e6912df..8bb9da2f 100644 --- a/_posts/labnews/2017-01-30-BriefNewYearsUpdate.markdown +++ b/_posts/labnews/2017-01-30-BriefNewYearsUpdate.markdown @@ -13,6 +13,4 @@ Just a brief update to: * congratulate Emeline Favreau, Carlos Martinez-Ruiz and Eckart Stolle on their great presentations at the [London NW-Europe IUSSI meeting](http://www.iussi.org/NWEurope/meetings.htm) and at [Popgroup 50 in Cambridge](http://populationgeneticsgroup.org.uk). * congratulate [Anurag Priyam](/team/priyam) who is *finally* joining us to begin a PhD. - * congratulate [Bruno Vieira](/team/bmpvieira.html) on his [Mozilla Science Fellowship](https://science.mozilla.org/programs/fellowships/fellows). - - + * congratulate [Bruno Vieira](/team/bmpvieira.html) on his Mozilla Science Fellowship. diff --git a/_posts/labnews/2017-02-17-social-supergene-evolution.markdown b/_posts/labnews/2017-02-17-social-supergene-evolution.markdown index 2aec38b9..079a9bcd 100644 --- a/_posts/labnews/2017-02-17-social-supergene-evolution.markdown +++ b/_posts/labnews/2017-02-17-social-supergene-evolution.markdown @@ -24,7 +24,7 @@ queens. The team had previously discovered that colony type is determined by a chromosome that carries one of two variants of a ‘supergene’ region containing more than 500 genes.

In a new research paper, published in the journal Molecular Ecology, the team from QMUL’s School of Biological and Chemical +"https://www.qmul.ac.uk/sbbs/">School of Biological and Chemical Sciences sequenced the DNA and compared the genomes of two types of individuals: those carrying the supergene version responsible for colonies with a single queen, and those carrying @@ -35,9 +35,9 @@ homogeneously over the entire length of the supergene. This suggests that a single event, such as a large chromosomal rearrangement, was responsible for the origin of this remarkable system for determining social organisation,” said lead author -Dr +Dr Yannick Wurm from QMUL’s School of Biological and Chemical +"https://www.qmul.ac.uk/sbbs/">School of Biological and Chemical Sciences.

Evolutionary advantage?

The team also discovered a large number of unfavourable @@ -50,9 +50,7 @@ advantages of having several queens in the colony outweigh the costs of the unfavourable mutations in the supergene region.”

This finding can help scientists understand how chromosomes evolve over time.

-

Rodrigo -Pracana, a PhD student at QMUL and first author of the study, +

Rodrigo Pracana, a PhD student at QMUL and first author of the study, said: “We know that the Y chromosome in mammals has also been affected by unfavourable mutations. It is exciting to see that the fire ant social chromosome has evolved in a similar way to the diff --git a/_posts/labnews/2017-02-21-scientists_explore_the_evolution_of_a_social_supergene_in_the_red_fire_ant.md b/_posts/labnews/2017-02-21-scientists_explore_the_evolution_of_a_social_supergene_in_the_red_fire_ant.md index dc37977d..12242691 100644 --- a/_posts/labnews/2017-02-21-scientists_explore_the_evolution_of_a_social_supergene_in_the_red_fire_ant.md +++ b/_posts/labnews/2017-02-21-scientists_explore_the_evolution_of_a_social_supergene_in_the_red_fire_ant.md @@ -20,9 +20,9 @@ Red fire ants are found in two different types of colonies: some colonies have a ![Credit: Romain Libbrecht and Yannick Wurm](/img/news/red-fire-ant-c-yannick-wurm-640.jpg#center){: width="307" height="197" style="max-width:100%; height: auto"} -In a new research paper, published in the journal [*Molecular Ecology*](//onlinelibrary.wiley.com/journal/10.1111/(ISSN)1365-294X){:target="_blank"}, the team from QMUL’s [School of Biological and Chemical Sciences](//www.sbcs.qmul.ac.uk/){:target="_blank"} sequenced the DNA and compared the genomes of two types of individuals: those carrying the supergene version responsible for colonies with a single queen, and those carrying the supergene variant responsible for colonies with multiple queens. +In a new research paper, published in the journal [*Molecular Ecology*](//onlinelibrary.wiley.com/journal/10.1111/(ISSN)1365-294X){:target="_blank"}, the team from QMUL’s [School of Biological and Chemical Sciences](https://www.qmul.ac.uk/sbbs/){:target="_blank"} sequenced the DNA and compared the genomes of two types of individuals: those carrying the supergene version responsible for colonies with a single queen, and those carrying the supergene variant responsible for colonies with multiple queens. -“We found that the two versions of the chromosome differ homogeneously over the entire length of the supergene. This suggests that a single event, such as a large chromosomal rearrangement, was responsible for the origin of this remarkable system for determining social organisation,” said lead author [Dr Yannick Wurm](//www.sbcs.qmul.ac.uk/staff/yannickwurm.html){:target="_blank"} from QMUL’s [School of Biological and Chemical Sciences](//www.sbcs.qmul.ac.uk/){:target="_blank"}. +“We found that the two versions of the chromosome differ homogeneously over the entire length of the supergene. This suggests that a single event, such as a large chromosomal rearrangement, was responsible for the origin of this remarkable system for determining social organisation,” said lead author [Dr Yannick Wurm](https://www.qmul.ac.uk/sbbs/staff/yannickwurm.html){:target="_blank"} from QMUL’s [School of Biological and Chemical Sciences](https://www.qmul.ac.uk/sbbs/){:target="_blank"}. #### Evolutionary advantage? @@ -32,7 +32,7 @@ Dr Wurm added: “It is likely that only a few genes among the hundreds present This finding can help scientists understand how chromosomes evolve over time. -[Rodrigo Pracana](//www.sbcs.qmul.ac.uk/staff/rodrigopracana.html){:target="_blank"}, a PhD student at QMUL and first author of the study, said: “We know that the Y chromosome in mammals has also been affected by unfavourable mutations. It is exciting to see that the fire ant social chromosome has evolved in a similar way to the human Y chromosome, although it controls social organisation and not sex.” +Rodrigo Pracana, a PhD student at QMUL and first author of the study, said: “We know that the Y chromosome in mammals has also been affected by unfavourable mutations. It is exciting to see that the fire ant social chromosome has evolved in a similar way to the human Y chromosome, although it controls social organisation and not sex.” #### A real pest @@ -47,4 +47,4 @@ Rodrigo Pracana added: “Our discoveries could help to develop novel pest contr - [The Wurm lab study](//wurmlab.github.io/){:target="_blank"} the lives of social insects including ants and bees. They combine behavioural experiments with genomics and bioinformatics approaches. -- Find out more about studying [postgraduate Ecological and Evolutionary Genomics MSc](//www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/121430.html){:target="_blank"} at QMUL's [School of Biological and Chemical Sciences](//www.sbcs.qmul.ac.uk/){:target="_blank"}. +- Find out more about studying [postgraduate Ecological and Evolutionary Genomics MSc](//www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/121430.html){:target="_blank"} at QMUL's [School of Biological and Chemical Sciences](/https://www.qmul.ac.uk/sbbs/){:target="_blank"}. diff --git a/_posts/labnews/2017-02-21-supergene-diversity-accepted.markdown b/_posts/labnews/2017-02-21-supergene-diversity-accepted.markdown index 56566835..c4061ff0 100644 --- a/_posts/labnews/2017-02-21-supergene-diversity-accepted.markdown +++ b/_posts/labnews/2017-02-21-supergene-diversity-accepted.markdown @@ -16,7 +16,7 @@ The fire ant social chromosomes carry a supergene that controls the number of qu * There is a large number non-synonymous substitutions between the two variants. * The never recombining variant Sb is almost fixed in the North American population. -You can check out [the press release](http://www.qmul.ac.uk/media/news/items/se/192904.html), which covers some of the details about our work. +You can check out the press release, which covers some of the details about our work. The full reference is: R Pracana, A Priyam, I Levantis, RA Nichols and Y Wurm. (2017) *The fire ant social chromosome supergene variant Sb shows low diversity but high divergence from SB* Molecular Ecology. DOI: 10.1111/mec.14054 diff --git a/_posts/labnews/2018-02-15-iussi_symposium_evolution_of_social_organization.markdown b/_posts/labnews/2018-02-15-iussi_symposium_evolution_of_social_organization.markdown index e2d26fc3..9dfc9c39 100644 --- a/_posts/labnews/2018-02-15-iussi_symposium_evolution_of_social_organization.markdown +++ b/_posts/labnews/2018-02-15-iussi_symposium_evolution_of_social_organization.markdown @@ -13,12 +13,12 @@ Join us in Guarujá! We (Emeline, Carlos & Yannick) are excited to host a symposium on the evolution of social organisation at the upcoming [IUSSI conference](http://iussi2018.com/). * [Tim Linksvayer](http://www.bio.upenn.edu/people/timothy-linksvayer) will give a plenary talk. - * We invite abstract submissions for talks and posters (deadline March 2nd!). + * We invite abstract submissions for talks and posters deadline March 2nd!. -We welcome a diversity of approaches and study systems. If you're unsure about the relevance of your work, don't hesitate to get in touch. +We welcome a diversity of approaches and study systems. If you're unsure about the relevance of your work, don't hesitate to get in touch. -Full symposium title and abstract below: +Full symposium title and abstract below: ### Evolution of social organization @@ -29,7 +29,4 @@ Understanding how and when changes in social lifestyle occur is central to the s Encompassing the complexities of such multifaceted topics requires interdisciplinary discussion. This symposium will thus include both theoretical and empirical research addressing the topic from a variety of scales and angles. - - - - + diff --git a/_posts/labnews/2018-10-10-better_genomics_analysis_code_at_IUSSI.markdown b/_posts/labnews/2018-10-10-better_genomics_analysis_code_at_IUSSI.markdown index 5cc248f8..8f87b121 100644 --- a/_posts/labnews/2018-10-10-better_genomics_analysis_code_at_IUSSI.markdown +++ b/_posts/labnews/2018-10-10-better_genomics_analysis_code_at_IUSSI.markdown @@ -24,7 +24,7 @@ This disruptive shift is largely due to the **50,000-fold drop in DNA sequencing A major challenge for small research labs now wielding in large genomic datasets is that it is easy to make a small mistake that [has](http://science.sciencemag.org/content/314/5807/1856.full) [high](http://science.sciencemag.org/content/351/6275/aaf3945) [costs](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0649-6). -In light of this, as part of a [workshop on genomics approaches](https://www.iussi2018.com/news) organised with Tim Linksvayer and Alex Mikheyev, I gave an overview of some of the lessons we can transfer from the worlds of "other" data sciences to our expanding world of social insect genomics. This includes: +In light of this, as part of a workshop on genomics approaches organised with Tim Linksvayer and Alex Mikheyev, I gave an overview of some of the lessons we can transfer from the worlds of "other" data sciences to our expanding world of social insect genomics. This includes: - writing analysis code for humans; - respecting style guides for code (e.g., [R style guide](http://adv-r.had.co.nz/Style.html)), and for [how to structure a genomic analysis](http://wurmlab.github.io/news/2018-10-01-project_structures/); - benefits of peer-reviewing code, and of peer-coding sessions; @@ -47,5 +47,3 @@ It is worth highlighting three additional, important points raised during the co A fun and highly stimulating conference. - - diff --git a/_posts/labnews/2019-02-27-phd_studentship.markdown b/_posts/labnews/2019-02-27-phd_studentship.markdown index 2f8040fe..4ce88840 100644 --- a/_posts/labnews/2019-02-27-phd_studentship.markdown +++ b/_posts/labnews/2019-02-27-phd_studentship.markdown @@ -9,28 +9,28 @@ categories: - labnews --- -We have an exciting PhD position open through the London NERC DTP. +We have an exciting PhD position open through the London NERC DTP. -[**Apply by March 18th**](https://www.qmul.ac.uk/sbcs/postgraduate/phd-programmes/projects/display-title-655614-en.html) on the QMUL website. +**Apply by March 18th** on the QMUL website. The studentship is funded by the London NERC DTP will cover tuition fees and provide an annual tax-free maintenance allowance for 4 years at the Research Council rate (£17,009 in 2019/20). Candidates must meet RCUK eligibility criteria (I think this means ok for UK citizens and medium-term residents). The project is *highly* interdisciplinary. -Great candidates fulfill at least 3 of the following 4 criteria: +Great candidates fulfill at least 3 of the following 4 criteria: * smart * hard working * understands genomes or social insects * not scared of data analysis or coding. -We can adapt the project to the students’ interests and background. +We can adapt the project to the students’ interests and background. If you have any questions regarding prerequisites, scope or nature of the project, please don't hesitate to get in touch with me (Yannick). ## Research context -We have two main lines of research, in collaboration with national and international colleagues and stakeholders. +We have two main lines of research, in collaboration with national and international colleagues and stakeholders. **Genetics of social behaviour**. Social animals exhibit a broad range of behaviors, and some theoretical understanding exists of the tradeoffs between different forms of social organisation. However, we know little about the genes and processes underpinning social organisation or how it evolves. The diversity of social behaviors across the 20,000 species of ants represents a unique opportunity to empirically understand the mechanisms and tradeoffs involved in social change. We use highly molecular approaches, including genomics and bioinformatics but also potentially behavioural or field work to address major questions about social evolution. We aim to generate exciting new insights into genes and processes underpinning a major social transition, with implications on understanding evolution of complex phenotypes. @@ -39,4 +39,3 @@ We have two main lines of research, in collaboration with national and internati ## Training The student will receive extensive training in big data bioinformatics, phylogenomics, data visualisation, and experimental research approaches in evolution and genomics. Furthermore, they will receive hands-on training in interdisciplinary project management, communicating science in writing and verbally, including by presenting at workshops and conferences. - diff --git a/_posts/labnews/2020-11-01-student_controlled_unix_cloud_servers.markdown b/_posts/labnews/2020-11-01-student_controlled_unix_cloud_servers.markdown index 0d8d7df3..eab29a7f 100644 --- a/_posts/labnews/2020-11-01-student_controlled_unix_cloud_servers.markdown +++ b/_posts/labnews/2020-11-01-student_controlled_unix_cloud_servers.markdown @@ -13,7 +13,7 @@ Getting into big data science can be a big leap if you're a biologist who is new We try to cut that down into a series of smaller, more manageable steps. -As part of that, we run a hands-on [genome bioinformatics course](http://wurmlab.github.io/genomicscourse/practicals) that introduces students to UNIX, and covers topics from Illumina read cleaning to genome assembly, annotation, population genomics and genome-wide association mapping. +As part of that, we run a hands-on genome bioinformatics course that introduces students to UNIX, and covers topics from Illumina read cleaning to genome assembly, annotation, population genomics and genome-wide association mapping. For obvious 2020 reasons, we needed to do this online in a manner that: - has **manageable costs but sufficient power for genomics analyses**; @@ -58,4 +58,3 @@ We can potentially deploy our solution for other courses. If you're interested, ![/img/news/2020-11-01-unix_bioinf_cloud/panel.png](/img/news/2020-11-01-unix_bioinf_cloud/panel.png){: width="499" height="397" style="max-width:100%; height: auto"} ![/img/news/2020-11-01-unix_bioinf_cloud/cloud-computer-web-interface.png](/img/news/2020-11-01-unix_bioinf_cloud/cloud-computer-web-interface.png){: width="1141" height="532" style="max-width:100%; height: auto"} - diff --git a/_posts/oldblogarchive/2004-12-09-fire-ants-whats-the-point.markdown b/_posts/oldblogarchive/2004-12-09-fire-ants-whats-the-point.markdown index 2c883a56..3d9e9557 100644 --- a/_posts/oldblogarchive/2004-12-09-fire-ants-whats-the-point.markdown +++ b/_posts/oldblogarchive/2004-12-09-fire-ants-whats-the-point.markdown @@ -10,7 +10,7 @@ categories: - oldblogarchive --- -[Red Fire Ants](http://en.wikipedia.org/wiki/Red_Imported_Fire_Ant) are natives of South America where they occupy an ecologic niche, under pressure of predators and competitors. In other places, such as the southern [United States](http://www.invasivespecies.gov/profiles/fireant.shtml) or [Australia](http://www.dpi.qld.gov.au/fireants/), fire ants are considered an _invasive species_: given almost no predators or competitors, their proliferation is unlimited. They have become a considerable agricultural and thus **economic pest** as well as a significant **health hazard**. +[Red Fire Ants](http://en.wikipedia.org/wiki/Red_Imported_Fire_Ant) are natives of South America where they occupy an ecologic niche, under pressure of predators and competitors. In other places, such as the southern United States or Australia, fire ants are considered an _invasive species_: given almost no predators or competitors, their proliferation is unlimited. They have become a considerable agricultural and thus **economic pest** as well as a significant **health hazard**. Understanding these guys could contribute to solving these issues. It might also help understand how a useful social insect, the honey bee, works. We could get **better honey**! @@ -18,19 +18,19 @@ Other issues which could be of interest for ants as well as generally concerning - + * Eggs laid by a queen are practically identical. How does the environment (temperature variations...) and handling (by nurses) determine that a larvae will become a queen while another will become a worker? Which worker will become a soldier? a nurse? a scout? - + * A queen can live a long time - maybe 20 or 40 years. But a worker's life lasts only one or two years. And a male only one or two weeks. And yet they carry identical genetic information. Could we also live longer? - + * How do ants form alliances with other colonies? How do they use slavery, propaganda, deception, appeasement, spying? How does an individual know what it should do and communicate the result? - + * To which extent is it possible to use ideas from social insects to solve our problems? A large number of cooperating small interchangeable robots might solve certain issues better than one big robot... - + * ... diff --git a/_posts/oldblogarchive/2005-03-02-refined-nucleotide-blast-matrix.markdown b/_posts/oldblogarchive/2005-03-02-refined-nucleotide-blast-matrix.markdown index 8ffc82ea..a95f8893 100644 --- a/_posts/oldblogarchive/2005-03-02-refined-nucleotide-blast-matrix.markdown +++ b/_posts/oldblogarchive/2005-03-02-refined-nucleotide-blast-matrix.markdown @@ -18,13 +18,13 @@ blastn is not good at finding these sequence's homologues: - + * blastn searches for homologous sequences by trying to identify windows of 12 identical nucleotides. - + * for blastn, a C-T mismatch is just like any other mismatch. For bisulfite treated sequences, we know that many Ts are in fact Cs which have been modified by chemical treatment. Thus we should penalize them less. - + * blastn is optimized for speed, not flexibility. That means the window-size and scoring matrix are hard-coded - the user cannot edit them. @@ -41,13 +41,13 @@ Poking around on the internet for alternatives did not turn anything up, so I as > >6. Remember that your scores will be making some wrong assumptions about using proteins. You should still find the hits you are looking for. -Contacting NCBI confirmed this... Wayne Matten pointed me towards a METHODS [paper](http://blast.wustl.edu/doc/ntmats.pdf) describing *The Use of BlastP For Nucleic Acid Searches*. He also indicated [example matrices](ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/). +Contacting NCBI confirmed this... Wayne Matten pointed me towards a METHODS paper describing *The Use of BlastP For Nucleic Acid Searches*. He also indicated [example matrices](ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/). So the next step was downloading and compiling [NCBI Blast](http://www.ncbi.nlm.nih.gov/BLAST/) sources, and getting [Apple-Genentech's G5-optimized Blastall](http://www.apple.com/acg/). Then for each nucleotide sequence database I wanted to blast against, I had to: - + * call formatdb (supplied with ncbi's Blast: `~/bin/blast-2.2.10/bin/formatdb -i Group10_20050120.fa -l Group10.formatdb.log -t "Apis Contig Group10"` - + * blast my sequences against this database: `~/bin/blastall-2.2.9-apple-genentech -p blastp -d genomes/Amel20050120-freeze/contigs/Group10_20050120.fa -i ~/treatedSequence.fasta -o /Users/admin/Documents/Perl/generated\ data/heleneTest.2005-feb-25-mini -M BLOSUM80 -F F` @@ -59,22 +59,22 @@ This let me test the custom scoring matrix to give an increased difference in sc - + * not penalizing Ns or other non-ACGT bases - + * giving increased importance to conserved C-C alignments (rare since in in a lightly methylated sequence, most Cs are transformed to Ts) - + * not penalizing C-T alignments when C is in a "normal" sequence and T is in bisulfite-treated sequence. - + * reducing positive influence of T-T alignments (in bisulfite-treated sequence, T could really be a modified C). - + * Venues not explored include: - + * modifying influence of transversions and transitions, since the probability of their occuring differs, especially between related species. diff --git a/_posts/oldblogarchive/2006-11-18-development.markdown b/_posts/oldblogarchive/2006-11-18-development.markdown index e388e830..6d83adcb 100644 --- a/_posts/oldblogarchive/2006-11-18-development.markdown +++ b/_posts/oldblogarchive/2006-11-18-development.markdown @@ -36,9 +36,9 @@ iConvert Images Bebetes Project Along with 5 others... Les Fourmis - + Projet Regulation - Ok. This isn't code. But Am,ao?(C)lie V,ao?(C)ron and I spent a lot of time on the computer for it! It's an attempt at modeling part of _E. coli_'s global regulation, using a tool called [Genetic Network Analyzer](http://www.inrialpes.fr/helix/logic_GNA_mn.html), developed at Helix Inria. We did this as part of a fourth-year project at Insa de Lyon. (more info in the pdf file's intro). [This is it](/attic/dev/gna2003.pdf). + Ok. This isn't code. But Am,ao?(C)lie V,ao?(C)ron and I spent a lot of time on the computer for it! It's an attempt at modeling part of _E. coli_'s global regulation, using a tool called Genetic Network Analyzer, developed at Helix Inria. We did this as part of a fourth-year project at Insa de Lyon. (more info in the pdf file's intro). [This is it](/attic/dev/gna2003.pdf). [Timepark](/attic/dev/timepark) 2003-2004 first semester project: an edo-based modeling framework (C++) and graphical end-user app (Obj-C). diff --git a/_posts/oldblogarchive/2006-11-18-timepark.markdown b/_posts/oldblogarchive/2006-11-18-timepark.markdown index e8266461..3c4730c2 100644 --- a/_posts/oldblogarchive/2006-11-18-timepark.markdown +++ b/_posts/oldblogarchive/2006-11-18-timepark.markdown @@ -16,7 +16,7 @@ categories: ## Timepark -Development report (for my school, Insa de Lyon). [Timepark-report January 2004](http://yannick.poulet.org/dev/timepark_report-jan2004.pdf). +Development report (for my school, Insa de Lyon). @@ -46,18 +46,16 @@ An open source modeling and simulation framework, used as the backend for Timepa - + * An object's position is defined by it's (x,y,z) properties; object classes may inherit these and additionally defined properties. - + * A property's value can be defined by ordinary differential equations (ODEs). - + * A property can evolve differently depending on the system's state through the use of control statements which are functions of any of the system's objects properties (eg: if _light is green_ then _d(x)/dt = 10_ else _d(x)/dt =0_. Technologies: C++ STL, Flex/Yacc, Xerces, OpenGL Download Source and documentation soon... - - diff --git a/data/supergene_introgression/gt.vcf.gz/index.html b/data/supergene_introgression/gt.vcf.gz/index.html index 05274b4f..49b4a694 100644 --- a/data/supergene_introgression/gt.vcf.gz/index.html +++ b/data/supergene_introgression/gt.vcf.gz/index.html @@ -9,7 +9,7 @@

Fire ant supergene introgression files:

VCF:

+ href="https://github.com/wurmlab/wurmlab.github.io/blob/master/data/supergene_introgression/gt.vcf.gz/gt.vcf.gz">

Variant calling file

diff --git a/index.html b/index.html index 8cc9f7e7..10e88aea 100644 --- a/index.html +++ b/index.html @@ -84,7 +84,7 @@

Empowering genomic data scientists

-The sequences of *A. rubens* precursor proteins or the putative neuropeptides/polypeptide hormones derived from them were aligned with homologous proteins/peptides in other bilaterian species, some of which were identified here for the first time. Alignments were generated and edited using Jalview [47] and MAFFT [48] with JABAWS web service [49], employing default settings (gap opening penalty at local pairwise alignment = −2, similarity matrix = Blosum62, gap open penalty = 1.53, group size = 20, group-to-group gap extension penalty = 0.123). GeneDoc ([https://genedoc.software.informer.com/](https://genedoc.software.informer.com/){:target="_blank"}) was used to annotate the alignments and prepare alignment figures. +The sequences of *A. rubens* precursor proteins or the putative neuropeptides/polypeptide hormones derived from them were aligned with homologous proteins/peptides in other bilaterian species, some of which were identified here for the first time. Alignments were generated and edited using Jalview [47] and MAFFT [48] with JABAWS web service [49], employing default settings (gap opening penalty at local pairwise alignment = −2, similarity matrix = Blosum62, gap open penalty = 1.53, group size = 20, group-to-group gap extension penalty = 0.123). GeneDoc was used to annotate the alignments and prepare alignment figures.
diff --git a/publications/2022-the-era-of-reference-genomes-in-conservation-genomics/index.md b/publications/2022-the-era-of-reference-genomes-in-conservation-genomics/index.md index 2e86a8ee..d1256c87 100644 --- a/publications/2022-the-era-of-reference-genomes-in-conservation-genomics/index.md +++ b/publications/2022-the-era-of-reference-genomes-in-conservation-genomics/index.md @@ -190,7 +190,7 @@ the entire set of DNA sequences (or genes) of a species represented by the core the inference of the phylogenetic relationships among different lineages of organisms from genome-wide data. ##### Reference genome -a contiguous and accurate genome assembly representative of a species in which the coordinates of genes and other important features are annotated. Current definitions of reference genome quality are given in [2.] and [https://www.earthbiogenome.org/assembly-standards](https://www.earthbiogenome.org/assembly-standards). +a contiguous and accurate genome assembly representative of a species in which the coordinates of genes and other important features are annotated. Current definitions of reference genome quality are given in [2.] and assembly standards.
diff --git a/talks/index.html b/talks/index.html index a4673fb6..9cfb0d87 100644 --- a/talks/index.html +++ b/talks/index.html @@ -17,18 +17,15 @@ slideshareurl: //www.slideshare.net/yannickwurm/2014-1015nextbug-edinburgh about: "Presenting our ant genomics research, some bioinformatics challenges we've faced & the solutions we're overcoming them with. These include our software for: - " + " - context: "2014 May: Balti and Bioinformatics @ Birmingham" label: "balti2014" - contexturl: "//pathogenomics.bham.ac.uk/blog/2014/05/balti-and-bioinformatics-27th-may-2014" + #contexturl: "//pathogenomics.bham.ac.uk/blog/2014/05/balti-and-bioinformatics-27th-may-2014" youtubeid: mmMQw2gIozI about: "Nick Loman asked me (with 2 days notice!!) to share some thoughts on challenges and opportunities for computing infrastructure for genomics analysis." @@ -40,13 +37,13 @@ finding that a social chromosome determines social behavior in the red fire ant." - context: "2013 July: European Society for Evolutionary Biology @ Lisbon" - contexturl: "//www.eseb2013.com" + #contexturl: "//www.eseb2013.com" youtubeid: yAj0BVoXrsc about: "Conference talk on our finding that a social chromosome determines social behavior in the red fire ant." - context: "2013 September: Cream Teas and Bioinformatics Exeter" - contexturl: "//pathogenomics.bham.ac.uk/blog/2013/09/cream-teas-and-bioinformatics-meeting-report-in-pictures-and-slides/" + #contexturl: "//pathogenomics.bham.ac.uk/blog/2013/09/cream-teas-and-bioinformatics-meeting-report-in-pictures-and-slides/" slideshareid: 25977819 slidesharekey: "iLF5FFTszRKT5x" slideshareurl: "//www.slideshare.net/yannickwurm/2013-0905cream-teasexeter" diff --git a/teaching/DAY1_exercises.docx.html b/teaching/DAY1_exercises.docx.html index 8f143686..e7a62675 100644 --- a/teaching/DAY1_exercises.docx.html +++ b/teaching/DAY1_exercises.docx.html @@ -38,127 +38,127 @@ padding: 0 6px; }

DAY 1 EXERCISES

Oksana Riba-Grognuz

UNIX

First - of all let's login to Vital-IT infrastructure. Each of you received a + of all let's login to Vital-IT infrastructure. Each of you received a user name to use in the secure connection command below.

$ ssh username@prd.vital-it.ch

Now we are connected to a front-end node (prd.vital-it.ch) that can only be used to submit jobs to the Vital-IT cluster. For this practical we will be using - 2 big-memory machines of the UNIL Department of Ecology and Evolution -(dee-serv01 and dee-serv02). Half of you will connect to one of these -machines and half to the other. Use the machine name you were assigned + 2 big-memory machines of the UNIL Department of Ecology and Evolution +(dee-serv01 and dee-serv02). Half of you will connect to one of these +machines and half to the other. Use the machine name you were assigned below.

$ ssh dee-serv0X

Once you get there check where you are with the command printing the current directory.

$ pwd

Alternatively, - use the "echo" command to find out the address of your home directory. -If you did not change your directory upon login this address should be + use the "echo" command to find out the address of your home directory. +If you did not change your directory upon login this address should be the same as when you did "pwd".

$ echo $HOME

If you forgot which username you used, you can always check with the command below.

$ whoami

List file and directories in your location.

$ ls

Wherever - you are located there is always an easy way to get back to home + you are located there is always an easy way to get back to home directory. Just type "cd" (change directory) without arguments.

How to get home?

$ cd

Adding bioinformatics software to your $PATH

To see what you programs you can run simply type the TAB key twice.

$ TAB TAB

Possible - options are the program files found in locations specified in $PATH. + options are the program files found in locations specified in $PATH. What are these locations? Look at $PATH with "echo" command:

$ echo $PATH

Right now your $PATH is missing the software we will need.  For this one would typically do the following: 

$ export PATH="$PATH:/some/additional/software/location"

Vital-IT - have prepared something that does this for you. So that it happens + have prepared something that does this for you. So that it happens automatically when you login, you’ll need to edit the .bashrc in your home directory.

To see your .bashrc file with the command "ls" we need to add additional arguments, as this file is hidden by default.

$ ls -lah

Browse the contents of .bashrc using command "less".

$ less .bashrc

Does your .bashrc contain the following lines?

source /mnt/common/R-BioC/R-BioC.bashrc

source /mnt/common/DevTools/DevTools.bashrc

source /mnt/common/UHTS/UHTS.bashrc

These - lines add to your $PATH the locations and configurations of R, and of - Ultra High Throughput Sequencing (UHTS) applications. If they are + lines add to your $PATH the locations and configurations of R, and of + Ultra High Throughput Sequencing (UHTS) applications. If they are missing from your .bashrc append them to .bashrc using ">>": 

$ echo "source /mnt/common/R-BioC/R-BioC.bashrc" >> .bashrc

$ echo “source /mnt/common/DevTools/DevTools.bashrc” >> .bashrc

$ echo "source /mnt/common/UHTS/UHTS.bashrc"     >> .bashrc

Now please logout and then ssh to the server again. Check how your $PATH has changed:

$ echo $PATH

Files and folders

If you are not familiar with file and folder operations try the following commands.

Make a new empty folder.

$ mkdir newfolder

Go to this folder.

$ cd newfolder

Go one folder back.

$ cd ..

$ rmdir newfolder

Make the folder again

$ mkdir newfolder

Create a file in that folder by redirecting printed output of "echo" command to a file.

$ echo "Some text" > newfolder/file.txt

Try removing the folder again

$ rmdir newfolder

Q: Why this does not work?

Try command "rm" (removes files and folders)

$ rm newfolder

This still does not work. Check in the manual (type "man" before the command) to see which parameters to put.

$ man rm

Editing files

Files - can be edited locally on Vital-IT (using nano or vi or emacs), or on -your laptop using a text editor of your choice (Aquamacs, -TextWrangler... NOT Microsoft Word!). To edit a file locally you must + can be edited locally on Vital-IT (using nano or vi or emacs), or on +your laptop using a text editor of your choice (Aquamacs, +TextWrangler... NOT Microsoft Word!). To edit a file locally you must first download it from Vital-IT. For this you can use scp (or Cyberduck or something like FileZilla). 

For example, if you choose scp do the following locally on your computer:

$ mkdir Scripts

$ scp username@prd.vital-it.ch:/location/of/Scripts/MyScript.rb Scripts/

Once the script was modified upload it back to Vital-IT.

$ scp Scripts/MyScript.rb username@prd.vital-it.ch:/location/of/Scripts/


DATA ANALYSIS

We will be working on Solenopsis invicta (the red fire ant) using Illumina DNA and RNA-seq reads. For the official release of de novo genome assembly (Wurm et al. 2011, PNAS) we combined Illumina and 454 technologies in a hybrid assembly approach. For this course we will use only Illumina because it has become clear that it is possible to perform de novo genome assembly using only this technology. Furthermore, we are only considering for this practical a very small subset (less than 5%) of the fire ant genome.

Please form the groups of two (one more computational; one less computational). From now on you will use only one access to Vital-IT per group.

Login to Vital-IT. The - command below connects first to the prd server and then to the dee + command below connects first to the prd server and then to the dee server. We are only allowed to use ssh to connect to prd, but cannot use it for calculations.  

$ ssh -t username@prd.vital-it.ch ssh dee-serv0X

According - to the Vital-IT rules, no calculations can be carried out in the home -directory. We will use /scratch/cluster/weekly/ for all practicals -(files will be deleted after one week!). Create there a directory named -according to your user-name and change from your home directory to this + to the Vital-IT rules, no calculations can be carried out in the home +directory. We will use /scratch/cluster/weekly/ for all practicals +(files will be deleted after one week!). Create there a directory named +according to your user-name and change from your home directory to this new directory.

$ mkdir /scratch/cluster/weekly/username 

$ cd    /scratch/cluster/weekly/username

Extract the files required for today's practical.

$ unzip /scratch/cluster/monthly/oribagro/summer2012_Oksana.zip


Quality Control: FastQC

We will use the FastQC package - installed locally on your computer. We will analyse files located in -DNA-seq/Raw/ and DNA-seq/Nr/. They are big. So please take a copy of these files from provided USB key or from the local web server (zipped as DNA-seq.zip). [probably http://192.168.167.32 ]

$ ls DNA-seq/Raw

$ ls DNA-seq/Nr

Our - goal is to make decisions on de novo assembly strategy based on FastQC + installed locally on your computer. We will analyse files located in +DNA-seq/Raw/ and DNA-seq/Nr/. They are big. So please take a copy of these files from provided USB key or from the local web server (zipped as DNA-seq.zip). [probably  ]

$ ls DNA-seq/Raw

$ ls DNA-seq/Nr

Our + goal is to make decisions on de novo assembly strategy based on FastQC quality report. Open FastQC and open files to analyse. Process first Raw - files, which are subsets of Illumina Hi-seq lanes as they came out of + files, which are subsets of Illumina Hi-seq lanes as they came out of the sequencer.

Q: What do the file names mean?

Q: Do you think both lanes should be used for assembly?

Q: Do we need to trim or filter reads?

Q: Which information is important to take the decision about trimming/filtering?

Q: How can you explain a significant drop in the quality in the beginning of the reads of 2nd pair members of lane 7?

Process Nr reads now.  Nr means “Non-redundant”: reads were processed to remove exact duplicates.

Q: Do you see a big change in duplications levels?

Q: What can be the reason for that (consider information given during presentation)?


Quality Control: FASTX-Toolkit

We - will use FASTX-Toolkit to implement filtering/trimming decided in the -previous step. This part takes place at Vital-IT. Process Raw or Nr -(non-redundant) reads and output "clean" files to -/scratch/cluster/weekly/username/DNA-seq/Clean. Toolkit allows us to do + will use FASTX-Toolkit to implement filtering/trimming decided in the +previous step. This part takes place at Vital-IT. Process Raw or Nr +(non-redundant) reads and output "clean" files to +/scratch/cluster/weekly/username/DNA-seq/Clean. Toolkit allows us to do different types of filtering/trimming.

Find out how to trim reads based on coordinates:

$ fastx_trimmer -h

Find out how to trim/filter reads based on quality:

$ fastq_quality_trimmer –h

Use - fastx_trimmer or fastq_quality_trimmer on each file individually + fastx_trimmer or fastq_quality_trimmer on each file individually (replace first base -f and/or last base -l values to desired trimming in - the command below). Or alternatively use a provided launch script that + the command below). Or alternatively use a provided launch script that will do the job on all files at once.

$ fastx_trimmer -f xxxx -l yyyy -i DNA-seq/Nr/101104_s_7_1.subset.fastq -o DNA-seq/Clean/101104_s_7_1.subset.fastq

$ fastx_trimmer -f xxxx -l yyyy -i DNA-seq/Nr/101104_s_7_2.subset.fastq -o DNA-seq/Clean/101104_s_7_2.subset.fastq

Ideally we want to remember what exactly was done to Raw data. A good way to achieve this is by doing - all processing using bash wrapper scripts. You have an example of such -script in Scripts folder. Another advantage of using + all processing using bash wrapper scripts. You have an example of such +script in Scripts folder. Another advantage of using scripts that execute the command on each input file automatically, is to handle large data sets comprised of multiple files.

$ less Scripts/run_fastx_trimmer.sh

This - script should be launched in project directory containing DNA-seq + script should be launched in project directory containing DNA-seq directory in it. You can use vi or nano to modify this script on Vital-IT, - or if you do not know how to use these, you can download this file to -your local computer and modify using your preferred text editor to do -what you judge necessary as trimming/filtering. Hint: you + or if you do not know how to use these, you can download this file to +your local computer and modify using your preferred text editor to do +what you judge necessary as trimming/filtering. Hint: you need to modify the line that launches fastx_trimmer command by adjusting - trimming coordinates (-f xxx and -l yyy specifying the first and the + trimming coordinates (-f xxx and -l yyy specifying the first and the last base respectively).  If you forgot how to edit files, check the end of the UNIX introduction.

To launch the script you should go to the following folder on Vital-IT:

$ cd /scratch/cluster/weekly/username

$ Scripts/run_fastx_trimmer.sh


DE NOVO GENOME ASSEMBLY

Few definitions are important for de novo assembly: contigs (contiguous sequences) and scaffolds, illustrated in the figure - below. A genome assembly consists in hundreds to thousands of + below. A genome assembly consists in hundreds to thousands of scaffolds.

We - will use SOAPdenovo for genome assembly. Depending on genome -characteristics different software might be the most appropriate. For + will use SOAPdenovo for genome assembly. Depending on genome +characteristics different software might be the most appropriate. For the red fire ant data in 2009 SOAPdenovo was the best performing - assembler of Illumina reads. To keep track of our actions and ensure -the reproducibility of all steps we will rely on bash scripts to run -assemblies. A usual approach to Illumina assembly is to do multiple -assemblies using different combinations of data quality -trimming/filtering and different assembler parameters. Due to time and + assembler of Illumina reads. To keep track of our actions and ensure +the reproducibility of all steps we will rely on bash scripts to run +assemblies. A usual approach to Illumina assembly is to do multiple +assemblies using different combinations of data quality +trimming/filtering and different assembler parameters. Due to time and resource constraints, each pair of students will only perform 2 or 3 assemblies.

Illumina - assemblers rely on de Bruijn graph constructed from K-mers (K-length + assemblers rely on de Bruijn graph constructed from K-mers (K-length words) of all reads. This makes K-mer length a key software parameter to optimise.

SOAPdenovo package consists of four programs: pregraph, contig, map and scaff. Assembly with paired end reads involves the use of all four programs. - Command "all" allows executing full SOAPdenovo package easily. We -basically need to specify "SOAPdenovo all -s config_file -o + Command "all" allows executing full SOAPdenovo package easily. We +basically need to specify "SOAPdenovo all -s config_file -o output_prefix".

$ cd /scratch/cluster/weekly/username/SOAPdenovo/Assembly

This folder contains two files conf01-lanes47_maplen60 and Run_conf01-RL200D.sh.

$ less conf01-lanes47_maplen60

$ less Run_conf01-RL200D.sh

Don’t run anything yet! Config - file is created according to a format defined in SOAPdenovo -requirements. Run_conf01-RL200D.sh will use config file to run -SOAPdenovo with the specified parameters and config files and output -results to folders named according to parameter values.

Check the explanations of commands and config file specifications at http://soap.genomics.org.cn/soapdenovo.html .

Modify conf01-lanes47_maplen60. You need at least to specify the correct location of input fastq files that you want to use (replace username). Optionally you can change some of the parameters.

 

Q: Can you trim reads within SOAPdenovo config file? 

Modify Run_conf01-RL200D.sh. Use K-mer values of at least 1/3 of read length. Do not put more than three K-mer values as it will increase run time (Please do not run more than one assembly at a time -  we are a big group of students sharing few compute resources). 

Q: What are the different parameters used to run SOAPdenovo in Run_conf01-RL200D.sh

After you have edited the script, make sure you are in the correct folder and launch it:

$ ./Run_conf01-RL200D.sh

After the assembly is finished examine the contents of output folders. Look at the contents of LOG file form the assembly

$ less LOG

Files out.contig and out.scafSeq respectively contain FASTA format scaffold and contig sequences.

        Optional

Q: What does asm_flags=3 in the config file mean?

There are multiple versions of SOAPdenovo. To see them, do:

$ SOAPdenovo TAB TAB

Q: Why are there several versions? When is it dangerous to use SOAPdenovo-127mer?


ASSESSING ASSEMBLY QUALITY

Comparing assembly metrics

A - common way to select the optimal assembly strategy is to look at -various types of statistics, like total number of assembled base pairs, + file is created according to a format defined in SOAPdenovo +requirements. Run_conf01-RL200D.sh will use config file to run +SOAPdenovo with the specified parameters and config files and output +results to folders named according to parameter values.

Check the explanations of commands and config file specifications at soap.genomics.org.cn/soapdenovo.html .

Modify conf01-lanes47_maplen60. You need at least to specify the correct location of input fastq files that you want to use (replace username). Optionally you can change some of the parameters.

 

Q: Can you trim reads within SOAPdenovo config file? 

Modify Run_conf01-RL200D.sh. Use K-mer values of at least 1/3 of read length. Do not put more than three K-mer values as it will increase run time (Please do not run more than one assembly at a time -  we are a big group of students sharing few compute resources). 

Q: What are the different parameters used to run SOAPdenovo in Run_conf01-RL200D.sh

After you have edited the script, make sure you are in the correct folder and launch it:

$ ./Run_conf01-RL200D.sh

After the assembly is finished examine the contents of output folders. Look at the contents of LOG file form the assembly

$ less LOG

Files out.contig and out.scafSeq respectively contain FASTA format scaffold and contig sequences.

        Optional

Q: What does asm_flags=3 in the config file mean?

There are multiple versions of SOAPdenovo. To see them, do:

$ SOAPdenovo TAB TAB

Q: Why are there several versions? When is it dangerous to use SOAPdenovo-127mer?


ASSESSING ASSEMBLY QUALITY

Comparing assembly metrics

A + common way to select the optimal assembly strategy is to look at +various types of statistics, like total number of assembled base pairs, number of scaffolds, N50 length, maximum scaffold length etc.

A script distributed with the assembly software Abyss called "abyss-fac" allows generating this kind of statistics.

Run abyss-fac script on results on one of your scaffolds files (replace the last assembly folder with the one you generated).

$ cd /scratch/cluster/weekly/username/SOAPdenovo/Assembly/conf01_K35_R_L200_D

$ abyss-fac out.scafSeq

Now run program called seqstat to generate statistics.

$ seqstat out.scafSeq

Q: Why do some numbers of seqstat differ from those generated with abyss-fac?

Q: Which do you find the most appropriate to use in selecting best assembly?

We - can compare statistics between contig fasta and scaffold fasta using a + can compare statistics between contig fasta and scaffold fasta using a cutoff equal to the value of -L parameter used in the assembly.

Run - abyss-fac script on scaffolds again specifying -t 200 (the value of -L + abyss-fac script on scaffolds again specifying -t 200 (the value of -L used in the assembly) as a cutoff for minimum scaffold length.

$ abyss-fac -t 200 out.scafSeq

Do the same for contigs

$ abyss-fac -t 200 out.contig

Q: Why is the number of bp not the same?

Optional (crucial for submitting genome assembly to NCBI)

In order to submit a newly assembled genome to a public database one needs - to submit contig sequences and an AGP file specifying how these -sequences are arranged in scaffolds. Scaffolds are built from contigs -(contiguous sequences), that are joined by stretches of N bases using -the information form paired reads with known insert size. Unfortunately -SOAPdenovo, as most other de Brujin graph based programs does not -provide such output and report a contig file with all sequences -(regardless of inclusion in scaffolds). We can generate an -AGP file containing only contigs that were used for scaffold -construction by defining contigs based on stretches of N sequences + to submit contig sequences and an AGP file specifying how these +sequences are arranged in scaffolds. Scaffolds are built from contigs +(contiguous sequences), that are joined by stretches of N bases using +the information form paired reads with known insert size. Unfortunately +SOAPdenovo, as most other de Brujin graph based programs does not +provide such output and report a contig file with all sequences +(regardless of inclusion in scaffolds). We can generate an +AGP file containing only contigs that were used for scaffold +construction by defining contigs based on stretches of N sequences within scaffolds.

$ /scratch/cluster/weekly/username/Scripts/fasta2agp.pl -f out.scafSeq -p out

This will generate files out.contigs.fa and out.agp. Run abyss-fac on new contigs file

$ abyss-fac -t 200 out.contigs.fa

Now the total number of bp is the same as in scaffold fasta file. 

Have a look at AGP file. Read about format specifications at the following link: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/AGP_Specification.shtml 

$ less out.agp

Often one would generate multiple assemblies to refine the strategy (many more than three). - Although we have generated 2-3 assemblies, we will act as if we had a -lot and process statistics from these files into R-readable format. We -will save the file in folder called SOAPdenovo/Statistics .

$ cd /scratch/cluster/weekly/username/SOAPdenovo/Assembly

$ ./get.stats.sh > stats

Now use scp or other means to get file “stats” locally on your computer to use with R. If you obtained no statistics, modify the read.table command in what follows by replacing “stats” with  “http://www.antgenomes.org/~yannickwurm/tmp/manyAssemblyStats.txt”.

 Lets use R to find the best assembly in terms of quantitative metrics. Locally on your computer:

$ R

stats    <- read.table("stats", header=T, sep="\t")

## get K-mer values from file name

myKmers    <- substr(stats$file, 8,10)

## get config names

myConf     <- as.numeric(substr(stats$file, 5,6))

## get all color names containing string "dark"

allColors  <- colors()[grep("dark",colors())]

## randomly sample the neccessary number of colors

myColors   <- sample(allColors, length(unique(myConf)))

## make vector of colors per assembly setting

confColors <- myColors[as.factor(myConf)]

configName <- paste("config", as.character(unique(myConf)), sep="")

## Lets plot this:

myTitle <- "N50 vs Number of contigs >= 200bp\nSize is proportional to total assembly bp"

plot(x    = stats$n.200,

     y    = stats$N50,

     cex  = (stats$sum/1000000),  # circle size

     col  = confColors,

     pch  = 19,                   # symbol "filled circle"

     main = myTitle,

     xlab = "Number of Contigs > = 200 bp",

     ylab = "N50")

legend("topright", configName, col=unique(confColors), pch=19)

text(stats$n.200, stats$N50, myKmers, cex=0.7, col="white")

myTitle <- "N50 vs Number of contigs >= N50\nSize is proportional to total assembly bp"

plot(x    = stats$n.N50,

     y    = stats$N50,

     cex  = (stats$sum/1000000),

     col  = confColors,

     pch  = 19,

     main = myTitle,

     xlab = "Number of Contigs >= N50",

     ylab ="N50")

legend("topright", configName, col=unique(confColors), pch=19)

text(stats$n.N50, stats$N50, myKmers, cex=0.7, col="white")

## Finally make a plot with the values we expect to get.

## The same subset of data represented 3066758 bp accross eight

## scaffolds with N50 of 1867187

myTitle <- "N50 vs Number of contigs >= 200bp\nSize is proportional to total assembly bp"

plot(x    = c(stats$n.200, 8),

     y    = c(stats$N50, 1867187),

     cex  = c(stats$sum, 3066758)/1000000,

     col  = c(confColors, "red"),

     pch  = 19,

     main = myTitle,

     xlab = "Number of Contigs >= 200 bp",

     ylab ="N50")

legend("topright",

       c(configName, "Official"),

       col=unique(c(confColors,"red")),

       pch=19)

Q: Which assembly is the best in terms of quantitative metrics?

Q: Why there is a significant difference with official release?


Quality assessment using independent information

Independently obtained information can provide the most reliable measures of whether or not your assembly is accurate.

\ No newline at end of file + diff --git a/teaching/DAY3_exercises_Oksana_WORKS_denovo.docx.html b/teaching/DAY3_exercises_Oksana_WORKS_denovo.docx.html index 9b3d837c..8209e0b4 100644 --- a/teaching/DAY3_exercises_Oksana_WORKS_denovo.docx.html +++ b/teaching/DAY3_exercises_Oksana_WORKS_denovo.docx.html @@ -38,103 +38,103 @@ padding: 0 6px; }

DAY 3 EXERCISES

Oksana Riba-Grognuz

Differential Gene Expression

We - will be using one of the most popular programs for differential -expression analyses and transcript isoform assembly from spliced + will be using one of the most popular programs for differential +expression analyses and transcript isoform assembly from spliced alignments of RNA-seq reads: TopHat and Cufflinks. - These programs are actively maintained and updated. TopHat relies on -Bowtie to carry out short-read alignments. Recently, following the + These programs are actively maintained and updated. TopHat relies on +Bowtie to carry out short-read alignments. Recently, following the release of Bowtie2, new TopHat and Cufflinkes versions were released. We - will be using the latest release of TopHat and Cufflinks with Bowtie1. -To do so, login to vital-it and source file -“Tophat/latest_versions.bashrc” as shown below. Source command will -modify your $PATH as specified in latest_versions.bashrc and update + will be using the latest release of TopHat and Cufflinks with Bowtie1. +To do so, login to vital-it and source file +“Tophat/latest_versions.bashrc” as shown below. Source command will +modify your $PATH as specified in latest_versions.bashrc and update environment variables specifying versions of TopHat and Cufflinks.

$ ssh -t username@prd.vital-it.ch ssh dee-serv0X

$ cd /scratch/cluster/weekly/username

$ source TopHat/latest_versions.bashrc

YOU NEED R 2.15!!!!

If everything worked the installed software versions should be:

Please - check which versions of Cufflinks and TopHat you have. If these are -different from the ones listed above, give us a sign and we will set up + check which versions of Cufflinks and TopHat you have. If these are +different from the ones listed above, give us a sign and we will set up the correct ones.

$ which tophat

$ echo $TOPHAT_VERSION

$ which cufflinks

$ echo $CUFFLINKS_VERSION

Tophat: mapping reads to genome.

We will start by preparing all necessary input files to run TopHat. As it uses Bowtie to - align short reads to a reference genome, we need to generate Bowie -index for genome. For this go to GenomeSubset/Assembly/ folder and run + align short reads to a reference genome, we need to generate Bowie +index for genome. For this go to GenomeSubset/Assembly/ folder and run the command bowtie-build on scaffold sequences contained in FASTA file SINV_subset_1.fa. - Start by looking how to launch this command (HINT: we do not need any -options). Note that it is convenient to use the same prefix for input + Start by looking how to launch this command (HINT: we do not need any +options). Note that it is convenient to use the same prefix for input fasta file and bowtie index.

$ cd /scratch/cluster/weekly/username/GenomeSubset/Assembly

$ ls

$ bowtie-build -h

Please launch the appropriate command.

Q: How many index files were generated?

We will use 75 bp single Illumina RNA-seq data from 3 conditions, each with 4 biological replicates: - Queens, Workers and Males, prefixed with Q, W, M respectively. Each -replicate represents a pool of individuals. A 3 digit number in file -names indicates red fire ant colony from which sample was taken. Have a + Queens, Workers and Males, prefixed with Q, W, M respectively. Each +replicate represents a pool of individuals. A 3 digit number in file +names indicates red fire ant colony from which sample was taken. Have a look in the folder RNA-seq/Raw

$ ls /scratch/cluster/weekly/username/RNA-seq/Raw

Due to time constraints we will launch TopHat and Cufflinks on 3 files. For - the remainder of the pipeline we will use pre-calculated TopHat and + the remainder of the pipeline we will use pre-calculated TopHat and Cufflinks results with all files. Go to the directory TopHat.

$ cd /scratch/cluster/weekly/username/TopHat

In - this directory you will find a file called “file.list” that lists 3 -file names in column 1 and corresponding fragment sizes in column 2. -Fragment sizes must be specified when running Cufflinks on single reads -(note that for paired read analyses Cufflinks calculates fragment size + this directory you will find a file called “file.list” that lists 3 +file names in column 1 and corresponding fragment sizes in column 2. +Fragment sizes must be specified when running Cufflinks on single reads +(note that for paired read analyses Cufflinks calculates fragment size distribution based on alignments). Have a look on TopHat parameters.

$ tophat -h

A - file launching TopHat is called runTopHat.sh, have a look on it. This -script takes all files in “file.list” and processes 1 by 1. Output + file launching TopHat is called runTopHat.sh, have a look on it. This +script takes all files in “file.list” and processes 1 by 1. Output directories are named according to sample names (prefix of fastq files).

$ less runTopHat.sh

In - the beginning of this file there is a little trick to avoid manually -changing username: the variable with username is set using command -“whoami”. Thus if you used your username as working directory name you + the beginning of this file there is a little trick to avoid manually +changing username: the variable with username is set using command +“whoami”. Thus if you used your username as working directory name you do NOT need to change links to - input files (genome index and raw files location) in launch script. -Optionally you can change TopHat parameters. Use nano or vi editor -within Vital-IT, or scp the script to edit locally as we did on Day1. + input files (genome index and raw files location) in launch script. +Optionally you can change TopHat parameters. Use nano or vi editor +within Vital-IT, or scp the script to edit locally as we did on Day1. Once done, launch the script.

$ ./runTopHat.sh

Cufflinks: mapping-based (= reference-based) gene/transcript identification - separately for each sample

While - TopHat is running we can get ready for the next part of the pipeline: -transcript assembly. Cufflinks assembles putative transcripts based on -the alignments generated by TopHat. Have a look on Cufflinks parameters + TopHat is running we can get ready for the next part of the pipeline: +transcript assembly. Cufflinks assembles putative transcripts based on +the alignments generated by TopHat. Have a look on Cufflinks parameters and on respective launch script “runCufflinks.sh”. To do this open a new terminal window or tab.

$ cufflinks -h

$ less runCufflinks.sh

When - TopHat is over launch Cufflinks and while it is running examine TopHat + TopHat is over launch Cufflinks and while it is running examine TopHat output files. Use samtools to look into bam files. Bam is a binary version of SAM (Sequence Alignment/Map) format used for storing large nucleotide sequence alignments.

$ ./runCufflinks.sh

$ samtools view W403/accepted_hits.bam | less

Column - 6 in bam file contains CIGAR strings. These strings specify the -structure of alignment. For example if all 75 bp align it will state + 6 in bam file contains CIGAR strings. These strings specify the +structure of alignment. For example if all 75 bp align it will state 75M, meaning 75 Match. If there is a 315bp gap in read alignment it will - state 61M315N14M, meaning that first 61bp Match, then there are 315 -Non-matching genomic bases, and finally 14bp Match (total 75bp of read -matched). CIGAR strings with N, are thus reads likely containing splice + state 61M315N14M, meaning that first 61bp Match, then there are 315 +Non-matching genomic bases, and finally 14bp Match (total 75bp of read +matched). CIGAR strings with N, are thus reads likely containing splice junctions (span over several exons).  

This transcript identification was performed once per sample. When - Cufflinks finished assembling transcripts for each lane have a look on + Cufflinks finished assembling transcripts for each lane have a look on its output files. Next step is to merge all assemblies and thus generate - a reference set that can be used for differential expression part. - For this part we will be using Tophat/Cufflinks output pre-run on + a reference set that can be used for differential expression part. + For this part we will be using Tophat/Cufflinks output pre-run on all RNA-seq files. This output is in the folder TopHat_full.

$ less M350b/Cufflinks/genes.fpkm_tracking

$ less M350b/Cufflinks/transcripts.gtf

$ cd /scratch/cluster/weekly/username/TopHat_full

Cuffmerge: merging results from the 3*4=12 cufflinks analyses.

See how to run cuffmerge.

$ cuffmerge -h

As you can see the program requires an input of gtf file list. Generate one.

$ ls -1 */Cufflinks/transcripts.gtf > gtf.list

Run - Cuffmerge to combine transcripts assembled for each replicate. We will -use reference file generated by Cuffmerge for differential expression + Cuffmerge to combine transcripts assembled for each replicate. We will +use reference file generated by Cuffmerge for differential expression analysis with Cuffdiff.

$ genomeRef=/scratch/cluster/weekly/username/GenomeSubset/Assembly/SINV_subset_1.fa

$ cuffmerge -s $genomeRef gtf.list

Have a look on the assembled merged gtf.

$ less merged_asm/merged.gtf

Renaming our favorite candidate genes (so we can find them easily)

We will want to be able to easily identify our favorite genes after analysis. Thus identify 4 Vitellogenin proteins (Vg1, Vg2, Vg3, Vg4) and 2 Transformer proteins - in the assembled “merged.gtf” we need to extract fasta sequences -corresponding to coordinates in gtf file. After that we will run Blast + in the assembled “merged.gtf” we need to extract fasta sequences +corresponding to coordinates in gtf file. After that we will run Blast of these sequences against file “Proteins.fasta” containing the proteins - of interest. We will format file with “Proteins.fasta” as Blast + of interest. We will format file with “Proteins.fasta” as Blast database and run translated transcript query against it.

$ gffread merged_asm/merged.gtf -g $genomeRef -w merged_asm/merged.fa

$ formatdb -p T -i Proteins.fasta

$ blastall -p blastx -a 1 -m 8 -i merged_asm/merged.fa -d Proteins.fasta -v 1 -b 1 -e 1.0e-5  > merged_asm/merged.blastx.all

One - of the Vg4 hits  Additionally we will retain only hits with -minimum alignment length of 150 for Transformer and over 1500 for + of the Vg4 hits  Additionally we will retain only hits with +minimum alignment length of 150 for Transformer and over 1500 for Vitellogenins (column 4).

Q: one of the fs

cat merged_asm/merged.blastx.all | awk '{if($2~/Tra/ && $4>=150){print $0} else if($4>=1500){print $0}}' > merged_asm/merged.blastx

Blast - identifies matching transcript sequences prefixed “TCONS_” by default. -As we will be working on gene level, we need to identify corresponding -genes prefixed “XLOC_”. Use script “get.xloc.sh” to do this. Redirect + identifies matching transcript sequences prefixed “TCONS_” by default. +As we will be working on gene level, we need to identify corresponding +genes prefixed “XLOC_”. Use script “get.xloc.sh” to do this. Redirect the output of this script using “>” to a file.

$ cat merged_asm/merged.blastx

$ ./get.xloc.sh > myLoci.txt

$ cat myLoci.txt

Finally - we would like to update our gtf reference file with names listed in -column 2 of myLoci.txt. For this we will make a backup copy of original + we would like to update our gtf reference file with names listed in +column 2 of myLoci.txt. For this we will make a backup copy of original file and then use a script “update_reference.sh” to do changes.

$ cp merged_asm/merged.gtf merged_asm/merged_original.gtf

$ ./update_reference.sh

$ ./update_reference.sh myLoci.txt

Next we will run Cuffdiff that will calculate differential expression levels - for genes and isoforms, as well as differential splicing and promoter -use. We will need to estimate our average fragment length (Remember: -this is only necessary when analyzing single Illumina reads). Use R (on + for genes and isoforms, as well as differential splicing and promoter +use. We will need to estimate our average fragment length (Remember: +this is only necessary when analyzing single Illumina reads). Use R (on Vital-IT) to read script file.list and calculate average fragment length - (specified for each file in column 2). Launch Cuffdiff or use provided -launch script called “runCuffdiff.sh”. If you choose to use the launch + (specified for each file in column 2). Launch Cuffdiff or use provided +launch script called “runCuffdiff.sh”. If you choose to use the launch script then edit input file locations and average fragment length.

$ cuffdiff -h

$ ./runCuffdiff.sh

Have a look at output files of Cuffdiff.

$ ls Diff_FDR0.01

$ less Diff_FDR0.01/gene_exp.diff

$ less Diff_FDR0.01/genes.count_tracking

$ less Diff_FDR0.01/genes.read_group_tracking

$ grep Vg Diff_FDR0.01/gene_exp.diff

CummeRbund: Insanely great plots

We will now use R package cummeRbund developed for the analysis of Cuffdiff output. Make sure you are using the latest - R version (2.15). Start by downloading all files contained in Cuffdiff -output directory to your computer locally. It will be faster if you “zip -r Diff_FDR0.01.zip Diff_FDR0.01” first. [failproof backup]

$ mkdir Diff_FDR0.01

$ scp username@prd.vital-it.ch:/scratch/cluster/weekly/username/TopHat_full/Diff_FDR0.01/* Diff_FDR0.01/

$ scp username@prd.vital-it.ch:/scratch/cluster/weekly/username/TopHat_full/myLoci.txt .

$ R

# install package

# source("http://bioconductor.org/biocLite.R")

# biocLite(c("cummeRbund", "Hmisc", "gplots"))

library(cummeRbund)

options(width=65)

# import data

cuff <- readCufflinks(dir = "Diff_FDR0.01", rebuild=T)

# look at object structure

str(cuff)

genes <- genes(cuff)

# plot gene expression density

dens<-csDensity(genes)

dens

# boxplots per sample

b<-csBoxplot(genes)

b

# assess general similarity between samples

# using a scatter plot comparing gene FPKM values for each comparison

(s<-csScatter(genes,"Q", "W", smooth=T))

(s<-csScatter(genes,"Q", "M", smooth=T))

(s<-csScatter(genes,"M", "W", smooth=T))

# inspect differentially expressed genes

(v<-csVolcano(genes,"Q","W"))

(v<-csVolcano(genes,"Q","M"))

(v<-csVolcano(genes,"M","W"))

#  gene fpkm values

gene.fpkm<-fpkm(genes)

head(gene.fpkm)

# access gene fpkm as a matrix

gene.matrix<-fpkmMatrix(genes)

head(gene.matrix)

# isoform fpkm

isoform.fpkm<-fpkm(isoforms(cuff))

head(isoform.fpkm)

# gene expression differences

gene.diff<-diffData(genes)

head(gene.diff)

# extract data for significant differences

sig_gene_data <- subset(gene.diff, (significant == "yes"))

#Q: why significant gene names are not unique?

sigGenes <- unique(sig_gene_data$gene_id)

length(sigGenes)

# cummeRbund is currently implementing the support for replicates

# we extract replicate values by reading file "genes.read_group_tracking"

repData  <- read.table("Diff_FDR0.01/genes.read_group_tracking",

                       header=T, sep="\t")

sampleId <- paste(repData$condition, repData$replicate, sep="")

repData  <- data.frame(repData, "sampleId"=sampleId)

repFPKM  <- reshape(repData,  idvar="tracking_id",

                    timevar="sampleId", direction="wide",

                    drop=c(names(repData)[2:6], names(repData)[8:9]))

rowNames <- repFPKM[,1]

repFPKM  <- repFPKM[-1]

row.names(repFPKM) <- rowNames

colnames(repFPKM)  <- substr(colnames(repFPKM), 6,7)

#heatmap for significant gene values in replicates

library(gplots)  

FPKM.sig     <- repFPKM[sigGenes,]

FPKM.sig     <- log2(FPKM.sig+0.5)

genes.dend   <- as.dendrogram (hclust (as.dist (1-cor (t(FPKM.sig)))))

samples.dend <- as.dendrogram (hclust (as.dist (1-cor (FPKM.sig))))

heatmap.2(as.matrix(FPKM.sig), scale="row",

          Rowv=genes.dend, Colv=samples.dend, trace="none",

          col=colorRampPalette(c("white", "yellow", "red"))(100))

sample.names<-samples(genes)

head(sample.names)

gene.featurenames <- featureNames(genes)

head(gene.featurenames)

# let's look at Vitellogenins

myGene <- getGenes(cuff,"Vg1")

myGene

expressionPlot(myGene)

fpkm(myGene)

myGene <- getGenes(cuff,"Vg2")

expressionPlot(myGene)

fpkm(myGene)

myGene <- getGenes(cuff,"Vg3")

expressionPlot(myGene)

fpkm(myGene)

myGene <- getGenes(cuff,"Vg4")

expressionPlot(myGene)

fpkm(myGene)

# Q: What do you conclude about expression of different Vitellogenins?

# let's take a Vg overexpressed in Queens and find 20 similar gene profiles

mySimilar            <- findSimilar(cuff,"Vg2",n=20)

mySimilar.expression <- expressionPlot(mySimilar,logMode=T,showErrorbars=F)

print(mySimilar.expression)

# we can also define a profile to search with numeric values

# let's find genes overexpressed in both female castes: queens and workers

sample.names

myProfile             <- c(0,1000,1000)

mySimilar2            <- findSimilar(cuff,myProfile,n=10)

mySimilar2.expression <- expressionPlot(mySimilar2,logMode=T,showErrorbars=F)

print(mySimilar2.expression)

# let's find genes overexpressed in both sexual castes: queens and males

myProfile             <- c(1000,1000,0)

mySimilar2            <- findSimilar(cuff,myProfile,n=10)

mySimilar2.expression <- expressionPlot(mySimilar2,logMode=T,showErrorbars=F)

print(mySimilar2.expression)

# k-means clustering

myGenes<-getGenes(cuff,sig_gene_data$gene_id)

ic <- csCluster(myGenes, k = 4)

head(ic$cluster)

icp <- csClusterPlot(ic)

icp

Optional: Alternative Splicing

We know - that Transformer genes (Tra1 and Tra2) have sex-specific alternative -splicing. Prepare a gff file to examine possible isoforms in Apollo. -Fasta sequence for scaffold containing these genes is in the file -“SIgn00005.fa”. First we get the coordinates of Tra1 and Tra2 genes to + R version (2.15). Start by downloading all files contained in Cuffdiff +output directory to your computer locally. It will be faster if you “zip -r Diff_FDR0.01.zip Diff_FDR0.01” first. [failproof backup]

$ mkdir Diff_FDR0.01

$ scp username@prd.vital-it.ch:/scratch/cluster/weekly/username/TopHat_full/Diff_FDR0.01/* Diff_FDR0.01/

$ scp username@prd.vital-it.ch:/scratch/cluster/weekly/username/TopHat_full/myLoci.txt .

$ R

# install package

# source("http://bioconductor.org/biocLite.R")

# biocLite(c("cummeRbund", "Hmisc", "gplots"))

library(cummeRbund)

options(width=65)

# import data

cuff <- readCufflinks(dir = "Diff_FDR0.01", rebuild=T)

# look at object structure

str(cuff)

genes <- genes(cuff)

# plot gene expression density

dens<-csDensity(genes)

dens

# boxplots per sample

b<-csBoxplot(genes)

b

# assess general similarity between samples

# using a scatter plot comparing gene FPKM values for each comparison

(s<-csScatter(genes,"Q", "W", smooth=T))

(s<-csScatter(genes,"Q", "M", smooth=T))

(s<-csScatter(genes,"M", "W", smooth=T))

# inspect differentially expressed genes

(v<-csVolcano(genes,"Q","W"))

(v<-csVolcano(genes,"Q","M"))

(v<-csVolcano(genes,"M","W"))

#  gene fpkm values

gene.fpkm<-fpkm(genes)

head(gene.fpkm)

# access gene fpkm as a matrix

gene.matrix<-fpkmMatrix(genes)

head(gene.matrix)

# isoform fpkm

isoform.fpkm<-fpkm(isoforms(cuff))

head(isoform.fpkm)

# gene expression differences

gene.diff<-diffData(genes)

head(gene.diff)

# extract data for significant differences

sig_gene_data <- subset(gene.diff, (significant == "yes"))

#Q: why significant gene names are not unique?

sigGenes <- unique(sig_gene_data$gene_id)

length(sigGenes)

# cummeRbund is currently implementing the support for replicates

# we extract replicate values by reading file "genes.read_group_tracking"

repData  <- read.table("Diff_FDR0.01/genes.read_group_tracking",

                       header=T, sep="\t")

sampleId <- paste(repData$condition, repData$replicate, sep="")

repData  <- data.frame(repData, "sampleId"=sampleId)

repFPKM  <- reshape(repData,  idvar="tracking_id",

                    timevar="sampleId", direction="wide",

                    drop=c(names(repData)[2:6], names(repData)[8:9]))

rowNames <- repFPKM[,1]

repFPKM  <- repFPKM[-1]

row.names(repFPKM) <- rowNames

colnames(repFPKM)  <- substr(colnames(repFPKM), 6,7)

#heatmap for significant gene values in replicates

library(gplots)  

FPKM.sig     <- repFPKM[sigGenes,]

FPKM.sig     <- log2(FPKM.sig+0.5)

genes.dend   <- as.dendrogram (hclust (as.dist (1-cor (t(FPKM.sig)))))

samples.dend <- as.dendrogram (hclust (as.dist (1-cor (FPKM.sig))))

heatmap.2(as.matrix(FPKM.sig), scale="row",

          Rowv=genes.dend, Colv=samples.dend, trace="none",

          col=colorRampPalette(c("white", "yellow", "red"))(100))

sample.names<-samples(genes)

head(sample.names)

gene.featurenames <- featureNames(genes)

head(gene.featurenames)

# let's look at Vitellogenins

myGene <- getGenes(cuff,"Vg1")

myGene

expressionPlot(myGene)

fpkm(myGene)

myGene <- getGenes(cuff,"Vg2")

expressionPlot(myGene)

fpkm(myGene)

myGene <- getGenes(cuff,"Vg3")

expressionPlot(myGene)

fpkm(myGene)

myGene <- getGenes(cuff,"Vg4")

expressionPlot(myGene)

fpkm(myGene)

# Q: What do you conclude about expression of different Vitellogenins?

# let's take a Vg overexpressed in Queens and find 20 similar gene profiles

mySimilar            <- findSimilar(cuff,"Vg2",n=20)

mySimilar.expression <- expressionPlot(mySimilar,logMode=T,showErrorbars=F)

print(mySimilar.expression)

# we can also define a profile to search with numeric values

# let's find genes overexpressed in both female castes: queens and workers

sample.names

myProfile             <- c(0,1000,1000)

mySimilar2            <- findSimilar(cuff,myProfile,n=10)

mySimilar2.expression <- expressionPlot(mySimilar2,logMode=T,showErrorbars=F)

print(mySimilar2.expression)

# let's find genes overexpressed in both sexual castes: queens and males

myProfile             <- c(1000,1000,0)

mySimilar2            <- findSimilar(cuff,myProfile,n=10)

mySimilar2.expression <- expressionPlot(mySimilar2,logMode=T,showErrorbars=F)

print(mySimilar2.expression)

# k-means clustering

myGenes<-getGenes(cuff,sig_gene_data$gene_id)

ic <- csCluster(myGenes, k = 4)

head(ic$cluster)

icp <- csClusterPlot(ic)

icp

Optional: Alternative Splicing

We know + that Transformer genes (Tra1 and Tra2) have sex-specific alternative +splicing. Prepare a gff file to examine possible isoforms in Apollo. +Fasta sequence for scaffold containing these genes is in the file +“SIgn00005.fa”. First we get the coordinates of Tra1 and Tra2 genes to Tra.gtf. Then we convert gtf to gff. Finally gff and fasta files must be uploaded to local computer to visualize with Apollo.

$ grep Tra merged_asm/merged.gtf > Tra.gtf

$ gffread Tra.gtf -o Tra.gff

$ scp username@prd.vital-it.ch:/scratch/cluster/weekly/username/TopHat_full/Tra.gtf .

$ scp username@prd.vital-it.ch:/scratch/cluster/weekly/username/TopHat_full/SIgn00005.fa .

Q: Can you identify Tra1 and Tra2 isoforms that are likely to be artefacts?

Q: Which ones look plausible?

\ No newline at end of file + diff --git a/teaching/MWAS practical.docx.html b/teaching/MWAS practical.docx.html index 7035c79a..1c30353b 100644 --- a/teaching/MWAS practical.docx.html +++ b/teaching/MWAS practical.docx.html @@ -38,71 +38,71 @@ padding: 0 6px; }

MWAS practical

The - data for todays practical can be downloaded from the Vital-It server + data for todays practical can be downloaded from the Vital-It server using the command scp. It will need to be on your local machine:

$ scp username@prd.vital-it.ch:/scratch/cluster/monthly/kridout/summer2012_Day2.zip ./ 

To - annotate our genome we will need an automated pipeline of repeat -masking, homology searching and gene prediction. To do this we will be + annotate our genome we will need an automated pipeline of repeat +masking, homology searching and gene prediction. To do this we will be using the program MAKER on the webserver MWAS.

To avoid overloading the MWAS server, PLEASE WORK IN PAIRS AND SEND ONLY A SINGLE MAKER REQUEST.

Depending - on the type of analysis to be performed we could try to annotate all + on the type of analysis to be performed we could try to annotate all scaffolds, or we might identify scaffolds of interest. Here we will pick a number of interesting scaffolds out of yesterdays assemblies. We will do this using BLAST.

First, we will need to login to Vital-It.

$ ssh username@prd.vital-it.ch

$ ssh dee-serv02.vital-it.ch

Make the directories that we will work in:

$ mkdir /scratch/cluster/weekly/username/MWAS

$ mkdir /scratch/cluster/weekly/username/MWAS/BLAST

$ mkdir /scratch/cluster/weekly/username/MWAS/BLAST/db

Change into the MWAS folder and collect a list of interesting proteins.

$ cd /scratch/cluster/weekly/username/MWAS

$ unzip /scratch/cluster/monthly/kridout/Interesting_genes.fa.zip

We need to temporarily add the BLAST executables to your $PATH (do not place this permanently into your .bashrc as you will need to run an older version of BLAST for another practical):

$ source /mnt/common/Blast/Blast.bashrc

Next - we need to turn the scaffolds into a BLAST database for searching. -First change into the BLAST directory and make the database:

$ cd /scratch/cluster/weekly/username/MWAS/BLAST

$ makeblastdb -in [path_to_your_best_scaffolds] -dbtype nucl -out [db/db_name]

Now we blast our interesting proteins against the scaffolds assembled yesterday to find those of interest.

$ tblastn -db [db/db name] -query [path_to_Interesting_genes.fa] -evalue 2e-8 -out [output name] -outfmt 6

Finally, we will run a script to make a fasta file of all the matching scaffolds.

$ perl /scratch/cluster/monthly/kridout/scaffoldsFromBlast.pl [path_to_your_best_scaffolds] [blast_output] > [scaffold_subset]

Use abyss-fac to see how many scaffolds you have collected and the range of scaffold lengths.

   

Exit Vital-It and Collect the file using scp.

$ scp username@prd.vital-it.ch:/scratch/cluster/weekly/username/BLAST/[scaffold_subset] ./ 

The remainder of this practical will be done on your local machine.

>Navigate to the MAKER website:

http://derringer.genetics.utah.edu/cgi-bin/MWAS/maker.cgi

>Make a new login using your username (i.e. student4) and select the New Job tab

The Denovo Annotation section should contain any sequences to be annotated. In our case, this means the scaffold subset that we have just collected.  

>Upload your scaffold file

EST evidence from our study species or a relative can be added.  

>Upload the transcript file Sinvicta_tr.fa from the day 2 files

Searching - the full swissprot database and the full repeat database will take too -much time, so a subset of proteins and repeats has been prepared for + we need to turn the scaffolds into a BLAST database for searching. +First change into the BLAST directory and make the database:

$ cd /scratch/cluster/weekly/username/MWAS/BLAST

$ makeblastdb -in [path_to_your_best_scaffolds] -dbtype nucl -out [db/db_name]

Now we blast our interesting proteins against the scaffolds assembled yesterday to find those of interest.

$ tblastn -db [db/db name] -query [path_to_Interesting_genes.fa] -evalue 2e-8 -out [output name] -outfmt 6

Finally, we will run a script to make a fasta file of all the matching scaffolds.

$ perl /scratch/cluster/monthly/kridout/scaffoldsFromBlast.pl [path_to_your_best_scaffolds] [blast_output] > [scaffold_subset]

Use abyss-fac to see how many scaffolds you have collected and the range of scaffold lengths.

   

Exit Vital-It and Collect the file using scp.

$ scp username@prd.vital-it.ch:/scratch/cluster/weekly/username/BLAST/[scaffold_subset] ./ 

The remainder of this practical will be done on your local machine.

>Navigate to the MAKER website:

>Make a new login using your username (i.e. student4) and select the New Job tab

The Denovo Annotation section should contain any sequences to be annotated. In our case, this means the scaffold subset that we have just collected.  

>Upload your scaffold file

EST evidence from our study species or a relative can be added.  

>Upload the transcript file Sinvicta_tr.fa from the day 2 files

Searching + the full swissprot database and the full repeat database will take too +much time, so a subset of proteins and repeats has been prepared for you.

>Upload the protein file Swissprot_subset.prot

>Upload the repeats file Sinvicta_repeats.fa

>Under Denovo Annotation select your scaffold file

>Under EST Evidence select the transcript file

>Under Protein Homology Evidence select the swissprot protein file

>Under the Configure Repeat Masking section select the repeats file

>Select the course training files for SNAP (we will not be using AUGUSTUS and GeneMark, although this is recommended)

Very - short scaffolds might still contain whole genes, however there is a -high likelihood of broken genes in these regions. In the interest of + short scaffolds might still contain whole genes, however there is a +high likelihood of broken genes in these regions. In the interest of time and accuracy we will skip them.

>Under Annotation Properties set the minimum contig length to 1000

>Select Add Job to Queue to launch MAKER

This process will take several hours. You can monitor the status of your job from the Running Jobs tab.

MWAS Questions

You - should have now started the process of genome annotation. Generating a -set of reliable genes is not a trivial task. It is very important to -understand the potential problems with genes predicted with automated -methods. Consider the data that you have produced this far and answer + should have now started the process of genome annotation. Generating a +set of reliable genes is not a trivial task. It is very important to +understand the potential problems with genes predicted with automated +methods. Consider the data that you have produced this far and answer the following questions:

Q1. Genome/transcript - assembly does not yet produce scaffolds that are guaranteed to be -accurate. What effects could misassemblies in the data have on genome + assembly does not yet produce scaffolds that are guaranteed to be +accurate. What effects could misassemblies in the data have on genome annotation?  

Q2. Errors - in the EST or protein annotation databases will be perpetuated through -this annotation process.  This is a problem for all homology based + in the EST or protein annotation databases will be perpetuated through +this annotation process.  This is a problem for all homology based annotation methods. How might you try to deal with this uncertainty?

Q3. What are the pros and cons of a homology based gene finding approach?

Q4. What are the pros and cons of a denovo prediction approach (e.g. AUGUSTUS) for gene finding?

Q5. What are the potential biases generated by predicting genes in this way?

WHILE MAKER IS RUNNING CHECK THAT YOU HAVE APPOLO INSTALLED AND FUNCTIONING

If Apollo is not installed, you should download and install it before we continue with the Apollo practical. Downloading it from the local server will be fastest.

Apollo manual gene editing

Sometimes - we might want to view or manually edit specific genes. To perform this + we might want to view or manually edit specific genes. To perform this task, programs such as IGV or Apollo can be used.

>Download all data from MWAS (under View Results)

>Open Apollo and load the gff file (this is gff3 format) of all scaffolds downloaded from MAKER. Untick the box labeled Embedded sequence and select the original scaffold file

The - top 2 panes represent the forwards strand and the bottom 2 are the -reverse. Results from the different prediction methods are displayed in + top 2 panes represent the forwards strand and the bottom 2 are the +reverse. Results from the different prediction methods are displayed in the black panes and the final annotations in the blue.

>Click on the squares (predictions) in the black panes to see which programs have produced the different results

>To color the different prediction methods, right click on the prediction (in the black panel) and select  Change color of this feature type

Scaffold and gff3 files have been prepared for you to examine and compare with your own annotations. Open the fasta file Sinvicta_good_scaffolds.fa and the gff3 file Sinvicta_good_predictions.gff3 in Apollo.

Q6. What are the main differences that you see between the files (do not try to investigate individual genes)?

Q7. Based - on these results, what can you say about which programs over predict -and which under predict? What do you think of the reliability of these + on these results, what can you say about which programs over predict +and which under predict? What do you think of the reliability of these programs?

The - gff3 file given to you contains examples of correctly and incorrectly -annotated genes. In some places genes have been forced together, or + gff3 file given to you contains examples of correctly and incorrectly +annotated genes. In some places genes have been forced together, or separated at introns. Some genes also have incorrect splice sites.

Q8. Using Apollo, identify likely genes that might have been badly annotated. Why did you choose these genes?

Using tools such as Apollo we can perform manual editing of gene predictions. - Splice sites occur at the intron/exon boundary and are highly -conserved. Apollo highlights the ends of exons that do not have the + Splice sites occur at the intron/exon boundary and are highly +conserved. Apollo highlights the ends of exons that do not have the necessary splicing sites using an orange triangle:

 

Acceptable splice sites can be colored as in the above example.

>Right click on the nucleotide sequence to select color by splice site potential  (you will not see the nucleotides until you have zoomed in sufficiently far)

Simply choosing the closest splice site is not a reliable method for fixing this prediction.

Q9. How might you determine the true splice positions of a predicted gene?

The gene maker-Sign00006-snap-gene-0.10-mRNA-1 contains a number of incorrect splicing sites identified by Apollo.

>Select this gene under the Annotation tab for a better view and zoom in to examine the sequence

Q10. Do you think that this gene represents a real protein? Explain your answer. What else might this gene be?

A protein domain scan on the whole genome determined that the fire ant genome contains multiple Vitellogenins. This is potentially interesting because the single Vitellogenin of the honey bee plays roles in determining lifespan and division of labor (Amdam et al 2012). Protein2genome suggest there are three vitellogenins in our scaffold:

maker-Sign00006-snap-gene-1.41-mRNA-1

maker-Sign00006-snap-gene-1.42-mRNA-1

maker-Sign00006-snap-gene-1.43-mRNA-1

Q11. What evidence supports these predictions?

Gene maker-Sign00006-snap-gene-1.40-mRNA-1 is considerably shorter, but has also matched this protein. Experimental evidence suggests that there is indeed a 4th Vitellogenin gene, despite the missing EST evidence.

A closer examination of the BLAST evidence suggests that this gene has been truncated by the prediction software.

Q12. Why do you think this is?

We - will attempt to extend this gene. Given more time we would examine the -BLAST results against a number of Vitellogenin homologues to determine -the correct splice, start and stop sites. Here, we will try to combine + will attempt to extend this gene. Given more time we would examine the +BLAST results against a number of Vitellogenin homologues to determine +the correct splice, start and stop sites. Here, we will try to combine the BLASTX prediction and EST results.

First, we should duplicate the gene(s) that we intend to work on in case we make a mistake.

>Double click on each of these genes to select it as a whole. Right click on the selected gene and choose Duplicate transcript

The - first new BLAST exon (from right to left) does not begin at a splice + first new BLAST exon (from right to left) does not begin at a splice site, however this can be extended using EST evidence which does.

>Right click in the blue panel underneath the nucleotide that your new exon should start at. Select Create new annotation -> gene

>Set the new gene length to a single base

>Find the end of your exon and repeat this process

>Hold the shift key and click on the two new genes to select them both.  Right click on one of these and select Merge transcripts

>Double click on the new gene to select the whole thing. Right click on this gene and select Merge exons

You have built your first new exon. Apollo will place an orange triangle at the end of the gene if the splice site was not correct.

>Check your gene for correct splicing and frame. If the frame is incorrect there will be stop codons.

>If - your exon is correctly spliced, you should try to join it to the rest + your exon is correctly spliced, you should try to join it to the rest of the gene. Select the gene and the new exon, right click and Merge transcripts

>You - may now find that the splicing does not work. This is because the -original gene was built to end at a stop codon. Use the EST and BLAST -evidence to find the correct splice site. Make sure that the correct + may now find that the splicing does not work. This is because the +original gene was built to end at a stop codon. Use the EST and BLAST +evidence to find the correct splice site. Make sure that the correct reading frame is maintained

To move the end of the exon by small amounts, right click on the transcript and select Exon detail editor. Navigate - to the necessary part of the sequence (remember that the sequence will -be displayed from left to right, where as these genes on the reverse + to the necessary part of the sequence (remember that the sequence will +be displayed from left to right, where as these genes on the reverse strand are viewed in the pale blue panel from right to left).

>Repeat this process until you reach the end of the gene. Remember to stop the gene at a stop codon.

Predicted genes:

maker-Sign00006-snap-gene-6.41-mRNA-1

maker-Sign00006-snap-gene-6.43-mRNA-1

Both map in parts to the same EST. It is possible that these genes have been fragmented by Apollo and belong to the same transcript.

Q13. How might you try to determine whether these are the same gene?

Extension

Despite - possible errors in the databases, genes can be annotated using + possible errors in the databases, genes can be annotated using automated BLAST searching. A commonly used tool for automated annotation - is BLAST2GO, which allows the user to annotate using a gene name, -domain, function, ontology and pathway.  This can be very useful + is BLAST2GO, which allows the user to annotate using a gene name, +domain, function, ontology and pathway.  This can be very useful for downstream processes.

If you have finished editing your genes, choose a subset (maximum 10) and explore the functionality of BLAST2GO:

http://www.blast2go.com/b2glaunch

\ No newline at end of file + diff --git a/teaching/bachelor3-tp4Microarray/index.html b/teaching/bachelor3-tp4Microarray/index.html index 361b28d4..ce387e7c 100644 --- a/teaching/bachelor3-tp4Microarray/index.html +++ b/teaching/bachelor3-tp4Microarray/index.html @@ -7,7 +7,7 @@

TP4: Analyse de donnees d'expression

Merci d'envoyer a demandez l'adresse les reponses aux questions en surlignees avec [TP4 Bioinfo] comme sujet. Vu qu'il necessite que plusieurs logiciels soient installes, il est conseille de le faire en salles Pol204.2 ou Bio1928.

Votre sujet de recherche sont les tissus musculaires. En particulier, vous voulez caracteriser le coeur d'un point de vue moleculaire.

-

A travers une recherche bibliographique vous trouvez une etude extensive des transcriptomes humains et murins. Peut-etre qu'il y aura des donnees interessantes! En 2004, Su et al., ont quantifie l'expression de genes au sein de 79 tissus humains et 61 tissus de souris a l'aide de puces a ADN. Pour l'Homme, ils se sont servis d'une puce Affymetrix basee sur la puce Affymetrix Human U133A. Dans l'article (tres cite) figure un lien vers les donnees brutes.

+

A travers une recherche bibliographique vous trouvez une etude extensive des transcriptomes humains et murins. Peut-etre qu'il y aura des donnees interessantes! En 2004, Su et al., ont quantifie l'expression de genes au sein de 79 tissus humains et 61 tissus de souris a l'aide de puces a ADN. Pour l'Homme, ils se sont servis d'une puce Affymetrix basee sur la puce Affymetrix Human U133A. Dans l'article (tres cite) figure un lien vers les donnees brutes.

Mise en place des donnees

Telechargez le jeu de donnees Human U133A + GNF1H (MAS5-condensed).

Decompressez ce jeu de donnees. Depuis Excel, ouvrez le fichier de donnees GNF1Hdata.txt qui est au format texte, chaque colonne etant separee par une "tabulation". Essayez de comprendre le contenu du fichier.

Que represente chaque ligne? Chaque colonne?
@@ -59,7 +59,7 @@

Visualisation des donnees

-

Dans un cas reel on ferrait plusieurs autres analyses preliminaires pour valider la qualite des donnees. Plusieurs packages Bioconductor pourront nous aider.

+

Dans un cas reel on ferrait plusieurs autres analyses preliminaires pour valider la qualite des donnees. Plusieurs packages Bioconductor pourront nous aider.

Transformation des donnees

Nous avons des valeurs d'expression absolues pour chaque gene. Deux questions biologiques peuvent se poser: