Final Project Guidelines

Overview

The final project is aimed at providing you an opportunity to gain real-world experience performing research related to personal genomics. Projects may be performed in groups of 1-3 students.  This means you may work alone if you wish. We encourage you to use Piazza to help you find teammates.

Projects may take one of the following forms:

  • Implementing an analysis method we covered in class, and benchmarking your implementation on real datasets and vs. existing tools.
  • A novel analysis of public human genome datasets, or simulated genomes, to answer an original research question.

We will use this Google spreadsheet to keep track of teams and who is doing what: https://docs.google.com/spreadsheets/d/1jJlKEqkC2gjWmuD5blRrCVQMcNLRxMR4BCYImoG-4LY/edit?usp=sharing

We have provided suggested project ideas for each of these categories to give you an idea of the scope projects should take. However, you may also suggest your own topic idea. In that case we encourage you to discuss your idea with the instructors as early as possible. We are continuing to add to this list over the next several weeks. 

Scope: You only have a couple weeks to do this. It might sound like a lot but it is not. You can be ambitious, but will want to have something to say for the report. Consider the scope of the project to be around 1-2 problem sets worth of work.

Deliverables

The final project is worth a total of 30% of your grade and consists of the following components:

  • Proposal (5%) Due 05/05/21
  • Paper (20%) Due 06/04/21
  • Presentation (5%) To take place 05/26/21 and 06/02/21

Awards

We will be giving out several awards for top projects! Each award is worth 2 points of extra credit (out of 100 points) on the final report.

  • Best figure.
  • Best presentation
  • Best documentation

One group may receive more than one award. Awards will be announced during week 11 on Piazza.

Resources

We anticipate that projects will be completed on JupyterHub. If your team requires additional storage/compute resources let us know early on and we can accommodate reasonable requests. 

Large datasets to be used by multiple teams (e.g. 1000 Genomes Project) will be made available in the public course directory.

The following are public datasets would be good candidates for the project

  • The 1000 Genomes Project http://www.internationalgenome.org/ Sequencing data and variant calls for about 3,000 samples worldwide
  • The Simons Genome Diversity Project https://www.simonsfoundation.org/life-sciences/simons-genomediversity-project-dataset/ 300 high coverage whole genomes from diverse population groups.
  • The ENCODE Project https://www.genome.gov/10005107 Histone modifications, transcription factor, expression, and more data from a variety of human cell types.
  • The GTEx Project gtexportal.org. Expression data and QTLs from hundreds of human cell types.