One of the main things a bioinformatician does all day is wrangle data from one file format to another. Here we will briefly summarize the main file formats genotypes are stored in, some tools for wrangling them, and some additional genotype file formats we'll occasionally be dealing with.
Data you will download from 23andme is in VCF format. VCF files generally have:
##
which describe the data in each field. They also contain helpful things like the commands uesd to generate the file, information about the reference genome used, etc.#CHROM
which gives the column headers of the data in the following rows.The most important FORMAT field is "GT", which gives the genotype of each sample. 0=reference allele, 1=alternate allele 1, 2=alternate allele 2, etc. For example GT=0/0 means a sample is homozygous for the reference allele, 1/1 means heterozygous.
For unphased genotypes, alleles are separated by a "/", meaning they are unordered. For phased genotypes, alleles are separated by a "|", meaning they are ordered.
The full spec is described here: https://samtools.github.io/hts-specs/VCFv4.2.pdf
You can look at these example VCF files on datahub:
/datasets/cs284s-sp20-public/ps1/pset1_1000Genomes_chr16.vcf
(phased, multi-sample)/datasets/cs185s-sp20-public/week3/friend_genome.vcf.gz
(unphased, single-sample)Note the second is zipped. You can zip and index VCF files using bgzip
and tabix
:
cp /datasets/cs284s-sp20-public/ps1/pset1_1000Genomes_chr16.vcf . # copy to current dir
bgzip pset1_1000Genomes_chr16.vcf
tabix -p vcf pset1_1000Genomes_chr16.vcf.gz
Indexing allows us to easily extract specific genomic locations, e.g.:
tabix pset1_1000Genomes_chr16.vcf.gz 16:31011634-31011635
You can use bcftools
to do many manipulations to VCF files, e.g.:
bcftools query -l pset1_1000Genomes_chr16.vcf.gz # list the samples in the VCF file
bcftools index -n pset1_1000Genomes_chr16.vcf.gz # list the number of records (variants) in the VCF file
bcftools query -e'AF<0.01' -e'AF>0.99' -f "[%GT\t]\n" pset1_1000Genomes_chr16.vcf.gz # extract tab-separated genotypes
If you want to parse a VCF file in Python, rather than writing your own tools to do this, you're better off using existing libraries:
You will also come across genotypes in the format used by Plink (https://www.cog-genomics.org/plink/1.9/formats), which can perform a ton of different functions, including filter, association testing, IBD calculation, and more.
The text version of plink files includes:
The binary versions of these files are bed/bim files.
You can find example plink data in: /datasets/cs284s-sp20-public/ps2/
:
When running plink, you will almost always use one of these options:
--bfile <prefix>
: uses <prefix>.bed
and <prefix>.bim
as input--file <prefix>
: uses <prefix>.ped
, <prefix>.map
, and <prefix>.fam
as inputYou will probably eventually encounter the need to convert things between VCF/plink formats. Plink can do that:
# VCF->plink
echo rs112607901 > exclude.txt # this ID was duplicated
plink \
--vcf pset1_1000Genomes_chr16.vcf.gz \
--recode \
--exclude exclude.txt \
--out pset1_1000Genomes_chr16
# Plink->VCF
zcat pset1_1000Genomes_chr16.vcf.gz | grep -v "^#" | cut -f 1-5 > gtdata_alleles.tab
plink \
--file pset1_1000Genomes_chr16 \
--recode vcf bgz \
--a2-allele gtdata_alleles.tab 4 3 '#' \
--real-ref-alleles \
--exclude exclude.txt \
--out pset1_1000Genomes_chr16_converted
We will encounter other ways of representing genotypes, including:
wget (https://mathgen.stats.ox.ac.uk/impute/scripts/vcf2impute_gen
perl vcf2impute_gen -vcf pset1_1000Genomes_chr16.vcf.gz \
-gen pset1_1000Genomes_chr16.gen.gz
See a description of the format here: https://www.cog-genomics.org/plink/1.9/formats#gen
All the problem sets for CSE284 are done in Jupyter notebooks. You might find the following tips helpful.
You can run bash commands directly in cells if you preceed your command with a !
. You can even use tab completion. e.g.:
! ls /datasets/cs284s-sp20-public/ps2
You can also write multi-line bash scripts directly in cells using the bash magic %%bash
at the top of the cell:
%%bash
echo "Hello world"
ls /datasets/cs284s-sp20-public/ps1
You can pass arguments to the bash cells, which can either be constants or variables you had defined earlier in Python:
chrom="1"
file="myfile"
%%bash -s "$chrom" "$file" "test"
chrom=$1 # $1 reads first input argument to the bash script
file=$2 # $2 reads second argument
mystring=$3 # etc. ...
echo "Run command for chr$chrom on file $file. $mystring"
You can also use other types of cell magics, e.g.:
%%file <filename>
writes the contents of the cell to a file%%timeit
times the contents of the cell%%R
lets you write R code. (After you run %load_ext rpy2.ipython
in a cell before that. You might need to do pip install --user simplegeneric
to get that to work)%%file testfile.txt
1 2 3
4 5 6
%%timeit
x = 2
%load_ext rpy2.ipython
# Suppress warnings (most notable from rpy2)
import warnings
warnings.filterwarnings('ignore')
%%R
library(dplyr)
paste("Look I can run R code here")
x <- 2
paste(x)
%%bash
DATADIR=/datasets/cs284s-sp20-public/ps2/rfmix/
for chrom in $(seq 1 22)
do
ls -ltrh ${DATADIR}/ps2_rfmix_chr${chrom}.map
done
Some times a command will take a long time, and you'd like to close your computer while it's running and come back to it later. You can use the nohup
command for this. The syntax is:
nohup <mycommand> &
This will run your command and output what would normally get printed to the screen to nohup.out
.
Sometimes if you use >
in your command it can cause problems with where nohup writes the output. The safest way to avoid issues with this are either:
Make a string variable with your command, then run with sh -c
:
mycommand="echo 1 2 3 > test.tx"
nohup sh -c "$mycommand" &
Alternatively, put all your commands in a bash script e.g. myscript.sh
, with a header line #!/bin/bash at the top
, then run the script with nohup:
chmod +x myscript.sh
nohup ./myscript.sh &
(the example is for a bash script, you can do the same with a Python, perl, etc. script).
Sometimes you have a lot of jobs to run (like one per chromosome) and would like to parallelize them and run them in the background with nohup. But maybe you don't want to run them all at once due to CPU/RAM constraints. e.g. if you are running imputation and it takes 10GB of memory and your server has only 50GB of RAM, you should probably only run ~2-3 jobs at once. xargs
is a versatile tool that does lots of things, but one of the most useful is the ability to run a set number of commands at once. Below gives an example. Note:
-n 1
tells xargs to run on one line at a time-P 5
tells xargs to only use at most 5 processors (run at most 5 jobs at a time)-I %
tells it to replace %
in the command run with each line from standard input.The command is a little cryptic (it is the only xargs command I ever use...). But I find it super helpful.
You can put the cell below in a script myscript.sh
, then run nohup ./myscript.sh &
to run in the background, making sure you only run up to 5 jobs at a time.
%%bash
DATADIR=/datasets/cs284s-sp20-public/ps2/rfmix/
for chrom in $(seq 1 22)
do
cmd="ls -ltrh ${DATADIR}/ps2_rfmix_chr${chrom}.map"
echo $cmd
done | xargs -n1 -I% -P5 sh -c "%"