Part 1: Genotype Data Formats and Wrangling Tools

One of the main things a bioinformatician does all day is wrangle data from one file format to another. Here we will briefly summarize the main file formats genotypes are stored in, some tools for wrangling them, and some additional genotype file formats we'll occasionally be dealing with.

Part 1.1: VCF files

Data you will download from 23andme is in VCF format. VCF files generally have:

  • Header lines beginning with ## which describe the data in each field. They also contain helpful things like the commands uesd to generate the file, information about the reference genome used, etc.
  • A final header line beginning with #CHROM which gives the column headers of the data in the following rows.
  • One line per variant describinging:
    • CHROM: the chromosome of the variant
    • POS: the position on the chromosome of the variant
    • ID: the identifier of the variant, for example a dbSNP rsid.
    • REF: the allele in the reference genome
    • ALT: a comma-separated list of alternate alleles (usually there is just 1)
    • QUAL: Describes confidence this is a true variant site.
    • FILTER: contains locus-level filters, e.g. PASS if it passes, or "." if not set, or something else if it fails.
    • INFO: ";"-separated list of locus-specific metrics, such as allele frequencies, imputation quality, etc.
    • FORMAT: ":"-separated list of data fields reported for each sample
    • The remaining part of the line has one column per sample. Each sample has a ":"-separated list of data values according to the order specified in FORMAT>

The most important FORMAT field is "GT", which gives the genotype of each sample. 0=reference allele, 1=alternate allele 1, 2=alternate allele 2, etc. For example GT=0/0 means a sample is homozygous for the reference allele, 1/1 means heterozygous.

For unphased genotypes, alleles are separated by a "/", meaning they are unordered. For phased genotypes, alleles are separated by a "|", meaning they are ordered.

The full spec is described here: https://samtools.github.io/hts-specs/VCFv4.2.pdf

You can look at these example VCF files on datahub:

  • /datasets/cs284s-sp20-public/ps1/pset1_1000Genomes_chr16.vcf (phased, multi-sample)
  • /datasets/cs185s-sp20-public/week3/friend_genome.vcf.gz (unphased, single-sample)

Note the second is zipped. You can zip and index VCF files using bgzip and tabix:

cp /datasets/cs284s-sp20-public/ps1/pset1_1000Genomes_chr16.vcf . # copy to current dir
bgzip pset1_1000Genomes_chr16.vcf 
tabix -p vcf pset1_1000Genomes_chr16.vcf.gz

Indexing allows us to easily extract specific genomic locations, e.g.:

tabix pset1_1000Genomes_chr16.vcf.gz 16:31011634-31011635

You can use bcftools to do many manipulations to VCF files, e.g.:

bcftools query -l pset1_1000Genomes_chr16.vcf.gz # list the samples in the VCF file
bcftools index -n pset1_1000Genomes_chr16.vcf.gz # list the number of records (variants) in the VCF file
bcftools query -e'AF<0.01' -e'AF>0.99' -f "[%GT\t]\n" pset1_1000Genomes_chr16.vcf.gz # extract tab-separated genotypes

If you want to parse a VCF file in Python, rather than writing your own tools to do this, you're better off using existing libraries:

You will also come across genotypes in the format used by Plink (https://www.cog-genomics.org/plink/1.9/formats), which can perform a ton of different functions, including filter, association testing, IBD calculation, and more.

The text version of plink files includes:

  • FAM (sample info): A text file with no header line, and one line per sample with the following six fields:
    • Family ID ('FID')
    • Within-family ID ('IID'; cannot be '0')
    • Within-family ID of father ('0' if father isn't in dataset)
    • Within-family ID of mother ('0' if mother isn't in dataset)
    • Sex code ('1' = male, '2' = female, '0' = unknown)
    • Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)
  • PED (genotypes): Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file.
  • MAP (variants info) : A text file with no header file, and one line per variant with the following 3-4 fields:
    • Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.
    • Variant identifier
    • Position in morgans or centimorgans (optional; also safe to use dummy value of '0')
    • Base-pair coordinate

The binary versions of these files are bed/bim files.

You can find example plink data in: /datasets/cs284s-sp20-public/ps2/:

When running plink, you will almost always use one of these options:

  • --bfile <prefix>: uses <prefix>.bed and <prefix>.bim as input
  • --file <prefix>: uses <prefix>.ped, <prefix>.map, and <prefix>.fam as input

You will probably eventually encounter the need to convert things between VCF/plink formats. Plink can do that:

# VCF->plink
echo rs112607901 > exclude.txt # this ID was duplicated
plink \
  --vcf pset1_1000Genomes_chr16.vcf.gz \
  --recode  \
  --exclude exclude.txt \
  --out pset1_1000Genomes_chr16

# Plink->VCF
zcat pset1_1000Genomes_chr16.vcf.gz | grep -v "^#" | cut -f 1-5 > gtdata_alleles.tab
plink \
  --file pset1_1000Genomes_chr16 \
  --recode vcf bgz \
  --a2-allele gtdata_alleles.tab 4 3 '#' \
  --real-ref-alleles \
  --exclude exclude.txt \
  --out pset1_1000Genomes_chr16_converted

Part 1.3 Other formats

We will encounter other ways of representing genotypes, including:

  • IMPUTE2 uses a "gen" file format. You can convert from VCF with the script vcf2impute_gen
wget (https://mathgen.stats.ox.ac.uk/impute/scripts/vcf2impute_gen
perl vcf2impute_gen -vcf pset1_1000Genomes_chr16.vcf.gz \
    -gen pset1_1000Genomes_chr16.gen.gz

See a description of the format here: https://www.cog-genomics.org/plink/1.9/formats#gen

Part 2: Jupyter Notebook Tips

All the problem sets for CSE284 are done in Jupyter notebooks. You might find the following tips helpful.

Part 2.1 Running bash commands in notebooks - oneliners

You can run bash commands directly in cells if you preceed your command with a !. You can even use tab completion. e.g.:

In [2]:
! ls /datasets/cs284s-sp20-public/ps2
imputation	 ps2_ibd.lwk.nosex	    ps2_impute.subset.gen.gz
ps2_ibd.lwk.bed  ps2_ibd.lwk.ped	    ps2_pca.genotypes.vcf.gz
ps2_ibd.lwk.bim  ps2_ibd.lwk.vcf.gz	    ps2_pca.genotypes.vcf.gz.tbi
ps2_ibd.lwk.fam  ps2_ibd.lwk.vcf.gz.tbi     ps2_reference_labels.csv
ps2_ibd.lwk.log  ps2_impute.heldout.gen.gz  rfmix
ps2_ibd.lwk.map  ps2_impute.heldout.vcf.gz

Part 2.2 Running bash commands - cell magics

You can also write multi-line bash scripts directly in cells using the bash magic %%bash at the top of the cell:

In [4]:
%%bash

echo "Hello world"

ls /datasets/cs284s-sp20-public/ps1
Hello world
pset1_1000Genomes_chr16.vcf
VKORC1_exons_NM_001311311.bed

You can pass arguments to the bash cells, which can either be constants or variables you had defined earlier in Python:

In [9]:
chrom="1"
file="myfile"
In [10]:
%%bash -s "$chrom" "$file" "test"

chrom=$1 # $1 reads first input argument to the bash script
file=$2 # $2 reads second argument
mystring=$3 # etc. ...

echo "Run command for chr$chrom on file $file. $mystring"
Run command for chr1 on file myfile. test

Part 2.3 Other types of cell magics

You can also use other types of cell magics, e.g.:

  • %%file <filename> writes the contents of the cell to a file
  • %%timeit times the contents of the cell
  • %%R lets you write R code. (After you run %load_ext rpy2.ipython in a cell before that. You might need to do pip install --user simplegeneric to get that to work)
In [12]:
%%file testfile.txt
1 2 3
4 5 6
Overwriting testfile.txt
In [16]:
%%timeit
x = 2
11.4 ns ± 1.26 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)
In [4]:
%load_ext rpy2.ipython
# Suppress warnings (most notable from rpy2)
import warnings
warnings.filterwarnings('ignore')
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
In [5]:
%%R
library(dplyr)
paste("Look I can run R code here")
x <- 2
paste(x)
[1] "2"

Part 3: Bash tips

You might find the following bash tips helpful.

Part 3.1 For loops

You can use for loops to run similar commands over in over again while avoiding some typing. For example, if you want to run the same command on each chromosome, using a for loop is very convenient:

In [8]:
%%bash

DATADIR=/datasets/cs284s-sp20-public/ps2/rfmix/

for chrom in $(seq 1 22)
do
    ls -ltrh ${DATADIR}/ps2_rfmix_chr${chrom}.map
done
-rw-rw-r-- 1 77076 1645 99M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr1.map
-rw-rw-r-- 1 77076 1645 108M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr2.map
-rw-rw-r-- 1 77076 1645 90M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr3.map
-rw-rw-r-- 1 77076 1645 91M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr4.map
-rw-rw-r-- 1 77076 1645 81M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr5.map
-rw-rw-r-- 1 77076 1645 79M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr6.map
-rw-rw-r-- 1 77076 1645 73M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr7.map
-rw-rw-r-- 1 77076 1645 70M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr8.map
-rw-rw-r-- 1 77076 1645 54M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr9.map
-rw-rw-r-- 1 77076 1645 62M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr10.map
-rw-rw-r-- 1 77076 1645 62M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr11.map
-rw-rw-r-- 1 77076 1645 60M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr12.map
-rw-rw-r-- 1 77076 1645 45M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr13.map
-rw-rw-r-- 1 77076 1645 41M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr14.map
-rw-rw-r-- 1 77076 1645 37M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr15.map
-rw-rw-r-- 1 77076 1645 40M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr16.map
-rw-rw-r-- 1 77076 1645 35M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr17.map
-rw-rw-r-- 1 77076 1645 35M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr18.map
-rw-rw-r-- 1 77076 1645 29M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr19.map
-rw-rw-r-- 1 77076 1645 28M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr20.map
-rw-rw-r-- 1 77076 1645 18M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr21.map
-rw-rw-r-- 1 77076 1645 17M Mar 25 16:05 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr22.map

Part 3.2 Running things in the background with nohup

Some times a command will take a long time, and you'd like to close your computer while it's running and come back to it later. You can use the nohup command for this. The syntax is:

nohup <mycommand> &

This will run your command and output what would normally get printed to the screen to nohup.out. Sometimes if you use > in your command it can cause problems with where nohup writes the output. The safest way to avoid issues with this are either:

  • Make a string variable with your command, then run with sh -c:

    mycommand="echo 1 2 3 > test.tx"
    nohup sh -c "$mycommand" &
    
  • Alternatively, put all your commands in a bash script e.g. myscript.sh, with a header line #!/bin/bash at the top, then run the script with nohup:

    chmod +x myscript.sh
    nohup ./myscript.sh &
    

(the example is for a bash script, you can do the same with a Python, perl, etc. script).

Part 3.3 Parallelizing jobs with xargs

Sometimes you have a lot of jobs to run (like one per chromosome) and would like to parallelize them and run them in the background with nohup. But maybe you don't want to run them all at once due to CPU/RAM constraints. e.g. if you are running imputation and it takes 10GB of memory and your server has only 50GB of RAM, you should probably only run ~2-3 jobs at once. xargs is a versatile tool that does lots of things, but one of the most useful is the ability to run a set number of commands at once. Below gives an example. Note:

  • -n 1 tells xargs to run on one line at a time
  • -P 5 tells xargs to only use at most 5 processors (run at most 5 jobs at a time)
  • -I % tells it to replace % in the command run with each line from standard input.

The command is a little cryptic (it is the only xargs command I ever use...). But I find it super helpful. You can put the cell below in a script myscript.sh, then run nohup ./myscript.sh & to run in the background, making sure you only run up to 5 jobs at a time.

In [10]:
%%bash

DATADIR=/datasets/cs284s-sp20-public/ps2/rfmix/
for chrom in $(seq 1 22)
do
    cmd="ls -ltrh ${DATADIR}/ps2_rfmix_chr${chrom}.map"
    echo $cmd
done | xargs -n1 -I% -P5 sh -c "%"
-rw-rw-r-- 1 77076 1645 99M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr1.map
-rw-rw-r-- 1 77076 1645 108M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr2.map
-rw-rw-r-- 1 77076 1645 90M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr3.map
-rw-rw-r-- 1 77076 1645 91M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr4.map
-rw-rw-r-- 1 77076 1645 81M Mar 25 16:06 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr5.map
-rw-rw-r-- 1 77076 1645 79M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr6.map
-rw-rw-r-- 1 77076 1645 73M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr7.map
-rw-rw-r-- 1 77076 1645 70M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr8.map
-rw-rw-r-- 1 77076 1645 54M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr9.map
-rw-rw-r-- 1 77076 1645 62M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr10.map
-rw-rw-r-- 1 77076 1645 62M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr11.map
-rw-rw-r-- 1 77076 1645 60M Mar 25 16:07 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr12.map
-rw-rw-r-- 1 77076 1645 45M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr13.map
-rw-rw-r-- 1 77076 1645 41M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr14.map
-rw-rw-r-- 1 77076 1645 37M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr15.map
-rw-rw-r-- 1 77076 1645 40M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr16.map
-rw-rw-r-- 1 77076 1645 35M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr17.map
-rw-rw-r-- 1 77076 1645 35M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr18.map
-rw-rw-r-- 1 77076 1645 29M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr19.map
-rw-rw-r-- 1 77076 1645 28M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr20.map
-rw-rw-r-- 1 77076 1645 18M Mar 25 16:08 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr21.map
-rw-rw-r-- 1 77076 1645 17M Mar 25 16:05 /datasets/cs284s-sp20-public/ps2/rfmix//ps2_rfmix_chr22.map