A few months ago I send a DNA sample for Whole Genome Sequencing offered by Dante Labs.

My goal is to extract information from the genome and use it for personalized medicine, prevention, training, nutrition and more. The report generated by DanteLabs’s partners sequencing.com looks long (163 pages), but 140 pages of this report are hard to read table with SNP id and possible condition. The export is PDF and this data is not machine-readable.

Luckily, Sequencing.com provide the same data in gzipped VCF format which allowed me to start developing what I wanted to do.

VCF (Variant Call Format) is just a text format with tab separated values.

chr1  13116 . T G 479.77  PASS  AC=2;AF=1.00;AN=2;DP=12;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=22.63;QD=34.24;SOR=2.670;VQSLOD=2.23;culprit=QD GT:AD:DP:GQ:PGT:PID:PL  1/1:0,12:12:36:1|1:13116_T_G:508,36,0

The ID column is supposed to be the SNP id, but unfortunately, it was not provided in the export.

I spend a few days searching for a tool to convert the chromosome position to SNP id without success. The mapping is quite simple but requires all the possible positions in the genome mapped to SNP id.

This data is provided by the The National Center for Biotechnology Information but what I need is a small part of 160GB bzipped JSON files. Since I had no other ideas I downloaded the information from their FTP ftp://[email protected]/snp/.redesign/.archive/b152/JSON.

I have 256GB HDD so extraction of the files was not an option. Here the power of Unix comes in hand. Unix Philosophy suggests that programs should do one thing, do it better and handle text streams so the programs can work together.

I started with bzcat which can not only extract the content of an archive, but also stream the content instead of saving it into a new file. The following command will take the first line of the archive and print it to the terminal.

$ bzcat refsnp-chr1.json.bz2 | head -1

The JSON object is quite big so I won’t paste it here, but I need few things from it:

  • The SNP id which is attribute refsnp_id;
  • The position where this SNP can be. One SNP can start in multiple positions or at least the file sometimes contain two positions for the same SNP in the attribute hgvs in format "NC_000009.12:g.119661416A=" where the number between g. and the = is the position in the chromosome.

For extracting the positions I wrote a python script:

import sys, re;

for line in sys.stdin:
  snp_id = re.findall('refsnp_id":"([\d:]+)"', line)
  positions = re.findall('hgvs":"NC_0[\w.]+:g.([\d:]+)', line)
  if positions:
    for position in list(set(positions)):
      print(position+","+ snp_id[0])

I named the file extract_position_and_snp.py and now I can execute:

$ bzcat refsnp-chr1.json.bz2 | head -1 | python3 extract_position_and_snp.py

Now I got the response I am looking for. The reason I wanted comma separated file is that I am using PostgreSQL for now and the psql terminal command supports loading data from CSV which is way faster than importing through SQL.

The next step is to generate files for import. Unix supports redirection of the output from the command to a file so now I can just remove the head command and stream the entire file line by line through the python script and save the output to a file called 1_pos_snp.csv where 1 is the chromosome.

$ bzcat refsnp-chr1.json.bz2| python3 insert.py > 1_pos_snp.csv

The script takes quite a lot of time, but I spread the execution into two machines and a day later I got the files for the all 24 chromosomes.

The CSV files are between 1.1 and 2.4 gigabytes where the first chromosome has 106 708 907 lines. The lines count can be checked with another useful Unix command wc -l ([w]ord [c]ount with parameter [l]ines). I was checking the script execution from time to time just to be sure something is happening by looking at how many lines are saved in the file.

$ wc -l 1_pos_snp.txt
 106708907 1_pos_snp.txt

Now after I have the 24 files for the 24 chromosomes it’s time to import them into the database. I created a table for each chromosome. Not sure if this helps for the performance but at least I can have one column short since the chromosome is known from the name of the table instead of a column.

CREATE TABLE public.refsnp_1 (
  snp_id bigint NOT NULL,
  "position" bigint NOT NULL

Instead of changing the table names manual in the copy command for every file I wrote a script which allowed me to just pass the name of the .csv file and extract the chromosome from the filename.


CHROMOSOME=$(echo $1 | cut -d'_' -f 1)

psql "postgresql://username:[email protected]/database" -c "\COPY refsnp_$CHROMOSOME(position, snp_id) FROM '$1' DELIMITER ',' CSV;"

The import is quite fast. It took more time to transfer the file from my computer to the remote machine than the import.

For my genome I need to map position and get the snp_id. So I added indexes to position column of all the tables:

CREATE INDEX "position_refsnp_1" ON public.refsnp_1 USING btree ("position");

This takes a while, but this time will be saved when I run the update query.

After the data for mapping is ready I can import my genome into the same database. First I will also use the Unix commands and python script to get what I need from the VCF file.

import sys

for line in sys.stdin:
  cols = line.split("\t")

  chromosome = cols[0].replace('chr', '')
  position = cols[1]
  reference = cols[3]
  alternative = cols[4]
  quality = cols[5]
  filter_ = cols[6]
  info = cols[7]
  format_ = cols[8]
  data = cols[9].replace("\n", '')

  my_1 = reference if data[0] == '0' else alternative
  my_2 = reference if data[2] == '0' else alternative

  print("\"{}\",{},\"{}\",\"{}\",\"{}\",\"{}\",{},\"{}\",\"{}\",\"{}\",\"{}\"".format(chromosome, position, reference, alternative, my_1, my_2, quality, filter_, info, format_, data))

The only additional logic for the conversion to comma separated values is to read the genotype. The VCF specification specifies that the positions are represented in the last column with 1 or 0.

The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on

In my case for the first line in my results, I have G/G for position 13116.

chr1  13116 . T G 479.77  PASS  AC=2;AF=1.00;AN=2;DP=12;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=22.63;QD=34.24;SOR=2.670;VQSLOD=2.23;culprit=QD GT:AD:DP:GQ:PGT:PID:PL  1/1:0,12:12:36:1|1:13116_T_G:508,36,0

The processing is another example of the Unix power.

$ cat ~/my_genome.snp.vcf | grep "^[^#]" | python3 import_vcf.py > genome.csv

The VCF file starts with around 100 lines of comments which are filtered out by the grep command.

Importing the genome is done in the same way as importing the mapping above. I created a table in the database:

CREATE TABLE public.my_genome (
  id integer NOT NULL,
  chromosome character varying(2),
  "position" bigint,
  reference_allele character varying(10),
  alternative_allele character varying(20),
  my_1 character varying(1),
  my_2 character varying(1),
  quality real,
  filter character varying(10),
  info character varying,
  format character varying,
  data character varying,
  snp_id bigint

The import is quite simple and way faster since I have just around 3.2 million rows in my genome.


psql "postgresql://user:[email protected]/database" -c "\COPY my_genome(chromosome,position,reference_allele,alternative_allele,my_1,my_2,quality,filter,info,format,data) FROM '$1' DELIMITER ',' CSV;"

Now I just need to update the SNP ids in my_genome table:

update my_genome set snp_id=(select snp_id from refsnp_1 where my_genome."position"=refsnp_1."position" limit 1) where snp_id is null and my_genome.chromosome='1';

Note that there can be SNP ids in the same positions in different chromosomes so you need to specify the chromosome from my_genome table. The need for limit 1 is explained at the end.

Doing this for all the chromosomes I have just 5945 rows not mapped with SNP id.

Great! Not I can query my SNPs and see the genotype:

# select chromosome, position, concat('rs', snp_id) as snp, concat(my_1, my_2) as genotype from my_genome where position=13116;
 chromosome | position |     snp    | genotype
 1          |    13116 | rs62635286 | GG
(1 row)

I checked few SNP ids with the data viewer app in sequencing.com and it looks correct.

There is just one problem. One snp SNP be in more than one position. I decided to check if I have such data in the database.

select snp_id from my_genome group by snp_id having count(snp_id) > 1;

It turns out I have 2747 duplicated SNP ids. Checking the data viewer app turns out that I am not having the same SNP as Sequencing.com. So the question is how to chose the right SNP when there can be two SNPs in the same location?

I will cover this in the next blogpost.