A few months ago I send a DNA sample for Whole Genome Sequencing offered by Dante Labs.
My goal is to extract information from the genome and use it for personalized medicine, prevention, training, nutrition and more. The report generated by DanteLabs’s partners sequencing.com looks long (163 pages), but 140 pages of this report are hard to read table with SNP id and possible condition. The export is PDF and this data is not machine-readable.
Luckily, Sequencing.com provide the same data in gzipped VCF format which allowed me to start developing what I wanted to do.
VCF (Variant Call Format) is just a text format with tab separated values.
The ID column is supposed to be the SNP id, but unfortunately, it was not provided in the export.
I spend a few days searching for a tool to convert the chromosome position to SNP id without success. The mapping is quite simple but requires all the possible positions in the genome mapped to SNP id.
This data is provided by the The National Center for Biotechnology Information but what I need is a small part of 160GB bzipped JSON files. Since I had no other ideas I downloaded the information from their FTP ftp://[email protected]/snp/.redesign/.archive/b152/JSON.
I have 256GB HDD so extraction of the files was not an option. Here the power of Unix comes in hand. Unix Philosophy suggests that programs should do one thing, do it better and handle text streams so the programs can work together.
I started with bzcat
which can not only extract the content of an archive, but also stream the content instead of saving it into a new file. The following command will take the first line of the archive and print it to the terminal.
The JSON object is quite big so I won’t paste it here, but I need few things from it:
- The SNP id which is attribute
refsnp_id
; - The position where this SNP can be. One SNP can start in multiple positions or at least the file sometimes contain two positions for the same SNP in the attribute
hgvs
in format"NC_000009.12:g.119661416A="
where the number betweeng.
and the=
is the position in the chromosome.
For extracting the positions I wrote a python script:
I named the file extract_position_and_snp.py and now I can execute:
Now I got the response I am looking for. The reason I wanted comma separated file is that I am using PostgreSQL for now and the psql
terminal command supports loading data from CSV which is way faster than importing through SQL.
The next step is to generate files for import. Unix supports redirection of the output from the command to a file so now I can just remove the head command and stream the entire file line by line through the python script and save the output to a file called 1_pos_snp.csv where 1 is the chromosome.
The script takes quite a lot of time, but I spread the execution into two machines and a day later I got the files for the all 24 chromosomes.
The CSV files are between 1.1 and 2.4 gigabytes where the first chromosome has 106 708 907 lines. The lines count can be checked with another useful Unix command wc -l
([w]ord [c]ount with parameter [l]ines). I was checking the script execution from time to time just to be sure something is happening by looking at how many lines are saved in the file.
Now after I have the 24 files for the 24 chromosomes it’s time to import them into the database. I created a table for each chromosome. Not sure if this helps for the performance but at least I can have one column short since the chromosome is known from the name of the table instead of a column.
Instead of changing the table names manual in the copy command for every file I wrote a script which allowed me to just pass the name of the .csv file and extract the chromosome from the filename.
The import is quite fast. It took more time to transfer the file from my computer to the remote machine than the import.
For my genome I need to map position and get the snp_id. So I added indexes to position column of all the tables:
This takes a while, but this time will be saved when I run the update query.
After the data for mapping is ready I can import my genome into the same database. First I will also use the Unix commands and python script to get what I need from the VCF file.
The only additional logic for the conversion to comma separated values is to read the genotype. The VCF specification specifies that the positions are represented in the last column with 1 or 0.
In my case for the first line in my results, I have G/G for position 13116.
The processing is another example of the Unix power.
The VCF file starts with around 100 lines of comments which are filtered out by the grep command.
Importing the genome is done in the same way as importing the mapping above. I created a table in the database:
The import is quite simple and way faster since I have just around 3.2 million rows in my genome.
Now I just need to update the SNP ids in my_genome
table:
Note that there can be SNP ids in the same positions in different chromosomes so you need to specify the chromosome from my_genome
table. The need for limit 1 is explained at the end.
Doing this for all the chromosomes I have just 5945 rows not mapped with SNP id.
Great! Not I can query my SNPs and see the genotype:
I checked few SNP ids with the data viewer app in sequencing.com and it looks correct.
There is just one problem. One snp SNP be in more than one position. I decided to check if I have such data in the database.
It turns out I have 2747 duplicated SNP ids. Checking the data viewer app turns out that I am not having the same SNP as Sequencing.com. So the question is how to chose the right SNP when there can be two SNPs in the same location?
I will cover this in the next blogpost.