1 Once upon a time
There was a time when the data volume was bigger than the
capacity to process them. Back in the last millenium, more
precisely in the centenial, and even more precise some five
decades ago, I had turned a storage room adjoining the garage at
the Institute of Animal Breeding at the University of Göttingen
into a computer lab that housed a Hewlett Packard 1000, a rack
mounted 16 bit computer with a huge 15MB disk and 16KB RAM if I
remember correctly and a FORTRAN66 compiler. Further, it ran the
IMAGE network type database system.
By today’s standards, all of that sounds pretty reasonable, if it
were not for the obvious typos. Replacing the ‘MB’ by ‘TB’, the
‘KB’ by ‘GB’, the ’66’ by ’90’, and the ‘network’ database by
‘relational’: and we have an uptodate computer system as we might
have it now. However, these were not typos, that was what we had
to live with.
We got this system to develop a central database for the nucleus
of the national crossbreeding program in Germany. Having been
taught FortranIV by Bill Hill in Edinburgh, I had no experience
in running a computer system, never done anything with databases.
Thus, embarking on such a project can be viewed as bold or
stupid, or both, make your pick.
It became clear very soon, that the storage and computing
capacity was limited. Upon selection of each breeding female, a
sow card was printed with 80 IDs to be used for her offspring. It
consisted on 6 digits plus a cross check digit which was used for
ID validation on data entry. This setup required keeping track of
which ID allocated to a dam were in fact in use.
The only problem was, there was no disk space to store all
potential piglet IDs. So the solution was to create a bit array
of 80, 1 bit for each ID that had been used to ear tag an
offspring. In this way we could keep track of 32 IDs in just 4
bytes as opposed to minimum 1 ID in those 32 bits. Well, that was
a compression rate of 32 and did solve the space problem.
Why this story from the dark ages of computing in these glorious
times of computational abundance? You will likely guess what I am
getting at: it’s SNP data.
In a meeting some time ago someone was told me about his SNP
genotype database that was acting up after hours of export from a
database occupying terra bytes of disk space from a meagre
100,000 samples. That got me thinking, taking me those 50 years
back and now we have TheSNPpit :).
2 Back to the future
50 years later we had apparently the same problem: too much data
given the the computational and storage capacity. Not
surprisingly, the bit vector setup came back to mind in a
similarly simple setup: for a given SNP panel each genotyping
produced the same number of SNPs basically creating a matrix with
SNPs as columns and one line of such genotypes for each genotyped
animal.
Needless to say, that the TheSNPpit as it is now had a long way
to go starting with the idea, moving to a proof of concept
implementation written in Perl and the final implementation in C.
The PoC implementation did speed up SNP data extraction already
at a rate of 1mio/sec. That was much faster than the common SQL
normalized data structure currently often used in relational
databases. In particular, data volume was no issue any more.
However, looking back, it was clear to me that exports from the
bitstring based database to a file had to go much faster than the
Perl code showed. The only problem was language.
3 Back to the future 2
As mentioned above Fortran was the programming language I learned
and have used for my software packages like VCE and PEST. The
only problem of Fortran is its unsatisfactory database support.
After having played with ODBC for a way I had to admit, that no
stable application could be written that way. The choice was: use
the slow Perlcode or switch to a C, the latter having been around
for nearly as long as FORTRAN. The only problem was: I had to
learn it from scratch.
Well, this is now TheSNPpit setup: an efficient wrapper with
great database access in Perl and for time critical im- and
export written in the good old C language. This combo does indeed
live up to expectations: it is about 100 time faster than the
Perl code spitting out SNP data often at the speed of the hard
disk.
4 Open Source Future
TheSNPpit has been released under the Open Source model. Whenever
something is used regularly, there is always the issue of
software and user support and further development. Therefore,
during the second SNPpit workshop in Mariensee in 2018
TheSNPpitGroup was set up with its terms of reference listed in
the Memorandum of Understanding, which can be found here.
TheSNPpit being a tool that supports direct database access users
can write their own software and make it available to the SNPpit
community. One first example is the Parentage Testing package
written by Hermann Schwarzenbacher. A short description of
Parentage Testing is available in the section on
contributed software.
Eildert Groeneveld