Big Data Analytics for Viral Ecology

"Big data" is pervasive in biology and can be used to discover new insights into interconnected biological processes. This is particularly true for research in viral ecology, wherein large-scale metagenomic datasets are uncovering the extensive genetic diversity of viruses and their role in host-driven nutrient and energy cycles in aquatic systems. Yet, despite innovations in sequencing technology, bottlenecks still exist in analyzing these massive and highly contextualized datasets. Specifically, ecosystem-wide analyses require the harmonization, integration and analysis of multiple biological datasets such as genes, protein function, pathways and environmental or host-related factors. Here we describe a strategy to perform massive comparative metagenomic sequence analysis using the Hadoop big data architecture, and interconnect these data with biological annotations stored in a scalable Neo4J graph database for functional, taxonomic and ecosystem-level analyses. We demonstrate the utility of our toolkit using a large-scale viral metagenomics dataset from the TARA Oceans Expedition. This work represents a first step in storing, comparing, and querying massive metagenomic datasets using scalable big data architectures toward understanding viruses and their impact on host-processes in the ocean.