"Big data" is pervasive in biology and can be used to discover new insights into interconnected biological processes. This is particularly true for research in viral ecology, wherein large-scale metagenomic datasets are uncovering the extensive genetic diversity of viruses and their role in host-driven nutrient and energy cycles in aquatic systems. Yet, despite innovations in sequencing technology, bottlenecks still exist in analyzing these massive and highly contextualized datasets. Specifically, ecosystem-wide analyses require the harmonization, integration and analysis of multiple biological datasets such as genes, protein function, pathways and environmental or host-related factors. Here we describe a strategy to perform massive comparative metagenomic sequence analysis using the Hadoop big data architecture, and interconnect these data with biological annotations stored in a scalable Neo4J graph database for functional, taxonomic and ecosystem-level analyses. We demonstrate the utility of our toolkit using a large-scale viral metagenomics dataset from the TARA Oceans Expedition. This work represents a first step in storing, comparing, and querying massive metagenomic datasets using scalable big data architectures toward understanding viruses and their impact on host-processes in the ocean.