Maximizing the Impact of Genomic Data Sharing: Benefits, Best Practices, and Applications in Phage Genomics
As sequencing prices drop, new genomics techniques emerge, and bioinformatics pipelines become more sophisticated, the amount of genomics data generated grows astronomically. Microbiologists, immunologists, and scientists from many other disciplines are increasingly dabbling in (or diving head first into) genomics, where storing and analyzing “big data” is a critical skill.
Since the birth of genomics, data sharing, open access, and collaboration have been major pillars of the research community. As in any research field, the unrestricted flow of information increases the overall pace of research, leading to insights that would be impossible if individual research groups kept their data isolated in a bubble.
However, there are still many barriers to the widespread sharing and distribution of genomics data. While genomic data management standards from funding agencies and journals have become commonplace and emphasize sharing, practical and cultural hurdles continue to hamper the widespread implementation of an open genomics data model.1
Here we talk about some of these challenges, the numerous benefits of data sharing, best practices for increasing the impact of shared genomic data, and the applications in a growing field with unique therapeutic applications, phage genomics.
The Benefits and Drawbacks of Genomic Data Sharing
One of the fundamental ways science moves forward is by building upon the vast amount of data previously published. Genomics has fostered a culture of sharing, enabling researchers like you to “stand on the shoulders of giants” rather than start from scratch.
Here are some additional benefits of genomic data sharing:2
● Higher reproducibility: Reproducibility is critical for ensuring the validity and trust in novel discoveries. When raw sequencing and bioinformatics workflows are shared, the next generation of researchers can reproduce and replicate results, building a solid foundation for future research questions.
● Greater statistical power: With open data access, future researchers can pool available data with newer datasets, increasing sample size and statistical power.
● More efficient use of funds: Open access to genomics data can increase visibility and awareness about studies and prevent distinct research groups from rehashing the same research avenues. Rather than competing, groups can use their funding to collaborate or investigate unique questions.
● Increased impact and visibility: Data published in accessible repositories are often cited more than those with inaccessible data.3 Furthermore, open genomics data can be linked with other datasets and types (i.e., multi-omics), providing unprecedented and unique insights and a greater understanding of biology.
There are also some drawbacks to being open with genomics data. Practically, researchers may need to gain knowledge of how or where to share their genomics data. Scientists may be concerned about protection from getting scooped on future publications, misuse of data, or loss of priority on patent filing. There is also a competitive culture in the sciences, particularly in academia, where conflict over grants and high-impact publications can be fierce.
Best Practices for Data Sharing in Genomics
Despite reluctance to share genomics data openly, the increasing popularity of preprint publications and reinforcement from funding agencies and journals are boosting engagement in genomics data sharing.
To help ensure that this data reaches its full potential, below are some best practices and guidelines for sharing genomic data, which should be particularly useful for those researchers dipping their toes into genomics for the first time.
We are enthusiastic about phage genomics at Rime Bioinformatics, so we’ve provided some more focused context about analyzing phage genomes.
Publish data in an easily accessible specialized or general data repository
All too often, research groups will publish results without making the accompanying raw genomic data available.4 To remedy this and tap into the benefits listed above, publish all raw genomics data in one of the many large public genomic data repositories, including the DNA Data Bank of Japan, the European Molecular Biology Laboratory-European Bioinformatics Institute, or the National Center for Biotechnology Information. In doing so, you make your data accessible to any other researcher for further analysis and validation.
Submitting your data to these large databases also ensures that the data is accurate, high-quality, suitably annotated, and includes the appropriate metadata. It also guarantees the information is preserved long-term, as most data repositories have procedures for backing up and securing data against theft or loss.
For certain niche research areas, you may have smaller, specialized databases to which you can submit sequence information. In the field of phage genomics, there are several specialized databases such as SEA-PHAGES – focused on phage discovery and driven by undergraduate researchers – and PhagesDB: “a comprehensive, interactive, database-backed website that collects and shares information related to the discovery, characterization, and genomics of viruses that infect Actinobacterial hosts.”5
Make data generation and analysis reproducible and replicable
Reproducibility (which, in genomics, means getting consistent results using the same input data, computational workflows, methods, code, and conditions of analysis) and replicability (e.g., getting consistent results across studies aimed at answering the same question, from distinct data sets) are pillars of confidence in scientific knowledge for scientists and non-scientists alike.6 Therefore, supporting reproducibility and replicability is of the utmost importance.
To support reproducibility and replicability in genomics, sharing the computational methods used to produce derived data (e.g., new data created by combining and processing existing raw data) is just as important as sharing the raw data itself (as described in the section above).
You can do this by:4
- Reporting input data, parameters, and tools/applications (including version information) for each computational step
- Publishing source code for custom scripts, tools, applications, and complete bioinformatics pipeline in a public code repository, such as GitHub, or notebook, such as Jupyter.
- Publishing complete computational workflows in a public workflow management system and applicable documentation so a user can understand, modify, or troubleshoot the workflow.
In phage genomics, workflows for phage genomics, such as those published by Turner et al. and Shen and Millard, provide comprehensive guidance for open access, well-validated tools for phage genome assembly and annotation.7,8 Recently, the Center for Phage Technology published a user-friendly phage genome annotation tool that runs on Galaxy and Apollo platforms and uses a powerful suite of exclusively open-source tools.9 There are a handful of other standalone tools, such as PHANOTATE, specifically focused on the annotation of phage genomics and other tasks in genomics.10
Follow published standards from funding agencies and journals
To extract the maximum benefits for the whopping amount of genomics data being generated, maintained, used, and archived, clearly defined standards for “good data management” should be followed.11
Most journals, funding agencies, and the researchers they serve have agreed upon principles that ensure the findability, accessibility, interoperability, and reuse of data assets called the FAIR Principles.11,12 This puts many published best practices and the two points about accessibility and reuse (which make up the “A” and “R” in the “FAIR” acronym) made above into a framework that can be easily followed and referenced by all researchers.
In brief, the FAIR data principles follow a stepwise process of:
- Ensuring all relevant data and metadata is easy for humans and computers to search for and find
- Providing access to the required data and metadata through authentication or authorization
- Enabling the use of the data through interoperation with apps and workflows that will allow analysis, storage, or processing
- Facilitating reuse by others through the comprehensive description of metadata and data
A final Note on Phage Genomics
Genomic data sharing is an essential part of scientific research. It is especially important in phage research, where data sharing is vital to discovering and developing therapeutic applications in various industries.
Difficulties in phage genome assembly and annotation
Despite the sophisticated workflows and phage-specific automated tools reported, assembling and annotating phage genomes is far from the perfect computational process. Errors in genome assembly can occur, leading to downstream misannotations of the genome.7 In addition, auto-annotation tools (like PHANOTATE and others) can frequently miss start codons and annotate coding regions as non-coding or visa versa.13 Phage genome repositories are rife with these types of mistakes, which can perpetuate legacy annotation errors in newly sequenced phage genomes.13
To deal with these challenges, researchers depend on the manual annotation of phage genomes, which can be a feasible approach for smaller sequences. However, this process can become cumbersome for large genomes, rely on subjective criteria, and be labor- and time-intensive.
High-performance phage genomics with Rime Bioinformatics
At Rime Bioinformatics, we specialize in phage genome assembly and annotation and have helped our clients solve challenging bioinformatics issues. We actively work with phage leaders worldwide, in both the private and public sectors. With customers from academia, we adhere to strict academic standards for genome formatting and facilitate file preparation before database publication.
If you’re new to phage genomics or struggling with building a bioinformatics pipeline, genome assembly and annotation, or database submission, we can help you get the most out of your sequencing data.
Get in touch with us today and tell us about your project!
- Data Management. NIH website. Accessed January 4th, 2023.
- Byrd JB, Greene AC, Prasad DV, Jiang X, Greene CS. Responsible, practical genomic data sharing that accelerates research. Nat Rev Genet. 2020;21(10):615-629.
- Piwowar HA, Vision TJ. Data reuse and the open data citation advantage. PeerJ. 2013;1:e175.
- Brown AV, Campbell JD, Assefa T, et al. Ten quick tips for sharing open genomic data. PLoS Comput Biol. 2018;14(12):e1006472.
- Russell DA, Hatfull GF. PhagesDB: the actinobacteriophage database. Bioinformatics. 2017;33(5):784-786. doi:10.1093/bioinformatics/btw711
- New Report Examines Reproducibility and Replicability in Science, Recommends Ways to Improve Transparency and Rigor in Research. National Academies website. Accessed January 6th, 2023. Published April 7, 2019.
- Turner D, Adriaenssens EM, Tolstoy I, Kropinski AM. Phage Annotation Guide: Guidelines for Assembly and High-Quality Annotation. Phage (New Rochelle). 2021;2(4):170-182.
- Shen A, Millard A. Phage Genome Annotation: Where to Begin and End. Phage (New Rochelle). 2021;2(4):183-193.
- Ramsey J, Rasche H, Maughmer C, et al. Galaxy and Apollo as a biologist-friendly interface for high-quality cooperative phage genome annotation. PLoS Comput Biol. 2020;16(11):e1008214.
- McNair K, Zhou C, Dinsdale EA, Souza B, Edwards RA. PHANOTATE: a novel approach to gene identification in phage genomes. Bioinformatics. 2019;35(22):4537-4542.
- Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018.
- FAIR Principles. Go Fair website. Accessed January 5th, 2023. Published November 23, 2017.
- Salisbury A, Tsourkas PK. A Method for Improving the Accuracy and Efficiency of Bacteriophage Genome Annotation. Int J Mol Sci. 2019;20(14):3391.