Training in High-Throughput Sequencing: Common Guidelines to Enable Material Sharing, Dissemination, and Reusability

The advancement of high-throughput sequencing (HTS) technologies and the rapid development of numerous analysis algorithms and pipelines in this field has resulted in an unprecedentedly high demand for training scientists in HTS data analysis. Embarking on developing new training materials is challenging for many reasons. Trainers often do not have prior experience in preparing or delivering such materials and struggle to keep them up to date. A repository of curated HTS training materials would support trainers in materials preparation, reduce the duplication of effort by increasing the usage of existing materials, and allow for the sharing of teaching experience among the HTS trainers’ community. To achieve this, we have developed a strategy for materials’ curation and dissemination. Standards for describing training materials have been proposed and applied to the curation of existing materials. A Git repository has been set up for sharing annotated materials that can now be reused, modified, or incorporated into new courses. This repository uses Git; hence, it is decentralized and self-managed by the community and can be forked/built-upon by all users. The repository is accessible at http://bioinformatics.upsc.se/htmr.


Introduction
The advent of high-throughput sequencing (HTS) has revolutionized biological and biomedical research [1], allowing researchers to generate an overwhelming amount of genome-wide data. Now that sequencing is accessible to most, the major bottleneck has shifted from HTS data generation to analysis and interpretation. These remain challenging tasks due to the complexity of the analytical pipelines, the extensive list of available tools, and the evolving nature of the field. To support the need of researchers to carry out their data analysis, institutions worldwide offer highly specialized training on HTS data analysis (Table 1). However, despite the increase in the number of training solutions, the demand still largely exceeds what is currently offered.
Typically, the training offered by most institutions consists of short (2-6 days), intense courses, often focusing on a particular HTS pipeline-e.g., RNA-Seq, ChIP-Seq, whole genome sequencing, or variant analysis. During training, instructors and course organizers aim to provide a well-balanced mixture of lectures, which cover the data generation steps and illustrate the theory behind the analysis, and practical sessions, in which trainees can practice running HTS pipelines on real datasets [2]. Post-course surveys have revealed that participants regard the practical sessions as the most valuable components of a training course, as they represent an opportunity to run complex pipelines under the supervision of highly skilled trainers and discuss the issues associated with the analysis of such datasets with the experts in this field [3]. Years of experience in delivering such courses has taught us that the best-suited trainers are scientists who deal with HTS data analysis on a daily basis. Trainers can therefore include researchers working on HTS projects, computer scientists developing relevant algorithms and software, as well as bioinformaticians providing data analysis support to research groups. Consequently, for most instructors training is not formally part of their job and is done in addition to an already heavy workload.
Generating effective training materials (e.g., lectures and practical exercises) and testing them to ensure the smooth and successful delivery of a training course are time-consuming activities that all trainers need to undertake prior to a training event. In the last few years, a large body of training material on HTS data analysis has been generated; however, the sharing of such materials among trainers rarely happens, leaving instructors around the world with the need to constantly reinvent the wheel. Therefore, there is a need to develop mechanisms and best practices to both increase the visibility of complete training materials and enable their reusability, ensuring a reduction of trainers' workload and fostering interactions within the trainers community.
Several initiatives have been established in recent years to support bioinformatics training and create community resources, most notably ELIXIR [4], a research infrastructure for coordinating biological data across Europe, which collaborates with global efforts such as GOBLET [5,6] to ensure that an adequate provision of training is put in place to reach a large and diversifying user base. Both initiatives are developing training portals (ELIXIR's TeSS [7] and the GOBLET training portal [8,9]) to allow for collation of training materials, increasing their discoverability.
The establishment of such portals is of great importance, but this does not guarantee materials' reusability. The development and production of training materials is usually undertaken by individual trainers for their personal use, often with a particular course or learning context in mind. As there is a sole developer and initial user of the materials, they are often lacking in detailed description or documentation, thereby making it difficult for another trainer to determine what the purpose of the training session was, who the materials were aimed at, and what resources are required to run such a session. Additionally, there is often great variety in the style of training materials presented (due to personal preference) and the level of detail in the content (e.g., PowerPoint slides annotated with notes).
If materials could be described consistently, so that either trainers or trainees understood what they could gain from using these materials, reusability of materials may be easier to achieve. For consistent description to take place, however, some form of guidance or best practice standard is required that allows for the addition of metadata to describe the materials and their use. HTS-related material also has additional issues when sharing is considered: (i) the constant evolution of the technologies that requires frequent materials revision; (ii) the incessant development of analysis tools, which prevents the establishment of standardized analytical procedures and training materials; and (iii) their complexity, as training materials are often linked to large datasets that require dedicated storage.
The ability to easily share materials via online portals is a fairly new phenomenon, and while providing guidance or best practice standards to trainers for describing their materials is a step in the right direction for more reusability, these also need to be adopted by those who provide the portals enabling the appropriate and correct information to be displayed.
This raises the following questions: What can trainers do to improve the sharing and reusability of their materials? What best practice can they adopt to enable the effective delivery of materials developed by another trainer?
To tackle these issues, on 13-14 January 2015, a workshop on "Best practices in next-generation sequencing data analysis" took place at the University of Cambridge, bringing together 29 trainers in the field of HTS data analysis, representing seven ELIXIR nodes and ten GOB-LET partners (see S1 Table), with the aim to: (i) meet and discuss issues associated with the reusability of training materials, (ii) define a collective strategy to tackle such issues and identify an approach to the curation of training materials to enable their reusability among trainers, and (iii) implement this curation strategy by creating a unified collection of consistently described and well-annotated training materials.
This article provides a summary of the issues that were discussed during the workshop and presents the workshop's major outcomes: the best practices guidelines that have been adopted for the curation of HTS training materials and a curated set/repository of materials that is now accessible to the entire training community through Git and is discoverable via TeSS and GOB-LET portals.

Material sharing and reusability
Developing a strategy for sharing training materials is a key step towards reusability. The first step of this approach is the identification of training materials that should be shared and in which modality.
Materials that most trainers want to share include, but are not limited to, presentations, hands-on practicals, and datasets. Presentations are rarely reused as they are. Typically, trainers would use this type of material as a source of inspiration, to see how other instructors cover a particular topic, ending up reusing a subset of slides or just following the overall structure to then create their own set of slides. Tutorials instead tend to be reused in their original format; therefore, their completeness and consistency are crucial. Obtaining a well-documented tutorial would substantially speed up course preparation and trainers were unanimous in indicating that finding adequate datasets (in terms of size, content, and relevance to the audience) is often challenging. Datasets should be derived from real experiments, as opposed to simulated data; they need to be publicly available and also suitable to demonstrate particular analysis steps and their caveats. Moreover, they should be reasonable in size to allow for the fast execution of a typical HTS pipeline and be well annotated.
The ideal solution for sharing materials among trainers would be to build a unified collection of consistently annotated materials, easy to search and expand, and link this collection to existing training portals for the benefit of the trainers community at large. Such a collection would provide a framework (i) where trainers could share ideas, (ii) where materials could be tested, improved, or further developed in a collaborative manner, and (iii) where iterative versions of the same material could be archived.
Solutions to achieve these goals, and to address most of the issues discussed in the introduction, have already been developed in other contexts, e.g., software carpentry (http://softwarecarpentry.org/lessons/). It is common practice in computer science to collaboratively document, develop, test, and version programs. Based on this observation, it was decided to draft and develop a common collection of materials using a concurrent versioning system, namely Git [10], as the backbone.
To address the reusability issue, a core set of descriptors and a controlled vocabulary were devised and implemented as part of the chosen annotation strategy.

Standards for describing training materials
Utilizing a minimal set of descriptors will allow for training materials that have been developed independently to be annotated in a consistent manner. The aim of such metadata would be to summarize basic information about the training material, describing its content, suitability for different audiences, and provenance, capturing all the information that we consider essential for the materials to be reused by someone else rather than the original author.
In particular, clear and concise learning objectives (LOs) should be adopted to describe courses and annotate training materials. LOs should help trainers to identify materials that they might want to reuse, to plan a course based on what individuals need to be able to achieve by the end of it, and also to track learning progression throughout a course. LOs should also help trainees to gather if a course or training materials are suitable for their needs and what they can expect to learn by attending a course or utilizing training materials.
The minimal set of descriptors agreed upon to describe each training material is described in Table 2 (including exemplary usage of the descriptors and, for the dataset descriptor, the exemplary use of a study by Buecker et al. [11]). In this course we provide a basic introduction to conducting ChIP-Seq data analysis using the Galaxy framework. We will be retracing most of the steps required to get from an Illumina FASTQ sequence file all the way to performing peak calling and identifying over-represented sequence motifs and functional annotation. The aim is to give biologists the tools to independently run a basic analysis of ChIP-Seq data. Once the minimal set of descriptors was agreed upon, the training workflows for which training materials were available at the workshop were selected and, for each workflow's module, materials were collaboratively annotated.

Training workflows
Three workflows were selected, "RNA-Seq," "ChIP-Seq," and "variant analysis," alongside two general topics: "Prerequisites" and "NGS-Introduction," which provide the basic building blocks for the three workflows.
"Prerequisites" includes materials covering basic skills and knowledge in programming and statistics. For example, familiarity with the Unix shell and the R environment are crucial prerequisites. Depending on the target audience, introductions to Unix and R should be incorporated into a course's program or training materials, possibly utilizing materials developed by the Software Carpentry Foundation [12,13]. "NGS-introduction" covers common concepts such as sequencing platforms, sequencing applications, and data formats.
The three workflows, "RNA-Seq," "ChIP-Seq," and "Variant calling," aim to provide introductory as well as advanced training for these types of HTS analysis.
Each workflow was discussed and a set of modules was defined for each. As an example, the RNA-Seq topic contains the following modules: Pre-processing, Alignment, Alignment Quality Control (QC), Feature summarization, Feature summarization QC, Exploratory analysis, Denovo transcriptome assembly, and Differential Expression (DE). All the topics, their modules, and corresponding description are detailed in Table 3. The modules were furthermore grouped into essential and optional, highlighting those that are fundamental for each analysis workflow over others that can be used at the discretion of the trainer.
A set of controlled vocabularies-adapted from the EMBRACE Data and Methods (EDAM) ontology [14]-was selected to tag individual modules in order to facilitate annotation and enable keyword-based searching of the repository content.
All materials provided by the trainers attending the workshop were assigned to the appropriate workflow module, annotated as described in the previous session, and uploaded to a Git repository.

Repository implementation
The current implementation of the HTS training materials repository has two components: a back-end, based on Git to handle the content, and a front end [15], served as web-content, which builds on the Git repository, as detailed in Fig 1. The Git versioning system is used to keep track of materials' updates, in the same way as it is used in software development to keep track of changes applied to source code.
Contributors who wish to apply changes to the repository content can do so by forking it. Modified materials can then be reloaded, after passing consistency checks in order to prevent accidental data modification or deletion. Additionally, the API associated with the selected Git implementation (GitLab [16]) allows for the development of applications that programmatically access the materials and their metadata. This API was used to: (i) index the module metadata repository content and (ii) program the search functionality, which is available from the repository's landing page. Leveraging on such metadata, a trainer can query the repository based on their teaching interests and retrieve relevant training materials. Finally, to ensure an easier navigation within the repository, hyperlinks are used to connect related materials, e.g., to collate all the material used in a given course.

Contributing to the repository
The project's wiki provides contributors with templates for the submission of new materials, the set of descriptors (i.e., the controlled vocabulary necessary to tag materials), and submission instructions, including an introduction to the minimal set of Git commands essential to contribute to the repository. Trainers who wish to contribute to the repository can do so either programmatically or by interacting with a curator. In either case, they need to provide the materials and the corresponding metadata (see descriptors in Table 2). For a programmatic submission, the contributor can either login via an existing Google or GitHub account or create a new repository's user account and then follow the afore-mentioned instructions. If necessary, e.g., for a novel type of data or analysis, descriptors could be extended. For interactive submissions, we provide online forms, available from the repository's landing page, to create the material's metadata. Upon successful generation of the metadata, the contributor will be contacted by a curator for completing the submission. Regardless of the submission approach, new materials are checked for annotation inconsistencies before being made publicly available. During the initial testing phase, the repository was met with great interest from the community. Currently, 47 members have created an account in the system, of which 29 have attended the workshop and 28 have submitted materials. To facilitate communication between contributors, we have set up a mailing list, which currently includes 26 people.

Conclusions
We have developed a strategy for the curation of HTS training materials and established a working framework for the sharing of such materials among trainers for promoting and strengthening interactions among them and learning from each other's teaching experience.
The Git repository of curated HTS materials that we have created is now publicly available at http://bioinformatics.upsc.se/htmr and discoverable through both ELIXIR and GOBLET training portals. It now provides the potential to ease the preparation of training courses via a community driven sharing strategy. In addition, it enables trainers to update and modify their material while keeping track of the changes. This solution is scalable and has been made robust through the use of an easily manageable API in combination with consistent curation.
Community initiatives are already planned to refine the training materials curation strategy and extend the coverage of this collection.
ELIXIR has recently organized a thematic hackathon focusing on the use of the EDAM ontology to annotate training materials currently available in the collection. The workshop had the dual purpose of tagging materials with EDAM ontology annotations and, at the same time, improving the ontology's coverage. Additionally, training materials were mapped to the bioinformatics tools and resources from the ELIXIR's Tool and Data Service registry [17,18] to increase the simultaneous discoverability of bioinformatics tools and related training materials.
GOBLET and ELIXIR are now planning a second workshop that will bring together trainers working in the field of metagenomics, to enhance the training network within this field. The strategy presented in this paper will be applied to the curation of existing metagenomics analysis training materials, with the aim to define a generic approach for the curation and dissemination of training materials through training portals such as TeSS and GOBLET.
We encourage trainers active in delivering HTS training, as well as trainers that might be new to this field, to get involved, utilize the materials already available in the repository to deliver training in this area, and contribute to this initiative with new materials.