NTB-T07

Title

Software containerization in bioinformatics: how to make reproducible, portable and reusable bioinformatics software&pipelines

Tutorial details
  • Date: Sunday, September 18th
  • Time: 14:00 to 18:00 CEST (Slot 27)
  • Format: Face-to-face
  • Room: TBD
Instructors
  • Giacomo Baruzzo, PhD. Department of Information Engineering, University of Padova, Padova (Italy)
  • Prof. Barbara Di Camillo, PhD. Department of Information Engineering, University of Padova, Padova (Italy)
  • Marco Cappellato, MSc. Department of Information Engineering, University of Padova, Padova (Italy)
  • Giulia Cesaro, MSc., PhD student, Department of Information Engineering, University of Padova, Padova (Italy) – Helper
  • Mikele Milia, MSc., research fellow, Department of Information Engineering, University of Padova, Padova (Italy) – Helper
Summary

Abstract

Reproducibility of scientific results is a major issue in all the fields of science and research, including bioinformatics. Containers are the solution to many of the issues related to reproducibility of bioinformatics analyses, and they also provide a very efficient way to ensure reusability and portability of bioinformatics software&pipelines. Indeed, containers provide a way to encapsulate an entire execution environment, including software dependencies, libraries, runtime code, data, etc. that can be easily shared, deployed and executed in a very efficient and fully reproducible way in a variety of computing systems.

This tutorial will introduce attendants to software containerization and its impact in science and research in terms of reproducibility, portability and reusability of software and related results, with practical application to the bioinformatics field. First, the tutorial will provide an overview of the different virtualization strategies (e.g. containers vs. virtual machines) from both users’ and developers’ points of view, and the differences between the available container engines. Second, it will be explained how to develop, build, run and share containers using two of the most used container engines, i.e. Docker and Singularity. Since many fields of bioinformatics involve computationally intensive tasks and the analysis of large datasets, how to work with containers in both desktop/workstation and High-Performance Computing (HPC) infrastructures will be presented, including practical guides for parallel framework and GPU applications.

Motivation

The lack of reproducibility of scientific results has a negative impact on several research fields, including bioinformatics. Specifically, bioinformatics analyses are the results of set/pipelines of software packages, each one having multiple running options, making it difficult to fully reproduce a specific analysis workflow. Moreover, bioinformatics softwares have a fast-evolving development and they extensively rely on external libraries/packages, two elements that limit the reproducibility across different software versions and operating systems. The extensive use of external libraries/packages also limits the portability and reusability of the software, requiring properly installing and configuring a large amount of dependencies, and building/compiling of the application on different target systems.

Containers are the solutions to most of the above issues. Containers use lightweight virtualization to encapsulate an entire execution environment that can be easily shared, deployed and executed in an efficient and fully reproducible way in a variety of computing systems. Compared to other virtualization strategies, software containerization has a very low overhead in terms of computational burden, it supports modern parallel computing (multithreading, MPI, GPU computing, etc.) and it is very used even in High-Performance Computing scenarios.

Therefore, we propose this tutorial with the aim of providing all researchers who have to develop/use bioinformatics software with a powerful tool to enhance the reproducibility, usability and reusability of their software/pipelines and related scientific results.

Learning outcomes:

  • Understand software containerization and its impact in science and research in terms of reproducibility, portability and reusability of software and results, with practical application to the bioinformatics field.
  • Understand the different virtualization strategies (e.g. containers vs virtual machines) from both users’ and developers’ points of view, and the differences between the available container engines (e.g. Docker and Singularity).
  • Ability to develop, build, run and share simple containers based on singularity and Docker container engines in both desktop/workstation and HPC infrastructures.
Intended audience

Master or PhD students, and researchers in the field of bioinformatics, computational biology and medical informatics that are users and/or developers of bioinformatics software&pipelines, and who are interested in enhancing reproducibility, reusability and portability of bioinformatics software and analyses.

Prerequisites

Basic knowledge of Linux-based operating systems and Linux terminals is strongly suggested.

Maximum number of attendees

40

Material required (for participants)

All the course material (slides, container definitions files, container images, etc.) will be made freely available at https://sysbiobig.dei.unipd.it/eccb-2022-new-trends-in-bioinformatics-tutorial-software-containerization. Virtual machines containing a working environment to reproduce tutorial examples will be made freely available to course attendants that want to practice during/after the tutorial (see link above). Participants interested in practising during the tutorial should bring their own laptops.

Programme
  • Introduction to software containerization – Part 1 (definition of containers; why we need containers in Bioinformatics; containers pros: efficiency, reproducibility, reusability, portability; container cons: learning curve, same kernel/architecture constraint; how the container works; discussion and questions) [30 minutes]
  • Introduction to software containerization – Part 2 (main container engines: Docker, Singularity, etc; container registry: Docker Hub, Nvidia GPU Cloud, Singularity Libraries, etc; main operations: run a container, build a container, develop a container; virtualization strategies: containers vs virtual machines; containers in High-Performance Computing; discussion and questions) [30 minutes]
  • Working with containers – Part 1 (introduction to Singularity; how to run Singularity containers: singularity exec, singularity run, singularity shell; how to download and build Singularity containers: singularity build; discussion and questions) [30 minutes]
  • Coffee break
  • Working with containers – Part 2 (how to develop a Singularity container: singularity definition files; Singularity advanced topics: sandboxes, encryption, remote build, advanced building options; discussion and questions) [30 minutes]
  • Working with containers – Part 3 (working with Docker containers: similarities and differences with Singularity containers; from Singularity to Docker: Docker syntax and terminology) [30 minutes]
  • Working with containers – Part 4 (containers in HPC: container job files, container&MPI, container&GPU; examples of Docker and Singularity containers for bioinformatics software; discussion and questions) [30 minutes]
TIME CONTENT
14:00 – 14:30Introduction to software containerization – Part 1
Definition of containers; why we need containers in Bioinformatics; containers pros: efficiency, reproducibility, reusability, portability; container cons: learning curve, same kernel/architecture constraint; how the container works; discussion and questions
14:30 – 15:00Introduction to software containerization – Part 2
Main container engines: Docker, Singularity, etc; container registry: Docker Hub, Nvidia GPU Cloud, Singularity Libraries, etc; main operations: run a container, build a container, develop a container; virtualization strategies: containers vs virtual machines; containers in High-Performance Computing; discussion and questions
15:00 – 15:30Working with containers – Part 1
Introduction to Singularity; how to run Singularity containers: singularity exec, singularity run, singularity shell; how to download and build Singularity containers: singularity build; discussion and questions
Break
16:00 – 16:30Working with containers – Part 2
How to develop a Singularity container: singularity definition files; Singularity advanced topics: sandboxes, encryption, remote build, advanced building options; discussion and questions)
16:30 – 17:00Working with containers – Part 3
Working with Docker containers: similarities and differences with Singularity containers; from Singularity to Docker: Docker syntax and terminology
17:00 – 17:30Working with containers – Part 4
Containers in HPC: container job files, container&MPI, container&GPU; examples of Docker and Singularity containers for bioinformatics software; discussion and questions