bertha: Project skeleton for scientific software

Science depends heavily on reliable and easy-to-use software packages, such as mathematical libraries or data analysis tools. Developing such packages requires a lot of effort, which is too often avoided due to the lack of funding or recognition. In order to reduce the efforts required to create sustainable software packages, we present a project skeleton that ensures the best software engineering practices from the start of a project, or serves as reference for existing projects.


Introduction
In a recent essay in Nature [1], a familiar dilemma in science was addressed.On the one hand, science relies heavily on open-source software packages such as libraries for mathematical operations, implementations of numerical methods, or data analysis tools.As a consequence, those software packages need to work reliably and should be easy to use.On the other hand, scientific software is notoriously underfunded and the required efforts are achieved as side projects or even as spare-time work of scientists.
Indeed, there is a lot of effort to invest beyond the work on the actual implementation -which is typically a formidable challenge on its own.This becomes apparent after consulting literature on software engineering in general (such as the influential "Pragmatic Programmer" [2]) and in scientific contexts in particular (e.g., [3][4][5][6]).The vast amount of best practices guides and development guidelines (e.g., those published by the German Aerospace Center (DLR) [7] and the Netherlands eScience Center [8]) further underlines the importance of the topic and may serve as guidance, but often scientists lack the time and/or formal training in software engineering required to ensure sustainable software development [1,5,6].Too often, this results in poorly maintained software projects of questionable reliability and usability.
Given all this, it is once more all about achieving much with little effort.Therefore, in this paper we present a project skeleton that may serve as solid yet lightweight base for a small to medium scale scientific software project.In the envisaged use case, scientists can create an instance of this template with one click of a button.This instance implements essential best practices in software engineering from the very start.After performing a minimal amount of customizations, the scientist can soon start working on the actual implementation and can concentrate on what really matters.To the best of our knowledge, such a project skeleton has not been published yet.
In the scope of this work, we focus on scientific software libraries which are written in the C++ programming language for performance reasons and feature bindings for Python in order to provide an easy-to-use interface to the user.These programming languages are widely accepted in both open-source and high performance computing (HPC) communities and should therefore constitute a reasonable choice.It should be noted that the skeleton does not (and should not) cover every eventuality (e.g., where support for the Fortran programming language is required) but concentrates on one particular use case.This is contrary to the recommendations in related literature, which are kept general and language agnostic on purpose.The rationale behind this decision is to keep the template lightweight and avoid cluttering.
The paper at hand is organized as follows: In Section 2, we identify the essential best practices that are required to ensure high quality scientific software based on related literature and our own experiences with our software projects (e.g., the mbsolve software, a solver for the generalized Maxwell-Bloch equations [9,10]).Subsequently, we present our project skeleton and discuss the specific implementation of the identified measures in Section 3. As already stated above, some minor customization steps are required.Section 4 gives an overview of these steps and thereby an introduction to the (potential) user.Finally, we conclude with a short summary and give an outlook on future work, i.e., additional tools and measures that further improve the quality of scientific software projects.

Best practices in scientific software engineering
This section describes the essential recommendations and best practices from related literature [1][2][3][4][5][6][7][8] which serve as basis for the project skeleton.All recommendations are language agnostic and grouped into seven categories with no particular order of importance.Table 1 gives on overview of the best practices.

Project management
Most software projects in scientific context start with a single developer.However, over time the projects are likely to grow, be extended, and possibly taken over by other developers.Building a developer community is crucial for the success of the project in general and in particular for open-source projects [4].Therefore, the project infrastructure should be able to handle multiple developers from the very start.
All of the guidelines in literature we have found mention the usage of a version control system (VCS).Even for the single developer this brings advantages as VCS intrinsically features a backup solution and synchronization between different machines.Once more developers are working on the project, collaboration is enabled in a transparent way.By using a VCS, the "Make Incremental Changes" paradigm [5,6] can be implemented easily and the intrinsically generated development history may serve as rudimentary documentation of design decisions [4].
In a more advanced scenario, the VCS is coupled to a project management tool which provides means of communication within the developer team and thereby further enhances transparency.As the communication logs are available for developers who join the team at a later stage, it is also a certain form of documentation [4].One essential element of a project management tool is a ticket system or issue tracker.Issues are requests for a certain change (such as a bug fix or feature implementation) and play a crucial role in modern iterative and incremental software development processes such as feature-based development [11].As the name suggests, issue trackers keep track of issues from their creation (by users or developers) to their completion in the form of an accepted solution by the developer [7].Modern project management tools also include convenient mechanisms for code review.Similar to a scientific paper, a rigorous review process may be time-intensive and annoying but eventually yields solutions of higher quality and wider acceptance [5].

Code quality
Like we care about language style when writing a scientific article, we should care about coding style when writing scientific software.Here, we should bear the mottos "Write Programs for People, Not Computers" [5] and "Don't Repeat Yourself" [2,5] in mind and produce easily readable and modular code.In developer teams it is crucial to agree on a certain coding style at the beginning of the project.The coding style usually consists of two parts: Rules for formatting code and best practices for programming in the respective language.Code formatting tools enable manual and automated checks whether source code is compliant with agreed code formatting rules [8].Analogously, static code analysis tools perform checks whether the agreed best practices are violated [7].

Independence
Some guidelines recommend that open standards, protocols, and file formats should be used wherever possible (e.g., the HDF5 format for large data sets [8]).Thereby, vendor lock-in situations are avoided which would arise, for example, if a certain source code can only be compiled using a certain compiler brand or version.Our general recommendation here is to provide solutions that work with the most widely used operating systems and compilers (and possibly combinations thereof) from the very start.
Following the advice that one should never reinvent the wheel, established software libraries and tools are often used to speed up development processes.Here, we recommend to use open-source components unless there is a strong reason not to.This is in agreement with the interoperability and reusability part of the FAIR principle [12,13].

Automation
We should "Let the Computer Do the Work" [2,5] and automate the repetitive tasks such as building the software, running tests, performing quality checks, and deploying the generated artifacts (typically, software in binary form and documentation) to a software repository.Otherwise, those tedious tasks are most likely postponed, not done at all, or performed only partially.Here, continuous integration (CI) tools are helpful as different jobs can be defined and grouped into stages, which are executed at every time the developers push changes to the version control repository.Then, the developers receive feedback for the changes which is an essential part of the "Make Incremental Changes" strategy [5].
The feedback typically consists of two parts (at least) which are shortly outlined.First, the build process should run in an automated and platform independent fashion.Here, it is particularly important that thirdparty dependencies are found without hard coded paths.The output of the build process tells the developers whether the build on different platforms was successful.This is especially beneficial as most developers develop on a certain platform and the code is not intrinsically tested on other platforms (different operating systems, different compiler versions, ...).Second, test programs can be executed automatically on different platforms.For example, unit tests can help to verify the correct behavior of certain functions or modules of the software.Functional tests, on the other hand, help to gain more confidence in the overall function of the software [7].
It makes sense to define the continuous integration pipelines as early as possible, so that the developers benefit from the feedback from the very beginning.Thereby, bugs in the software (in particular regressions) can be detected early.Furthermore, the effectiveness of optimizations can be assessed while the correct operation of the software is ensured.

Documentation
In order to make scientific software reusable, providing documentation to users and developers is one of the most important steps [1][2][3][4][5][6][7][8].Bangerth and Heister [4] list five items which the documentation should contain: Traditional comments, function level documentation, class level documentation, overview how modules interact, and complete examples in tutorial form.As to traditional comments, it is good practice to "Document Design and Purpose, Not Mechanics" [5] and avoid obvious comments.Function and class level documentation is typically generated based on comments in code using special annotation.The resulting reference manual is particularly interesting for developers and advanced users who need to know the details.On the other hand, the module overview documentation should inform new users about the big picture.This information is typically written into the files README (aim of the software, installation notes, list of dependencies), CHANGELOG (overview of releases, features, known bugs), CONTRIBUTING (guide for (potential) developers), and TUTORIAL (guide for (potential) users) [6].

Testing
As mistakes are natural and are bound to happen, we should plan for them and develop strategies on how to detect them as early as possible [5].Automated testing, whose importance has already been underlined in Section 2.4, is the cornerstone of such strategies.It should be noted that the effectiveness of tests should be monitored as well.Here, code coverage tools are useful as they are able to detect code parts which are not covered by the executed tests [7].
Again, we stress that certain measures such as writing unit tests should be carried out from the very beginning.Apart from their use in automated testing, unit tests may have a positive effect on the code design.Since modular code is usually testable, having unit tests can be considered a necessary requirement for modular code [2].

Deployment
Whether or not a certain software project is used, depends to a large degree on the ability to distribute it [4].Hence, it is recommended to package the software and distribute it using an established software repository [1].Similar to the practices discussed above, it is important that the deployment is carried out automatically and as early as possible [3].

Implementation of the project skeleton
Based on the (general and programming language agnostic) best practices introduced in the section above, we implement measures for a C++ software library with bindings for the Python language in this section.The result is publicly available [14] and may serve as template for new projects or reference for existing projects.Figure 1 sketches the skeleton approach.
It should be noted that there may be different ways to implement a certain measure.For the sake of simplicity, we discuss only one or two possibilities for most measures.Following the recommendations in Section 2.3, we have selected open-source tools and libraries exclusively.Thereby, one particular lightweight solution is provided for scientists who are new to the topic, while the advanced users may replace a certain implementation of a measure with another library or tool of their choice.
Since a project skeleton does not include a real implementation, best practices regarding planning, structuring and writing code can hardly be demonstrated.In this regard, we refer the reader to available literature on the topic such as [2] and focus on the project skeleton that provides the required infrastructure.

Usage of a version control system (VCS) and appropriate work flow
A multitude of version control systems has been published and used over the last three decades.We stick to our criterion that the software must be open-source and note that git has received much attention since its first The source code and dependencies of a scientific software project are denoted in orange.These are the parts the developer has to provide.The presented skeleton guides the project from creation to deployment.Here, the arrows denote jobs that are created by the CMake build system.These jobs are triggered during the different continuous integration stages (build, tests, quality, deploy) or (in the case of the dashed arrows) by the conda-forge build service that follows the recipe [15].The job names indicate the tools in use, where CXX represents one of the C++ compilers that are supported by CMake.
it makes sense to use something established such as the GitLab Flow [16].This workflow uses feature branches to develop and test new features or bug fixes.Once the changes on the feature branch fulfill the requirements and pass the automated tests and quality checks, the developer can open a merge request.A maintainer can subsequently merge the changes in the main development branch.Additionally, the GitLab Flow allows stable branches and different environments (such as production) where further restrictions can apply.The latter features are not required at the initial stage of a project, but underline that the GitLab Flow is simple enough for small projects yet powerful enough for large and established projects.

Usage of a project management tool including issue tracking
There are several management tools and hosting platforms that can be combined with the git version control system with different strengths and drawbacks.Here, we would like to leave the choice to the developers and provide two possible solutions for the undecided.Over the last decade, the GitHub platform has received significant attention.It provides free public git repositories and integrations of other services (such as the zenodo repository for storing research output).Due to its prominence, we have decided to provide a mirror repository of the project skeleton in GitHub [17].This repository is marked as project template, therefore a new project can be instantiated with one click of a button.As to continuous integration, GitHub offers support for external CI providers such as Travis CI, AppVeyor, or Microsoft Azure.These services are free for open-source projects and are typically configured using a YAML file, where CI jobs can be described.
Alternatively, the GitLab platform can be used which is conceptually similar to GitHub, the main difference being the possibility to self-host the platform on a local server.While the concepts (such as Pages and Releases) are similar, there are slight differences.For example, the project template instantiation mechanism is different.
At this point, it is not possible to create an instance of the project skeleton with a single click.However, we aim to provide that feature in near future [18].
GitLab.com provides free hosting and internal continuous integration services for open-source projects.Currently, those internal CI services are restricted to the Linux operating system.It is possible, however, to install GitLab's CI suite on a local machine and connect it to GitLab.Alternatively, an external service can be used for Windows or macOS operating systems.For the case that the project should not be open-source, the self-hosted operation mode may be selected.Here, the CI suite must be installed on local machines which can subsequently be connected to the local GitLab installation.
It should be noted that we did not add configuration files for all options to the template in order to provide a lightweight skeleton.Instead, we included the configuration file for the GitLab internal CI, which calls the targets generated by the build systems.From this configuration file, corresponding files for other CI services can be derived.

Automated build system
In particular when the C++ programming language is involved, the CMake project provides well-established tools to build, test and package software.The main advantage of CMake (compared to alternatives such as GNU make, Visual Studio, or Eclipse) is that a level of abstraction is introduced.The configuration files consist of directives such as add_library or find_package and are therefore quite easy to read and understand.Based on those configuration files, project files for the aforementioned alternatives (and many other build systems) can be generated.Thereby, the software project can be built for different operating systems or using different compilers.Additionally, CMake features a mechanism to find third-party libraries and tools.This feature is essential for cross-platform dependency management.
As a proof of concept, we have added a simple shared library written in C++ to our project skeleton.It features a simple class device with two member variables that represent its start and end coordinates, respectively.An instance of this class can be created using one of two constructors, where either the coordinates are specified directly, or the length can be set and the start coordinate is assumed to be at origin.Finally, a method returns the length of the device.
For such a shared library, Python bindings can be generated conveniently using the SWIG project.It is fully supported by CMake and requires only a minimal configuration file, which basically specifies which C++ header files should be considered when creating the interface.SWIG scans the specified header files and automatically generates a Python module, which can be subsequently imported and used in a Python project.

Unit testing
Ideally, the software is designed so that each unit of software (e.g., a function) fulfills a certain, unique task ("Design by Contract" technique [2]).Furthermore, the implementation of each unit is flawless.While the first goal can be achieved by careful design and refactoring, the second statement is rarely true.As mentioned above, mistakes will happen and we have to test whether the implementations of each unit work correctly.
In the case of our simple C++ library, we have to check for instance whether the calculation of the length yields the correct result.This can be achieved by writing a unit test that creates an instance of the device class, calls its get_length method, and compares the result of the method to the expected value.Also, whenever the user specifies input data, the implementation should check whether those values are reasonable and handle invalid values (most likely, by throwing an exception).Error handling code must be tested as well, for example by creating a unit test where the error is provoked on purpose and checking whether the error handling code yields the correct behavior.As the number of unit tests is expected to be large for a real life project, it is recommended to use a unit test framework.
We chose the Catch2 library as it is open-source, light-weight and header-only.Based on this library, we added a test executable with several unit tests to our CMake build system.Here, we could rely on the CTest functionality of the CMake project.Whether or not the unit tests cover all possible situations can be assessed using code coverage tools.We have added the possibility of using the gcov tool to the project skeleton.This tool generates profiling information during the execution of tests.This information can be converted to a human-readable report subsequently, where metrics such as line coverage are given on a per-file basis.

Automatic code formatting
Here, the clang-format tool constitutes a helpful and versatile instrument.It can be configured using a single file, where the code formatting rules are specified.There are several predefined styles which can be used as-is or alternatively serve as basis.It is also possible to define a certain style from scratch, but we recommend to use an existent style (with slight modifications, if required).
In our project skeleton, the clang-format tool is integrated into the CMake build system.Thereby, the user can easily format all source files automatically.This functionality is also used as check whether the source code conforms to the specified style in the scope of continuous integration.

Documentation generation
From the implementation point of view, we can separate the different types of documentation listed in Section 2.5 into two groups, namely the function reference and the overview documentation.The function reference is based on comments in the source code that use special annotation.The information in those comments can be extracted using the Doxygen tool.For the overview documentation, which provides the "big picture", it makes sense to use a structured text format.Since Doxygen supports the Markdown language, we chose to write files such as README.mdand CONTRIBUTING.md in this annotation.Both overview documentation and function reference are then transformed to static HTML pages which can be viewed locally or uploaded to a web server.
We note that while Doxygen provides unchallenged support for in-source C++ documentation, the design of the generated HTML files appears a bit dated.There are more advanced workflows that use Doxygen as input parser and alternative tools to generate the static HTML pages.However, this is beyond the scope of the work at hand.

Automated packaging and deployment to a public repository
While many operating systems or programming languages feature a common repository to exchange programs and libraries in binary form, it would be beneficial to have a language agnostic repository that covers all operating systems.Fortunately, the conda system provides exactly this.Once a software project is in a stable state, a recipe can be created on conda-forge that defines the source of the project, the steps required to build it, and meta information such as the name of the responsible maintainer.Based on this recipe, the conda-forge build system automatically generates the binaries for different platforms.Then, on each platform the resulting package can be easily installed within a conda environment.
Most likely, the package has dependencies on other libraries.The conda system offers a vast amount of third-party components and convenient methods to install them.The already mentioned environment approach has a positive effect on the dependency management, as on Windows it is generally impossible to distinguish between different versions of a library (at least when considering unmanaged C++ code), dubbed the "DLL Hell".Using conda environments, however, it is possible to separate different versions in a clean and convenient way.
The generated documentation could be included in a conda package as well.However, we found it more appropriate to publish them on a web server for visibility reasons.Both GitHub and GitLab offer the possibility to host static HTML pages such as those generated by Doxygen.With a few lines of CI configuration, the generated documentation is automatically generated and uploaded.See [19] for an example.

Creating a skeleton instance
In order to create a new project, the project skeleton can be cloned either using the mechanisms of GitHub or GitLab (as described above and in Figure 2).Alternatively, the files can be copied manually and added to a new repository.After the cloning procedure, the skeleton can be adjusted to the needs of the new project.The Figure 2 Creating an instance of the project skeleton with a few button clicks.On GitHub (left), navigate to the bertha mirror [17] and click on "Use this template" (green button).The same can be achieved on GitLab (right) by creating a new project and selecting the bertha template (in development, see [18]).
required and recommended steps are shortly outlined in the following.For more detailed instructions, please follow the tutorial in the bertha documentation [19].

Setup stage
At the beginning, it is important to define a meaningful name for the project and replace bertha with it throughout the project (e.g., in the CMake build structure).Ensure that the name is not already used (e.g., in conda-forge) in case the project should be open-source.Then, the project team should agree on where to host the project (for internal use only or publicly available), on the license for the project, and on the workflow.The latter includes mainly the coding style and the version control workflow.Both should be documented as soon as possible.

Implementation stage
At this point, the software project has a solid initial state.Now it is time to add functionality.Here, consider writing the documentation first (the contract), then implementing the functionality, and at the same time writing unit tests.This approach will seem slow but improves the quality of the design and helps to detect mistakes early on.Also, the CMake build structure can be adjusted to add requirements (e.g., software libraries) or additional modules (besides the existing core library).

Publication stage
In the case of an open-source project, the code should be distributed and communicated as soon as there is a state with some first functionality.For the distribution of the project in binary form, the conda recipe for bertha [15] may serve as reference.

Conclusion
In the work at hand, we have presented a skeleton for scientific software projects which consists of libraries written in the C++ programming language and features a Python interface.The skeleton contains the essential elements required to ensure best software engineering practices.Thereby, we hope to provide the scientific community with a helpful tool that saves time during the setup of a new project.Based on our experience gained during the development of the skeleton, creating a bertha instance may replace at least one person month of evaluating tools, reading documentation, and searching for answers in the internet.
Furthermore, this contribution may serve as checklist and reference for existing projects.We hope that in both use cases -building a project from scratch or adapting an existing one -the project skeleton will aid the implementation of good practices in scientific software engineering and consequently improve the quality and reusability of scientific software projects.
As a next step, the implementation of further measures is envisaged.For example, a static code analysis tool could further improve the quality of the code.Also, the generated documentation and quality reports should be presented in a modern appearance.Finally, the project skeleton concept can be transformed to other project classes in scientific software engineering, such as a pure Python project or the combination of a Fortran library with a Python interface.

Figure 1
Figure 1Overview of the project skeleton.The source code and dependencies of a scientific software project are denoted in orange.These are the parts the developer has to provide.The presented skeleton guides the project from creation to deployment.Here, the arrows denote jobs that are created by the CMake build system.These jobs are triggered during the different continuous integration stages (build, tests, quality, deploy) or (in the case of the dashed arrows) by the conda-forge build service that follows the recipe[15].The job names indicate the tools in use, where CXX represents one of the C++ compilers that are supported by CMake.

Table 1
Overview of best practices in software engineering for scientific software projects.For each best practice, implementation candidates are listed where the selected choice is denoted in bold.