Rethomics: An R framework to analyse high-throughput behavioural data

The recent development of automatised methods to score various behaviours on a large number of animals provides biologists with an unprecedented set of tools to decipher these complex phenotypes. Analysing such data comes with several challenges that are largely shared across acquisition platform and paradigms. Here, we present rethomics, a set of R packages that unifies the analysis of behavioural datasets in an efficient and flexible manner. rethomics offers a computational solution to storing, manipulating and visualising large amounts of behavioural data. We propose it as a tool to bridge the gap between behavioural biology and data sciences, thus connecting computational and behavioural scientists. rethomics comes with a extensive documentation as well as a set of both practical and theoretical tutorials (available at https://rethomics.github.io).


Introduction
The biological implications and the determinism of animal behaviour have long been a subject of significant scientific interest. In the 1960s, behavioural geneticists, such as Seymour Benzer, showed that some apparently complex behaviours could in fact be governed by simple genetic determinants, which connected genetics and behaviour [1]. In the last few decades, our ability to acquire and analyse vast amounts of biological data has tremendously increased [2], deeply transforming both genetics [3] and neuroscience [4]. In fact, ethology itself is undergoing its own transition towards data sciences, which has prompted the terms 'ethomics' [5,6] and 'computational ethology' [7]. It is now accepted that the study of behaviour can also benefit from quantitative sciences such as machine learning, physics and computational linguistics [8,9].
Although general questions regarding the environmental, evolutionary, neural and genetic determinants of behaviours are shared within the community, the multiplicity of model organisms, hypotheses and paradigms has led to the existence of a very diverse palette of specific recording techniques. Some tools were developed, for instance, to continuously record simple behavioural features such as walking activity [10] and position [11] over long durations (days or weeks); to score more complex ones such as feeding [12,13], aggression [14] and courtship [15]; and to study the behaviour of multiple interacting animals [16][17][18] recording platforms are unrelated to each other, there are also some attempts to build general purpose tools that can be adapted by researchers to suit their specific goals [5,[19][20][21]. However, when it comes to the subsequent analysis of the generated results, there is still no unified programmatic framework that could be used as a set of building blocks in a pipeline. The fields of structural biology and bioinformatics are good examples of communities that have taken advantage of sharing standard files formats, modular command line tools [22] and software packages [23] that can be assembled into pipelines [24]. In these research areas, which are closely linked to data sciences and statistics, scripting interfaces are the standard since they help to deliver reproducible results [25,26]. In addition, they can be used on remote resources such as computer clusters, which makes them more scalable to the context of 'big data' [27]. Since many aspects of behaviour analysis are also becoming increasingly dependent on data sciences, the development of such common tools and data structures would be very valuable.
At first, it may seem as though behavioural experiments are prohibitively heterogeneousin terms of model organisms, paradigm and timescale-for a similar community to arise. However, low-level conceptual consistencies and methodological challenges are shared across experiments. For instance, the results (i.e. the 'data') feature a set of long time series (sometimes multivariate and irregular), but also contain a formal description of the treatment applied to each individual, the 'metadata'. Storing and accessing data and metadata efficiently involves the implementation of a nested data structure which, in principle, can be shared between acquisition platforms and experimental paradigms.
Here, we describe the rethomics platform, an effort to promote the interaction between behavioural biologists and data scientists. rethomics is implemented as a collection of interconnected packages, offering solutions to importing, storing, manipulating and visualising large amounts of behavioural results. We also present two practical examples of its application to the analysis of behavioural rhythm in fruit flies, a widely studied subject. results in increased testability, maintainability and adaptability. In this model, each task of the analysis workflow (i.e. data import, manipulation and visualisation) is handled by a different package, and new ones can be designed to suit specific needs. At the core of rethomics lies the behavr table, a structure used to store large amounts of data (e.g. position and activity) and metadata (e.g. treatment and genotype) in a unique data.table-derived object [30]. Any input package will import experimental data as a behavr table which can, in turn, be analysed and visualised regardless of the original input platform. Numerical results and plots are standard objects that can, therefore, be further analysed inside the wide R package ecosystem.

Internal data structure
Ethomics results can easily scale and data structure therefore gains from being computationally efficient-both in terms of memory footprint and processing speed. For instance, there could be very long time series, sampled several times per second, over multiple days, for each individual. In addition, time series can be multivariate-encoding coordinates, orientation, dimensions, activity, colour intensity and so on. Furthermore, experiments may feature a large number of individuals. Each individual is also associated with some metadata: a set of 'metavariables' that describe experimental conditions. For instance, metadata stores information regarding the date and location of the experiment, treatment, genotype, sex, post hoc observations and other arbitrary metavariables. A large set of metavariables is an important asset since they can later be treated as covariates.
behavr tables link metadata and data within the same object, extending the syntax of data.table to manipulate, join and access metadata (Fig 2 and S1 Fig). This approach guarantees that any data point can be mapped correctly to its parent metadata thanks to a shared key (id). Furthermore, it allows implicit update of metadata when data is altered. For instance, when data is subsetted, only the remaining individuals should be in the new metadata. It is also important that metadata and data can interoperate-for example, when updating a variable according to the value of a metavariable (say, altering the variable x only for animals with the metavariable sex = 'male'). The online tutorials and documentation provide a detailed set of examples and concrete use cases of behavr.

Data import
Data import packages translate results from a specific recording platform (e.g. text files and relational databases) into a single behavr object. Currently, we provide two packages: one to import data from single and multi-beam Drosophila Activity Monitor Systems (Trikinetics Inc.) and another one for Ethoscopes [21]. Although the structure of the raw results is very different, conceptually, loading data is very similar. In all cases, users must provide a metadata table, with one row per individual, and featuring both mandatory and optional columns (Fig 2). The mandatory ones are the necessary and sufficient information to fetch data (e.g. machine id, region of interest and date). The optional columns are user-defined arbitrary fields that translate experimental conditions (e.g. treatment, genotype and sex).
In this respect, the metadata file is a standardised and comprehensive data frame describing an experiment. It explicitly lists all treatments and individuals, which facilitates interspersion of conditions. Furthermore, it streamlines the inclusion and analysis of further replicates in the same workflow. Indeed, additional replicates can simply be added as new rows-and the id of the replicate later used, if needed, as a covariate.

Visualisation
To integrate visualisation in rethomics, we implemented ggetho, a package that offers new tools that extend the widely adopted ggplot2 [31]. ggetho makes full use of the internal behavr structure to summarise temporal trends. We implemented a set of new 'layers' and 'scales' that particularly applies to the visualisation of long experiments, with the ability to, for instance, display 'double-plotted actograms', periodograms, annotate light and dark phases and wrap time over a given period. Importantly, ggetho is fully compatible with ggplot2. For instance, ggplot2 operations such as faceting, transforming axes and adding new layers functions natively with ggetho.

Results
In order to illustrate the usefulness of rethomics, we provide two examples. The first one is a detailed and reproducible description of the loading and analysis of activity data in the context of circadian rhythm, using DAM2 (Trikinetics Inc.) data. The second one shows how rethomics integrates within R to perform a multi-scale analysis of a periodic behaviour, using continuous wavelet transform, on data generated with ethoscopes [21].

Canonical circadian analysis in Drosophila
The zeitgebr package implements a comprehensive suite of methods to analyse circadian rhythms, including the computation of autocorrelograms, χ 2 [32] and Lomb-Scargle [33] periodograms, and automatic peak detection.
The study of the rhythmical activity of fruit flies has played a crucial role in the development of circadian biology [34]. To date, most of the behavioural data used in the field is acquired through the Drosophila Activity Monitor System (DAMS). The package damr was developed to import DAMS results in the rethomics framework, which we envision will be a common use case. To illustrate the ability of rethomics to analyse pre-existing results, we gathered a subset of the data from a recent publication [35], kindly made publicly available by the authors [36].
Wild type flies are highly rhythmic in Light-Dark (LD) cycles and become arrhythmic in constant light (LL). In their study, the authors gain understanding of the function of the molecular clock by showing that overexpression of the gene NKCC makes the flies rhythmic in LL, and that the endogenous period in LL is longer than 24 hours.
Here, we guide the reader through the analysis of two of the genotypes employed in that study; one control group (NKCC ox /+, which is arrhythmic under LL) and one where NKCC ox is overexpressed in clock neurons (TIM/NKCC ox , which is rhythmic under LL). In particular, we outline the necessary steps to analyse two repetitions of the same experiment in which a total of 58 animals were recorded for three to four days in LD and then subjected to constant light for six or seven days. The metadata.csv file and all the associated result files can be downloaded on zenodo [36].
Fist, we install the rethomics packages (see the webpage for installation instructions), download the zip archive containing the raw data and extract all files into our working directory. Then, we load the necessary rethomics packages: Then, the metadata file is read and linked to the .txt result files.
Preprocessing. Since the two original replicates do not have the same baseline duration and we want to analyse them together, we align their respective times to the experimental perturbation: the transition from LD to LL (t = 0). This is achieved by subtracting the The id is a long string of characters (for instance, '2013-11-19 09:00:00| Monitor36.txt|01'), which makes it difficult to read and display as a label on a plot. To address this issue, we create our own label metavariable, as the combination of a number and genotype (e.g. '1.NKCCOX/+'). In the restricted context of this analysis, label acts as a unique identifier. Importantly, we also retain id as an unambiguous unique identifier. Indeed, two animals in separate experiments may have the same label, but different ids. In addition, if the metadata changes-for instance by the addition or removal of individuals-the label is likely to change, not the id.
Curation. It is important to visualise an overview of how each individual behaved and, if necessary, amend the metadata accordingly. For this, we generate a tile plot (Fig 3A).
# baseline subtraction À À note the use of xmv  # make a ggplot object with label on the y and moving on the z axis fig3A <À ggethoðdt; aesðy ¼ label; z ¼ movingÞÞ þ # show data as a tile plot # that is z is a pixel whose intensity maps moving stat À tile À ethoðÞ þ # add layers to draw annotations to show L and D phases # as white and black; respectively # the first layer is for the baseline ( until t ¼ 0 ) stat À ld À annotationsðx À limits ¼ cðdt½;minðtÞ�; 0ÞÞ þ # in the 2nd one; we start at 0 and use grey The activity of dead or escaped animals is scored as long series of zeros, which may be erroneously interpreted as inactivity (see, for instance, individual labelled 30 and 18 in Fig 3A). The sleepr package offers a tool to detect and remove such data. It proceeds by detecting the first time an animal is immobile for more than 99% of the time (the default) for at least time_window seconds and then discard any subsequent data.
In addition, we can trim the data to the same number of days across experiments and individuals.

# filter dt with the valid ids
Note that, as a consequence of the curation procedure, we now have 53 'valid' individuals. Double-plotted actograms. Double-plotted actograms are a common choice to visualise the periodicity and rhythmicity in circadian experiments. In S2A Fig, we show a double-plotted actogram for each animal. A selected sample of four individuals for each genotype is shown in Fig 4A. Periodograms. Ultimately, in order to quantify periodicity and rhythmicity, we compute periodograms. Several methods are implemented in zeitgebr: χ 2 , Lomb-Scargle, autocorrelation and Fourier. In this example, we generate χ 2 periodograms and lay them out in a grid. Periodograms for the subset of eight animals used in Fig 4A are   Population statistics. As shown in the original study [35], double-plotted actograms and periodograms suggest that NKCC ox /+ flies are mostly arhythmic in LL whilst Tim/NKCC ox appear to have a long-period rhythm. To visualise this difference at the population scale, we can plot an average periodogram (see Fig 5A): To further quantify this difference, we opt to show the number of rhythmic animals-i.e. individuals for which a peak was found-in each group (see Fig 5B). Then, we can compare the average value of the peak for the rhythmic animals (see Fig 5C). First of all, we compute a summary per individual (by = id):

# display periodogram
fig5A <À ggperioðper À dt; aesðy ¼ power À signif À threshold; colour ¼ genotypeÞÞ þ # periododogram shown as a line for population mean # and bootstrap error bars stat À pop À ethoðmethod ¼ ggplot2::mean À cl À bootÞ þ # rename x and y axis scale À y À continuousðname ¼ "Relative power"Þ þ scale À x À hoursð"Period"Þ summary À dt <À per À dt½; :ð first À peak À period ¼ period½peak ¼¼ 1�; # fg can be used for tmp variables first À peak À rel À power ¼ f summary_dt is a regular data frame with one row per individual, containing both metadata and our summary statistics. It can therefore be used directly by ggplot: fig5C <À ggplotðsummary À dt; aesðy ¼ first À peak À period; x ¼ genotypeÞÞ þ # draw the mean of each genotype group stat À summaryðfun:y ¼ mean; geom ¼ "point"; shape ¼ 3Þ þ # draw bootstrap confidence intervals stat À summaryðfun:data ¼ mean À cl À boot; geom ¼ "errorbar"Þ þ # shows all individuals as points # the size of the point represents the power of the peak geom À jitterðaesðcolour ¼ genotype; size ¼ first À peak À rel À powerÞ; R provides one of the richest statistical toolboxes available. Using base R we could perform a χ 2 test on the number of rhythmic vs arhythmic flies in both genotypes, or, like in this case, fit a binomial generalised linear model: The result shows a strong positive effect of genotype Tim/NKCC ox on the probability of being rhythmic (p-value 5.15 × 10 −06 ): Lastly, we can generate a table that computes arbitrary population statistics for each genotype: :ð mean À period ¼ meanðfirst À peak À period; na:rm ¼ TÞ = hoursð1Þ; sd À period ¼ sdðfirst À peak À period; na:rm ¼ TÞ = hoursð1Þ; percent À rhythmic ¼ 100 � sumðis À rhythmicÞ = :N; n À rhythmic ¼ sumðis À rhythmicÞ; by ¼ genotype � # to round all numeric columns to two digits result À dt½; lapplyð:SD; Open source software for behavioural analysis The case study described so far shows how rethomics can be employed to generate publication-quality figures and state-of-the-art statistics on pre-existing data. We were able to comprehensively analyse the data from a circadian experiment with a few lines of code, presenting a workflow that applies equally well to much larger datasets.

Multi-scale analysis of position
One of the challenges of behaviour analysis is the 'nesting' of events happening over different timescales. In other words, a behavioural variable can be modulated by the circadian rhythm, but also by co-occurring ultradian and infradian rhythms. For instance, an animal could have rhythmic bursts of locomotor activity recurring at high frequency (e.g. 1 min), but also a circadian (i.e. 24 h) regulation of the same variable, both rhythms would then happen at timescales separated by approximately three orders of magnitudes, which makes them difficult to visualise and integrate into the same analysis. Being able to keep frequency information over multiple scales is however important in some cases. In particular, this is relevant to study the frequency modulation of a rhythm by another-that is, if the periodicity of a high-frequency rhythm itself can be a function of a lower frequency one.
The problem of understanding time series at different scales is not uncommon in fields such as economics [37], climate sciences [38] and ecology [39] where variables are governed by multiple underlying rhythms (e.g. tidal, daily, yearly and multi-yearly). One approach is to study a variable of interest in the time/period domain using, for instance, continuous wavelet transform (CWT) [40]. In the context of chronobiology, CWT has been suggested as a tool to investigate ultradian rhythms [41].
To illustrate how rethomic integrates with other packages and render such non-mainstream analysis possible, we developped, as part of the zeitgebr package, utilities to perform wavelet analysis, which we then applied to the position of 80 naive fruit flies (40 females and 40 males) in their glass tubes. We used the package scopr, to load five days of ethoscope positional data, which we sampled at 0.1 Hz. Our variable of interest is the position of animals in their tube (from the food end, Position = 0, to the cotton end, Position = 1). Fig 6A-6C shows the raw position data at two different scales for two representative animals.
We implemented specific utilities to compute and display CWT spectrograms in rethomics, based on the WaveletComp package [42]. We then used this interface to average the result of the five consecutive days in the time/period domain over one circadian day both for the two representative animals ( Fig 6D) and for the population (Fig 6E).
As suggested by the slow oscillations of the mean position (Fig 6A), we observe peaks in power corresponding to high-period (12 h and 24 h) rhythms. In addition, a large amount of signal is detected for low-period (around 60 s) events-likely corresponding to the position of animals walking along (back and forward) their tubes in a very paced manner. Interestingly, in females, this low period pace appears to be frequency modulated during the L phase, suggesting a slower walking speed around ZT6 h. In contrast, males display only a high-frequency rhythm around the phase transitions (L to D and D to L). Surprisingly, the peak of high-frequency rhythm implies a faster pace in males (approximately 60 s) than in females (approximately 120 s)-indicating that, when active, males walk faster than females. Average spectrogram across 40 males and 40 females. In D and E, the lines on the right show the marginal power spectra corresponding to the spectrograms shownthat is the average across 12 hours in either the light (white bar) or dark (black bar) phase. The male data was collected and described in our previous study [21] (controls in Fig 5M-5P) and the females data was acquired in parallel, in the same experimental conditions, but not previously published. https://doi.org/10.1371/journal.pone.0209331.g006 Open source software for behavioural analysis This non-exhaustive proof of principle illustrates how analysis of behavioural data can be taken further by adapting the wide range of numerical tools already available in the R ecosystem.

Discussion
The framework we presented in this article describes a general computational set of tools to store, manipulate and analyse behavioural data. As a practical case study, in this article, we showed how rethomics can readily be used to analyse circadian and sleep data. As such, it already fills a clear gap in the sleep and circadian community by providing an open, extensible and programmatic toolbox as opposed to a rigid array of GUI tools which cannot interoperate. In particular, within the R environment, rethomic features a streamlined workflow (Fig 1): data loading; quality control ( Fig 3); per-individual visualisation ( Fig 4A) and computation ( Fig 4B); group averages and statistical analysis (Fig 5), ultimately resulting in publicationquality figures. In addition to the canonical analysis we presented in this study, the richness of the R environment makes rethomics particularly interesting for exploratory work: as a case in point, we showed here the use of continuous wavelet transform to study the modulation of walking pace (Fig 6). Altogether, rethomics will allow the community to share solutions for the analysis and visualisation of behavioural data originated from a plethora of recording equipment. By design, rethomics is meant to be flexible and certainly not limited to the study of circadian rhythms. In fact, behavr tables (Fig 2) are a general structure to link data to metadata that come with no assumption regarding the type and number of variables stored. Therefore, we anticipate it will apply to a variety of contexts and scales. In the last decade, multiple devices have been developed-either commercially or academically-to record longitudinal data on several animal models (i.e. feeding, metabolic activity, heart rate, position and so on). All these devices and behavioural paradigm could take advantage of the skeletal structure in rethomics and lead to the development of device-specific packages. Such an approach becomes particularly useful in order to compare methods or study behaviours that are recorded by different devices on the same individual animals.
Supporting information S1 Table. List of packages and main functionalities implemented in rethomics. The last column shows a list of user-space functions for each role. Extensive list of all functions is available in the documentation of each package. (PDF) S1 Fig. Non-exhaustive list of uses of a behavr table (referred as dt). In addition to operations on data, which are inherited from data.table, we provide utilities designed specifically to act on both metadata and data. Jones, Diana Bicazan and Florencia Fernandez-Chiappe for their comments as early users; Marcus Ghosh and Tara Kane for their feedback on the manuscript; Brenna Williams, for her help to support multi-beam DAMS; Patrick Krätschmer, for his time discussing the conceptual framework.