The hair cell analysis toolbox is a precise and fully automated pipeline for whole cochlea hair cell quantification

Our sense of hearing is mediated by sensory hair cells, precisely arranged and highly specialized cells subdivided into outer hair cells (OHCs) and inner hair cells (IHCs). Light microscopy tools allow for imaging of auditory hair cells along the full length of the cochlea, often yielding more data than feasible to manually analyze. Currently, there are no widely applicable tools for fast, unsupervised, unbiased, and comprehensive image analysis of auditory hair cells that work well either with imaging datasets containing an entire cochlea or smaller sampled regions. Here, we present a highly accurate machine learning-based hair cell analysis toolbox (HCAT) for the comprehensive analysis of whole cochleae (or smaller regions of interest) across light microscopy imaging modalities and species. The HCAT is a software that automates common image analysis tasks such as counting hair cells, classifying them by subtype (IHCs versus OHCs), determining their best frequency based on their location along the cochlea, and generating cochleograms. These automated tools remove a considerable barrier in cochlear image analysis, allowing for faster, unbiased, and more comprehensive data analysis practices. Furthermore, HCAT can serve as a template for deep learning-based detection tasks in other types of biological tissue: With some training data, HCAT’s core codebase can be trained to develop a custom deep learning detection model for any object on an image.


Introduction
The cochlea is the organ in the inner ear responsible for the detection of sound. It is tonotopically organized in an ascending spiral, with mechanosensitive sensory cells responding to high-frequency sounds at its base and low-frequency sounds at the apex. These mechanically sensitive cells of the cochlea, known as hair cells, are classified into two functional subtypes: outer hair cells (OHCs) that amplify sound vibrations and inner hair cells (IHCs) that convert these vibrations into neural signals [1]. Each hair cell carries a bundle of actin-rich microvillus-like protrusions called stereocilia. Hair cells are regularly organized into one row of IHCs

Analysis pipeline
HCAT combines a deep learning algorithm, which has been trained to detect and classify cochlear hair cells, with a novel procedure for cell frequency estimation to extract information from cochlear imaging datasets quickly and in a fully automated fashion. An overview of the analysis pipeline is shown in Fig 1. The model accepts common image formats (tif, png, and jpeg), in which the order of the fluorescence channels within the images, or their assigned color, does not affect the outcome. Multi-page tif images are automatically converted to a 2D maximum intensity projection. When working with large confocal micrographs, HCAT analyzes small crops of the image and subsequently merges the results to form a contiguous detection dataset. These cropped regions are set to have 10% overlap along all edges, ensuring that each cell is fully represented at least once. Regions that do not contain any fluorescence above a certain threshold may be optionally skipped, increasing speed of large image analysis while . The entire cochlea is then used to infer each cell's best frequency along the cochlear coil. First, all suprathreshold anti-MYO7A-positive pixels are converted to polar coordinates (D) and fit by the Gaussian process nonlinear curve fitting algorithm (E). The resulting curve is converted back to cartesian coordinates and the resulting line is converted to frequency by the Greenwood function; the apical end of the cochlea (teal circle) is inferred by the region of greatest curl (F), and the opposite end of the cochlea is assigned as the basal end (red circle). Cells are then assigned a best frequency based on their position along the predicted curve, and cochleograms (G) are generated in a fully automated way for each cell type (IHCs and OHCs), with a bin size by default set to 1% of the total cochlear length. HCAT, hair cell analysis toolbox; IHC, inner hair cell; OHC, outer hair cell. https://doi.org/10.1371/journal.pbio.3002041.g001

PLOS BIOLOGY
limiting false positive errors. When the entire cochlea is contained as a contiguous piece (Fig  1A), which is common for neonatal cochlear histology, HCAT will estimate the cochlear path and each cell will be assigned a best frequency. Following cell detection and best frequency estimation, HCAT performs two post processing steps to refine the output and improve overall accuracy. First, cells detected multiple times are identified and removed based on a userdefined bounding box overlap threshold, set to 30% by default. The second step, optional and only applicable for whole cochlear coil analysis, removes cells too far from the estimated cochlear path, reducing false positive detections in datasets with suboptimal anti-MYO7A labeling outcomes, such as high background fluorescence levels or instances of nonspecific labeling away from the organ of Corti. As outlined below, for each detection analysis HCAT outputs diagnostic images with overlaid cell-specific data, in addition to an associated CSV data table, enabling further data analysis or downstream post processing, and, when applicable, automatically generates cochleograms. HCAT is computationally efficient and can execute detection analysis on a whole cochlea on a timescale vastly faster than manual analysis, regularly completing in under 90 s when utilizing GPU acceleration on affordable computational hardware. HCAT is available in two user interfaces: (1) a command line interface that offers full functionality, including cell frequency estimation and batch processing of multiple images or image stacks across multiple folders; and (2) a graphical user interface (GUI), which is user-friendly and is optimized for analysis of individual or multiple images contained within a single folder. The GUI is unable to infer cell's best frequency and is suitable for analysis of small regions of cochlea.

Detection and classification
To perform cell detection, we leverage the Faster R-CNN [9] deep learning algorithm with a ConvNext [13] backbone trained on a varied dataset of cochlear hair cells from multiple species, at different ages, and from different experimental conditions (Table 1, Fig 2). Most of the hair cells used to train the detection model were stained with two markers: (1) anti-MYO7A, a hair cell specific cell body marker; and (2) the actin label, phalloidin, to visualize the stereocilia bundle. Bounding boxes for each cell along with class identification labels were manually generated to serve as the ground truth reference by which we trained the detection model (Fig 2). Boxes were centered around stereocilia bundles and included the hair cell cuticular plate as

PLOS BIOLOGY
these were determined the most robust features per cell in a maximum intensity projection image. The trained Faster R-CNN model predicts three features for each detected cell: a bounding box, a classification label (IHC or OHC), and a confidence score (Fig 3).
To limit false positive detections, cells predicted by Faster R-CNN can be rejected based on their confidence score or their overlap with another detection through an algorithm called nonmaximum suppression (NMS). To find optimal values for the confidence and overlap thresholds, we performed a grid search by which we assessed model performance at each combination of values and selected values leading to most accurate model performance (Figs 3E-3G and S1).
The trained Faster R-CNN detection algorithm performs best on maximum intensity projections of 3D confocal z-stacks of hair cells labeled with a cell body stain (such as anti-MYO7A) and a hair bundle stain (such as phalloidin), imaged at a X-Y resolution of approximately 290 nm/px (Fig 4D and 4E). However, the model can perform well with combinations of other markers, including antibody labeling against ESPN, Calbindin, Calcineurin, p-AMPKα, as well as following FM1-43 dye loading. HCAT can accurately detect cells in healthy and pathologic cochlear samples, collected within a range of imaging modalities, resolutions, and signal-to-noise ratios. While the pixel resolution requirements for the imaging data are not very demanding, imaging artifacts and low fluorescence signal intensity can limit detection accuracy. Although there is one row of IHCs and three rows of OHCs in most cochlear samples, there are rare instances where two rows of IHCs or four rows OHCs can be seen in normal cochlear samples, the algorithm is robust and largely accurate in such instances (Fig 4D).

Cochlear path determination
For images containing an entire contiguous cochlear coil, HCAT can additionally predict cell's best frequency via automated cochlear path determination. To do this, HCAT fits a Gaussian

PLOS BIOLOGY
process nonlinear regression [14] through the ribbon of anti-MYO7A-positive pixels, effectively treating each hair cell as a point in cartesian space. A line of best fit can be predicted through each hair cell and in doing so approximate the curvature of the cochlea. The length of this curve is then used as an approximation for the length of the cochlear coil. For example, a cell that is 20% along the length of this curve could be interpreted as one positioned at 20% along the length of the cochlea, assuming the entire cochlear coil was imaged.
To optimally perform the initial regression, individual cell detections are rasterized and then downsampled by a factor of ten using local averaging (increasing the execution speed of this step), then converted to a binary image. Next, a binary hole closing operation is used to close any gaps, and subsequent binary erosion is used to reduce the effect of nonspecific staining. Each positive binary pixel of the resulting 2D image is then treated as an X/Y coordinate that may be regressed against (Fig 1D). The resulting image is unlikely to form a mathematical function in cartesian space, as the cochlea may curve over itself such that for a single location on the X axis, there may be multiple clusters of cells at different Y values. To rectify this overlap, the data points are converted from cartesian to polar coordinates by shifting the points and centering the cochlear spiral around the origin, then converting each X/Y coordinate to a corresponding angle/radius coordinate. As the cochlea is not a closed loop, the resulting curve will have a gap, which is then detected by the algorithm, shifting these points by one period, and creating a continuous function. A Gaussian process [14], a generalized nonlinear function, is then fit to the polar coordinates and a line of best fit is predicted. This line is then converted back to cartesian coordinates and scaled up to correct for the earlier down-sampling (Fig 1E).
The apex of the cochlea is then inferred by comparing the curvature at each end of the line of best fit based on the observation that the apex has a tighter curl when mounted on a slide.  A validation output image is generated for each detection analysis performed by the software. An image is automatically generated by the software similar to the one shown here for a dataset that includes an entire cochlea (A), with the vast majority of cells accurately detected (B). For each image, the model embeds information on cell's ID, its location along the cochlear coil (distance in μm from the apex), its best frequency, cell classification (IHC as yellow squares, OHC as green circles), and the line that represents tool's cochlear path estimation ((C), blue line). The very few examples of poor performance are highlighted in (D) and (E) (arrowheads point to three missed IHCs and two OHCs). A set of cochleograms reporting cell counts per every 1% of total cochlear length, generated with manual cell counts and frequency assignment (gray) closely agrees with an HCAT-predicted cochleogram (red) generated in a fully automated fashion (F). HCAT is accurate along the entire length of the cochlea (G), as evident by assessing the accuracy with a bin size of 10% of cochlear length. To assess the accuracy of the tool's best frequency assignment, the magnitude difference between every cell's best frequency calculated manually, and automatically, with respect to frequency for eight different cochleae is at maximum 15% of an octave across all frequencies (H). Each color represents one cochlea. HCAT, hair cell analysis toolbox; IHC, inner hair cell; OHC, outer hair cell. https://doi.org/10.1371/journal.pbio.3002041.g004

PLOS BIOLOGY
The resulting curve closely tracks the hair cells on the image. Next, the curve's length is measured, and each detected cell is then mapped to it as a function of the total cochlear length (%). Each cell's best frequency is calculated using the Greenwood function, a species-specific method of determining cell's best frequency from its cochlear position [15] (Fig 1F). Upon completion of this analysis, the automated frequency assignment tool generates two cochleograms, one for IHCs and one for OHCs (Fig 1G).
To validate this method of best frequency assignment, we compared it to the existing standard in the field-manual frequency estimation. We manually mapped the cochlear length to cochlear frequency using a widely used ImageJ plugin, developed by the Histology Core at the Eaton-Peabody Laboratories (Mass Eye and Ear) and compared them to the results predicted by our automatic tool (Figs 4G and S1). Over 8 manually analyzed cochleae, the maximum cell frequency error of automated, relative to a manually, mapped best frequency was under 15% of an octave, with the discrepancy between the two methods less than 5% for most cells (60% of a semitone). In one cochlea, the overall cochlear path was predicted to be shorter than manually assigned, due to the threshold settings of the MYO7A fluorescence channel, causing an error at very low and very high frequencies (Fig 4G, dark blue). While this error was less than 15% of an octave, it is an outlier in the dataset. It is recommended, when using this tool, to evaluate the automated cochlear path estimation, and if poor, perform manual curve annotation to facilitate best frequency assignment. If required, the user is also able to switch the designation of automatically detected points representing the apical and basal ends of the cochlear coil (Fig 1F, red and cyan circles).

Performance
Overall, cochleograms generated with HCAT track remarkably well to those generated manually (Fig 4F). Comparing HCAT to manually annotated cochlear coils (not used to train the model), we report a 98.6 ± 0.005% true positive accuracy for cell identification and a <0.01% classification error (8 cochlear coils, 4,428 IHCs and 15,754 OHCs; S1 Fig). We found no bias in accuracy with respect to estimated best frequency. To assess HCAT performance on a diverse set of cochlear micrographs, we sampled 88 images from 15 publications [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30] that represent a wide variety of experimental conditions, including ototoxic treatment using aminoglycosides, genetic manipulations that could affect the hair cell anatomy, noise exposure, blast trauma, and age-related hearing loss ( Table 2). We performed a manual quantification and automated detection analysis of these images after they were histogram-adjusted and scaled via the HCAT GUI for optimal accuracy. HCAT achieved an overall OHC detection accuracy of 98.6 ± 0.5% and an IHC detection accuracy of 96.9 ± 2.8% for 3,545 OHCs and 1,110 IHCs, with mean error of 0.34 OHC and 0.32 IHC per image. Of the 88 images we used for this validation, no errors were detected on 62 of them, and HCAT was equally accurate in images of low and high absolute cell count (Fig 5). Multi-piece cochleogram generation workflow is shown in (S3 Fig).

Validation on published datasets
We further evaluated HCAT on whole, external datasets (generously provided by the Cunningham [31], Richardson and Kros laboratories [7]) and replicated analyses from their publications. Each dataset presented examples of organ of Corti epithelia treated with ototoxic compounds resulting in varying degrees of hair cell loss. The two datasets complement each other in several ways, covering most use cases of data analysis needs following ototoxic drug use in the organ of Corti to assess hair cell survival: in vivo versus in vitro drug application, confocal fluorescence versus widefield fluorescence microscopy imaging, early postnatal versus adult organ of Corti imaging. HCAT succeeded in quantifying the respective datasets in a fully

PLOS BIOLOGY
automated fashion with an accuracy sufficient to replicate the main finding in each study (Fig  6), underestimating the total number of cells for Gersten and colleagues, 2020 by 7.3%, and overestimating the total number of cells from Kenyon and colleagues, 2021 by 1.47%. It is worth noting that these datasets were collected without optimization for an automated analysis. Thus, we expect an even higher performance accuracy with an experimental design optimized for HCAT-based automated analysis.

Discussion
Here, we present the first fully automated cochlear hair cell analysis pipeline for analyzing multiple micrographs of cochleae, quickly detecting and classifying hair cells. HCAT can  [7], generously provided by the Richardson and Kros laboratories. Images were collected using epifluorescence microscopy, following a 48-h incubation in either 0 μm gentamicin (Control), 5 μm gentamicin, or 5 μm gentamicin + 50 μm test compound UoS-7692. Each symbol represents the number of OHCs in a mid-basal region from 1 early postnatal in vitro cultured cochlea [7]. One-way ANOVA with Tukey's multiple comparison tests. ***, p < 0.001; ns, not significant. In some cases, HCAT detections overestimated the total number of surviving hair cells in the gentamycin-treated tissue. However, overall, the software-generated results are in agreement with those of the original study, drawing the same conclusion. (B) Original imaging data of adult mouse cochleae, underlying the finding in Fig 7A-B in Gersten and colleagues, 2020 [31] were generously provided by the Cunningham laboratory. In this study, mice were treated by in vivo application of clinically proportional levels of ototoxic compounds, Cisplatin, Carboplatin, Oxaliplatin, and Saline (control), in an intraperitoneally cyclic delivery protocol [31]. Regions of interest were imaged at the base, middle, and apex of each cochlea. HCAT's automated detections were comparable to manual quantification and were sufficient to draw a conclusion that is consistent with the original publication. Upon comparison, HCAT had higher detection accuracy in OHCs, compared to IHCs, likely due to the variability of the MYO7A fluorescence intensity levels in IHCs across the dataset. HCAT, hair cell analysis toolbox; IHC, inner hair cell; OHC, outer hair cell. https://doi.org/10.1371/journal.pbio.3002041.g006 analyze whole cochleae or individual regions and can be easily integrated into existing experimental workflows. While there were previous attempts at automating this analysis, each were limited in their use to achieve widespread application [3,5]. HCAT allows for unbiased, automated hair cell analysis with detection accuracy levels approaching that of human experts at a speed so significantly faster that it is desirable even with rare errors. Furthermore, we validate HCAT on data from various laboratories and find it is accurate across different imaging modalities, staining, age, and species. Deep learning-based detection infers information from the pixels of an image to make decisions about what objects are and where they are located. To this end, the information is devoid of any context. HCAT's deep learning detection model was trained largely using anti-MYO7A and phalloidin labels; however, the model can perform on specimens labeled with other markers, as long as they are visually similar to examples in our training data. For example, some of the validation images of cochlear hair cells sampled from published figures contained cell body label other than MYO7A, such as Calbindin [16,32], Calcineurin [20,33], and p-AMPKα [34], while in other images, phalloidin staining of stereocilia bundle was substituted by anti-espin [35] labeling. Although no images containing hair cell-specific nuclear markers, such as pou4f3 [36], were included in the pool training data, HCAT performed reasonably well when tested on such images, especially when they also contained a bundle stain. Of higher importance is the quality of the imaging data: proper focus adjustment, high signal-to-noise ratio, image resolution, and adequately adjusted brightness and contrast settings. Furthermore, the quality of the training dataset greatly affects model performance; upon validation, HCAT performed slightly worse when evaluated on community provided datasets due to fewer representative examples within the pool of our training data.
We will strive to periodically update our published model when new data arise, further improving performance over time. At present, HCAT has proven to be sufficiently accurate to consistently replicate major findings even with occasional discrepancies to a manual analysis, even when used on datasets that were collected without any optimization for automated analysis. The strength of this software is in automation, allowing for processing thousands of hair cells over the entire cochlear coil without human input. Recent advancements in tissue-clearing techniques enable the acquisition of the intact 3D architecture of the cochlear coil using confocal or two-photon laser scanning microscopy allowing for future development of the HCAT tool as the wealth of such imaging data are made available to the public. Although no tissue-cleared data were used to develop HCAT, we tested it on few published examples of tissue-cleared mouse and pig cochlear imaging data [3,4]. While HCAT showed reasonable hair cell detection rates, the tool was unable to perform as accurate as we report for high-resolution confocal imaging data, most likely because the tissue-cleared datasets were collected at lower resolution (0.65 to 0.99 μm/pix), and contained only anti-MYO7a fluorescence.
It is common for the population of missing cells, rather than absolute counts, to be reported in cell survival studies. We were unable to support missing cell detection or quantification in HCAT. We found there lacked sufficient, and robust information on the locations of missing cells to automate their detection consistently and accurately. In some cases, a distinctive "Xshaped" phalangeal scar may be seen in the sensory epithelium following hair cell loss [37,38] that may be sufficient to determine the presence of a missing cell; however, this is often visible with an actin stain or on scanning electron microscopy images, and not so in the other pathologic cases HCAT attempts to support. While the detection model was trained and cochlear path estimation designed specifically for cochlear tissue, HCAT can serve as a template for deep learning-based detection tasks in other types of biological tissue in the future. While developing HCAT, we employed best practices in model training, data annotation, and augmentation. With minimal adjustment and a small amount of training data, one could adapt the core codebase of HCAT to train and apply a custom deep learning detection model for any object in an image.
To our knowledge, this is the first whole cochlear analysis pipeline capable of accurately and quickly detecting and classifying cochlear hair cells. HCAT enables expedited cochlear imaging data analysis while maintaining high accuracy. This highly accurate and unsupervised data analysis approach will both facilitate ease of research and improve experimental rigor in the field.

Preparation and imaging of in-house training data
Organs of Corti were dissected in one contiguous piece at P5 in Leibovitz's L-15 culture medium (21083-027, Thermo Fisher Scientific) and fixed in 4% formaldehyde for 1 h. The samples were permeabilized with 0.2% Triton-X for 30 min and blocked with 10% goat serum in calcium-free HBSS for 2 h. To visualize the hair cells, samples were labeled with an anti-Myosin 7A antibody (#25-6790 Proteus Biosciences, 1:400) and goat anti-rabbit CF568 (Biotium) secondary antibody. Additionally, samples were labeled with Phalloidin to visualize actin filaments (Biotium CF640R Phalloidin). Samples were then flattened into one turn, mounted on slides using ProLong Diamond Antifade Mounting kit (P36965, Thermo Fisher Scientific), and imaged with a Leica SP8 confocal microscope (Leica Microsystems) using a 63×, 1.3 NA objective. Confocal Z-stacks of 512 × 512 pixel images with an effective pixel size of 288 nm were collected using the tiling functionality of the Leica LASX acquisition software and maximum intensity projected to form 2D images. All experiments were carried out in compliance with ethical regulations and approved by the Animal Care Committee of Massachusetts Eye and Ear.

Training data
Varied data are required for the training of generalizable deep learning models. In addition to imaging data collected in our lab, we sourced generous contributions from the larger hearing research community from previously reported [7,31,[39][40][41][42][43][44][45][46], and in some cases unpublished, studies. Bounding boxes for hair cells seen in maximum intensity projected z-stacks were manually annotated using the labelImg [47] software and saved as an XML file. For whole cochlear cell annotation, a "human in the loop" approach was taken, first evaluating the deep learning model on the entire cochlea, visually inspecting it, then manually correcting errors. Our dataset contained examples from three different species, multiple ages, microscopy types, and experimental conditions. Only the images generated in-house contain an entire, intact cochlea. A summary of our training data is presented in Table 1.

Training procedure
The deep learning architectures were trained with the AdamW [48] optimizer with a learning rate starting at 1 × 10 −4 and decaying based on cosine annealing with warm restarts with a period of 10,000 epochs. In cases with a small number of training images, deep learning models tend to fail to generalize and instead "memorize" the training data. To avoid this, we made heavy use of image transformations that randomly add variability to the original set of training images and synthetically increase the variety of our training datasets [49] (S2 Fig).

Hyperparameter optimization
Eight manually annotated cochleae were evaluated with the Faster R-CNN detection algorithm without either rejection method (via detection confidence or non-maximum suppression). A grid search was performed by breaking each threshold value into 100 steps from zero to one, and each combination applied to the resulting cell detections, reducing their number, then calculating the true positive (TP), true negative (TN), and false positive (FP) rates (S1D and S1E Fig). An accuracy metric of the TP minus both TN and FP was calculated and averaged for each cochlea. The combinations of values that produce the highest accuracy metric were then chosen as default for the HCAT algorithm.

Computational environment
HCAT is operating system agnostic, requires at least 8 GB of system memory, and optionally an NVIDIA GPU with at least 8 GB of video memory to optional GPU acceleration. All scripts were run on an analysis computer running Ubuntu 20.04.1 LTS, an open-source Linux distribution from Canonical based on Debian. The workstation was equipped with two Nvidia A6000 graphics cards for a total of 96 GB of video memory. Many scripts were custom written in python 3.9 using open-source scientific computation libraries including numpy [50], matplotlib, and scikit-learn [51]. All deep learning architectures, training logic, and much of the data transformation pipeline was written in pytorch [52] and making heavy use of the torchvision [52] library.

S1 Fig. Validation of hair cell detection analysis and location estimation.
Whole cochlear turns (A) were manually annotated and evaluated with the HCAT detection analysis pipeline. Each analysis generated cochleograms (B), reporting the "ground truth" result obtained from manual segmentation (dark lines) superimposed onto the cochleogram generated from hair cells detected by the HCAT analysis (light lines). The best frequency estimation error was calculated as an octave difference of predicted best frequency for every hair cell versus their manually assigned frequency using the ImageJ plugin (C). Optimal cell detection and nonmaximum suppression thresholds were discerned via a grid search by maximizing the true positive rate penalized by the false positive and false negative rates (D). Black lines on the curves (E) denote the optimal hyperparameter value.  HCAT (B). First, each piece must have its curvature annotated manually in ImageJ from base to apex (C) using the EPL cochlea frequency ImageJ plugin. Then, these annotations and images are passed one-at-a-time to the HCAT command line interface. This will generate a CSV for each file, which are then manually compiled (E). This allows for the generation of a complete cochleogram from a multi-piece dissection (F). (EPS) S1 Data. A compressed folder with spreadsheets containing, in separate files, the underlying numerical data and statistical analysis for Figs 1G, 3E, 3F, 3G, 4F, 4H, 5A, 5B, 5C, 6A, 6B, S1B, S1C, S1D, S1E, S1F. (ZIP)