Strategies of data layout and cache writing for input-output optimization in high performance scientific computing: Applications to the forward electrocardiographic problem

Input-output (I/O) optimization at the low-level design of data layout on disk drastically impacts the efficiency of high performance computing (HPC) applications. However, such a low-level optimization is in general challenging, especially when using popular scientific file formats designed with an emphasis on portability and flexibility. To reconcile these two aspects, we present a novel low-level data layout for HPC applications, fully independent of the number of dimensions in the dataset. The new data layout improves reading and writing efficiency in large HPC applications using many processors, and in particular during parallel post-processing. Furthermore, its combination with a cached write mode, in order to aggregate multiple writes into larger ones, substantially decreased the writing times of the proposed strategy. When applied to our simulation framework for the forward calculation of the human electrocardiogram, the combined strategy resulted in drastic improvements in I/O performance, of up to 40% in writing and 93–98% in reading for post-processing tasks. Given the generality of the proposed strategies and scientific file formats used, our results may represent significant improvements in I/O performance of HPC applications across multiple disciplines, reducing execution and post-processing times and leading to a more efficient use of HPC resource envelopes.


Introduction
The optimization of high performance computing (HPC) codes is an area of active research, underpinning a continuous and cost-effective development of both established and emergent industrial and scientific sectors. As a representative example, the progress we are experiencing in computational medicine based on HPC applications is allowing the translation of mathematical models of physiological systems such as the heart to biomedical research and clinical practice. Within the field of cardiac electrophysiology, these include investigations on multiscale mechanisms of disease and lethal arrhythmias [1,2], drug action [3,4] [5,6], causes of inter-patient variability in response to treatment [7,8], the role of myocardial structure in modulating heart function [9,10], identification of novel biomarkers for clinical diagnosis [11,12], and the stratification of patients at high risk of sudden cardiac death [13], among others. In scientific computing, substantial efforts to improve HPC performance are frequently placed on the optimization of the numerical solution of the underlying physical models. As in other disciplines, in cardiac electrophysiology this involves the development of strongly scalable solvers [14,15], improved problem-specific preconditioners [16,17], temporal and/or spatial adaptivity [18,19], higher-order numerical schemes [20,21], or the use of reduced models to alleviate model complexity [22,23]. Additional areas of active HPC performance improvements include multithreading, load-balance, compiler optimization, or optimization at the application and operating system levels. Critically, scientific applications in large HPC systems often read and write vast amounts of data. However, much less attention is given in general to the input-output (I/O) optimization of these codes, frequently assumed as an inevitable burden with little scope for improvement, leaving I/O as a challenging factor in the overall performance of HPC applications [24].
Ideally, the first stage of I/O optimization in HPC applications should take place at the lowlevel design of the output structure, based on the most frequent access patterns to data. This is, however, a laborious task, in particular when using popular scientific file formats, designed with a focus on portability and flexibility (such as the HDF5 file format considered here [25]). To circumvent this complexity, the use of higher level analysis tools is usually preferred for HPC I/O optimization [26][27][28][29], commonly based on the profiling of communication, latencies and computation overheads in parallel applications. Other diagnostic tools also provide comprehensive summaries of data access patterns [30][31][32], which can then be used in later stages of I/O optimization. Additional middleware file formats with improved write and read performance have also been developed [33], although their acceptance in scientific applications still remains low.
To simplify such a delicate crafting process of low-level I/O optimization, in this work we present a general algorithm for the automatic design of data layout, solely based on the size of the dataset and one additional parameter, the target chunk size in bytes. When combined with a cached write mode, the new algorithm (applied to our simulation framework for the forward calculation of the human electrocardiogram) resulted in overall improvements in I/O performance of up to 40% in writing to disk, and between 93% to 98% in reading for different post-processing tasks. Given the generality of the proposed strategies and the scientific file formats used (HDF5 as a standard for portability and flexibility, and of widespread use among the scientific computing community), our methodology may be broadly applied to other scientific areas, yielding significant improvements in I/O performance of HPC applications and a more efficient use of HPC resources across multiple scientific disciplines.

Bidomain equations in a bath
The bidomain equations [17] describe the evolution of the electrical activity in the heart (O h ), surrounded by a conductive passive medium (i.e. the bath, O b ). Two overlapping domains are assumed in the heart: the intracellular and extracellular domains, with respective potentials ϕ i and ϕ e , whose difference provide the transmembrane potential (V m = ϕ i − ϕ e ). The formulation of the problem is then given by the system of partial differential equations (PDEs): together with boundary conditions n Á (σ i rϕ i ) = 0 and n Á (σ b rϕ e ) = I e on the heart and the external bath boundaries, respectively, where n represents the outward-facing unit normal. In the equations above, σ i and σ e are the intracellular and extracellular conductivity tensors, σ b is the bath conductivity, χ is the surface-area-to-volume ratio, and C m the membrane capacitance per unit area. The vector u contains cell-level variables (such as ionic concentrations and membrane gating variables), and I ion (u, V m ) is the ionic current per unit surface area, as given by the cellular electrophysiological model f. The source term I i is the intracellular stimulus per unit volume, whereas I e is a stimulus current per unit area at the external boundary of the bath (zero in the absence of an external electrical field).

Simulation environment
For the numerical solution of the bidomain equations we used Chaste (Cancer, Heart, and Soft Tissue Environment) [34,35], an open-source electrophysiology solver package using the finite element method.

Benchmark problem
The HPC scientific computing framework for which the I/O strategy is optimized in this work is illustrated in Fig 1. The bidomain with bath equations were solved in an anatomically realistic human ventricular and torso (i.e. the bath, also containing lungs and bones) mesh. The combined heart-torso mesh has a total of about 3.25 million nodes and 19.4 million tetrahedra.
A detailed description of model parameterization is provided in [36]. The ten Tusscher-Panfilov model [37] was used to describe human ventricular electrophysiology at the cellular level. For this mesh resolution with two double-precision outputs per printing time step (V m and ϕ e ), each printing time step requires storage of 8B × 3253316 × 2 % 52.1MB. The PDE time step was 25 μs, with temporal adaptivity between consecutive PDE time steps for the numerical solution of the cellular electrophysiology model. For feasibility in generating the benchmark solutions, the printing time step was set to the PDE time step, and the total run time was 150 printing time steps. To include other types of data access, the post-processing of ventricular maps of activation times and peak transmembrane voltage were enabled, which involve reading from the main results dataset. Finally, all datasets were converted to VTK visualization files as an additional post-processing step.

General considerations on HDF5 default data layout
At the hardware level, digital data storage is one-dimensional, so obviously there must be a mapping between the multidimensional datasets represented in HDF5 files and the disk. The default method simply serializes the in "row-major order", which might or might not be suitable depending on the access pattern (the order in which each process accesses values in the dataset).
Suppose we have a two-dimensional 10 × 10 dataset using this default layout (Fig 2). In row-major order the data are serialized by rows, e.g. elements 1 to 10 (labelled) are contiguous on disk, and the same for the following rows. Using conventional matrix notation indexed from 1, if a process wants to read entries (3,4) to (3,8) inclusive, i.e. the 24 th to 27 th elements (shaded, top panel), then it can do this very efficiently by reading a contiguous region on disk (solid arrows). On the contrary, if a process wants to read entries (4,3) to (8,3) inclusive, i.e. the 33 rd , 43 rd , 53 rd , 63 rd , and 73 rd elements (shaded, bottom panel), then it must perform a number of relatively expensive disk seeks between each row (dotted arrows). Clearly, for good performance the data layout must be based on the access pattern. Since data layout on disk has a significant influence on performance, HDF5 allows the specification of customized chunk shapes based on typical access patterns. A chunked dataset is divided into repeating units of the chunk dimensions, and space for each chunk is allocated contiguously on the disk. In the second example on column-reading above, we might utilize chunking by setting the chunk dimensions to the shape of a column in the dataset. With chunks that coincide with columns, the library would lay the data out on the disk column by column, so that any access from a column becomes a single, fast, contiguous read.
For cardiac applications, HDF5 datasets in our simulation package are three-dimensional objects, over time in the first dimension, nodes in the second, and variables in the third. A typical chunk has size {C t , D N , D v }, where: C t depends on the number of printing time steps, D t ; D N is the total number of nodes; and D v is the total number of variables. In other words, each chunk spans the dataset in the 'nodes' and 'variables' dimensions, with ceil(D t /C t ) chunks in the time dimension. This strategy is fairly efficient for simulations with a small number of processors and nodes as each chunk will be relatively small and easy to cache. It is, however, poorly suited to large parallel applications as discussed below.
Considering parallel performance, using chunks that span the node dimension is suboptimal due to the way the problem is partitioned across parallel processes. At the start of a simulation, the mesh is partitioned and each process is assigned a subset of the nodes that remains unchanged for the duration of the simulation. For simplicity, the nodes are reordered so that the nodes owned by each process are indexed contiguously. Using the earlier notation, each process will access a contiguous block of approximately size {D t , D N /j, D v }, where j is the number of processes (assuming an equal partition of nodes between processes). Recalling the chunk shape, we note that it is "orthogonal" to the regions owned by each process (Fig 3).
At this point it is necessary to briefly introduce two of the HDF5 drivers of our simulation package, and how they differ in reading and writing a chunked dataset. When writing, it uses the MPI-IO driver, which is specialized for parallel applications. The chunk shape has relatively little impact on writing because the MPI-IO driver uses direct access to the disk, and collective writes are used (designed to improve performance when writing from many processes to a single file by first concentrating data onto intermediate aggregators). When reading, however, (e.g. post-processing or generating visualization files) the default driver is used which uses standard POSIX operations. Unlike the MPI-IO driver, the default driver attempts to cache an entire chunk when data in it are accessed. Furthermore, the caching implementation in the default driver dictates that access can only be done on whole chunks. In other words, even if a process attempts to read just one entry from a chunk and it is possible for the chunk to be cached, then the process will read the entire chunk into the cache before continuing. If the chunk is too large to be cached then the cache is bypassed completely and direct access is used.
From the preceding paragraphs it becomes clear why using chunks that span the node dimension is suboptimal. First, consider an HDF5 reader being used to perform post-processing on a dataset in parallel. Each process is expected to access all the variables for all its nodes for all times. Depending on the size of each chunk compared to the size of the chunk cache there are two possibilities: • If chunks are small relative to cache size, then when a process requests a value in its block it must first read the entire chunk into its cache, despite most of the nodes belonging to other processors and being therefore of no interest. This might yield acceptable performance if enough chunks can be cached and the access pattern is conducive, which is not the case in frequent post-processing tasks. For example, imagine the reader accesses all the time points for one node, then all the time points for the next node, etc. Unless all the chunks can be cached at once, the reader will be forced to read the entire dataset on every iteration, just to get the values for each node.
• If chunks are larger than cache size, then each process will access the dataset independently and directly. In this case, the potential performance improvement from using the cache is lost, but the requirement to read in whole chunks is dropped, possibly resulting in less disk activity. Still, for optimal performance the use of caching should be favored.
As mentioned above, writing to the HDF5 file is expected to be less affected by chunk settings than reading, since the MPI-IO driver only uses direct access with collective writes. Nevertheless, the chunk shape results in every chunk being accessed simultaneously by all processes (Fig 3), possibly resulting in increased library overhead to track modifications and maintain consistency between processes. A chunk layout more closely resembling the process boundaries would alleviate this issue.

I/O optimization in large HPC systems
A new chunking algorithm. The most methodical way for optimal chunk design would be to set the chunk shape based on an analysis of the most frequent access patterns, within some chunk capacity limits. The chunk size is important because (1) disks are generally better at reading fewer, larger regions than more, smaller regions, and (2) it influences the size of chunk cache needed for good read performance. Unfortunately, two opposing modes of access coexist in our case at different stages of the simulation. When solving, the fastest varying dimension is the variable dimension, followed by nodes, and finally time. When performing post-processing, it is not uncommon to instead access the variables for each node over all time. An access-based approach might also result in highly problem-and/or machine-specific algorithms, that might show good performance in some applications at the expense of others. Input-output optimization in high performance scientific computing Instead, a general algorithm was developed to set the chunk size based only on the size of the dataset and one parameter, the target chunk size in bytes (T B ). For maximum generality and in order to ensure its applicability to any number of dimensions, the new chunk algorithm is designed to treat all dimensions equally. Another design requirement is that the chunk shape should result in high storage efficiency. HDF5 allocates the minimum integer number of chunks required to contain the dataset (recall Fig 3). Most conceivable chunk shapes a priori might therefore result in unacceptable amounts of wasted space at the edges of the dataset, because the chunk size is unlikely to be (close to) a factor of the dataset size in every dimension.
Central to the proposed solution is the variable "target size", T (not to be confused with T B ). First, for each dimension, the rounded-up division of the dataset size by the target size gives the minimum number of target-sized chunks that would span the dataset. Second, the rounded-up division of the dataset size by this number of chunks gives the actual size of chunk that is closest to the target size while still being close to a multiple. The problem then reduces to finding the target size that best satisfies the chunk size requirement in bytes. The solution can be written concisely as follows. Let the chunk and dataset sizes be vectors denoted byC andD, respectively:C where N is the number of dimensions in the dataset (usually 3 for time, nodes and variables). We therefore getC by finding the largest T such that where in Eq (7) represents the chunk size constraint (each element is 8 B), in Eq (8) limits the chunk to the dataset size in each dimension, and Eq (9) defines the chunk size given the dataset size and target size in such a way as to minimize wasted space. Algorithm 1 New HDF5 chunk size algorithm The method for solving the above is described in Algorithm 1. The dataset size (D) and target chunk size (T B ) are assumed as predetermined. The first while loop (line 9) increases the target size (T) and calculates the resulting chunk dimensions (C) until the target size in bytes (T B ) is reached, or the chunk spans the entire dataset (U is True). Note that a binary search for T between 1 and max ðDÞ (for example) would be faster, but as invoked only once per dataset (resulting in a negligible overhead in overall wall times) this does not represent a significant increase in performance, and we chose to present the algorithm here in incremental form in the interest of clarity. After leaving the while loop, if T B has been exceeded (line 13), the algorithm brings the size back below the target. Once given a target size in elements, the CALCULA-TECHUNKDIMS function (line 19) calculates chunk dimensions that aim for the target size in each dimension while minimizing wasted space at the dataset edge as outlined above. First, it calculates the minimum number of chunks of size T that would be required to span the dataset (x, line 23). Then, given x chunks, it calculates the minimum number of elements required in each chunk to span the dataset (line 24). This function also calculates the actual chunk size in bytes (C B ) and determines U. The chunk size in bytes is the product of 8 B and all the elements ofC (line 24). Finally, if x = 1 on every iteration, then one chunk spans the entire dataset and U is True (line 26).
The choice of T B depends on the problem size and the computer. The default value was set to 128 kB as this resulted in consistent performance in the small profiling tests that are run regularly to monitor performance. For large problems it was increased to 1 MB, in agreement with the default stripe size in Archer as discussed next.
Data striping. At the filesystem level, data striping is a common technique in HPC systems (including the Lustre filesystem in Archer) to increase data throughput by splitting files into segments and dividing the segments amongst multiple physical storage targets (e.g. hard disk drives).
For performance improvements at this level, two parameters may be set on a file or directory basis: the stripe size (S) and stripe count (c). The former is the size (in bytes) of each stripe, whereas the latter is the number of Object Storage Targets (OSTs) over which to divide the stripes (see Fig 4). The system defaults on Archer are 1 MB and 4, respectively, and there are 48 OSTs at the time of writing this work, each capable of writing at roughly 500 MB/s.
Optimal values for S and c can only be found through experimentation. For reference, the Lustre documentation (see Section 18.2.1 in [38]) recommends a stripe size between 512 kB and 4 MB. Smaller sizes are not recommended 'because the Lustre file system sends 1 MB chunks over the network'; more is not recommended because 'stripe sizes larger than 4 MB may result in longer lock hold times and contention during shared file access'. Finally, the stripe size must be a multiple of the page size (enforced to a multiple of 64 kB for compatibility). The default stripe size on Archer (1 MB) is generally optimal and other values will not be considered here.
The stripe count c will be investigated below. For writes from many processors to a single file a large stripe count is recommended, but not too many counts as this results in extra overhead for diminishing returns. A starting point based on the Lustre documentation is to use approximately '1 stripe per GB' to '1 stripe per 4 GB' of file size. As an example, for 100 printing time steps of our benchmark problem (expected dataset size of *5.21 GB), c should probably be between 2 and 6. Another is to "load balance" by using an integer factor of the number of processors, such as one stripe per compute node so that each node gets one aggregator.
Cached writes. Regardless of the data layout used, the simulation results are written to the HDF5 file every print time step of simulation time. Recall from our benchmark description (see Methods) that one printing time step of data in our benchmark consumes about 50 MB on disk. Large HPC parallel file systems have good sustained throughput and are typically optimized for high bandwidth (such as the 500 MB/s per OST in Archer as mentioned above), but performance for small writes is much lower. They work best with a small number of large, contiguous I/O requests whereas small ones are generally discouraged. We should therefore expect several hundred 50 MB writes to show worse performance than, say, one 40 GB write.
The chunk cache provided by HDF5 might have been a viable answer, but it is currently not available when using the MPI-IO driver in write mode. The selected solution was to implement a memory cached mode in the HDF5 writer, whose constructor now takes an argument specifying if the cache will be enabled. This simplifies the implementation of our strategy over making the cache switchable, which would require extra logic like flushing the cache when switching. A new vector member with a reserved size of C t × N n × N v acts as the cache, where C t is the size of a chunk in the time dimension (as calculated by the new chunking algorithm), N n is the number of nodes owned by a process, and N v is the number of variables. Once the number of elapsed print time steps equals C t , each process writes the contents of its cache to the HDF5 file. As collective writes are still used, the library then takes care of consolidating the data onto aggregators as normal.   Table. https://doi.org/10.1371/journal.pone.0202410.g005 Input-output optimization in high performance scientific computing the three repeats. Performing additional repeats was unfeasible due to time and resource requirements.
The principal results of this study are as follows. In all cases, the default chunk layout (1) spent little time in Output and substantial time in DataConv and PostProc. The former point is likely due to the small number of chunks (just three), resulting in very little overhead when coordinating collective writes. The latter point, however, clearly illustrates the high cost of reading results from a poorly laid out dataset for additional post-processing tasks. Conversely, the new layout (2) spent the vast majority of time in Output and little in DataConv or Post-Proc. The writing of the resulting 7497 new style chunks is evidently slow, but the smaller, squarer chunks allow the post-processing and conversion steps to happen extremely quickly by leveraging the built-in HDF5 chunk-caching functionality. Moreover, the effect of enabling the custom chunk-writing cache on the new style method is striking (3). In this case, the benefits of the new data layout to DataConv and PostProc are retained, whereas the time spent in Output is reduced to under 30 s in all cases. Clearly the new algorithm with cached writes offers superior performance on Archer compared to the considered alternatives. Fig 5 further illustrates the results for each method with respect to stripe and node counts, of importance for performance optimization. First, comparison by rows (stripe count effects) for the default chunk (1) illustrates increased DataConv times with 24 or 42 stripes compared to 4 stripes, both with either 8 or 20 nodes. Clearly, the DataConv process is unable to leverage the extra bandwidth offered by the large stripe counts, probably due to either an overhead of communicating with many OSTs, or technical advantage from concentrating on a small number of OSTs (e.g. internal OST caching). PostProc showed the same trend. In contrast, the uncached new chunks (2) were faster with 24 or 42 stripes than 4, showing performance benefits in parallel I/O. The severe bottleneck in Output is alleviated with a larger number of stripes, suggesting that the performance with 4 stripes is either limited by the OSTs or due to overwhelming the 4 threads assigned to aggregators. If we interpret the large error bars on Output as a sign of sensitivity to the machine load then the answer is probably the former. Finally, this method performed better with 24 stripes than 4 or 42, supporting the rule of thumb that a single large file written to by many processors should be striped, but not excessively, to avoid incurring large overheads. Another well-suited characteristic of the new-chunk cached method (3) for large HPC systems is its insensitivity to stripe counts, which alleviates optimization needs at the file system level, in particular for novice users.
Second, by comparing columns (node count effects), performance is improved for the default chunks (1) going from 8 to 20 nodes, both in DataConv and PostProc. This suggests that in spite of the sub-optimal I/O performance of these chunk layouts, the post-processing stage is somewhat able to utilize additional cores. The un-cached new algorithm (2), however, scales poorly at best, showing no significant difference between 8 and 20 nodes. In the case of 4 stripes, Output is still slow, perhaps exacerbated by the larger number of nodes. Yet again, there were no significant differences in performance for the cached new method (3) with node counts, highlighting its robustness for scalable applications.
The previous results also indicate that a strip count c of 4 simultaneously yields the best studied I/O performance for both the default chunks and new cached algorithm. This agrees with the initial estimate on that c should be between 2 and 6. A comparative summary of benchmark times for these 4 stripes on 8 compute nodes is presented in Table 1 for all the considered I/O strategies (see S1 Table for rest of stripes and compute nodes). These represent relative improvements in I/O performance of the new strategy over the default layout method of 40% in HDF5 writing to disk, 93% in post-processing, and 98% in HDF5 conversion to visualization files.
Alignment of new chunks with caching. As described in the presentation of our new chunking algorithm, any chunking algorithm is unlikely to produce chunks of exactly the target size. In addition, chunks are located by default at irregular locations within the file. Together, these two statements imply that a given chunk is unlikely to align perfectly with the stripe boundaries of the file. In such circumstances, accessing a chunk requires reading stripes from more than one OST. For example, if the stripe size and chunk target size are 1 MB and the true chunk size is slightly less than 1 MB, then reading a chunk is likely to involve requests to two OSTs. When the chunk and stripe size are similar, it might therefore be preferable for each chunk to be padded slightly with empty space so that each chunk starts on a stripe boundary. Note however that such an approach should be used with caution in order to avoid excessive wasted space.
The benchmark problem was used to test the performance with and without alignment (as implemented natively in the HDF5 library) in the new cached chunking algorithm. Results (mean ± S.E.M., in seconds) are shown in Table 2 for 110 and 113 repetitions of the unaligned and aligned cases, respectively.
Whereas HDF5 alignment did not substantially affect the overall performance of the new method, no improvements were attained in any of the considered I/O categories. A possible explanation for this is that the processes regularly read and write across chunk boundaries (i.e. the partition boundaries rarely fall exactly on chunk boundaries), so placing each chunk into its own stripe rarely results in a reduction in the number of OSTs accessed. Enabling alignment might also introduce small gaps into otherwise contiguous data, reducing performance slightly. The theoretical advantage of aligning chunks to stripes might however become apparent in larger problems.

Conclusion
In this work, we have presented significant improvements in the I/O performance of our electrocardiogram-simulation framework in large HPC infrastructures, particularly in the challenging and frequently neglected areas of data writing, post-processing, and data conversion. A general algorithm for the efficient design of data layouts in HDF5 files (as a leading scientific file format for data storage and portability) was developed, and further optimized using cached writes. The efficiency of the resulting I/O strategy with respect to native concurrent layouts has been shown, independently to stripe and node count effects in large HPC filesystems, as well as to data alignment within the resulting files. Furthermore, as a single parameter (the target Input-output optimization in high performance scientific computing chunk size in bytes) is responsible in our algorithm for the low-level design of the underlying datasets regardless their number of dimensions, this guarantees a maximum generality and its applicability to other scientific areas beyond the one considered in this work. The most substantial contribution is the method in which HDF5 files are written to disk, including the design of a novel low-level data layout independent to the number of dimensions in the dataset. Two actions are central to this I/O strategy. Firstly, the data layout (the so-called chunk shape) was modified to improve efficiency when reading small amounts of data, which is common in large HPC applications using many processors. This yielded a significant reduction in times for post-processing of simulation results and their conversion to other visualization formats, which are common scientific requirements across disciplines. A side effect of the new data layout was an increase in the output times required for the writing of results to the HDF5 file, due to the nature of storage systems in large HPC systems. This was overcome by implementing a cached write mode which bundles multiple small writes into larger ones, substantially reducing the aggregate writing times. The overall result was a drastic reduction in the time spent in all I/O stages of our simulation framework, with relative improvements over default HDF5 layouts of 40% in writing, 93% in post-processing, and 98% in data conversion.
Previous efforts have also been deployed for the optimization of I/O performance in HPC applications. Of particular relevance for our work are write-optimized middleware systems, such as ADIOS (Adaptable IO System, [39]) or PLFS (Parallel Log-structured Filesystem, [40]). These high-level I/O Application Programming Interfaces (APIs) allow for a more aggressive writing and efficient reordering of data locations in the case of ADIOS, and for a decoupling of concurrent writes to improve the speed of checkpoints in the case of PFLS, resulting in up to 100 × improvements in writing in selected applications [40,41]. Importantly, these write-optimized APIs have been also shown to not penalize read speeds [33]. However, they both introduce intermediate file formats that require conversion for analysis to standard scientific formats, or to be mounted as stackable filesystems on top of an existing parallel filesystem.
On the contrary, the simplicity of the I/O strategies presented in this work, solely based on the size of the dataset (independent to its number of dimensions) and the straightforward implementation of a cached write mode, could easily be incorporated into codes using popular scientific file formats like HDF5, which has a history of optimization on popular HPC platforms [42]. This would alleviate the need of using intermediate API layers and the associated additional complexity to end users, while resulting in important savings in writing, reading and post-processing times in scientific applications.
For applications involving mesh adaptivity, an inherent limitation of the HDF5 file format is that the chunk size is set at the dataset creation time and cannot be changed later, which forces the use of a fixed chunk size. Based on our results for fixed chunk sizes, we hence still expect an increased I/O efficiency for the new designed chunks compared to the default HDF5 layout in the presence of adaptivity. Such investigations (including the estimation of an optimal chunk size) fall however beyond the scope of our present work. In addition, our benchmark experiments were performed using a single (Lustre) parallel I/O environment. Although our cached results demonstrate almost complete independence to the number of stripe counts (see Fig 5), which in turn minimizes sensitivity to the choice of this parameter as a common technique to increase data throughput across multiple HPC filesystems, the evaluation of our methodology in other parallel environments also constitutes an interesting aspect for future research.
In conclusion, given the generality of our I/O strategies and file formats used, the improvements presented in this work might enable a more efficient use of HPC resources and accelerated progress in multiple areas of scientific research. This may allow researchers to achieve a wider range of functionalities using standard scientific file formats, and therefore more complete simulation frameworks, within tolerable HPC resource envelopes.
Supporting information S1 Table. Benchmark results for considered data layouts. I/O times for data writing to disk, data conversion, and post-processing in default-style chunks, new-style chunks, and new-style chunks with caching. (XLSX) S2 Table. Benchmark results under HDF5 alignment. I/O times for data writing to disk, data conversion, and post-processing in new-style chunks with caching, with and without HDF5 alignment. (XLSX)