A DNA-Based Semantic Fusion Model for Remote Sensing Data

Semantic technology plays a key role in various domains, from conversation understanding to algorithm analysis. As the most efficient semantic tool, ontology can represent, process and manage the widespread knowledge. Nowadays, many researchers use ontology to collect and organize data's semantic information in order to maximize research productivity. In this paper, we firstly describe our work on the development of a remote sensing data ontology, with a primary focus on semantic fusion-driven research for big data. Our ontology is made up of 1,264 concepts and 2,030 semantic relationships. However, the growth of big data is straining the capacities of current semantic fusion and reasoning practices. Considering the massive parallelism of DNA strands, we propose a novel DNA-based semantic fusion model. In this model, a parallel strategy is developed to encode the semantic information in DNA for a large volume of remote sensing data. The semantic information is read in a parallel and bit-wise manner and an individual bit is converted to a base. By doing so, a considerable amount of conversion time can be saved, i.e., the cluster-based multi-processes program can reduce the conversion time from 81,536 seconds to 4,937 seconds for 4.34 GB source data files. Moreover, the size of result file recording DNA sequences is 54.51 GB for parallel C program compared with 57.89 GB for sequential Perl. This shows that our parallel method can also reduce the DNA synthesis cost. In addition, data types are encoded in our model, which is a basis for building type system in our future DNA computer. Finally, we describe theoretically an algorithm for DNA-based semantic fusion. This algorithm enables the process of integration of the knowledge from disparate remote sensing data sources into a consistent, accurate, and complete representation. This process depends solely on ligation reaction and screening operations instead of the ontology.


Introduction
As the hereditary basis of every living organism, DNA has an ability to store and process information. This information is determined by the sequence of four distinct bases (A, C, G, T). An oligonucleotide is a short, single-stranded DNA molecule, and the complementary base pairing enables hybridization into a doublestranded polymer. These features of DNA have inspired the idea of DNA computing [1][2][3]. DNA computing, known also under the name of molecular computing, has great advantages of in vivo computing and in vitro computing, such as massive parallelism, extraordinary information density and exceptional energy efficiency. In contrast to traditional silicon-based technology, DNA computing has the natural potential of semantic fusion and reasoning for big data.
Nowadays, ontology has gained more and more acceptance as one of semantic technologies to solve the problem of heterogeneous knowledge sharing [4]. Many research efforts have been devoted to ontology modeling over the past decade [5][6][7][8][9], and quite a few running systems based on manual ontologies have been developed [10][11][12]. However, data is accumulating at an astounding rate with increasing computing power. Many activities, for instance encoding an organism's DNA [13], collecting satellite data [14], and conducting scientific experiments at the Large Hadron Collider [15], can create a staggering amount of data. The growth of these big data outstrips the capacities of current ontology engineering practices and tools. In bioinformatics, the semantic integration of big data has been identified as a new frontier [16]. The same trend can also be observed in other scientific domains. For example, with a vast amount of geographical data becoming available from satellites, especially the recent opening of the Landsat archive [17], there comes an increasing demand for automatic semantic processing of remote sensing images (RSIs) in a reasonable amount of time. Up to now, reasoning from big data is challenging. As the winner of the Semantic Web Challenge, Williams provided the experimental results showing that reasoning over the Billion Triple Dataset required 3712 processors from IBM LS21 blade servers and the computation time was 1314 seconds per processor [18]. Although this dataset contains 898,966,813 triples and the size of the combined dataset is around 17 GB, the amount of data obtained from satellite devices and open sources on the Internet per day is much higher and beyond the capabilities of analyst to process the data with the help of ontology [19]. Novel tools and approaches are needed to address this problem that has arisen during the current period of rapid data and knowledge growth. Now DNA computing has become an active research area [20][21][22][23][24]. DNA-based parallel computing takes advantage of many different DNA molecules to solve the NP-complete problems in polynomial or even linear time, while exponentially increasing time is required in silicon-based computer. In this paper, a DNA model is introduced for semantic fusion of the RSIs. It utilizes DNA computing and ontology technologies to enable the complete representation of the RSI's knowledge in linear time regardless of the amount of data obtained.
There is few published work in the literature about the application of DNA-based approach to semantic fusion. Tsuboi proposed a pattern matching algorithm based on stickiness of DNA molecules [25]. Semantic network technology is used to solve information recognition problem. However, the fusion of semantic relationship is not involved. This restricts the analysis and reasoning capacity of the processing system. Moreover, the encoding scheme in this algorithm is not suitable for arbitrary digital information and the different data objects have to be encoded by different oligonucleotides. However, an exhaustive representation is considered unrealistic. Church proposed a novel strategy to store digit information in DNA [20]. In Church's work, all data blocks can be programmed into a bitstream and then encoded onto thousands of oligonucleotides. But the sequential conversion code (Perl) faces the challenge from big data. Xu provided a new DNA computing model for graph vertex coloring problem [26], which can effectively reduce the solution space by seminested polymerase chain reaction. All these approaches described above lack support for semantic reasoning and little attention has been given to big data, which have become the key problems of knowledge sharing and semantic representation in the web environment.
In an attempt to overcome these difficulties, we propose here a novel DNA-based semantic fusion model as an extension of our previous research for distributed data application in remote sensing field [27]. In previous work, we have implemented a semantic fusion and reasoning system for the RSIs' retrieval. At present, the use of DNA computing in semantic fusion presents numerous opportunities for our future DNA reasoner. The inherent massive parallelism of DNA strands allows for big data storage and reasoning. The main efforts in this paper are to 1) develop a remote sensing data ontology with 1,264 concepts and 2,030 semantic relationships to annotate the RSIs; 2) encode arbitrary semantic properties, property values, semantic relationships and data types in DNA, and organize the semantic information into directed acyclic graph; 3) evaluate the performance of our parallel conversion method against the sequential approach with the Rest dataset [28]; 4) create an algorithm that takes advantage of the biochemical reaction to fuse the semantic information.

Results and Discussion
Remote sensing data ontology Ontology, as a formal representation of both implicit and explicit domain knowledge, can help to deal with heterogeneous representations of data and their interrelationships. There exist several forms of ontology with different semantic richness. As a specification developed by World Wide Web Consortium, the Resource Description Framework (RDF) [29] can present semantic information of web resources. RDF Schema [30] provides a type system for RDF and defines classes and properties that may be used to describe classes, properties and other data resources. It can also be used to build a lightweight ontology by describing RDF vocabularies. Figure 1 illustrates the remote sensing data ontology by using RDF Schema language. The computer code of the ontology is provided in File S1. All terms in the ontology vocabulary are divided into five groups (namely, Identification Information, Data Quality Information, Spatial Data Organization Information, Instrument Information, and Location Information) to represent the content, quality, condition, and other characteristics of data. To enable the extensibility of the ontology, we evaluated the suitability of several existing geospatial metadata standards, including the Content Standard for Digital Geospatial Metadata: Extension for Remote Sensing Metadata [31], ISO 19115 [32] and ISO/TS 19319 [33]. The Extension defines the metadata elements published by the U.S. Federal Geographic Data Committee and documents digital remote sensing datasets in the US. While ISO 19115 does only provide a structure for describing digital geographic data and many elements in ISO 19115 are from the Extension standard. ISO/TS 19139 defines an XML schema implementation derived from ISO 19115. These two ISO standards are very simple but not suitable for ontology modeling. Considering the fact that the conceptual model in the Extension does not provide enough semantic description of geographic data, we construct a hierarchical structure of the ontology. The relationships among specific classes are encoded into the ontology structure. The RDF Schema properties rdfs:range and rdfs:domain describe the relationships between specific properties and classes, and a lot of image data relationships have been described using the domain properties from the Extension standard.
The real RSIs must be first preprocessed with semantic annotation technique, where semantic tags defined in the ontology are assigned to the phrases in the descriptive metadata of the RSIs. This facilitates the fusion and reasoning based on image semantics. RDF instance of an RSI is shown in Figure 2, where the metadata of RSI 103001001E1EB700 are annotated with the properties such as imagequal (image quality), Cloud_Cover and spatresv (spatial resolution value), etc. The property values are numerous ''intermediate'' anonymous resources to represent constant values (called literals) such as Excellent, 0, 1.85, or aggregate concepts such as RSI's structured Nominal_Spatial_Resolution values. Anonymous resources cannot be referred to from outside their defining RDF instance, and hence do not require meaningful names.

Semantic property and data type
In order to convert the classes and properties representing data semantics into the sequence of nucleotides, we propose the property representation and type design suited for DNA implementation. For example, this paper annotates three RSIs E1EB7, D87C9 and B8EF1 with three properties: city (ct), imagequal (qa) and Cloud_Cover (cc). The first image's property values are Guang Zhou (GZ), Excellent (E), and 0, respectively. The other two's values are Hong Kong (HK), Good (G), 0, and HK, G, 16. Considering the linear structure of DNA strands, we arrange these properties and their values in sequence as shown in Figure 3. The label of a vertex is denoted as two-tuples (property name, property value). The edge denotes the connection between the vertices in the directed graph. To simplify the graphic structure, two new vertices labeled as ''Start'' and ''End'' are added to the directed graph and the vertices are integrated into one if they have the same property and property values. As shown in Figure 4, there are directed paths representing the annotation results of the RSIs between initial and terminal vertex in property network.
Everything would be simple if the type of property to be recorded was obviously in the form of the simple character string literal (plain literal) illustrated so far. However, most RSIs data involve structures that are more complex than that. Many constant values that serve as property values in the RSIs are numbers (e.g. the value of a Nominal_Spatial_Resolution property) or some other kinds of more specialized values. For example, Figure 4 illustrates a network diagram recording information about three RSIs, where the values of RSIs' Cloud_Cover property are literals ''0%'' and ''16%''. However, there is no explicit indication that ''0%'' or ''16%'' should be interpreted as a number. The common practice in computer programming or database systems is to provide additional information about how to interpret a literal by associating a data type, such as integer, boolean, or string, with this literal. In our new DNA model, 4-nt oligonucleotides are used to provide this kind of information. Since DNA strand has no built-in data type system of its own, our model simply provides a way to explicitly indicate, for a given data type, what oligonucleotide should be associated with it. Table 1 shows the common data types. The data types in this model refer to the XML Schema Datatypes defined in [34]. An advantage of this approach is that it gives our model the flexibility to directly represent information obtained from various RSIs or web sources. It is worth noting that type conversions may still be required when moving data between systems having different sets of data types.
Moreover, a property value may sometimes appear to be simple, but may actually be more complex. For example, the unit information of the spatial resolution for satellite imagery is meter, but in some cases such information is not explicitly given and omitted in contexts where it can be assumed that anyone accessing the property value will understand the unit information being used. However, this assumption is generally unsafe in the wider context of the imagery. One might give a resolution value in kilometer or degree, whilst others might assume that is in meter. In general, a comprehensive consideration should be given to the explicit representation of unit information.

Encoding the semantic information
Before the semantic information is converted into DNA, an encoding model is required. Although diverse coding strategies for DNA sequences have been developed and some have been demonstrated [20,35,36], no standard model exists. Church GM [20] first proposed a simple, universal strategy. In Church's work, arbitrary digital information can be converted into bitstreams by utilizing the ASCII code. These bits are then encoded onto the oligonucleotide library. Unlike conventional approaches, Church encodes one bit per base in order to meet the appropriate GCcontent and introduces a 19-nt oligonucleotide to represent the data's address space.
However, the common type system is not considered in Church's encoding method. Thus, we propose a novel data encoding approach for semantic information. Firstly, the vertices and edges in Figure 4 are converted into DNA sequences in order to efficiently represent the semantic properties. Every vertex is associated with a 48-nt oligonucleotide which is denoted V. The full description about the mapping from the vertex property to the   DNA sequence is provided in the Materials and Methods section. Now each V, except the start and end vertices, is decomposed into four oligonucleotides whose lengths are 24, 4, 4, 16: V = NTUA. N, T, U, and A represent the property name, data type, unit (or comment), and property value respectively. The unit value U depends on N and T. For example, the property name cc and property value 0 in the vertex (cc,0) are represented by the first and last parts of V (cc,0) respectively, where N (cc,0) = aaCgaagagC-TaagCCgCCgaaTC and A (cc,0) = gaCTgagaggTTggag. The oligonucleotide GCAT in V (cc,0) represent the unit %, as shown in Table 2.
Since the volume of electronic data expands rapidly, it is important to choose the optimal computer architecture for converting big data set. Conversion solutions range from clusterbased computing [37] to cloud-based computing [38]. Considering the cost-effective way to achieve a supercomputer performance, we use the cluster computing. All the conversion experiments in this paper were carried out in the HPC-JNU cluster system. The description about the HPC-JNU is provided in the Materials and Methods section. The sequential and parallel codes in C language are provided in File S3 and File S4 respectively. To evaluate the performance of these conversion programs, our semantic data are partly from the Rest dataset in BTC2012 dataset (http://km.aifb. kit.edu/projects/btc-2012/rest/). This dataset is encoded in NQuads format [39] and includes three data files that range in size from 409.99 MB to 2.69 GB. Figure 5 shows the conversion results of 4.34 GB source dataset in the HPC-JNU cluster system. As an explanatory scripting language, the Perl language has poor IO disk performance. The result of the parallel method shows the   best performance although the user of the cluster system has a maximum limit of 80 cores.

DNA's storage density
At present, remote sensing data are dramatically increasing in volume. For example, the U.S. National Climatic Data Center holds the world's largest archive of weather data and has archived 3 PB (petabyte) satellite imagery [40]. The extreme compactness of DNA is incredible. Because the mean molecular weight of a nucleotide is 330 g/mol [41] and a 200 bp encodes 128 bits in our encoding method, one gram of DNA can store 5.84610 20 bits. We approximate DNA's density to water's density (10 23 g/mm 3 ), then the volume of all DNA sequences encoding 3 PB data is 4.63610 22 mm 3 . We compare favorably contemporaneous storage technologies in Table 3 [42][43][44][45][46][47][48][49][50]. DNA storage has obviously the potential of storing data 100 times more compactly than other technologies.

Semantic fusion based on DNA
Semantic fusion is the key operation that ontology technology supports. It can automatically implement the union of the properties and semantic relationships. A resource, such as an RSI, and its replicas may be widely distributed over several image replicas databases. The owners of the resource may select different kinds of feature properties to annotate this RSI. We must merge these properties and relationships in order to improve the efficiency and accuracy of the knowledge. As shown in Figure 6, the semantic fusion enables image's semantic information from disparate data sources to be merged. The initial properties dissolve in the new properties and do not preserve their duplicate internal structures. However, the performances of ontology fusion and reasoning degrade rapidly as data grows. Therefore, we build a semantic fusion model based on DNA. Table 2 shows a set of oligonucleotides representing the possible properties labeling the vertices in Figure 6A. As regards orientation, all of the oligonucleotides are written 59 to 39. Now each V in Figure 6A is divided into two oligonucleotides, each of length 24: V = V'V''. V' and V'' are the first and second half of V. An edge from the vertex i to the vertex j is encoded as a 48-nt oligonucleotide, obtainable as the Watson-Crick complement of the second and the first halves of the oligonucleotides encoding the vertices i and j touching the edge. For example, the encoding of an edge from the vertex (cty,GZ) to the vertex (qa,E) is given: e (cty,GZ)R(qa,E) = AGCTACGTTaTCTggaTgCaaTgTCTaTTC TTTaagTTCaCaaCCTCa. For every vertex and every edge in Figure 6A, large quantities of V i and e ij are mixed together in the hybridization and ligation reaction as shown in Figure 7. The oligonucleotides V i served as splints to bring oligonucleotides associated with compatible edges together for ligation. Consequently, many DNA molecules encoding the property string are created. The remaining steps, as well as the conclusion in the output, are filtering and screening procedures. We use the Adleman style [1,51] algorithm for obtaining the result property string: Input: DNA molecules generated randomly in large quantities.
Step 1: Reject all DNA molecules that do not begin with V start and end in V end .
Step 2: Reject all DNA molecules encoding property strings that do not involve exactly 5 vertices.
Step 3: Reject all DNA molecules that contain the oligonucleotide TGCATGCA encoding the null value.
Output: Read out the property strings (if any). As shown in Figure 8, we can obtain the result property string by using the semantic fusion method based DNA. It is consistent with the semantic properties in Figure 6B.  Abstract representation of semantic fusion The above algorithm can be formally described by an abstract model. This abstract model is based on the data structure of the tubes. A tube is a multi-set of finite strings over the alphabet {A, C, G, T}, namely the DNA alphabet. Given a tube, one can perform the following operations: This model starts with the input tube T, containing the result of the ligation reaction. All separate operations select the oligonucleotides and thus require the amplification of the resulting tubes by the PCR (polymerase chain reaction).
Indeed, semantic fusion problem have been shown to be an NP-complete problem [52,53], which means that it is unlikely to find an algorithm working in polynomial time. The semantic fusion on image properties of modest size requires an altogether impractical amount of time on conventional electronic computer [54,55]. However, we use a finite sequence of ligation reaction and screening operations described above to solve the semantic fusion problem. A fusion starts with an initial tube and ends with one final tube. The fusion time depends solely on the total time of ligation reaction and five screening steps instead of the number of semantic properties and ontology complexity. Then the massive parallelism of DNA renders exponential time complexity in semantic fusion to linear time.

Conclusions
Semantic fusion is a process that is ubiquitous in nature. In this paper, a novel DNA-based semantic fusion model is proposed. The model combines organically parallel strategy with DNA encoding, which makes semantic conversion more efficient and storage density higher. Furthermore, we describe the abstract representation of semantic fusion and thus show that the fusion time of semantic properties in remote sensing images depends solely on the biochemical reactions and operations instead of the ontology. However, there are still many issues to be considered. Foremost issue is error. DNA molecules are fragile and they break easily. The errors of separate operations with DNA strands can make a really dramatic difference. Thus, steps towards coping with errors should be taken in. In future work, we also implement the ligation reaction and screening procedures based on biochemical techniques and clarify details in another paper.

Mapping from semantic information to an oligonucleotide
All properties and property values are converted to binary strings based on ASCII encoding. Each character corresponds to an 8-bit binary code. For example, the property cty has the binary code 011000110111010001111001. Conversion code in File S4 can then convert these bits to a or g for 0 and T or C for 1. Bases are chosen randomly according to the result of function rand(). Considering the big dataset, we add a 32-bit address starting from 00000000000000000000000000000000. For example, the properties and property values of an RSI E1EB7 in Figure 3 is represented by the string startctyGZ qa E cc00 end, where the symbol represents a whitespace character, start and end are the labels of the new vertices added in Figure 4. This property string has an ASCII code 001000000111001101110100011000010111001001110100 01100011011101000111100101000111010110100010000001110 00101100001001000000100010100100000011000110110001100 11000000110000001000000010000000100000011001010110111 001100100. It is then encoded to two 200 nt oligonucleotides by the conversion code given in File S4. Each encodes a 128-bit data block (128 nt). Before synthesized, the sequence is augmented to include the bases representing data type and data unit. For example, an oligonucleotide aCCggaTTgTCCgCaggCCTTggCaTa-gaCCTaCgTTaCa is the result of encoding the property ctyGZ in the vertex (cty,GZ). Considering the data type is string and data unit is undefined, we add TCGA and TGCA to the original oligonucleotide according to Table 1. Thus, the final oligonucleotide of the vertex (cty,GZ) is aCCggaTTgTCCgCaggCCTTggCTCGATGCAaTa-gaCCTaCgTTaCa, as shown in Table 2.

Specification of the cluster system
The HPC-JNU cluster system (http://hpc.jnu.edu.cn/) has 20 computational nodes. Each node is connected via the InfiniBand network. Table 4 shows the specifications of the HPC-JNU cluster system. Figure S1 and Figure S2 show the photographs of the computational nodes and the storage node. File S1 Code for remote sensing data ontology (see also http://cs.jnu.edu.cn/sun/ontology). Computer code in the RDF Schema language is used to generate the remote sensing data ontology in Figure 1. The RDF/OWL API is required. (RDFS).

(RDFS)
File S2 Code for ID 103001001E1EB700 instance (see also http://cs.jnu.edu.cn/sun/ontology). Computer code in the RDF language is ontology annotation file of remote sensing data (catalog ID 103001001E1EB700) instance in Figure 2. The RDF/OWL API is required. (RDF).

(RDF)
File S3 The sequential conversion code in C language. The code accesses and converts the data stored contiguously on disk. Despite the cache provided by the operating system, an application that performs a large number of reads, conversions and writes usually faces the performance challenge. GCC compiler is required. (C).

(C)
File S4 The parallel conversion code in C language. To support the run-time allocation of conversion tasks, a manager/ worker-style parallel C program has been built. The multiple processes of this parallel program can simultaneously access and convert big data by utilizing the MPI-IO. The MPI API is required. (C). (C)