Improved scheduling algorithm for signal processing in asynchronous distributed ultrasonic total-focusing-method system

Compared to the conventional ultrasonic phased-array system, a large-element phased-array system employing the total focusing method (TFM) can yield improved image resolution and accuracy, providing more flexible scanning methods and image merging functionality. In order to meet various forms of ultrasonic multi-group scanning, an architecture for multi-group scan integration called the “asynchronous distributed ultrasonic TFM system” is proposed, and a novel scheduling algorithm called “the sum of start time and processing time adjacent (SSPA) algorithm” is presented. The architecture adds a focus and group scheduler (FGS) and signal processing scheduler (SPS) to the traditional ultrasonic phased array system and constructs a signal processing arbitration (SPA) with several signal processing modules (SPMs). The FGS provides the focus parameters, pixel memory range, and number of pixels in each group. The SPS controls the SPA for the ultrasonic scanning data obtained from the elements, with SPM-sharing output data; hence, the optimal priority order and SPM assignment are realized, enabling switching of reading operations among the first-in−first-out memories for signal processing and minimal time-slot waiting. The SSPA algorithm is used to solve the job-shop scheduling problem with start time, which considers the processing time and start time, in order to reduce the time slot after each scheduling using adjacent operations. Therefore, the architecture enhances the flexibility of the multi-group scan, and this algorithm decreases the makespan, achieving higher efficiency compared to conventional scheduling algorithms. The reliability and validity of the algorithm are substantiated after its implementation using FPGA technology. The SPM utilization rate and the real-time performance of the ultrasonic TFM are improved. Thus, the proposed algorithm and architecture have considerable potential application in multi-sensor systems.


Introduction
Recently, techniques associated with multi group ultrasonic sensors, which include numerous piezoelectric elements and various ultrasonic phased array (UPA) scanning patterns, have attracted widespread attention in the field of nondestructive testing [1,2].
The full matrix capture-total focusing method (FMC-TFM) is a high-resolution imaging technique used in UPA systems, which was proposed by Holmes et al. in 2005 [3]. Full matrix capture (FMC) acquisition obtains the most complete imaging information for subsequent processing by acquiring all the transmit-receive pairs of the A-sweep dataset in multiple ultrasonic sensors. Fig 1 depicts the concept behind the total focusing method (TFM) algorithm. This algorithm meshes the region of interest (ROI) in a grid of pixels; a pixel is generated by the summation of the data from all the transmission-reception pairs. [3] The intensity, I, of a pixel, P(x, z), is expressed as where h ij is the analytical version of the echo received by element j when element i transmits; T ip þ T pj ¼ ð ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðx i À xÞ 2 þ z 2 q þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðx j À xÞ 2 þ z 2 q Þ=c is the time of flight between P and the element pair (i, j); N e is the number of elements; x i and x j are the coordinates of elements i and j, respectively; and c is the longitudinal velocity of sound.
To increase the focusing ability, the UPA instrument is often equipped with multiple ultrasonic sensors to collect the ultrasonic echo data from different directions. Sensors can work in one or more groups to generate a variety of scanning modes [4][5][6][7][8]; this is called multi group scanning, and each group scan includes several focused beams. The advantage of multi group scanning is improved scan flexibility through grouping, and realization of image merging [9][10][11]. For example, Song et al. verified that a large-aperture hemispherical phased array can restore sharp focus and maximize acoustic-energy delivery to target tissue [12]. However, this strategy increases the scanning data considerably in the defect detection process [13], rendering it difficult to transmit these data to a signal-processing module (SPM) for subsequent production of an ultrasound image.
Therefore, an asynchronous distributed ultrasonic system that applies the TFM method is proposed in this work. The proposed "asynchronous distributed ultrasonic TFM system" meets the following requirements: (i) distributed scanning can be realized, with each scanning group having potentially different start times, sample depths, and pixel synthesis parameters; (ii) compared to conventional arrays, more echo data can be derived by using an array with a larger number of sensor elements, thus achieving better detection resolution; (iii) asynchronous sampling and scanning can alleviate the requirement for a synchronous clock in the field-programmable gate array (FPGA); (iv) integrated focus module and SPMs in a single FPGA for the TFM can process the echo data in the same domain, reducing the system complexity and solving the clock jitter between scan groups; and (v) the FMC-TFM system can flexibly group the array without restricting the number of elements in each group.
However, asynchronous distributed ultrasonic TFM systems have certain problems that must be addressed: (i) multi group scanning involves different focal laws, scanning depths, and sound velocities, and the distributed architecture also has different start times, data numbers, and focus parameters; this yields different frame tasks (beam forming image data) with different numbers of data elements, and the arrival of these tasks may result in time overlap; (ii) limited by the FPGA in a single chip resource, the system cannot build SPMs for each element; however, it should realize real-time image display, which necessitates an SPM scheduling mechanism; (iii) in a distributed image combination application, the FPGA must ensure that several images are focused and processed in real time, within a 0.04-s period, to ensure real-time image combination.
As a result of the above-mentioned reasons, the asynchronous distributed TFM ultrasonic system needs to schedule the pixel data in first-in−first-out memories (FIFOs) before the data enter the SPM, to ensure correct and orderly pixel-data processing. To overcome a large number of problems associated with asynchronous focusing data and its scheduling, the "sum of start time and processing time adjacent (SSPA) algorithm" is also proposed in this paper. This algorithm performs multiplexing of the SPMs, maximizes the resource utilization, and ensures real-time performance.
The remainder of this paper is organized as follows: In the section Related Work, we discuss the related work of the job-shop scheduling problem (JSSP), FPGA parallel architecture and application, the distributed ultrasonic system, and multi-group scanning in an ultrasonic system. In the section Problem Description, we describe the target problem. The architecture of the asynchronous distributed ultrasonic TFM system, and the signal processing scheduling problem for multi-group scanning and its mathematical model are discussed. In the section Proposed SSPA Algorithm, we study the scheduling mechanism of the SSPA algorithm. In the section Experiment Results and Discussion, we present the results of an experiment comparing the first come, first served (FCFS) and shortest processing time (SPT) algorithms with the proposed SSPA scheduling algorithm implemented using the FPGA technology. Finally, in the section Conclusions and Future Research, we summarize the research and derive conclusions.

Related work
The SPM scheduling algorithm in the asynchronous distributed TFM ultrasonic system involves a job-shop scheduling problem with a start time (JSSP with ST). As the scheduling algorithm is intended to ensure real-time imaging, it cannot involve considerable calculation, and the algorithm flow and structure need not only ease the migration to the FPGA, but also avoid the impact of pseudo random numbers by random algorithm.
The research methods of the JSSP can be divided into two categories: optimization method and approximate/heuristic method. The optimization method mainly includes mixed integer linear programming, branch and bound and Laplace relaxation. Approximate/heuristic algorithms were originally introduced into JSSP problems because of their low computational complexity and easy implementation [14,15]. It mainly includes priority dispatching rules (PDRs), artificial intelligence, neural network, and neighborhood search method. The neighborhood search method includes tabu search, genetic algorithm, and simulated annealing, which is the meta-heuristic approximate optimization method. The earliest PDR was presented by Johnson [16] and Smith [17]. Other PDRs for the JSSP include the SPT, longest processing time, most work remaining, least work remaining, most operation remaining, least operation remaining, earliest due date (EDD), and selection of the first procedure in the work piece queue on the same machine (i.e., FCFS) [18]. Panwalkar [19] presented a summary of approximately 113 dispatch rules, classified through performance indexes. Furthermore, Wu [20] stated that scheduling rules can be divided into three categories: priority rules related to the job information, a combination of priority rules and switching, and the weighted priority dispatching rules. The key lies in selecting the best rule for a given problem. As per the optimization effect of each rule, the SPT can reduce the average process time of all the jobs, and the EDD is used for optimizing the target related to the maximum delay. Previously, Ying et al. [21] investigated no-wait flowshop scheduling problems with sequence-independent (NWFSP with SISTs) and sequence-dependent setup times (SDSTs) with the aim of minimizing the makespan. Hence, they proposed an efficient two-phase matheuristic. Their study is latest and the proposed TPM algorithm has achieved high performance. Furthermore, Lin et al. [22] investigated the system testing scheduling problem and used a method applicable to a computer manufacturing company; however, they used the algorithm in computer production. The JSSP with the ST generated by the ultrasonic multi-group scan system is not large in terms of the number of tasks, and the PDR of the approximate/heuristic method can reduce the calculation of the dispatch parameters and increase the scan verification time. Therefore, the SSPA algorithm adopts an improved PDR as a part of the algorithm.
In recent years, a large number of FPGA parallel architectures have been reported. Suzuki et al. [23] have proposed an FPGA architecture and implementation for a shared synapse architecture for autoencoders; this architecture utilizes less of the limited resources of an FPGA than an architecture that does not share the synapse weights, and it reduces the amount of synapse modules used by half. Rodríguez-Flores et al. [24] have proposed the evaluation of a scalable low-area FPGA hardware architecture for security protocols relying on public key encryption; the design can process operands of different sizes using the same data path, which exhibits a significant reduction in the area without loss of efficiency. Hossain et al. [25] have developed a novel parallel architecture for fast hardware implementation of elliptic curve point multiplication. The area-time product of the proposed point multiplication is low, and the performance product of the proposed design is improved. Kim et al [26] have suggested a pipelined non-deterministic finite automaton-based string matching scheme using FPGA implementation. By cutting down the number of used LUTs for implementing state transitions, the hardware overhead of combinational logic circuits is reduced. All the above parallel architectures are RTL-level to reduce area and resource usage, and asynchronous distributed signal processing module scheduling is not discussed.
Multiple examples of scheduling algorithm implementation are available in the literature [27][28][29][30][31][32][33][34][35][36][37]. Among the relevant achievements, Srinivasan and Pandharipande [27] designed a self-configuring scheduling protocol for ultrasonic sensor systems using a timeslot allocation algorithm, which simplified the deployment of ultrasonic sensor systems. Further, Long et al. [32] proposed a time-division-multiple-access-based energy consumption balancing algorithm for general k-hop wireless sensor networks, where one data packet is collected in a cycle; the results demonstrated the effectiveness of the algorithm in terms of the energy efficiency and timeslot scheduling. A distributed ultrasonic sensor system has also been used in many application scenarios [36][37][38][39][40]. For example, Caicedo and Pandharipande [36,38,40] have presented ultrasonic array sensor solutions for reliable presence detection in indoor spaces. In addition, Priyantha et al. [37] have presented the design, implementation, and evaluation of the cricket location-support system for in-building, mobile, and location-dependent applications. Furthermore, Zhang et al. [39] have proposed a dynamic distributed sensor scheduling scheme, where the tasking sensor is elected spontaneously from sensors with pending sensing tasks via random competition based on carrier sense multiple access. Although the above implementations are effective in different distributed environment, application in the UPA system has not been discussed, and the distributed algorithm of its internal signal processing has not been studied. Therefore, in this work, we focus on the distributed UPA system and its algorithm.

Problem description
Asynchronous distributed ultrasonic TFM system architecture Fig 2 displays a block diagram of the asynchronous distributed ultrasound TFM system, which has four sensor groups and two SPMs. To realize integration of multi group scan systems, the signal processing scheduler focus and group scheduler (FGS) and the signal processing scheduler (SPS) are crucial components of this architecture. The FGS focuses the echo data from the analog-digital converter (ADC) and low-voltage differential signaling (LVDS) and saves the focused data in the corresponding location in the pixel memory group. The SPS schedules the signal data process in each signal process module. The other components include the ADC and LVDS, pixel memory group, focus module, FIFOs, signal processing arbitration (SPA), cache and bus arbitration, Avalon streaming to memory map (Avalon ST-MM), external memory interface (EMIF), scatter-gather direct memory access (SGDMA), peripheral component interconnect express (PCI-E), double-data-rate three synchronous dynamic random access memory (DDR3 SDRAM), and a personal computer (PC).
The signal flow is as follows: the ultrasonic emission signal is sent to the test block, the ultrasonic probe group receives the echo signal, and after ADC, it enters the focus module. The focus module is controlled by the FGS. When the scan group begins to scan, the focus module receives the signal from the ADC of the corresponding group and sends it to the assigned position of the pixel memory group after focusing. The FGS provides two types of information: the focus time parameter and the grouping parameters. The focus time parameter gives the delay time of each pixel, which is calculated from the TFM equation. The grouping parameters determine the scan group elements to be scanned, the receiving elements of the group, and the pixel position in the pixel memory group. After focusing, the readouts of multiple FIFOs are controlled by the signal process scheduler (SPS), which contains scheduling information on the signal processing order and SPM allocation. If the FIFOs have loaded the pixel data that have been focused by the focus module and an interrupt signal is transmitted, the data transmission waits for the signal-processing scheduler to prioritize the incoming SPM. After the SPM is processed, the data transmission waits for bus arbitration and dispatch. Finally, the data are sent to the PC through the PCI-E bus (S1 Appendix).
Next, we describe the FGS and SPS in detail, by considering one of the sensor groups in the system and their corresponding modules. Two scheduling parameters are introduced: the start time and processing time. The FGS and its corresponding module with four elements in a sensor group, shown in Fig 3(A), includes the focus parameter, pixel memory assign, and FIFO depth modules. The focus parameter module provides the focus parameter, the pixel memory assign provides the pixel number and assigns the random access memory (RAM) address in each pixel memory group, and the FIFO depth depends upon the number of pixels in each group.
In the scan process, each scan in a group must be performed in accordance with four states. State-0 reads the parameter, which includes focusing and grouping of the parameter for each group, and its duration is related to the number of parameters. State-1 receives and focuses the echo ultrasonic data, and its duration is related to the sampling depth. State-2 is a buffer to wait for the next scan. In state-3, if the focal law is not completed for a group, the process returns to state-0; if the focal law is completed, the pixel data are output to the corresponding FIFO. The state diagram is shown in Fig 3(B). From Fig 3(B), scanning is performed as per the N FL focal laws in states-0 and 1 between N FL time conversions. State-2 waits for the next scan and state-3 outputs the focused pixel data. The different focal laws (element numbers), scan depths, and numbers of ROI pixels generate different data start times and volumes of echo data in the related group.
For the j-th sensor group, states-0, 1, 2, and 3 correspond to the processing times of T j RD , T j SD , T j wait , and T j write , respectively; T j RD is the processing time for each focal law to read the parameter, and is related to the focus parameter and the number of ROI pixels. T j SD is the processing time for ultrasonic echo data acquisition and focus for each focal law; T j SD ¼ D j SD � T CLK , where D SD is the sample depth and T CLK is the FPGA clock cycle. T j wait is the wait and system buffering time, which can be set. T j write is the output-pixel-data time (writing to RAM); Px is the number of pixels in the ROI. N j FL and N j e are the number of focal laws and the number of elements in the j-th group. In the TFM system, N j FL ¼ N j e . The entire scanning sequence diagram in the ModelSim simulation in the TFM ultrasonic system with four elements in one sensor group is depicted in Fig 3(C). A focused image is depicted in Fig 3(D). Fig 4(A) depicts the signal processing architecture. The SPS includes the signal process schedule finite state machine (FSM) and the scheduler table. It controls the mux and demux in order to determine the time when the pixel signals are output to the FIFOs and input to the SPMs, the mux and demux are dispatched, and the delay control is implemented. According to the schedule table, the states of the signal process schedule FSM control the signal process scheduling in each SPM ensures that the dispatch is implemented in sequence and is not disrupted. The signal process schedule FSM flowchart is depicted in Fig 4(B).
In the SPS process, the mux and demux in the modules require delay control, which can be used to handle the delay in the input and output signals through the SPMs. The mux and demux delays are constant; we label the sum of the mux and demux delays as T dl . The SPM delay is determined according to the structure of the SPMs. For a designed filter and circuit structure, the delay is constant, ensuring the timing.
From the above, the system has a number of parameters; these are transformed into the parameters of the scheduling algorithm in the next section. Our goal is not only to satisfy the scanning with different parameters in multi-group, but also to ensure that they pass through the limited SPM resources by scheduling, without affecting the real-time performance and data transmission.

Mathematical model
We assume that the asynchronous distributed ultrasonic TFM system described in the previous section, has M SPMs and N tasks to scan. A system can have multiple sensor groups, and each group can have one or multiple scan tasks; therefore, the number of sensor groups N g is less than or equal to N. In this paper, we only consider N g = N. We replace the "job" of the classical scheduling problem with a "task." There are N tasks ready to be processed at their start times, which must be distributed among the M SPMs. A processing schedule must be derived in the scheduler for the tasks assigned to each SPM. Each task has a required processing time, P j , and start time, S j . The optimization criterion is the minimization of the maximum completion time (makespan), among the SPMs.
Similar to the classic scheduling problem, it is generally assumed that at any time, each task can be processed by at most one SPM, and each SPM can process only one task. The other characteristics assumed in this paper are as follows: 1. All the data in all the problems are known deterministically when scheduling is undertaken.
2. There are N independent tasks available at the start time.
3. SPMs are available always, if they are not busy, with no breakdowns.

A task once started on the SPM must be completed without interruption.
Before the model is presented, we introduce the employed parameters and indices in Table 1.
The processing time of the j-th task is The start time of the j-th task is The problem can now be formulated as follows: s:t: Eq (4) defines the variable X, while Eq (5) minimizes the maximum completion time (C max ), i.e., the makespan, which is the maximum completion time of all SPMs. Constraint (6) provides the definition of C j , while Constraint (7) states that task i requires only one SPM. Constraints (8) and (9) define the "0 and 1" and the non-negative variables, respectively. Constraint (10) establishes the relationship between the completion times of tasks i and j that are assigned to the same machine in a specific SPM. Finally, Constraint (11) defines the makespan and Constraint (12) gives the relationship between the start times of two adjacent tasks in the same SPM. The total permutation of the problem is as follows (total number of possible solution space combinations): The upper bound of this problem is and the lower bound is

Proposed SSPA algorithm
The aim is to effectively solve the problem of the different start times and data quantities in the signal processing scheduling problem for multi group scanning that are generated by the SPM multiplexing and distributed scanning, i.e., JSSP with ST. To achieve this, we propose the SSPA algorithm. This algorithm has four main parameters: the start time, processing time, number of tasks, and number of SPMs. The main concept behind the algorithm is to order the task of the sum of the start and processing times, and to schedule it sequentially. In task scheduling, a task is scheduled to start in each SPM. If the SPM is idle, the task is deferred backward until it is near the next adjacent task. If the dispatch location is occupied, all other tasks in the SPM are deferred, enabling the dispatch task to be inserted into the SPM for scheduling. Next, the completion times of all the SPMs are compared, and the SPM with the earliest completion time is selected as the dispatch SPM of the task. Finally, after each task is scheduled for the SPMs, the scheduling is completed.
Compared to the conventional FCFS and SPT algorithms, the proposed algorithm has the following advantages: i. The task is deferred backward to the next adjacent task, providing time for scheduling of the next task and increasing SPM utilization.
ii. Based on the results of the previous scheduling iteration and comparison, the scheduling of the proposed algorithm is confirmed to be optimal.
iii. The amount of calculation has not increased much.
The algorithm steps are as follows: 1. Add the start time and processing time; the sum is R j .
3. Dispatch the first M tasks to M SPMs, based on the order of R.
4. According to the R order, dispatch the other N-M tasks.
5. Allocate task T j based on its start time to all the SPMs. If the processing of T j overlaps the other tasks in the SPMs, defer the other tasks to insert T j . If T j does not overlap, defer the task as much as possible, near the first task in the SPMs. We here present a simple example. For tasks T1-T4 depicted in Fig 6(A), the start times are 0, 1, 2, and 3 time units, and the processing times are 2, 3, 4, and 5 time units, respectively. The SPMs are M1 and M2. According to the flow of the SSPA, T3 and T4 are first assigned to M1 and M2, respectively, as depicted in Fig 6(B). Next, allocation of T2 to all SPMs is attempted, as depicted in Fig 6(C). In M1 and M2, T3 and T4 are deferred to insert the dummy T2. M2 is then selected for scheduling of T2, as shown in Fig 6(D). Allocation of dummy task T1 is attempted. If the T1 processing time overlaps the original task in the SPM, the other tasks in the SPM are postponed, e.g., for M2 in Fig 6(E). If there is no overlap, T1 is postponed near the adjacent task, e.g., M1 in Fig 6(E). Finally, the schedule is completed, as depicted in Fig 6  (F).
In past literature, the JSSP was taken as NP-hard [41] and associated with one of the most difficult problems in this class [42]. This is because every job in a JSSP can have a different and separate processing time; thus, the problem complexity grows with the number of jobs (tasks). For this reason, JSSP with ST is also an NP-hard problem. The classical SPT algorithm requires sorting and has a time complexity of O(N 2 ). The FCFS algorithm does not need iteration and the time complexity is O(N). The SSPA algorithm proposed in this paper first requires sorting; then, the completion times of the SPMs are compared during scheduling. Thus, the time complexity is O(N 2 M 2 ).

Experiment results and discussion
We designed a comparison test for the algorithms to prove that the SSPA algorithm enables high utilization and reduces the makespan. The compared algorithms were the FCFS and SPT [43] algorithms, as these are deterministic algorithms, which are often used for processor scheduling. The FCFS algorithm is dispatched to the current free SPM, according to the order of task arrival. The SPT is implemented in accordance with the shortest processing time and is scheduled to the current idle SPM.
In the experiment performed in this study, the start time was taken as being related to the number of focal laws and the scanning depth, and the processing time was taken as being related to the number of pixels in the ROI area; both were independent of each other. The PC used in our experiment was an Intel i7-4970, DDR3 8G RAM, and the programming environment was Matlab 2016a. In order to evaluate the algorithm performance, we used two criteria: the makespan and the utilization rate. We also varied the four parameters of the algorithms: the start time, processing time, number of tasks, and number of SPMs, so as to study the impact of parameter changes on the algorithm performance.
As noted above, the makespan is defined as the maximum completion time on the SPM after scheduling and is labeled C max . The makespan data take positive integers and the unit is the FPGA clock-cycle; however, for general purposes, the unit of makespan was taken as the time unit in the following experiments.
The utilization rate is defined as follows: where M is the number of SPMs and P j is the task processing time. Here, the value was reported as a percentage with two decimal places. The experiment settings were as follows: The number of SPMs was M, and the number of tasks was N. Each task had a start time and processing time. The tasks also took a uniform random positive integer within a given range. The FCFS, SPT, and SSPA algorithms were implemented and the values of M, N, the SPM range, and the task range were changed. All algorithms were executed 100 times, and the average C max and utilization rate were calculated. We performed four tests; the parameters are listed in Table 2. In test 1, as detailed in Table 2, the start time was taken as the variable. The start time value range was from 1 to the upper bound in integers, where the upper bound was changed from 10 to 100 in steps of 10. The processing time was set to uniform random integers in the range of 1 −100. N was 16, and M was 4.
Increasing the start time will further postpone the task makespan. In test 1, the upper bound of the start time range was varied and the makespan and utilization rates of the three considered algorithms, FCFS, SPT, and SSPA, were compared. The results are listed in Table 3. With the increase in the upper bound, the start time range increased also, and so too did the makespan curve, as shown in Fig 7(A). The SSPA had the smallest makespan, followed by the FCFS. Thus, the SPT makespan was the largest. These results indicate that the SSPA algorithm had the best scheduling performance. The utilization rate increased gradually with the increase in the upper bound and the corresponding gradual increase in the start time range (Fig 7(B)). Gradually, the utilization rate tended to become flat. This is because a bigger ratio between the start time and processing time allows more time slots generated by the start time to be inserted into the small task processing time; hence, a higher utilization rate is achieved. The SSPA utilization rate was always highest in these experiments, followed by those of the FCFS and SPT, as shown in Fig 7(B).
In test 2, the variable was the processing time, as indicated in Table 2. The processing time range was from 1 to the upper bound in integers, where the upper bound was varied from 10 to 100 in steps of 10. The start time was set to uniform random integers in the range of 1−100. N was 16 and M was 4.
Increasing the processing time also delays the task's makespan. In test 2, the upper bound of the processing time range was changed. The makespan and utilization rates of the three algorithms, FCFS, SPT, and SSPA, were compared. The results are listed in Table 4. With the increase in the upper bound, the processing time range increased, along with the makespan curve, as shown in Fig 8(A). The SSPA makespan was always the smallest, followed by the FCFS and SPT. Thus, the SSPA algorithm again exhibited the best scheduling performance. As previously, the utilization rate increased gradually and gradually tended to become flat as the upper bound was increased and the processing time range gradually increased. This is because the processing time of each task was different. Compared to the task with a large processing time, it is easier to insert a task with a small processing time into an idle time slot. Hence, the utilization rate of the problem, which has a small task range, is higher than that of a large tasks range. The SSPA utilization rate was always the highest, followed by that of the FCFS, and finally that of the SPT, as shown in Fig 8(B). In test 3, N was changed from 10−100 in steps of 10 (Table 2), the processing time and start time ranges were 1−10 in integers, and M was 4.
Increasing the number of tasks increases the scheduling burden and makespan delay. In Test 3, the number of tasks was increased and the makespan and utilization rates of the three algorithms (FCFS, SPT, and SSPA) were compared. The results are listed in Table 5. The makespan increased linearly with the increase in the number of tasks, as shown in Fig 9(A). The SSPA makespan was always the smallest, followed by those of the FCFS and SPT. The SSPA algorithm exhibited the best scheduling performance, as shown in Fig 9(A). As the number of tasks increased, the utilization rate curve increased and tended to become flat (Fig 9(B)). Furthermore, the utilization rate was higher when the number of tasks increased relative to the  Scheduling algorithm for signal processing ultrasonic TFM system number of SPMs. The SSPA utilization rate was always the highest, followed that of the FCFS and finally the SPT, as shown in Fig 9(B). In test 4, M was changed from 3−10 in steps of 1 (Table 2), the processing time and start time ranges were 1−10 in integers, and N was 16.
Increasing the SPMs is equivalent to increasing the resources, and the makespan decreases correspondingly. In test 4, the number of SPMs was increased and the makespan and utilization rates of the three algorithms (FCFS, SPT, and SSPA) were compared. The results are listed in Table 6. As the number of SPMs increased, the makespan curve decreased, as shown in Fig  10(A). The SSPA makespan was always the smallest, followed by those of the FCFS and SPT. The SSPA algorithm exhibited the best scheduling performance, as shown in Fig 10(A). As the number of SPMs increased, the curve decreased linearly and the utilization rate decreased continuously. The SSPA utilization rate was always the highest, followed by those of the FCFS and finally the SPT, as shown in Fig 10(B). More resources means that the number of tasks is closer to the number of SPMs; thus, the makespan and utilization rate become more equivalent. That is, if the number of tasks is equal to the number of resources, the makespan of any algorithm takes the maximum completion time of a single task and the utilization rates of the algorithms are equal.
Thus, it is established that, compared to the FCFS and SPT algorithms, the SSPA algorithm improves the utilization rate and reduces the makespan.  In addition, a fifth test, test 5, was performed to compare the statistics of the FCFS, SPT, and SPA algorithm experiments. The experimental conditions were as follows: M was 4, N was 16, the start time was 1−10, the processing time was 1−10, and uniform random integers were selected.
The results for test 5, which compared the statistical performance of the three algorithms (FCFS, SPT and SSPA), are presented in Figs 11 and 12 using box plots and 95% confidence interval plots (95% CI plots), respectively. Similar to the analyses of the makespan and utilization rate, the statistical data of the FCFS, SPT, and SSPA were computed 100 times with given parameters and randomly generated processing and start times. In the box plots, the median and interquartile range (IQR) results were compared. As regards the makespan, the SSPA had the smallest median, the narrowest IQR, and the lowest abnormal upper limit, as shown in Fig  11(A). These results can explain why the SSPA algorithm exhibited the smallest overall makespan. Note that the abnormal data over the upper quartile were also lower in the makespan data set, which indicates the best statistical performance. For the utilization rate comparison, the SSPA had the largest median, which was similar to that of the SPT (Fig 11(B)). The SSPA algorithm had the smallest IQR, the largest abnormal lower limit, and the smallest abnormal range, as shown in Fig 11(B); these results indicate that the SSPA algorithm yields more centralized data, and that the overall data has the highest utilization rate and fewer abnormal points compared to those of the other two algorithms.
From the 95% CI plots shown in Fig 12, the SSPA had the minimum mean value for both the makespan and utilization rate. Furthermore, the three algorithms had almost identical  confidence intervals. Compared with FCFS, the SPT algorithm had the best performance in the 95% CI plot, as shown in Fig 12. Finally, Table 7 compares the average elapsed times of the three algorithms for the test 5 calculation. Although the average elapsed time of the SSPA calculation was greater than those of the FCFS and SPT algorithms, it is acceptable in that it is of the same order of magnitude.

Scheduling algorithm implementation and performance evaluation
The scheduling algorithm was implemented using a UPA instrument (PA2000 model, Guangzhou Doppler Electronic Technologies Co., Ltd., Guangzhou, China), and a Cyclone V GT FPGA development board (Intel Corporation, Santa Clara, CA., USA) [44,45]as the PCI-E communication module with the PC. The UPA data were transmitted to the PC through the PCI-E interface, and the multi group scanning images were processed. The SPM delay was 15 clock cycles and the main clock frequency was 100 MHz (10 ns per clock cycle).
To better demonstrate the performance of the scheduling algorithm, a virtual scheduling process was used to test the application of the SSPA algorithm in the SPM in this experiment, as per the parameter settings listed in Table 8, corresponding to the symbol in the sub-section asynchronous distributed ultrasonic TFM system architecture. The reading parameter time was approximately converted from the DDR3 hard core frequency of 400 MHz, with 16-bit width and 0.8 efficiency. The sample depth and number of ROI pixels could be directly converted to T j SD and T j write . To make the effect more obvious, the sums of the read parameter time (T j RD ) values and the wait time (T j Wait ) values in the same group were made to be equal. Finally, we calculated the start and processing times based on Eqs (2) and (3), respectively. Fig 13 displays the results of the pre-synthesis simulation in SPA, using ModelSim 10.2 SE EDA tools (Mentor Co., Ltd., Wilsonville, OR, USA); the inputs to SPM0 and SPM1 were i_fir1, i_fir2, and the outputs to SPM1 and SPM2 were o_fir1, o_fir2, respectively. It can also be seen from the input signals through SPM0 and SPM1 and the output for the SPM delay that the input signals in0, in1, in2, and in3 in the SPA and signal processing were correctly restored as the output signals out0, out1, and out2, respectively. It is also apparent that the delay between the SPM input and output was 17 clock cycles, and the SPM delay was 15 cycles; hence, only a two-clock-cycle delay occurred owing to the mux and demux delays. In addition, it can be observed that the delay between the signal input to the FIFO after focus and the SPA output was only 36 clock cycles; subtracting the 15 clock cycles of the SPM delay, the delay owing to the entire scheduler architecture was approximately 19 clock cycles. Compared to the other time consumption, the delay caused by the SPA architecture was negligible; however, it improved the utilization rate of the SPM considerably. Fig 14 shows a comparison of the FCFS and SSPA algorithms. Four signals were scheduled in two SPMs using Signaltap II, Quartus 13.0 (Intel Corporation, Santa Clara, CA., USA). The SPT/FCFS makespan was 2.88 μs, while that for the SSPA was 2.56 μs; thus, the makespan of the SSPA algorithm was lower. It is apparent that a time saving of 0.32 μs, or 32 clock cycles, was achieved for the same signal using SSPA. Therefore, a makespan improvement of approximately 11% was achieved and the utilization rate was improved by approximately 9.72%.

Conclusions and future research
A novel algorithm called the SSPA was proposed, which is based on a scheduler realized using FPGA technology. The SSPA algorithm was applied to an asynchronous distributed ultrasonic TFM system; hence, the bandwidth utilization was maximized by 9.72% while the makespan was reduced by 11% compared to the conventional FCFS and SPT algorithms. The system could also flexibly group the array without restricting the number of elements in each group.
A mathematical model of the problem was established, and the total permutation of the problem and its upper and lower bounds were indicated. The steps and procedures of the SSPA algorithm were presented and its performance was demonstrated.  To improve the transmission efficiency of the considerable volume of data generated by an asynchronous distributed ultrasonic TFM system and the real-time performance of algorithms realized using FPGA technology, the SSPA scheduling algorithm, based on SPM, has significant potential for application in multi-sensor systems. Future research is likely to focus on the design of certain special scheduling algorithm modules for different sensor systems, or on study of the scheduling problem of distributed ultrasonic systems based on multi-FPGA technology.